# <center> Natural Language Processing (NLP)</center>
The [natural language processing](https://es.wikipedia.org/wiki/Procesamiento_de_natural_languages), abbreviated PLN3 —in English, natural language processing, NLP— is a field of sciences of computing, artificial intelligence and linguistics that studies the interactions between computers and human language. It deals with the formulation and investigation of computationally efficient mechanisms for communication between people and machines through natural language, that is, the world's languages. It is not about communication through natural languages ​​in an abstract way, but about designing mechanisms to communicate that are computationally efficient —that can be carried out by means of programs that execute or simulate communication—.

![elgif](https://media.giphy.com/media/xT0xeJpnrWC4XWblEk/giphy.gif)

NLP is considered one of the great challenges of artificial intelligence since it is one of the most complicated and challenging tasks: how to really understand the meaning of a text? How to undertand neologisms, ironies, jokes or poetry? If the strategy/algorithm we use does not overcome these difficulties, the results obtained will be of no use to us.
In NLP it is not enough to understand mere words, you must understand the set of words that make up a sentence, and the set of lines that make up a paragraph. Giving a global meaning to the analysis of the text/discourse in order to draw good conclusions.

Our language is full of ambiguities, of words with different meanings, twists and different meanings depending on the context. This makes NLP one of the most difficult tasks to master.

Therefore, the difficulty of the NLP is at several levels:

Ambiguity:

- Lexical level: for example, several meanings
- Referential level: anaphoras, metaphors, etc...
- Structural level: semantics is necessary to understand the structure of a sentence
- Pragmatic level: double meanings, irony, humor
- Gaps detection

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#The-data" data-toc-modified-id="The-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>The data</a></span></li><li><span><a href="#We-bring-all-the-data-to-a-dataframe-from-MySQL" data-toc-modified-id="We-bring-all-the-data-to-a-dataframe-from-MySQL-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>We bring all the data to a dataframe from MySQL</a></span></li><li><span><a href="#NLP" data-toc-modified-id="NLP-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>NLP</a></span><ul class="toc-item"><li><span><a href="#Stop-Words" data-toc-modified-id="Stop-Words-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Stop Words</a></span></li><li><span><a href="#Tokenize" data-toc-modified-id="Tokenize-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Tokenize</a></span></li></ul></li><li><span><a href="#WordClouds" data-toc-modified-id="WordClouds-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>WordClouds</a></span><ul class="toc-item"><li><span><a href="#We-generate-a-WordCloud-of-a-song" data-toc-modified-id="We-generate-a-WordCloud-of-a-song-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>We generate a WordCloud of a song</a></span></li><li><span><a href="#We-can-also-generate-it-from-a-column-of-an-entire-dataframe" data-toc-modified-id="We-can-also-generate-it-from-a-column-of-an-entire-dataframe-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>We can also generate it from a column of an entire dataframe</a></span></li></ul></li><li><span><a href="#We-translate" data-toc-modified-id="We-translate-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>We translate</a></span></li><li><span><a href="#Sentiment-analysis" data-toc-modified-id="Sentiment-analysis-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Sentiment analysis</a></span><ul class="toc-item"><li><span><a href="#TextBlob" data-toc-modified-id="TextBlob-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>TextBlob</a></span></li><li><span><a href="#NLTK" data-toc-modified-id="NLTK-6.2"><span class="toc-item-num">6.2&nbsp;&nbsp;</span>NLTK</a></span></li></ul></li></ul></div>

In [None]:
#!pip install googletrans==4.0.0-rc1\n
#!pip install spacy
#!pip install es-core-news-sm
#!pip install nltk
#!pip install wordcloud
#!pip install langdetect
#!pip install textblob
#python -m spacy download en_core_web_lg
#python -m spacy download en_core_web_sm

In [None]:
# Data management
import pandas as pd
import string

# Databases
import sqlalchemy as alch
from getpass import getpass
from pymongo import MongoClient

# Languages
import re

import spacy
import es_core_news_sm

import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.corpus import stopwords

from wordcloud import WordCloud
from langdetect import detect
from textblob import TextBlob

# Visualization
import plotly.express as px
import plotly.graph_objects as go
import matplotlib.pyplot as plt
%matplotlib inline

## The data


## We bring all the data to a dataframe from MySQL

In [None]:
password = getpass("Introduce your password: ")
dbName = "spotify"
connectionData=f"mysql+pymysql://root:{password}@localhost/{dbName}"
engine = alch.create_engine(connectionData)

In [None]:
query = "SELECT * FROM newone"
df = pd.read_sql_query(query, engine)
df

In [None]:
print(df.iloc[3]["lyrics"][:10])

In [None]:
as_it_was = df.iloc[3]["lyrics"]
as_it_was

In [None]:
len(as_it_was.split(" "))

## NLP

### Stop Words

Empty words is the name given to words without meaning such as articles, pronouns, prepositions, etc. that are filtered before or after natural language data processing.

Spacy library documentation
https://spacy.io/api/doc

In [None]:
stop = nlp.Defaults.stop_words
#stop

In [None]:
new_list = []
for element in as_it_was.split(" "):
    if element not in stop:
        new_list.append(element)
string_without_stop = " ".join(new_list)
print(string_without_stop)  

### Tokenize
One of the ways to normalize our tokens is through stemming and lemmatization.
Stemming consists of removing and replacing suffixes from the root of the word. Lemmatization is a bit more complex and involves doing an analysis of the vocabulary and its morphology to return the basic form of the word (unconjugated, singular, etc).
Read [this](https://medium.com/escueladeinteligenciaartificial/procesamiento-de-lenguaje-natural-stemming-y-lemmas-f5efd90dca8) interesting article.
When it comes to tokenizing, we are going to do it by previously removing the stop words.

![](https://d2mk45aasx86xg.cloudfront.net/difference_between_Stemming_and_lemmatization_8_11zon_452539721d.webp)

In [None]:
nlp = spacy.load("en_core_web_sm")
nlp

In [None]:
tokens = nlp(string_without_stop)
#tokens

In [None]:
lemmatized = []
for token in tokens:
    lemmatized.append(token.lemma_)

In [None]:
detect("what is this language")

We are going to write a function that will tokenize the lyrics of our songs regardless of whether they are in Spanish or English

In [None]:
def tokenizer(txt):
    try:
        if detect(txt) == "en":
            nlp = spacy.load("en_core_web_sm")
        elif detect(txt) == "es":
            nlp = spacy.load("es_core_news_sm")
            
        else:
            return "Not english nor spanish"
    except:
        return "Not able to analyze"
    
    tokens = nlp(txt)
    filtered = []
    
    
    for token in tokens:
        if not token.is_stop:
            lemma = token.lemma_.lower().strip()
            if re.search('^[a-zA-Z]+$',lemma): # This will remove the question marks
                filtered.append(lemma)
    return " ".join(filtered)

In [None]:
detect("diga'm-ho bé")

In [None]:
detect("no vestiu els nostres boscos de dol")

In [None]:
detect("takk fyrir")

In [None]:
detect("salam")

In [None]:
detect("som-hi")

In [None]:
detect("hello how are you doing")

In [None]:
tokenizer("hello how are you doing")

In [None]:
df["tokenized"] = df["lyrics"].apply(tokenizer)
df

We check that it works by passing a letter to the function

In [None]:
test = tokenizer(df.loc[8]["lyrics"])
test

In [None]:
# SO FAR: remove stop words
# TOKENIZE THE STRINGS
# WITH THE STRING TOKENIZED -> lemmatization
# NEXT STEP: create lemmatization of the words: holding / holds / etc into hold

# RECAP

**FIRST PART**
- I get the link for a spotify playlist
- I sign up as a spotify developer
- I get the token for authentication
- I get the token for doing queries (i get the token by doing one request)
- I get all the info from a playlist
- I save songs, users into a dataframe

- I use another API to get lyrics for every song

**SECOND PART**
- I crete a relational design for a database
- I create check functions to filter those insert that already exist
- I loop over the dataframe to try to insert those values in the tables
- *pending debugging*

**THIRD PART**
- Clean the text: remove stop words
- Tokenize: isolating the words
- From the tokens: filter symbols
- We can do lemmatization

- THEN: df w/lyrics, & processed lyrics
- Polarity and subjectivity of those texts

Goal: ETL & a bit of analysis

In [None]:
print(df.iloc[0]["lyrics"])

## WordClouds
A word cloud or tag cloud is a visual representation of the words that make up a text, where the size is larger for the words that appear more frequently

![wordcloud](https://i.imgur.com/8I8aJ1N.png)

### We generate a WordCloud of a song

In [None]:
test = df.iloc[5]["tokenized"]
test

In [None]:
test

In [None]:
# having a string with no repeated words

" ".join(set(test.split(" ")))

In [None]:
wordcloud = WordCloud(width=1600,height=400).generate(" ".join(set(test.split(" "))))
plt.figure(figsize=(15,10), facecolor="k")
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
#plt.savefig('images/wordcloud.png', facecolor='k', bbox_inches='tight')
plt.show();

In [None]:
wordcloud = WordCloud(width=1600,height=400).generate(counted)
plt.figure(figsize=(15,10), facecolor="k")
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
#plt.savefig('images/wordcloud.png', facecolor='k', bbox_inches='tight')
plt.show();

### We can also generate it from a column of an entire dataframe

## We translate
A little to our regret, although there are libraries that work in Spanish (the part of Spacy trained in Spanish works very well), the truth is that they work better in English, in general, there are other libraries that are not as exact and even so Spacy works best in English, so let's translate the lyrics.
The TextBlob library, which we are going to use later to do sentiment analysis, also translates, but we are better going to use googletrans and its library, be careful when installing it:
`pip install googletrans==3.1.0a0`
You have to install the alpha version that the official one has issues.
We create a column in the dataframe with all the translated letters, and leave the original as well, in case we need it.

⚠️ PLEASE INSTALL THE LIBRARY AS IT SAYS ABOVE ⚠️ [stackoverflow](https://stackoverflow.com/questions/52455774/googletrans-stopped-working-with-error-nonetype-object-has-no-attribute-group)

`pip install googletrans==4.0.0-rc1`

In [None]:
# Let's see how to translate a sentence

In [None]:
import googletrans
trans = googletrans.Translator()

In [None]:
esp = "que tengas un buen day"
en = trans.translate(esp, dest="en")
en.text

In [None]:
cat = "no vestiu els nostres cosos de vermell"

In [None]:
en = trans.translate(cat, dest="en")
en.text

In [None]:
en.text

In [None]:
# detect the language something is written in: detect
# we can use that to pass that google translate: google translate
# we can have all the info in the same language

In [None]:
def translate_into (string):
    try:
        trans = googletrans.Translator()
        language = detect(string) # error handling
        first = trans.translate(string, dest="en")
        return first.text
    except:
        string

In [None]:
translate_into("labas rytas")

In [None]:
df["tokenized_en"] = df["tokenized"].apply(translate_into)
df

Again we continue with the trend of automating and making functions for everything and thus be able to reuse code

In [None]:
df

## Sentiment analysis
### TextBlob
`TextBlob(the_string).sentiment`

**Arguments:** `string`<br>
**Returns:** `polarity`& `subjectivity`


The sentiment property returns a named tuple of the form Sentiment(polarity, subjectivity). The polarity score is a float in the range [-1.0, 1.0]. Subjectivity is a float in the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.

TextBlob is supported by two libraries, NLTK and pattern, I leave you the [documentation](https://textblob.readthedocs.io/en/dev/)
https://www.analyticsvidhya.com/blog/2018/02/natural-language-processing-for-beginners-using-textblob/

In [None]:
blob = TextBlob("This is the worst and it sucks")
blob

In [None]:
blob.sentiment

In [None]:
blob.sentiment.polarity

In [None]:
blob.sentiment.subjectivity

### NLTK
The Natural Language Toolkit, or more commonly NLTK, is a set of symbolic and statistical natural language processing libraries and programs for the Python programming language. NLTK includes graphical demonstrations and sample data.

In this case we will also get the polarity with the module [SentimentIntensityAnalizer](https://www.nltk.org/api/nltk.sentiment.html#module-nltk.sentiment.vader)

`sia.polarity_scores(the_string)`

**Aruments:** `string`<br>
**Returns:** `polarity`

In [None]:
nltk.downloader.download('vader_lexicon')

In [None]:
sia = SentimentIntensityAnalyzer()

In [None]:
sia.polarity_scores(df.iloc[5]["tokenized_en"])

In [None]:
df.iloc[5]["tokenized_en"]

In [None]:
a_sentence = "the table is red"

In [None]:
sia.polarity_scores(a_sentence)

In [None]:
def sa (x):
    try:
        return sia.polarity_scores(x)
    except:
        return x

In [None]:
sa("You'd Be Crazy to Miss This Summer Super Sale ~2.50 Acres Rosamond, Kern County, CA")

In [None]:
sa("Your Wait Is Over, Rush Today For Sumer Sale!!!")

In [None]:
sa("This 2.50-acre parcel is nestled in Kern County, CA which is a great location from where you get easy access to everywhere. It is located just a few hours from Los Angeles. This is a good place for people who want to retreat from society and get away from it all to experience the ultimate relaxation that they’ve been dreaming of!Let the possibilities wash over you as you explore a stress-free and peaceful life at this quiet location in Mojave. It’s high time to allow yourself to get rid of the modern robotic life and embrace the brand-new simplified lifestyle. Don’t wait another decade to only wish you would have invested in this land!")

Information about the [compound](https://github.com/cjhutto/vaderSentiment#about-the-scoring). 
It is the sum of the scores normalized between -1 and 1

In [None]:
df.sample()

In [None]:
df["sentiment"] = df["tokenized_en"].apply(sa)

In [None]:
df

In [None]:
summary = df.groupby(["ironhacker"])["sentiment"].mean().sort_values().to_frame().reset_index()

In [None]:
summary

In [None]:
fig = px.bar(df, x="ironhacker", y="sentiment")
fig.show()

- NLP

- Work with strings: regex
- Python: split, replace, mpve things around

- Tokenization: words
- Lemmatization: roots of the words

- Wordclouds: generate images and save them
- Numeric values out of texts: compound & subjectivity 

- Group by, plot, I can see the sentiment analysis for some text