# <center> Natural Language Processing (NLP)</center>
The [natural language processing](https://es.wikipedia.org/wiki/Procesamiento_de_natural_languages), abbreviated PLN3 —in English, natural language processing, NLP— is a field of sciences of computing, artificial intelligence and linguistics that studies the interactions between computers and human language. It deals with the formulation and investigation of computationally efficient mechanisms for communication between people and machines through natural language, that is, the world's languages. It is not about communication through natural languages ​​in an abstract way, but about designing mechanisms to communicate that are computationally efficient —that can be carried out by means of programs that execute or simulate communication—.

![elgif](https://media.giphy.com/media/xT0xeJpnrWC4XWblEk/giphy.gif)

NLP is considered one of the great challenges of artificial intelligence since it is one of the most complicated and challenging tasks: how to really understand the meaning of a text? How to undertand neologisms, ironies, jokes or poetry? If the strategy/algorithm we use does not overcome these difficulties, the results obtained will be of no use to us.
In NLP it is not enough to understand mere words, you must understand the set of words that make up a sentence, and the set of lines that make up a paragraph. Giving a global meaning to the analysis of the text/discourse in order to draw good conclusions.

Our language is full of ambiguities, of words with different meanings, twists and different meanings depending on the context. This makes NLP one of the most difficult tasks to master.

Therefore, the difficulty of the NLP is at several levels:

Ambiguity:

- Lexical level: for example, several meanings
- Referential level: anaphoras, metaphors, etc...
- Structural level: semantics is necessary to understand the structure of a sentence
- Pragmatic level: double meanings, irony, humor
- Gaps detection

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#The-data" data-toc-modified-id="The-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>The data</a></span></li><li><span><a href="#We-bring-all-the-data-to-a-dataframe-from-MySQL" data-toc-modified-id="We-bring-all-the-data-to-a-dataframe-from-MySQL-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>We bring all the data to a dataframe from MySQL</a></span></li><li><span><a href="#We-translate" data-toc-modified-id="We-translate-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>We translate</a></span></li><li><span><a href="#Sentiment-analysis" data-toc-modified-id="Sentiment-analysis-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Sentiment analysis</a></span><ul class="toc-item"><li><span><a href="#TextBlob" data-toc-modified-id="TextBlob-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>TextBlob</a></span></li><li><span><a href="#NLTK" data-toc-modified-id="NLTK-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>NLTK</a></span></li></ul></li><li><span><a href="#Adding-to-SQL" data-toc-modified-id="Adding-to-SQL-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Adding to SQL</a></span></li><li><span><a href="#Further-processing" data-toc-modified-id="Further-processing-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Further processing</a></span><ul class="toc-item"><li><span><a href="#Tokenize:-lemmatization" data-toc-modified-id="Tokenize:-lemmatization-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>Tokenize: lemmatization</a></span></li><li><span><a href="#Entity-recognition" data-toc-modified-id="Entity-recognition-6.2"><span class="toc-item-num">6.2&nbsp;&nbsp;</span>Entity recognition</a></span></li></ul></li><li><span><a href="#WordClouds" data-toc-modified-id="WordClouds-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>WordClouds</a></span><ul class="toc-item"><li><span><a href="#We-generate-a-WordCloud-of-a-song" data-toc-modified-id="We-generate-a-WordCloud-of-a-song-7.1"><span class="toc-item-num">7.1&nbsp;&nbsp;</span>We generate a WordCloud of a song</a></span></li><li><span><a href="#We-can-also-generate-it-from-a-column-of-an-entire-dataframe" data-toc-modified-id="We-can-also-generate-it-from-a-column-of-an-entire-dataframe-7.2"><span class="toc-item-num">7.2&nbsp;&nbsp;</span>We can also generate it from a column of an entire dataframe</a></span></li></ul></li></ul></div>

In [None]:
#!pip install googletrans==4.0.0-rc1\n
#!pip install spacy
#!pip install es-core-news-sm
#!pip install nltk
#!pip install wordcloud
#!pip install langdetect
#!pip install textblob
#python -m spacy download en_core_web_lg
#python -m spacy download en_core_web_sm

In [None]:
# Data management
import pandas as pd
import string

# Databases
import sqlalchemy as alch
from getpass import getpass
from pymongo import MongoClient

# Languages
import re

import spacy
import es_core_news_sm

import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.corpus import stopwords

from wordcloud import WordCloud
from langdetect import detect
from textblob import TextBlob

# Visualization
import plotly.express as px
import plotly.graph_objects as go
import matplotlib.pyplot as plt
%matplotlib inline

## The data


## We bring all the data to a dataframe from MySQL

## We translate
A little to our regret, although there are libraries that work in Spanish (the part of Spacy trained in Spanish works very well), the truth is that they work better in English, in general, there are other libraries that are not as exact and even so Spacy works best in English, so let's translate the lyrics.
The TextBlob library, which we are going to use later to do sentiment analysis, also translates, but we are better going to use googletrans and its library, be careful when installing it:
`pip install googletrans==3.1.0a0`
You have to install the alpha version that the official one has issues.
We create a column in the dataframe with all the translated letters, and leave the original as well, in case we need it.

⚠️ PLEASE INSTALL THE LIBRARY AS IT SAYS ABOVE ⚠️ [stackoverflow](https://stackoverflow.com/questions/52455774/googletrans-stopped-working-with-error-nonetype-object-has-no-attribute-group)

`pip install googletrans==4.0.0-rc1`

In [None]:
# Let's see how to translate a sentence

In [None]:
import googletrans

Again we continue with the trend of automating and making functions for everything and thus be able to reuse code

## Sentiment analysis
### TextBlob
`TextBlob(the_string).sentiment`

**Arguments:** `string`<br>
**Returns:** `polarity`& `subjectivity`


The sentiment property returns a named tuple of the form Sentiment(polarity, subjectivity). The polarity score is a float in the range [-1.0, 1.0]. Subjectivity is a float in the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.

TextBlob is supported by two libraries, NLTK and pattern, I leave you the [documentation](https://textblob.readthedocs.io/en/dev/)
https://www.analyticsvidhya.com/blog/2018/02/natural-language-processing-for-beginners-using-textblob/

### NLTK
The Natural Language Toolkit, or more commonly NLTK, is a set of symbolic and statistical natural language processing libraries and programs for the Python programming language. NLTK includes graphical demonstrations and sample data.

In this case we will also get the polarity with the module [SentimentIntensityAnalizer](https://www.nltk.org/api/nltk.sentiment.html#module-nltk.sentiment.vader)

`sia.polarity_scores(the_string)`

**Aruments:** `string`<br>
**Returns:** `polarity`

Information about the [compound](https://github.com/cjhutto/vaderSentiment#about-the-scoring). 
It is the sum of the scores normalized between -1 and 1

We check that it works by passing a letter to the function

## Adding to SQL

`alter table: new column`

`seed the database: row by row`

## Further processing

Spacy library documentation
https://spacy.io/api/doc

### Tokenize: lemmatization
One of the ways to normalize our tokens is through stemming and lemmatization.
Stemming consists of removing and replacing suffixes from the root of the word. Lemmatization is a bit more complex and involves doing an analysis of the vocabulary and its morphology to return the basic form of the word (unconjugated, singular, etc).
Read [this](https://medium.com/escueladeinteligenciaartificial/procesamiento-de-lenguaje-natural-stemming-y-lemmas-f5efd90dca8) interesting article.
When it comes to tokenizing, we are going to do it by previously removing the stop words.

![](https://d2mk45aasx86xg.cloudfront.net/difference_between_Stemming_and_lemmatization_8_11zon_452539721d.webp)

### Entity recognition

## WordClouds
A word cloud or tag cloud is a visual representation of the words that make up a text, where the size is larger for the words that appear more frequently

![wordcloud](https://i.imgur.com/8I8aJ1N.png)

### We generate a WordCloud of a song

### We can also generate it from a column of an entire dataframe