<h1 style="color: #492c68;">NLP: Sentiment Analysis</h1>

We are going dive into a basic introduction for <mark>Sentiment Analisis</mark> through <mark>Natural Language Processing</mark>. 

- As always, we read our dataset. Let's see what we got

In [None]:
## Basic Libraries
import pandas as pd
import numpy as np

## EDA Libraries
import matplotlib as plt
import seaborn as sns
import plotly as px

## Settings configuration
pd.set_option('display.max_columns', None)
import warnings
warnings.filterwarnings("ignore")

In [None]:
df = pd.read_csv("reviews.csv")

In [None]:
## Best setting for columns names

df.columns = df.columns.str.lower().str.replace(" ","_").str.replace(".","_").str.replace(":","")

In [None]:
df.shape

In [None]:
df.head(5)

From df we have interest about <mark>"score"</mark>, <mark>"summary"</mark> and <mark>"text"</mark>

- "text": is the most important variable in our project. These texts will be break down into tokens and processed to figure out their sentiment.
- "score" and "sumary" are useful to compare our new results.

In [None]:
df = df[["text", "score", "summary"]]

In [None]:
df

- Before we go further, let's plot "score" to see its distribution and have an idea of the data we are working with

In [None]:
sns.barplot(data= df["score"].value_counts())

- Lenght its too big. It would be better if we use a fraction of it to make it easier.

In [None]:
df = df.head(500)

<h2 style="color: #327a81;">NLP Preprocessing: NLTK Basics</h2>

<mark>NLTK</mark> means <mark>Natural Language Toolkit</mark>. It's a basic tool for <mark>NLP beginners</mark> and <mark>quick model processing</mark>. It's not too fancy but can be used to understand basics.

- Let's see how NLTK works

In [None]:
## Installing the specific librarie

import nltk
nltk.download('punkt') #tokenizer package
nltk.download('averaged_perceptron_tagger') #For label tagging

- For this demo. We take a random index as an example

In [None]:
example = df["text"][65]
print(example)

- NTLK can tokenize any text. This is crucial for the ML preprocessing.
- Tokenizer may seem similar to splitting strings, but it's more complex than that.

In [None]:
tokens = nltk.word_tokenize(example) # word_tokenize breaks the string into tokens that the machine understands
print(tokens)

In [None]:
nltk.pos_tag(tokens) # pos_tag shows the grammar labels of the tokens. Look for pos_tag list in google to know more about abreviations

<h2 style="color: #327a81;">Sentiment Analysis: VADER and RoBERTa</h2>

There are lot of ways to do Sentiment Analaysis. For this introduction we will see two different types.

- <mark>VADER</mark> to see the <mark>positive/negative valance</mark> of the sentences (basic)
- <mark>RoBERTa</mark> to do a <mark>mood analysis</mark> (complex)

<h3 style="color: #60b671;">VADER</h3>

VADER stands for <mark>Valance Aware Dictionary and Sentiment Reasoner</mark>. This NLTK tool is used to figure out if a piece of text is expressing positive, negative or neutral emotions.

- This method uses a <mark>"bag of words"</mark> approach. This means that the analysis uses the weight of the words (understands the positive charge without context)
- Can be <mark>more accessible for beginners</mark> than RoBERTa. It's recommended for fast prototypes... but has lots of limitations (self-awareness, context)

In [None]:
## Installing NLTK specific function

from nltk.sentiment import SentimentIntensityAnalyzer

sia = SentimentIntensityAnalyzer()

- Once we have defined our sia, we can test it with our previous example

In [None]:
print(example)

In [None]:
sia.polarity_scores(example) # polarity_scores give us pos/neg/neu values from -1 to 1

- Scores in the example are not bad. Tends to be neutral/positive, ditching the negative value
- So this matches with the real score and its summary?

In [None]:
df.loc[65] # Check the example's polarity with real ratings, to see if it fits

Now that we know how polarity scores works, let's apply this function to all texts 

In [None]:
# Polarity Scores through Iterrows on the "text" column

pol_scores = {}

for i, row in df.iterrows():
    text = row["text"]
    scores = sia.polarity_scores(text)
    pol_scores[i] = scores

- We can compare the polarity scores with "score" and "summary" to check if they make sense.

In [None]:
pd.DataFrame(pol_scores).T

In [None]:
# Let's unify both dataframes (original + polarity scores) to see if our sentiments match

vaders = pd.DataFrame(pol_scores).T

vaders = pd.concat([df, vaders], axis=1)  

- Let's check how scores 1 went after the polarity. It would tend to negative side

In [None]:
vaders.groupby("score").get_group(1)

- We can compare ratings with polarity scores through a quick plot.
- We can use "compound" to see this.

In [None]:
sns.barplot(data=vaders, x="score", y="compound")

<h3 style="color: #60b671;">RoBERTa</h3>

RoBERTa stands for <mark>Robustly Optimized BERT Approach</mark>. Is a variant of the BERT (Bidirectional Encoder Representations from Transformers) model. 

- Transformers is a step fordward in the NLP field. This deep learning model focus on <mark>attention mechanism</mark>, which allows it to recognise different parts of an input sequence.
- The attention mechanism is key to a text's <mark>self-awareness</mark>. Complex syntactic structures or different intentions, like sarcasm, can be processed by Transformers.

In [None]:
## Installing Transformers specific functions

from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification

- We will use a RoBERTa petrained model named Emotion English DistilRoBERTa-base. It's a refined embedding that can <mark>break text into emotion moods</mark>.

In [None]:
# Load the pretrained model

MODEL = f"j-hartmann/emotion-english-distilroberta-base"

In [None]:
# Let's see through a pipeline example how it works

from transformers import pipeline

classifier = pipeline("text-classification", MODEL, return_all_scores=True)

In [None]:
# Try any example message

classifier("It was really nice to meet you")

- After seeing how transformers works, let's apply the model in our texts

In [None]:
# Save functions that will help us

tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)

In [None]:
# Define the function for obtaining the scores

def mooder(text):
    max_lenght = 512
    encoded_text = tokenizer(text, return_tensors="pt", max_length=max_lenght, truncation=True, padding="longest")
    output = model(**encoded_text)
    scores = output[0][0].detach().numpy()
    moods = {
    "anger": scores[0],
    "disgust" : scores[1],
    "fear": scores[2],
    "joy": scores[3],
    "neutral": scores[4], 
    "sadness": scores[5],
    "surprise": scores[6]
    }
    
    return moods

In [None]:
roberta =  df["text"].apply(mooder) # We apply the function defined before to have all results

In [None]:
roberta = pd.DataFrame(roberta)
roberta

In [None]:
roberta_scores = pd.json_normalize(roberta["text"]) # Normalize the df to put all elements in dictionary as columns and values

In [None]:
moods = pd.concat([df, roberta_scores], axis=1)

- Let's take a sneak peak to results

In [None]:
moods.sample(10)

- We could check by any mood the scores to see if matches our expectations

In [None]:
moods.sort_values(by="disgust", ascending=False).head(15)

- As we did it before with VADERS, we can plot and compare ratings with a specific type of score to see if it fits

In [None]:
sns.barplot(data=moods, x="score", y="disgust")