# Sentiment Analysis in Python

[Tutorial Video](https://www.youtube.com/channel/UCxladMszXan-jfgzyeIMyvw)

In this notebook we will be doing some sentiment analysis in python using two different techniques:
1. VADER (Valence Aware Dictionary and sEntiment Reasoner) - Bag of words approach
2. Roberta Pretrained Model from 🤗
3. Huggingface Pipeline

# Step 0. Read in Data and NLTK Basics

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('ggplot') #Stylesheets for the plot

import nltk

In [None]:
# Read in data (more than 500k data)
df = pd.read_csv('../input/amazon-fine-food-reviews/Reviews.csv')
print(df.shape)
# Down sampling the data to 500 rows
df = df.head(500)
print(df.shape)

In [None]:
df.head()

In [None]:
# content of first text
# df['Text'].values[0]

## Quick EDA
- #### Exploratory Data Analysis -
Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods. It helps determine how best to manipulate data sources to get the answers you need, making it easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions.
[Read More](https://www.ibm.com/topics/exploratory-data-analysis)

```
.values_counts ()
```
- tells number of times it occures.
- Here we are looking at number of times a rating (1,2,3,4,5) occurs
- returns it in order of most occurences

```
.sort_index()
```
-We then sort it by index

```
.plot()
```
- We plot it as a bar graph, with title of Count of Reviews by Stars
- figure size (width, height)

```
set_xlabel('Review Stars')
```
- labels this table


In [None]:
ax = df['Score'].value_counts().sort_index() \
    .plot(kind='bar',
          title='Count of Reviews by Stars',
          figsize=(10, 5))
ax.set_xlabel('Review Stars')
plt.show()

We can see here that there are mostly 5 stars reviews. 1 star reviews are almost equal to the 3 stars reviews. 2 Star reviews are the least

## Basic NLTK

We are taking the 50th column text as an example. This example is a negative review

In [None]:
example = df['Text'][50]
print(example)

```
word_tokenize
```
- Splits the Sentence into an array of words
- Splits the spaces and symbols (including apostrophe)

```
[:10]
```
- slicing to get the 1st 10 words

In [None]:
tokens = nltk.word_tokenize(example)
tokens[:10]

```
pos_tag()
```
- NLTK finds the Part Of Speech each word belongs to
- Here we tagged each token


[POS Tags](https://www.guru99.com/pos-tagging-chunking-nltk.html)

In [None]:
tagged = nltk.pos_tag(tokens)
tagged[:10]

Takes these tokens and will group them into chunks of texts

In [None]:
entities = nltk.chunk.ne_chunk(tagged)
entities.pprint()

# Step 1. VADER Seniment Scoring

We will use NLTK's `SentimentIntensityAnalyzer` to get the neg/neu/pos scores of the text.

- This uses a "bag of words" approach:
    1. Stop words are removed
    2. each word is scored and combined to a total score.
    

> Takes all the words in the sentence
> It has a value of postive, negative, or neutral in each words
> Then it will add it up and will give a total score

> Stop words (and, the,) - they are just for the structure

In [None]:
from nltk.sentiment import SentimentIntensityAnalyzer
from tqdm.notebook import tqdm #progress bar tracker for looping the data

sia = SentimentIntensityAnalyzer() #initialize our sentiment analyzer

In [None]:
#Example:
sia.polarity_scores('I am so happy!')
#here we can see the scores for negative, neutral and positive
# compound score is the aggregation of these scores
# compound -1.0 (negative) -> 1.0 (positive)

In [None]:
sia.polarity_scores('This is the worst thing ever.')

In [None]:
sadText = "Dear Diary, Today the weather is fine and the sky is so bright. But I am still sad. I miss the old days. When can I be happy again?"
sia.polarity_scores(sadText)

In [None]:
happyText= "I am no longer sad. I am no longer scared. I am not alone anymore"
sia.polarity_scores(happyText)

In [None]:
sia.polarity_scores(example)

In [None]:
# Run the polarity score on the entire dataset
res = {} #creating a dictionary where we store the id of the text and it's polarity scores
for i, row in tqdm(df.iterrows(), total=len(df)):
    text = row['Text']
    myid = row['Id']
    res[myid] = sia.polarity_scores(text)

In [None]:
#res

In [None]:
pd.DataFrame(res).T #score of the sentment for each of the values

In [None]:
vaders = pd.DataFrame(res).T #score of the sentment
vaders = vaders.reset_index().rename(columns={'index': 'Id'}) # reset the index and name it to id
vaders = vaders.merge(df, how='left') #left merge

In [None]:
# Now we have sentiment score and metadata
vaders.head()

## Plot VADER results

- we need to expect that the ones with 1 star review == a negative compound, and those with 5 star reviews should have a postive compound


Here in this barplot we can see the compound score vs. the review

In [None]:
ax = sns.barplot(data=vaders, x='Score', y='compound')
ax.set_title('Compund Score by Amazon Star Review')
plt.show()

### Positive, Neutral, and Negative Scores of each Rating

In [None]:
fig, axs = plt.subplots(1, 3, figsize=(12, 3))
sns.barplot(data=vaders, x='Score', y='pos', ax=axs[0])
sns.barplot(data=vaders, x='Score', y='neu', ax=axs[1])
sns.barplot(data=vaders, x='Score', y='neg', ax=axs[2])
axs[0].set_title('Positive')
axs[1].set_title('Neutral')
axs[2].set_title('Negative')
plt.tight_layout()
plt.show()

# Step 3. Roberta Pretrained Model

- Use a model trained of a large corpus of data.
- Transformer model accounts for the words but also the context related to other words.

In [None]:
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from scipy.special import softmax

In [None]:
MODEL = f"cardiffnlp/twitter-roberta-base-sentiment"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)

In [None]:
# VADER results on example
print(example)
sia.polarity_scores(example)

In [None]:
# Run for Roberta Model
encoded_text = tokenizer(example, return_tensors='pt')
output = model(**encoded_text)
scores = output[0][0].detach().numpy()
scores = softmax(scores)
scores_dict = {
    'roberta_neg' : scores[0],
    'roberta_neu' : scores[1],
    'roberta_pos' : scores[2]
}
print(scores_dict)

In [None]:
def polarity_scores_roberta(example):
    encoded_text = tokenizer(example, return_tensors='pt')
    output = model(**encoded_text)
    scores = output[0][0].detach().numpy()
    scores = softmax(scores)
    scores_dict = {
        'roberta_neg' : scores[0],
        'roberta_neu' : scores[1],
        'roberta_pos' : scores[2]
    }
    return scores_dict

In [None]:
res = {}
for i, row in tqdm(df.iterrows(), total=len(df)):
    try:
        text = row['Text']
        myid = row['Id']
        vader_result = sia.polarity_scores(text)
        vader_result_rename = {}
        for key, value in vader_result.items():
            vader_result_rename[f"vader_{key}"] = value
        roberta_result = polarity_scores_roberta(text)
        both = {**vader_result_rename, **roberta_result}
        res[myid] = both
    except RuntimeError:
        print(f'Broke for id {myid}')

In [None]:
results_df = pd.DataFrame(res).T
results_df = results_df.reset_index().rename(columns={'index': 'Id'})
results_df = results_df.merge(df, how='left')

## Compare Scores between models

In [None]:
results_df.columns

# Step 3. Combine and compare

In [None]:
sns.pairplot(data=results_df,
             vars=['vader_neg', 'vader_neu', 'vader_pos',
                  'roberta_neg', 'roberta_neu', 'roberta_pos'],
            hue='Score',
            palette='tab10')
plt.show()

# Step 4: Review Examples:

- Positive 1-Star and Negative 5-Star Reviews

Lets look at some examples where the model scoring and review score differ the most.

In [None]:
results_df.query('Score == 1') \
    .sort_values('roberta_pos', ascending=False)['Text'].values[0]

In [None]:
results_df.query('Score == 1') \
    .sort_values('vader_pos', ascending=False)['Text'].values[0]

In [None]:
# nevative sentiment 5-Star view

In [None]:
results_df.query('Score == 5') \
    .sort_values('roberta_neg', ascending=False)['Text'].values[0]

In [None]:
results_df.query('Score == 5') \
    .sort_values('vader_neg', ascending=False)['Text'].values[0]

# Extra: The Transformers Pipeline
- Quick & easy way to run sentiment predictions

In [None]:
from transformers import pipeline

sent_pipeline = pipeline("sentiment-analysis")

In [None]:
sent_pipeline('I love sentiment analysis!')

In [None]:
sent_pipeline('Make sure to like and subscribe!')

In [None]:
sent_pipeline('booo')

# The End