<a href="https://colab.research.google.com/github/SaikumarUCM/Sentimental-Analysis-on-Amazon-Reviews/blob/main/Sentimental_Analysis_on_Amazon_Reviews.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis in Python


In this notebook we will be doing some sentiment analysis in python using two different techniques:
1. VADER (Valence Aware Dictionary and sEntiment Reasoner) - Bag of words approach
2. Roberta Pretrained Model from ðŸ¤— (Huggiing face)
3. Huggingface Pipeline

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns


plt.style.use('ggplot')

import nltk
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('maxent_ne_chunker_tab')
nltk.download('words')
nltk.download('vader_lexicon')


In [None]:
import kagglehub
path = kagglehub.dataset_download("snap/amazon-fine-food-reviews")
print(path)

In [None]:
# Read in data

df= pd.read_csv(path + '/Reviews.csv')

df.head()

df= df.head(500)

In [None]:
df['Score'].value_counts().sort_index()

In [None]:
ax= df['Score'].value_counts().sort_index().plot(kind='bar', title='Count of Reiews by Stars', figsize=(10,5))

ax.set_xlabel('Review Stars')
ax.set_ylabel('Number of Reviews')
plt.show()

# Basic NLTK (Natural Language ToolKit)

In [None]:
df.head()

In [None]:
df['Text'][0]

In [None]:
example= df["Text"][50]
tokens= nltk.word_tokenize(example)
tokens[:10]

In [None]:
tagged= nltk.pos_tag(tokens)
tagged[:10]

In [None]:

entities= nltk.chunk.ne_chunk(tagged)

entities.pprint()

# Step 1. VADER Seniment Scoring

We will use NLTK's `SentimentIntensityAnalyzer` to get the neg/neu/pos scores of the text.

- This uses a "bag of words" approach:
    1. Stop words are removed
    2. each word is scored and combined to a total score.

In [None]:
from nltk.sentiment import SentimentIntensityAnalyzer
from tqdm.notebook import tqdm


sia = SentimentIntensityAnalyzer()

In [None]:
sia.polarity_scores("I am so happy")

In [None]:
sia.polarity_scores("This is the worst thing ever")

In [None]:
example

In [None]:
sia.polarity_scores(example)

In [None]:
# Run the Polarity score on the entire dataset

res={}

for i, row_data in tqdm(df.iterrows(), total=len(df)):
    text = row_data['Text']
    myid = row_data['Id']
    res[myid] = sia.polarity_scores(text)


In [None]:
vaders= pd.DataFrame(res).T
print(vaders)

In [None]:
vaders = vaders.reset_index().rename(columns={'index':'Id'})
print(vaders)

In [None]:
vaders = vaders.merge(df, how='left')
vaders.head()

## Plot Vader Results

In [None]:
fig, axs = plt.subplots(1, 4, figsize=(12, 3))
sns.barplot(data=vaders, x='Score', y='pos', ax=axs[0])
sns.barplot(data=vaders, x='Score', y='neu', ax=axs[1])
sns.barplot(data=vaders, x='Score', y='neg', ax=axs[2])
sns.barplot(data=vaders, x='Score', y='compound', ax=axs[3])
axs[0].set_title('Positive')
axs[1].set_title('Neutral')
axs[2].set_title('Negative')
axs[3].set_title('Compound')
plt.tight_layout()
plt.show()

# Step 3. Roberta Pretrained Model

- Use a model trained of a large corpus of data.
- Transformer model accounts for the words but also the context related to other words.

In [None]:
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from scipy.special import softmax

In [None]:
MODEL= f"cardiffnlp/twitter-roberta-base-sentiment"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)


In [None]:
#. VADER results on example

print(example)
sia.polarity_scores(example)

In [None]:
# Run for the Roberta Model

encoded_text= tokenizer(example, return_tensors='pt')
encoded_text

In [None]:
output = model(**encoded_text)
output

In [None]:
scores= output[0][0].detach().numpy()
scores

In [None]:
scores= softmax(scores)
scores

In [None]:
scores_dict={
    'roberta_neg': scores[0],
    'roberta_neu': scores[1],
    'roberta_pos': scores[2]
}

print(scores_dict)


In [None]:
def polarity_scores_roberta(example):
    encoded_text= tokenizer(example, return_tensors='pt')
    output = model(**encoded_text)
    scores= output[0][0].detach().numpy()
    scores= softmax(scores)
    scores_dict={
        'roberta_neg': scores[0],
        'roberta_neu': scores[1],
        'roberta_pos': scores[2]
    }
    return scores_dict
#



In [None]:
res={}

for i, row_data in tqdm(df.iterrows(), total=len(df)):
    try:
      text = row_data['Text']
      myid = row_data['Id']

      vader_results= sia.polarity_scores(text)

      vader_results_rename= {}
      for key, value in vader_results.items():
          vader_results_rename[f"vader_{key}"]= value
      roderta_results= polarity_scores_roberta(text)

      both= {**vader_results_rename, **roderta_results}
      res[myid]= both


    except RuntimeError:
      print(f'Broke for id {myid}')

In [None]:
count=0
for i,j in res.items():
  print(i,j)
  count+=1
  if  count > 3:
    break



In [None]:
results_df= pd.DataFrame(res).T
results_df

In [None]:
results_df = results_df.reset_index().rename(columns={'index':'Id'})
results_df= results_df.merge(df, how='left')
results_df.head()

## Compare Score between models

In [None]:
results_df.columns

In [None]:
sns.pairplot(data=results_df,
             vars=['vader_neg', 'vader_neu', 'vader_pos',
                   'roberta_neg', 'roberta_neu', 'roberta_pos'],
             hue='Score',
             palette='tab10')

plt.show()

# Step 4: Review Examples:

- Positive 1-Star and Negative 5-Star Reviews

Lets look at some examples where the model scoring and review score differ the most.

In [None]:
results_df.query('Score == 1').sort_values('roberta_pos', ascending=False)['Text'].values[0]

In [None]:
results_df.query('Score ==1').sort_values('vader_pos', ascending=False)['Text'].values[0]

In [None]:
# Negative Sentiment 5 star review

In [None]:
results_df.query('Score == 5').sort_values('roberta_neg', ascending=False)['Text'].values[0]

In [None]:
results_df.query('Score == 5').sort_values('vader_neg', ascending=False)['Text'].values[0]

# Extra: The Transformers Pipeline
- Quick & easy way to run sentiment predictions

In [1]:
from transformers import pipeline

sent_pipeline = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Device set to use cpu


In [2]:
sent_pipeline('I love sentiment analysis!')

[{'label': 'POSITIVE', 'score': 0.9997853636741638}]

In [3]:
sent_pipeline('I hate this!')

[{'label': 'NEGATIVE', 'score': 0.9995765089988708}]