# Sentiment Analysis in Python

In this notebook we will be doing some sentiment analysis in python using two different techniques:

1. VADER(Valence Aware Dictionary and Sentiment Reasoner) - Bag of words approach
2. Roberta Pretrained Model from Hugging Face 🤗
3. Hugging Face

# Step 0. Read Data in Data and NTLK Basics

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('ggplot')

import nltk

In [None]:
# Read Data 
df = pd.read_csv('/kaggle/input/amazon-fine-food-reviews/Reviews.csv')

In [None]:
df.head()

In [None]:
df.info()

In [None]:
# Quick EDA

ax = df['Score'].value_counts().sort_index().plot(kind='bar',title = 'Count of Reviews by Stars',
                                             figsize=(10,5))
ax.set_xlabel('Review Star')

# Basic NLTK

In [None]:
example = df['Text'][0]
print(example)

### Tokenization

In [None]:
tokens= nltk.word_tokenize(example)
tokens[:10]

### Part of Speech

In [None]:
tagged = nltk.pos_tag(tokens)
tagged[:10]

### Tagged POS into Chuck 


In [None]:
entities = nltk.chunk.ne_chunk(tagged)
entities.pprint()

# Step 1. VADER Sentiment Scoring

We will use NLTK's `SentimentIntensityAnalyzer` to get the neg/neu/pos scores of the text

* This uses a "bag of words" approach:
    1. Stop words are removed
    2. Each words is scored and combined to a total score

In [None]:
from nltk.sentiment import SentimentIntensityAnalyzer
from tqdm.notebook import tqdm

sia = SentimentIntensityAnalyzer() 

In [None]:
sia.polarity_scores('Hi how are you?')

In [None]:
sia.polarity_scores(example)

In [None]:
# Run the polarity score on the entire dataset
res = {}
for i,row in tqdm(df.iterrows(),total=len(df)):
    text = row['Text']
    myid = row['Id']
    res[myid] = sia.polarity_scores(text)

In [None]:
df

In [None]:
vaders = pd.DataFrame(res).T
vaders = vaders.reset_index().rename(columns={'index':'Id'})
vaders

In [None]:
vaders = vaders.merge(df,how='left')

In [None]:
vaders.head()

In [None]:
ax = sns.barplot(data=vaders,x = 'Score',y='compound')
ax.set_title('Compound Score by Amazon Star Review')
plt.show()

In [None]:
fig,axs = plt.subplots(1,3,figsize=(15,5))
sns.barplot(data = vaders,x='Score',y = 'pos',ax = axs[0])
sns.barplot(data = vaders,x='Score',y = 'neu',ax = axs[1])
sns.barplot(data = vaders,x='Score',y = 'neg',ax = axs[2])
axs[0].set_title('Positive')
axs[1].set_title('Neutral')
axs[2].set_title('Negative') 
plt.show()

# Step 3. Roberta Pretrained Model

* Use a model trained of a large corpus of data
* Transformer model accounts for the words but also the context related to other words

In [None]:
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from scipy.special import softmax