# Sentiment Analysis using VADER

VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool used in NLP. It is specifically designed to analyze and determine the sentiment (positive, negative, or neutral) expressed in textual data, such as social media posts, customer reviews, and online comments.

 VALDER also considers how strong or intense the sentiment is.

It's a helpful tool for quickly figuring out how people feel about something based on what they write.

## Get the Data
We will use the [amazon-reviews](https://www.kaggle.com/datasets/kritanjalijain/[amazon-reviews) dataset on Kaggle.

To download a dataset with kaggle:
1. `pip install kaggle`
2. Download the kaggle.json file by clicking on "Create new token" in the API section of [your profile settings

3. Place the kaggle.json file in: `Users/username/.kaggle/kaggle.json`

In [None]:
import zipfile
from kaggle.api.kaggle_api_extended import KaggleApi

import pandas as pd

# Authenticate and download the dataset using the Kaggle API
api = KaggleApi()
api.authenticate()
api.dataset_download_files('kritanjalijain/amazon-reviews')

# Extract the dataset file

with zipfile.ZipFile('amazon-reviews.zip', 'r') as zip_ref:
    zip_ref.extractall('dataset')

In [22]:
df = pd.read_csv('dataset/train.csv', header=None, names=['sentiment', 'review', 'reviewText'])
# Keep only 10,000 rows as the dataset is too large
df = df[0:10000, :]
df.head()

Unnamed: 0,sentiment,review,reviewText
0,2,Stuning even for the non-gamer,This sound track was beautiful! It paints the ...
1,2,The best soundtrack ever to anything.,I'm reading a lot of reviews saying that this ...
2,2,Amazing!,This soundtrack is my favorite music of all ti...
3,2,Excellent Soundtrack,I truly like this soundtrack and I enjoy video...
4,2,"Remember, Pull Your Jaw Off The Floor After He...","If you've played the game, you know how divine..."


In [None]:
df.shape

## Clean the Data
We iterate over the dataframe and check if a review is empty or contains only whitespace characters. If so, we drop the row.

In [None]:
"""
# Remove NaN values and empty strings
df.dropna(inplace=True)

blanks = []

for i, label, review in df.itertuples():
    if type(review)==str:
        if review.isspace():
            blanks.append(i)

df.drop(blanks, inplace=True)
"""

NameError: name 'df' is not defined

## Import VADER

In [13]:
import nltk
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\balde\AppData\Roaming\nltk_data...


True

In [14]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()

## Apply VADER to the dataset
The polarity_score method returns a dictionary of scores for the input string. The compound score is a metric that calculates the sum of all the lexicon ratings which have been normalized between -1 (most extreme negative) and +1 (most extreme positive).

In [15]:
sid.polarity_scores('Spring is such a great time of the year')

{'neg': 0.0, 'neu': 0.631, 'pos': 0.369, 'compound': 0.6249}

Now, let's apply VADER to the reviewText column of the dataset. We will create a new column called scores that contains the polarity scores for each review.

In [23]:
df['scores'] = df['reviewText'].apply(lambda review: sid.polarity_scores(review))

Unnamed: 0,sentiment,review,reviewText,scores
0,2,Stuning even for the non-gamer,This sound track was beautiful! It paints the ...,"{'neg': 0.093, 'neu': 0.651, 'pos': 0.256, 'co..."
1,2,The best soundtrack ever to anything.,I'm reading a lot of reviews saying that this ...,"{'neg': 0.019, 'neu': 0.851, 'pos': 0.129, 'co..."
2,2,Amazing!,This soundtrack is my favorite music of all ti...,"{'neg': 0.04, 'neu': 0.691, 'pos': 0.269, 'com..."
3,2,Excellent Soundtrack,I truly like this soundtrack and I enjoy video...,"{'neg': 0.092, 'neu': 0.628, 'pos': 0.28, 'com..."
4,2,"Remember, Pull Your Jaw Off The Floor After He...","If you've played the game, you know how divine...","{'neg': 0.0, 'neu': 0.719, 'pos': 0.281, 'comp..."


In [33]:
df['compound']  = df['scores'].apply(lambda score_dict: score_dict['compound'])
df['comp_score'] = df['compound'].apply(lambda c: 2 if c >=0 else 1)   # 2 for positive, 1 for negative

In [30]:
df.head()

Unnamed: 0,sentiment,review,reviewText,scores,compound,comp_score
0,2,Stuning even for the non-gamer,This sound track was beautiful! It paints the ...,"{'neg': 0.093, 'neu': 0.651, 'pos': 0.256, 'co...",0.9454,pos
1,2,The best soundtrack ever to anything.,I'm reading a lot of reviews saying that this ...,"{'neg': 0.019, 'neu': 0.851, 'pos': 0.129, 'co...",0.8481,pos
2,2,Amazing!,This soundtrack is my favorite music of all ti...,"{'neg': 0.04, 'neu': 0.691, 'pos': 0.269, 'com...",0.9854,pos
3,2,Excellent Soundtrack,I truly like this soundtrack and I enjoy video...,"{'neg': 0.092, 'neu': 0.628, 'pos': 0.28, 'com...",0.9753,pos
4,2,"Remember, Pull Your Jaw Off The Floor After He...","If you've played the game, you know how divine...","{'neg': 0.0, 'neu': 0.719, 'pos': 0.281, 'comp...",0.9781,pos


## Evaluate the Model

In [34]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

print(accuracy_score(df['sentiment'], df['comp_score']))
print(classification_report(df['sentiment'], df['comp_score']))
print(confusion_matrix(df['sentiment'], df['comp_score']))

0.6877631574926669
              precision    recall  f1-score   support

           1       0.85      0.46      0.59   1799880
           2       0.63      0.92      0.75   1799913

    accuracy                           0.69   3599793
   macro avg       0.74      0.69      0.67   3599793
weighted avg       0.74      0.69      0.67   3599793

[[ 824331  975549]
 [ 148439 1651474]]
