**Sentiment analysis**

Hello and welcome to my sentiment analysis project. This will be conducted on a spotify reviews dataset that contain reviews about the application. The csv-file is split into two columns, the first contains a review and the second column contains a label about the general sentiment of the review.

The sentiment analysis practice we will be applying in this project is the NLTKS VADER teqnique.

Sentiment analysis is a powerful data analysation tool that enables insight into the sentiment of a text. The two first essential steps of the process is the following for VADER: 

    1. Tokenizing the string. The process of breaking up each review into the words that make it up and populating a list.
    2. Part of speech. The process of labeling each token with it's corresponding grammatical category. 


**Import neccesary libraries**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
from nltk.tokenize import word_tokenize
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import random



In [None]:
df = pd.read_csv("/kaggle/input/spotify-dataset/DATASET.csv")
df.head()

Lets first explore the dataset! We want to get an overview of the proportion of positive/negative reviews and also explore some example reviews.

Given that our label column contains categorial values we first need to perform a get_dummies operation. This  creates two new columns with boolean values that is true/false if the value is present in the row. This allows us to aggregate it in a piechart.

In [None]:
dummy_df = pd.get_dummies(df['label'])
dummy_df.head()

Positive, Negative = dummy_df['POSITIVE'].sum(), dummy_df['NEGATIVE'].sum()
data, labels = [Positive, Negative], ["Positive", "Negative"] 

fig, ax = plt.pie(data, labels = labels, startangle = 140)
plt.show()


In [None]:
randomInt = random.randint(0,500)
example = df['Review'][randomInt]
print(f"The review number: {randomInt}", example)

Lets do some tokenization, first we need to download the nltk data for tokenization. Then we will perform a tokenization on an example review.

In [None]:
nltk.download('punkt')

In [None]:
example_tokenized = word_tokenize(example)
example_tokenized

Now we need to apply part of speech on the tokenized review. This ensures that words are put in their correct grammatical category by labeling them with a two letter code. This is an important step seeing that the english language has many words that can be used in different grammatical context. For example the word 'run' can, depending on the context, be a:

    1. Verb - I run every morning
    2. Adjective - We did a test run of the system yesterday.

After this we will chunk it to a chunk object. What this does is grouping together the tokenized review, that has been tagged by POS, into a chunk of words so it actually forms a sentance.

In [None]:
pos_example = nltk.pos_tag(example_tokenized)
pos_example
pos_example_chunk = nltk.chunk.ne_chunk(pos_example)
pos_example_chunk.pprint()

Now have done the necessary steps to prime the data for our VADER sentiment analysis. This is a rule based sentiment analysis approach which is fitting for shorter and more informal texts, such as tweets and/or reviews. It is fitting for this because it works very well when the writer expresses their sentiment with slang, emoticons and exaggerated words. 

It is a very lightweight approach to sentiment analysis seeing that it makes work of a bag-of-words (BoW) apporach rather than training a model to evaluate the sentiment of a text. BoW utilizes a dictionary of 7500 manually labeled words with a 'valence score', this represents on a scale from 0-1 how negative, neutral or positive a review is. It manually assigns each token in the string a score and then aggregates the scores to form a numerical summarization that we can interpit. 

The shortcommings in VADER lies in it's inabillity to handle text that is more nuanced, especially when it comes to sarcasm. Seeing that the words of a sarcastic review can come of as objectively positive the VADER might interpit it as this, when it is obivously sarcastic. For example:

*Wow, this product is truly amazing. I mean, who doesn't love waiting for hours only to find out it doesn’t work at all? The best part? It breaks after one use, so I don’t even have to worry about storage space! Five stars for creativity on how badly designed this is.*

Some words that can seriously throw of the VADER is:

    1. Amazing 
    2. Best part
    3. Five stars

Once run through a VADER model it will get a positive score, even tough the customer was unhappy with the product. 

This aside, we can continue implementing VADER. 

In [None]:
nltk.download('vader_lexicon')

In [None]:
sia = SentimentIntensityAnalyzer()
sia.polarity_scores(example)

Cool it works! Let's do this on the first 1000 rows of our spotify dataframe, store it in a temporary dictionary and then merge it to our sliced dataframe. We put it into a dataframe for easy merging with the dataframe.

In [None]:
df_1000 = df[:1000].reset_index(drop=True)
df_1000['Id'] = df_1000.index
dict = {}
df_1000.head()

In [None]:
for i, row in df_1000.iterrows():
    text = row['Review']
    rowId = row['Id']
    dict[rowId] = sia.polarity_scores(text)

Let's merge it with the dataframe! In order to do this we need to convert the dictionary to a dataframe and ensure that there is a common column that the dataframes can be merged on. We will use the ID column we created above for this.

In [None]:
vaders_df = pd.DataFrame(dict).T
vaders_df['Id'] = vaders_df.index
vaders_df = vaders_df.merge(df_1000, on = 'Id', how = 'right')
vaders_df

Cool! We have attatched the compound values of the VADER model to each review! Let's do some exploratory data analysis on this new dataframe. Let's see do histogram displaying the distribution of the compounded VADER scores.

We will show a histogram, displaying the distribution of the compound scores. Also a scatterplot which shows us the relationship between the length of a review and the associated compound score. 

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(12, 6))

sns.histplot(vaders_df['compound'], kde = True, bins = 10, ax=ax[0])
ax[0].set_title('Distribution of VADER Compound Scores')
ax[0].set_xlabel('Compound Score')
ax[0].set_ylabel('Frequency')

vaders_df['Review_length'] = vaders_df['Review'].apply(len)
sns.scatterplot(x = "compound", y = "Review_length", data=vaders_df,ax=ax[1])
ax[1].set_title('Review Length vs VADER Sentiment Score')

plt.tight_layout()
plt.show()