<a href="https://colab.research.google.com/github/Nishanth-thiyakarajan/Sentiment-Analysis-using-Spacy/blob/main/Movie%20reviews_Sentiment_Analysis_using_NLTK_and_Spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Installing the Packages

In [1]:
pip install spacy



In [2]:
!python -m spacy download en_core_web_md

Collecting en-core-web-md==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.7.1/en_core_web_md-3.7.1-py3-none-any.whl (42.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m16.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.7.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


# Importing and Loading the Libraries.

In [43]:
import spacy
import pandas as pd
import numpy as np
import nltk
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score

In [28]:
nlp = spacy.load("en_core_web_md")
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


True

# Importing the Dataset

In [7]:
df = pd.read_csv("moviereviews.csv",sep="\t")
df

Unnamed: 0,label,review
0,neg,how do films like mouse hunt get into theatres...
1,neg,some talented actresses are blessed with a dem...
2,pos,this has been an extraordinary year for austra...
3,pos,according to hollywood movies made in last few...
4,neg,my first press screening of 1998 and already i...
...,...,...
1995,pos,"i like movies with albert brooks , and i reall..."
1996,pos,it might surprise some to know that joel and e...
1997,pos,the verdict : spine-chilling drama from horror...
1998,pos,i want to correct what i wrote in a former ret...


# Data Exploration and Cleaning

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   label   2000 non-null   object
 1   review  1965 non-null   object
dtypes: object(2)
memory usage: 31.4+ KB


In [9]:
df.isnull().sum()

Unnamed: 0,0
label,0
review,35


Here, review column is null, but we can see that the sentiments are measured for the black reviews. Scientifically, It is not possible. So, we are removing the null columns.

In [10]:
df.drop(df[df['review'].isnull()].index,inplace=True)
df

Unnamed: 0,label,review
0,neg,how do films like mouse hunt get into theatres...
1,neg,some talented actresses are blessed with a dem...
2,pos,this has been an extraordinary year for austra...
3,pos,according to hollywood movies made in last few...
4,neg,my first press screening of 1998 and already i...
...,...,...
1995,pos,"i like movies with albert brooks , and i reall..."
1996,pos,it might surprise some to know that joel and e...
1997,pos,the verdict : spine-chilling drama from horror...
1998,pos,i want to correct what i wrote in a former ret...


As the review column is text datatype, maybe there can be no values(blanks). To check that, we can use isspace() to get the number of blank values.

In [21]:
df[df['review'].str.isspace()]

Unnamed: 0,label,review
57,neg,
71,pos,
147,pos,
151,pos,
283,pos,
307,pos,
313,neg,
323,pos,
343,pos,
351,neg,


In [22]:
blanks=[]
for i,lb,rv in df.itertuples():
    if type(rv)==str:
        if rv.isspace():
            blanks.append(i)
blanks

[57,
 71,
 147,
 151,
 283,
 307,
 313,
 323,
 343,
 351,
 427,
 501,
 633,
 675,
 815,
 851,
 977,
 1079,
 1299,
 1455,
 1493,
 1525,
 1531,
 1763,
 1851,
 1905,
 1993]

We have found the blanks and now we are removing it.


In [23]:
df.drop(blanks,inplace=True)
df

Unnamed: 0,label,review
0,neg,how do films like mouse hunt get into theatres...
1,neg,some talented actresses are blessed with a dem...
2,pos,this has been an extraordinary year for austra...
3,pos,according to hollywood movies made in last few...
4,neg,my first press screening of 1998 and already i...
...,...,...
1995,pos,"i like movies with albert brooks , and i reall..."
1996,pos,it might surprise some to know that joel and e...
1997,pos,the verdict : spine-chilling drama from horror...
1998,pos,i want to correct what i wrote in a former ret...


In [24]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1938 entries, 0 to 1999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   label   1938 non-null   object
 1   review  1938 non-null   object
dtypes: object(2)
memory usage: 45.4+ KB


After the Data Cleaning, we can see that 1938  values are ready for the next steps.

## checking the need of Data Sampling.

In [25]:
df.label.value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
neg,969
pos,969


We can see that, Negative and Positive labels are equal. So, no need to Sampling.

# vader_lexicon from nltk.

Now, we are using "vader_lexicon" from nltk library. As, we Spacy cannot used for Sentiment Analysis, we are using nltk. In nltk, vader_lexicon is used for the sentiment analysis.


In [29]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()

Now, we are going to do the sentiment analysis for the positive and negative labels and store it in the new column.

In [30]:
df['scores'] = df['review'].apply(lambda review: sid.polarity_scores(review))
df

Unnamed: 0,label,review,scores
0,neg,how do films like mouse hunt get into theatres...,"{'neg': 0.121, 'neu': 0.778, 'pos': 0.101, 'co..."
1,neg,some talented actresses are blessed with a dem...,"{'neg': 0.12, 'neu': 0.775, 'pos': 0.105, 'com..."
2,pos,this has been an extraordinary year for austra...,"{'neg': 0.068, 'neu': 0.781, 'pos': 0.15, 'com..."
3,pos,according to hollywood movies made in last few...,"{'neg': 0.071, 'neu': 0.782, 'pos': 0.147, 'co..."
4,neg,my first press screening of 1998 and already i...,"{'neg': 0.091, 'neu': 0.817, 'pos': 0.093, 'co..."
...,...,...,...
1995,pos,"i like movies with albert brooks , and i reall...","{'neg': 0.073, 'neu': 0.763, 'pos': 0.164, 'co..."
1996,pos,it might surprise some to know that joel and e...,"{'neg': 0.238, 'neu': 0.688, 'pos': 0.074, 'co..."
1997,pos,the verdict : spine-chilling drama from horror...,"{'neg': 0.15, 'neu': 0.702, 'pos': 0.147, 'com..."
1998,pos,i want to correct what i wrote in a former ret...,"{'neg': 0.131, 'neu': 0.71, 'pos': 0.16, 'comp..."





Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.



scores contains the values

{'neg': 0.121, 'neu': 0.778, 'pos': 0.101, 'compound': -0.9125}

Note : (This is taken from the review column.It is measured by the words used in the review column. It refers with the english words.)

        neg - Negative side of the review
        neu - Neutral value
        pos - Positive side of the review
        compund - Normalize(value)

The compound value will gives the values from (-1 to +1). It states that how the label is aligned with the sentiment.

let us take the above value as example:
neg is higher than pos. So, it means that the sentiment is reside with the negative side. the compound value is -0.9(strongly negative).

Let us we take the compound value as a new column.

In [33]:
df['compound']  = df['scores'].apply(lambda score_dict: score_dict['compound'])
df

Unnamed: 0,label,review,scores,compound
0,neg,how do films like mouse hunt get into theatres...,"{'neg': 0.121, 'neu': 0.778, 'pos': 0.101, 'co...",-0.9125
1,neg,some talented actresses are blessed with a dem...,"{'neg': 0.12, 'neu': 0.775, 'pos': 0.105, 'com...",-0.8618
2,pos,this has been an extraordinary year for austra...,"{'neg': 0.068, 'neu': 0.781, 'pos': 0.15, 'com...",0.9951
3,pos,according to hollywood movies made in last few...,"{'neg': 0.071, 'neu': 0.782, 'pos': 0.147, 'co...",0.9972
4,neg,my first press screening of 1998 and already i...,"{'neg': 0.091, 'neu': 0.817, 'pos': 0.093, 'co...",-0.2484
...,...,...,...,...
1995,pos,"i like movies with albert brooks , and i reall...","{'neg': 0.073, 'neu': 0.763, 'pos': 0.164, 'co...",0.9991
1996,pos,it might surprise some to know that joel and e...,"{'neg': 0.238, 'neu': 0.688, 'pos': 0.074, 'co...",-0.9993
1997,pos,the verdict : spine-chilling drama from horror...,"{'neg': 0.15, 'neu': 0.702, 'pos': 0.147, 'com...",-0.5966
1998,pos,i want to correct what i wrote in a former ret...,"{'neg': 0.131, 'neu': 0.71, 'pos': 0.16, 'comp...",0.9387


Now, we can create a new column called pred_label, so that we can check with the actual and predicted values.

In [42]:
df['pred_label'] = df['compound'].apply(lambda c: 'pos' if c >=0 else 'neg')
df

Unnamed: 0,label,review,scores,compound,pred_label
0,neg,how do films like mouse hunt get into theatres...,"{'neg': 0.121, 'neu': 0.778, 'pos': 0.101, 'co...",-0.9125,neg
1,neg,some talented actresses are blessed with a dem...,"{'neg': 0.12, 'neu': 0.775, 'pos': 0.105, 'com...",-0.8618,neg
2,pos,this has been an extraordinary year for austra...,"{'neg': 0.068, 'neu': 0.781, 'pos': 0.15, 'com...",0.9951,pos
3,pos,according to hollywood movies made in last few...,"{'neg': 0.071, 'neu': 0.782, 'pos': 0.147, 'co...",0.9972,pos
4,neg,my first press screening of 1998 and already i...,"{'neg': 0.091, 'neu': 0.817, 'pos': 0.093, 'co...",-0.2484,neg
...,...,...,...,...,...
1995,pos,"i like movies with albert brooks , and i reall...","{'neg': 0.073, 'neu': 0.763, 'pos': 0.164, 'co...",0.9991,pos
1996,pos,it might surprise some to know that joel and e...,"{'neg': 0.238, 'neu': 0.688, 'pos': 0.074, 'co...",-0.9993,neg
1997,pos,the verdict : spine-chilling drama from horror...,"{'neg': 0.15, 'neu': 0.702, 'pos': 0.147, 'com...",-0.5966,neg
1998,pos,i want to correct what i wrote in a former ret...,"{'neg': 0.131, 'neu': 0.71, 'pos': 0.16, 'comp...",0.9387,pos


In [44]:
accuracy_score(df['label'],df['pred_label'])

0.6357069143446853

In [47]:
print(classification_report(df['label'],df['pred_label']))
print(confusion_matrix(df['label'],df['pred_label']))


              precision    recall  f1-score   support

         neg       0.72      0.44      0.55       969
         pos       0.60      0.83      0.70       969

    accuracy                           0.64      1938
   macro avg       0.66      0.64      0.62      1938
weighted avg       0.66      0.64      0.62      1938

[[427 542]
 [164 805]]


Here, we can see that, the accuracy, precision, recall values are very low.

## Hyperparameter Tuning.


In [48]:
for i in [-0.9,-0.8,-0.7,-0.6,-0.5,-0.4,-0.3,-0.2,-0.1,0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]:
    print("------------------------------------------------------------------------------------------------------------------------------")
    print(i)
    df['pred_label'] = df['compound'].apply(lambda c: 'pos' if c >=i else 'neg')
    print(accuracy_score(df['label'],df['pred_label']))

------------------------------------------------------------------------------------------------------------------------------
-0.9
0.5995872033023736
------------------------------------------------------------------------------------------------------------------------------
-0.8
0.6155830753353974
------------------------------------------------------------------------------------------------------------------------------
-0.7
0.6238390092879257
------------------------------------------------------------------------------------------------------------------------------
-0.6
0.6279669762641898
------------------------------------------------------------------------------------------------------------------------------
-0.5
0.630546955624355
------------------------------------------------------------------------------------------------------------------------------
-0.4
0.631062951496388
------------------------------------------------------------------------------------------------

Here, we can see that, for the 0.9, the accuracy score was 0.67(Highest). So we are tuning further.

In [49]:
for i in [0.91,0.92,0.93,0.94,0.95,0.96,0.97,0.98,0.99]:
    print("------------------------------------------------------------------------------------------------------------------------------")
    print(i)
    df['pred_label'] = df['compound'].apply(lambda c: 'pos' if c >=i else 'neg')
    print(accuracy_score(df['label'],df['pred_label']))

------------------------------------------------------------------------------------------------------------------------------
0.91
0.6692466460268318
------------------------------------------------------------------------------------------------------------------------------
0.92
0.6723426212590299
------------------------------------------------------------------------------------------------------------------------------
0.93
0.6718266253869969
------------------------------------------------------------------------------------------------------------------------------
0.94
0.672858617131063
------------------------------------------------------------------------------------------------------------------------------
0.95
0.6769865841073271
------------------------------------------------------------------------------------------------------------------------------
0.96
0.6785345717234262
-----------------------------------------------------------------------------------------------

Here, we can see that, for the 0.98, the accuracy score was 0.6847(Highest). So we are tuning further.

In [50]:
for i in [0.980,0.981,0.982,0.983,0.984,0.985,0.986,0.987,0.988,0.989,0.990]:
    print("------------------------------------------------------------------------------------------------------------------------------")
    print(i)
    df['pred_label'] = df['compound'].apply(lambda c: 'pos' if c >=i else 'neg')
    print(accuracy_score(df['label'],df['pred_label']))

------------------------------------------------------------------------------------------------------------------------------
0.98
0.6847265221878225
------------------------------------------------------------------------------------------------------------------------------
0.981
0.6857585139318886
------------------------------------------------------------------------------------------------------------------------------
0.982
0.6847265221878225
------------------------------------------------------------------------------------------------------------------------------
0.983
0.6836945304437565
------------------------------------------------------------------------------------------------------------------------------
0.984
0.6842105263157895
------------------------------------------------------------------------------------------------------------------------------
0.985
0.6842105263157895
-----------------------------------------------------------------------------------------

Here, we can see that, for the 0.982, the accuracy score was 0.6857585139318886(Highest).

In [51]:
df['pred_label'] = df['compound'].apply(lambda c: 'pos' if c >= 0.982 else 'neg')
df

Unnamed: 0,label,review,scores,compound,pred_label
0,neg,how do films like mouse hunt get into theatres...,"{'neg': 0.121, 'neu': 0.778, 'pos': 0.101, 'co...",-0.9125,neg
1,neg,some talented actresses are blessed with a dem...,"{'neg': 0.12, 'neu': 0.775, 'pos': 0.105, 'com...",-0.8618,neg
2,pos,this has been an extraordinary year for austra...,"{'neg': 0.068, 'neu': 0.781, 'pos': 0.15, 'com...",0.9951,pos
3,pos,according to hollywood movies made in last few...,"{'neg': 0.071, 'neu': 0.782, 'pos': 0.147, 'co...",0.9972,pos
4,neg,my first press screening of 1998 and already i...,"{'neg': 0.091, 'neu': 0.817, 'pos': 0.093, 'co...",-0.2484,neg
...,...,...,...,...,...
1995,pos,"i like movies with albert brooks , and i reall...","{'neg': 0.073, 'neu': 0.763, 'pos': 0.164, 'co...",0.9991,pos
1996,pos,it might surprise some to know that joel and e...,"{'neg': 0.238, 'neu': 0.688, 'pos': 0.074, 'co...",-0.9993,neg
1997,pos,the verdict : spine-chilling drama from horror...,"{'neg': 0.15, 'neu': 0.702, 'pos': 0.147, 'com...",-0.5966,neg
1998,pos,i want to correct what i wrote in a former ret...,"{'neg': 0.131, 'neu': 0.71, 'pos': 0.16, 'comp...",0.9387,neg


In [52]:
print(classification_report(df['label'],df['pred_label']))
print(confusion_matrix(df['label'],df['pred_label']))

              precision    recall  f1-score   support

         neg       0.66      0.75      0.70       969
         pos       0.71      0.62      0.66       969

    accuracy                           0.68      1938
   macro avg       0.69      0.68      0.68      1938
weighted avg       0.69      0.68      0.68      1938

[[725 244]
 [367 602]]


Now, the accuracy, presicion, recall values are bit higher than the 1st time.
