<a href="https://colab.research.google.com/github/MikaTina/Project_ML-SII/blob/main/Dictionary-based_Sentiment_Analysis_on_IMDB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# **Import Common Libraries**



In [1]:
import pandas as pd
import os
import numpy as np
import tensorflow as tf

# **Import Dataset**

In [2]:
URL = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

dataset = tf.keras.utils.get_file(fname="aclImdb_v1.tar.gz", 
                                  origin=URL,
                                  untar=True,
                                  cache_dir='.',
                                  cache_subdir='')

Downloading data from https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz


In [3]:
# The shutil module offers a number of high-level 
# operations on files and collections of files.
import os
import shutil
# Create main directory path ("/aclImdb")
main_dir = os.path.join(os.path.dirname(dataset), 'aclImdb')
# Create sub directory path ("/aclImdb/train")
train_dir = os.path.join(main_dir, 'train')
# Create sub directory path ("/aclImdb/test")
test_dir = os.path.join(main_dir, 'test')
# Remove unsup folder since this is a supervised learning task
remove_dir = os.path.join(train_dir, 'unsup')
shutil.rmtree(remove_dir)
# View the final train and test folder
print(os.listdir(train_dir))
print(os.listdir(test_dir))

['unsupBow.feat', 'pos', 'labeledBow.feat', 'neg', 'urls_unsup.txt', 'urls_neg.txt', 'urls_pos.txt']
['pos', 'labeledBow.feat', 'neg', 'urls_neg.txt', 'urls_pos.txt']


# **Load data from directories**

In [4]:
dir_train_pos = os.listdir(train_dir + "/pos")
dir_train_neg = os.listdir(train_dir + "/neg")
dir_test_pos = os.listdir(test_dir + "/pos")
dir_test_neg = os.listdir(test_dir + "/neg")

In [5]:
data = []
pos_train_dir = os.path.join(train_dir,'pos')
neg_train_dir = os.path.join(train_dir,'neg')
pos_test_dir = os.path.join(test_dir,'pos')
neg_test_dir = os.path.join(test_dir,'neg')

for element in dir_train_pos:
  input_file = open(pos_train_dir + "/"+ element, "r")
  line = input_file.readline()
  data.append([line, "Positive"])
  input_file.close()

for element in dir_train_neg:
  input_file = open(neg_train_dir + "/"+ element, "r")
  line = input_file.readline()
  data.append([line, "Negative"])
  input_file.close()

for element in dir_test_pos:
  input_file = open(pos_test_dir + "/"+ element, "r")
  line = input_file.readline()
  data.append([line, "Positive"])
  input_file.close()

for element in dir_test_neg:
  input_file = open(neg_test_dir + "/"+ element, "r")
  line = input_file.readline()
  data.append([line, "Negative"])
  input_file.close()

# **DataFrame Creation**

In [6]:
df = pd.DataFrame(data, columns=['review', 'sentiment'])

In [7]:
df

Unnamed: 0,review,sentiment
0,I've received this movie from a cousin in Norw...,Positive
1,"I saw this movie, and the play, and I have to ...",Positive
2,"Well, What can I say, other than these people ...",Positive
3,With various Bogdanoviches and Gazzaras scatte...,Positive
4,Chase has created a true phenomenon with The S...,Positive
...,...,...
49995,"A cheesy ""B"" crime thriller of the early '50, ...",Negative
49996,A total and absolute waste of time. Bad acting...,Negative
49997,This movie changed my life! Hogan's performanc...,Negative
49998,This demented left-wing wipe-out trivializes D...,Negative


# **First method: pysentiment2**
Pysentiment is a library used for dictionary-based sentiment analysis. 

In [8]:
pip install pysentiment2

Collecting pysentiment2
[?25l  Downloading https://files.pythonhosted.org/packages/d1/7c/596f3028260310d6206b7b88fe7d37fdb367913bfac9195912b27ab3cadb/pysentiment2-0.1.1-py3-none-any.whl (1.9MB)
[K     |████████████████████████████████| 1.9MB 4.1MB/s 
Installing collected packages: pysentiment2
Successfully installed pysentiment2-0.1.1


In [9]:
import pysentiment2 as ps

Two dictionaries are provided in the library: Harvard IV-4 and Loughran and McDonald Financial Sentiment Dictionaries, which are sentiment dictionaries for general and financial sentiment analysis. In this work the dictionary chosen is the general one, so Harvard IV-4.

In [10]:
hiv4 = ps.HIV4()

The following variables are used to examine the results of the sentiment analysis: the first one is used to compute the percentage of accuracy, while the others are used to compute precision, recall and f1 score.

In [11]:
# variables initialization
true_pos_ps=0     # number of positive reviews classified as positive
true_neg_ps=0     # number of negative reviews classified as negative
false_pos_ps=0    # number of negative reviews classified as positive
false_neg_ps=0    # number of positive reviews classified as negative

The score computed using this library has 4 components:

* Negative: number of negative tokens
* Positive: number of positive tokens
* Polarity: this is computed as (Pos-Neg)/(Pos+Neg) where Pos and Neg are respectively the number of positive and negative tokens, and it is a negative number if the review is classified as negative, and a positive one if the review is classified as positive
* Subjectivity: this is computed as (Pos+Neg)/count(*)

In this work we only use the Polarity component of the score, classifying the review as positive if this component is greater than 0, as neutral if it is exactly equal to 0, and as negative if it is lower than 0.

In [None]:
for row in df.itertuples():
  # tokenization of the review
  tokens = hiv4.tokenize(row.review)
  # score computation
  score = hiv4.get_score(tokens)
  # increment of the involved variables:
  # if the review was labeled as positive and classified as positive increment true_pos_ps
  if (score['Polarity']>0 and row.sentiment=='Positive'):
    true_pos_ps=true_pos_ps+1
  # if the review was labeled as negative and classified as negative increment true_neg_ps
  elif (score['Polarity']<0 and row.sentiment=='Negative'):
    true_neg_ps=true_neg_ps+1
  # if the review was labeled as negative and classified as positive increment false_pos_ps
  elif (score['Polarity']>0 and row.sentiment=='Negative'):
    false_pos_ps=false_pos_ps+1
  # if the review was labeled as positive and classified as negative increment false_neg_ps
  elif (score['Polarity']<0 and row.sentiment=='Positive'):
    false_neg_ps=false_neg_ps+1
  # if the review has been classifed as neutral (polarity equal to 0) we assign it randomly to false_pos_ps or false_neg_ps (depending on the index row):
  # if the index row is even increment false_pos_ps
  elif (score['Polarity']==0) and row.Index%2==0:
    false_pos_ps=false_pos_ps+1
  # if the index row is odd increment false_neg_ps
  else:
    false_neg_ps=false_neg_ps+1
  # print to check the status of the computation
  if (row.Index%500==0):
    print(row.Index)

## Results analysis

In [13]:
# ACCURACY
accuracy_ps = (true_pos_ps+true_neg_ps)/(true_pos_ps+true_neg_ps+false_pos_ps+false_neg_ps)
print('The accuracy percentage is ' + str(accuracy_ps*100) + '%')

The accuracy percentage is 61.46%


In [14]:
# PRECISION
precision_ps = true_pos_ps/(true_pos_ps+false_pos_ps)
print('The precision is ' + str(precision_ps))

The precision is 0.5895928266941055


In [15]:
# RECALL
recall_ps = true_pos_ps/(true_pos_ps+false_neg_ps)
print('The recall is ' + str(recall_ps))

The recall is 0.7906564163217031


In [16]:
# F1-SCORE
f_score_ps=(2*precision_ps*recall_ps)/(precision_ps+recall_ps)
print('The f1-score is ' + str(f_score_ps))

The f1-score is 0.675479959582351


# **Second method: NLTK and Vader**
NLTK (Natural Language Tool Kit) is a library for NLP, while Vader (Valence Aware Dictionary e sEntiment Reasoner) is instead a rule-based model for
sentiment analysis, tuned to sentiments from social media.


In [17]:
pip install nltk



In [18]:
import nltk

In [19]:
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


True

In [20]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
vader = SentimentIntensityAnalyzer()



In [21]:
# variables initialization
true_pos_nltk=0     # number of positive reviews classified as positive
true_neg_nltk=0     # number of negative reviews classified as negative
false_pos_nltk=0    # number of negative reviews classified as positive
false_neg_nltk=0    # number of positive reviews classified as negative

The score computed using this library has 4 components:

* Positive: proportion of the text that falls in the positive category
* Neutral: proportion of the text that falls in the neutral category
* Negative: proportion of the text that falls in the negative category
* Compound: metric that calculates the sum of the ratings (one for each word) and normalizes it between -1 (extremely negative) and +1 (extremely positive)

In this work we only use the compound component of the score, classifying the review as positive if this component is greater than 0, as neutral if it is exactly equal to 0, and as negative if it is lower than 0.

In [None]:
for row in df.itertuples():
  # score computation
  score = vader.polarity_scores(row.review)
  # increment of the involved variables:
  # if the review was labeled as positive and classified as positive increment true_pos_ps
  if (score['compound']>0 and row.sentiment=='Positive'):
    true_pos_nltk=true_pos_nltk+1
  # if the review was labeled as negative and classified as negative increment true_neg_ps
  elif (score['compound']<0 and row.sentiment=='Negative'):
    true_neg_nltk=true_neg_nltk+1
  # if the review was labeled as negative and classified as positive increment false_pos_ps
  elif (score['compound']>0 and row.sentiment=='Negative'):
    false_pos_nltk=false_pos_nltk+1
  # if the review was labeled as positive and classified as negative increment false_neg_ps
  elif (score['compound']<0 and row.sentiment=='Positive'):
    false_neg_nltk=false_neg_nltk+1
  # if the review has been classifed as neutral (compound equal to 0) we assign it randomly to false_pos_ps or false_neg_ps (depending on the index row):
  # if the index row is even
  elif (score['compound']==0) and row.Index%2==0:
    false_pos_nltk=false_pos_nltk+1
  # if the index row is odd
  else:
    false_neg_nltk=false_neg_nltk+1
  # print to check the status of the computation
  if row.Index%500==0:
    print(row.Index)

## Results analysis

In [23]:
# ACCURACY
accuracy_nltk = (true_pos_nltk+true_neg_nltk)/(true_pos_nltk+true_neg_nltk+false_pos_nltk+false_neg_nltk)
print('The accuracy percentage is ' + str(accuracy_nltk*100) + '%')

The accuracy percentage is 69.504%


In [24]:
# PRECISION
precision_nltk = true_pos_nltk/(true_pos_nltk+false_pos_nltk)
print('The precision is ' + str(precision_nltk))

The precision is 0.6475913646410513


In [25]:
# RECALL
recall_nltk = true_pos_nltk/(true_pos_nltk+false_neg_nltk)
print('The recall is ' + str(recall_nltk))

The recall is 0.8556226747209665


In [26]:
# F1-SCORE
f_score_nltk=(2*precision_nltk*recall_nltk)/(precision_nltk+recall_nltk)
print('The f1-score is ' + str(f_score_nltk))

The f1-score is 0.737212188060113
