## Sentiment Analysis on Movie Reviews

The sentiment of reviews is binary, meaning the rating < 5 results in a sentiment score of 0, and rating >=7 have a sentiment score of 1.
No individual movie has more than 30 reviews.

##### Data fields
- id - Unique ID of each review
- sentiment - Sentiment of the review; 1 for positive reviews and 0 for negative reviews
- review - Text of the revie

### Problem Statement:

Build a sentiment analysis model to classify movie reviews as positive or negative based on the text content. Utilize techniques such as word embeddings, word to vec, Bag of words etc.

To accomplish this, you need to follow the below steps:
-  Data Preprocessing:
- Feature Extraction:
- Model Building and Evaluation:
- Evaluate the model's performance

In [18]:
# importing neccessary libraries

import numpy as np
import pandas as pd
import re
import nltk
import spacy
import string

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
# Reading the dataset
data = pd.read_csv('labeledTrainData.tsv', delimiter='\t')
data.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


In [3]:
data.shape

(792, 3)

In [4]:
data.dtypes

id           object
sentiment     int64
review       object
dtype: object

In [5]:
#Summary of the dataset
data.describe()

Unnamed: 0,sentiment
count,792.0
mean,0.479798
std,0.499907
min,0.0
25%,0.0
50%,0.0
75%,1.0
max,1.0


In [6]:
# Checking for null values
data.isna().sum()                                # No null values present

id           0
sentiment    0
review       0
dtype: int64

In [7]:
data['review'][1]

'\\The Classic War of the Worlds\\" by Timothy Hines is a very entertaining film that obviously goes to great effort and lengths to faithfully recreate H. G. Wells\' classic book. Mr. Hines succeeds in doing so. I, and those who watched his film with me, appreciated the fact that it was not the standard, predictable Hollywood fare that comes out every year, e.g. the Spielberg version with Tom Cruise that had only the slightest resemblance to the book. Obviously, everyone looks for different things in a movie. Those who envision themselves as amateur \\"critics\\" look only to criticize everything they can. Others rate a movie on more important bases,like being entertained, which is why most people never agree with the \\"critics\\". We enjoyed the effort Mr. Hines put into being faithful to H.G. Wells\' classic novel, and we found it to be very entertaining. This made it easy to overlook what the \\"critics\\" perceive to be its shortcomings."'

In [8]:
#sentiment count
data['sentiment'].value_counts()            # Dataset is Balanced

0    412
1    380
Name: sentiment, dtype: int64

In [9]:
# Checking for duplicate values
data.duplicated().sum()                    # No duplicate values present

0

## Basic Preprocessing
- Removing tags
- lowercasing
- removing punctuations
- removing stopwords
- lemmatization

#### Removing HTML tags

In [10]:
import re
def remove_tags(raw_text):
    cleaned_text = re.sub(re.compile('<.*?>'), '', raw_text)
    return cleaned_text

data['review'] = data['review'].apply(remove_tags)

In [11]:
data['review'][1]

'\\The Classic War of the Worlds\\" by Timothy Hines is a very entertaining film that obviously goes to great effort and lengths to faithfully recreate H. G. Wells\' classic book. Mr. Hines succeeds in doing so. I, and those who watched his film with me, appreciated the fact that it was not the standard, predictable Hollywood fare that comes out every year, e.g. the Spielberg version with Tom Cruise that had only the slightest resemblance to the book. Obviously, everyone looks for different things in a movie. Those who envision themselves as amateur \\"critics\\" look only to criticize everything they can. Others rate a movie on more important bases,like being entertained, which is why most people never agree with the \\"critics\\". We enjoyed the effort Mr. Hines put into being faithful to H.G. Wells\' classic novel, and we found it to be very entertaining. This made it easy to overlook what the \\"critics\\" perceive to be its shortcomings."'

#### Lowercasing

In [12]:
data['review'] = data['review'].apply(lambda x:x.lower())

#### Removing punctuations

In [13]:
PUNCT_TO_REMOVE = string.punctuation
def remove_punctuation(text):
    """custom function to remove the punctuation"""
    return text.translate(str.maketrans('', '', PUNCT_TO_REMOVE))


data['review'] = data['review'].apply(lambda text: remove_punctuation(text))

In [14]:
data['review'][1]

'the classic war of the worlds by timothy hines is a very entertaining film that obviously goes to great effort and lengths to faithfully recreate h g wells classic book mr hines succeeds in doing so i and those who watched his film with me appreciated the fact that it was not the standard predictable hollywood fare that comes out every year eg the spielberg version with tom cruise that had only the slightest resemblance to the book obviously everyone looks for different things in a movie those who envision themselves as amateur critics look only to criticize everything they can others rate a movie on more important baseslike being entertained which is why most people never agree with the critics we enjoyed the effort mr hines put into being faithful to hg wells classic novel and we found it to be very entertaining this made it easy to overlook what the critics perceive to be its shortcomings'

#### Removing stopwords

In [25]:
import nltk

# Download the NLTK stopwords resource
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [26]:
#Removing stopwords
import nltk
from nltk.corpus import stopwords

sw_list = stopwords.words('english')

data['review'] = data['review'].apply(lambda x: [item for item in x.split() if item not in sw_list]).apply(lambda x:" ".join(x))

In [27]:
data.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,stuff going moment mj ive started listening mu...
1,2381_9,1,classic war world timothy hines entertaining f...
2,7759_3,0,film start manager nicholas bell giving welcom...
3,3630_4,0,must assumed praised film greatest filmed oper...
4,9495_8,1,superbly trashy wondrously unpretentious 80 ex...


#### Lemmatization

In [28]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [29]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
def lemmatize_words(text):
    return " ".join([lemmatizer.lemmatize(word) for word in text.split()])

data["review"] = data["review"].apply(lambda text: lemmatize_words(text))
data.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,stuff going moment mj ive started listening mu...
1,2381_9,1,classic war world timothy hines entertaining f...
2,7759_3,0,film start manager nicholas bell giving welcom...
3,3630_4,0,must assumed praised film greatest filmed oper...
4,9495_8,1,superbly trashy wondrously unpretentious 80 ex...


## Feature Extraction Techniques:
- BOW
- TF-IDF
- Word2Vec


In [30]:
x = data.iloc[:,2:3]
x.head()

Unnamed: 0,review
0,stuff going moment mj ive started listening mu...
1,classic war world timothy hines entertaining f...
2,film start manager nicholas bell giving welcom...
3,must assumed praised film greatest filmed oper...
4,superbly trashy wondrously unpretentious 80 ex...


In [31]:
y= data['sentiment']
y.head()

0    1
1    1
2    0
3    0
4    1
Name: sentiment, dtype: int64

In [32]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2, random_state=134)

In [33]:
x_train

Unnamed: 0,review
769,recently purchased collection one awesome seri...
297,riding giant amazing movie really show people ...
205,shannon leethe daughter bruce leedelivers high...
68,king mask beautifully told story pit familial ...
482,like people wa intrigued heard concept film es...
...,...
114,reallyand incredible film though isnt populare...
375,penultimate episode star trek third season exc...
376,whatever possessed guy ritchie remake wertmull...
550,last read nancy drew book 20 year ago much mem...


### BOW

Bags of words model : It is used to convert text documents to numerical vectors or bag of words.

In [None]:
cv = CountVectorizer(stop_words = "english", min_df = 10, max_df=200, max_features = 200)

In [None]:
#parse matrix to numpy array to_array
x_train_bow = cv.fit_transform(x_train['review']).toarray()
x_test_bow = cv.transform(x_test['review']).toarray()

In [None]:
print(cv.vocabulary_)

{'noir': 122, 'price': 132, 'program': 135, 'grant': 72, 'prince': 133, 'brown': 20, 'fat': 57, 'favor': 58, 'learned': 102, 'golden': 70, 'tape': 174, 'al': 1, 'seven': 157, 'jerry': 88, 'soap': 166, 'douglas': 47, 'jr': 93, 'magnificent': 108, 'grand': 71, 'prove': 136, 'dressed': 48, 'judge': 94, 'davis': 37, 'loss': 105, 'vhs': 187, 'japan': 86, 'fred': 65, 'tune': 184, 'paris': 126, 'danny': 36, 'shakespeare': 159, 'stewart': 170, 'bank': 10, 'johnny': 92, 'murdered': 118, 'knowledge': 99, 'round': 145, 'native': 119, 'joan': 91, 'heroine': 76, 'burn': 23, 'allen': 3, 'keeping': 96, 'screaming': 152, 'president': 130, 'speech': 167, 'gangster': 66, 'behavior': 12, 'treated': 181, 'lake': 101, 'ed': 52, 'boat': 15, 'drunk': 50, 'shadow': 158, 'jackson': 85, 'scared': 151, 'sake': 148, 'wedding': 192, 'angel': 6, 'river': 142, 'charlie': 30, 'russian': 147, 'eat': 51, 'batman': 11, 'tim': 179, 'cgi': 29, 'hunter': 82, 'loving': 106, 'rob': 143, 'pulled': 138, 'international': 84, 'a

In [None]:
len(cv.vocabulary_)

200

### TF-IDF

Term Frequency-Inverse Document Frequency model (TFIDF) : It is used to convert text documents to matrix of tfidf features.

In [None]:
tfidfvec = TfidfVectorizer(stop_words = "english", min_df = 10, max_df=200, max_features = 2000)

In [None]:
#parse matrix to numpy array to_array
x_train_tfidf= tfidfvec.fit_transform(x_train['review']).toarray()
x_test_tfidf = tfidfvec.transform(x_test['review']).toarray()

In [None]:
print(tfidfvec.vocabulary_)

{'weakness': 1958, 'destiny': 487, 'murderous': 1169, 'cox': 408, 'scale': 1555, 'afternoon': 57, 'rental': 1466, 'lip': 1043, 'tense': 1803, 'tight': 1831, 'significant': 1623, 'statement': 1703, 'thirty': 1818, 'secretary': 1569, 'bleak': 179, 'enter': 587, 'vincent': 1931, 'ad': 42, 'assistant': 115, 'meeting': 1115, 'hoped': 855, 'sleazy': 1647, 'letter': 1030, 'discovers': 514, 'spy': 1694, 'join': 966, 'gain': 725, 'freedom': 711, 'seat': 1568, 'unlikely': 1907, 'factor': 640, 'sensitive': 1580, 'hearing': 818, 'aid': 59, 'rank': 1418, 'noir': 1202, 'farm': 652, 'lying': 1074, 'burning': 242, 'iti': 940, 'unable': 1885, 'fictional': 662, 'conspiracy': 373, 'internet': 922, 'price': 1356, 'ran': 1416, 'refuse': 1443, 'program': 1374, 'overrated': 1257, 'reader': 1425, '710': 24, 'mainstream': 1085, 'bag': 136, 'thrill': 1822, 'itthe': 941, '25': 19, 'hardcore': 805, 'appropriate': 105, 'critical': 423, 'religion': 1455, 'remaining': 1458, 'acceptable': 32, 'revealed': 1494, 'bbc':

In [None]:
len(tfidfvec.vocabulary_)

2000

### Word2Vec

In [None]:
# using pre-trained model

In [34]:
import gensim

In [35]:
from gensim.models import Word2Vec,KeyedVectors

In [42]:
!wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"

--2023-08-13 08:18:41--  https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.200.240, 52.217.121.0, 52.216.137.22, ...
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.200.240|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2023-08-13 08:18:41 ERROR 404: Not Found.



In [43]:
!gzip -d GoogleNews-vectors-negative300.bin.gz

gzip: GoogleNews-vectors-negative300.bin.gz: No such file or directory


In [None]:
import gensim.downloader as api

wv = api.load('word2vec-google-news-300')

vec_king = wv['king']



In [36]:
model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin',binary=True,limit=500000)

FileNotFoundError: ignored

In [None]:
# Remove stopwords

X_train = X_train['review'].apply(lambda x: [item for item in x.split() if item not in sw_list]).apply(lambda x:" ".join(x))
# Remove stopwords

X_test = X_test['review'].apply(lambda x: [item for item in x.split() if item not in sw_list]).apply(lambda x:" ".join(x))

In [None]:
import spacy
import en_core_web_sm
# Load the spacy model. This takes a few seconds.
nlp = en_core_web_sm.load()
# Process a sentence using the model
doc = nlp(x_train.values[0])
print(doc.vector)

In [45]:
!pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.5.tar.gz
!python -m spacy download en_core_web_sm

Collecting https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.5.tar.gz
[31m  ERROR: HTTP error 404 while getting https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.5.tar.gz[0m[31m
[0m[31mERROR: Could not install requirement https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.5.tar.gz because of HTTP error 404 Client Error: Not Found for url: https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.5.tar.gz for URL https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.5.tar.gz[0m[31m
Collecting en-core-web-sm==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.6.0/en_core_web_sm-3.6.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m56.3 MB/

In [None]:
input_arr = []
for item in X_train.values:
    doc = nlp(item)
    input_arr.append(doc.vector)

input_arr = np.array(input_arr)

In [None]:
input_test_arr = []
for item in X_test.values:
    doc = nlp(item)
    input_test_arr.append(doc.vector)

input_test_arr = np.array(input_test_arr)

In [None]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(input_arr,y_train)

In [None]:
y_pred = gnb.predict(input_test_arr)
accuracy_score(y_test,y_pred)

## Model Building and Evaluation

### 1. Logistic regression

In [None]:
## Using BOW

In [None]:
from sklearn.linear_model import LogisticRegression
lr= LogisticRegression()

In [None]:
lr.fit(x_train_bow,y_train)
y_pred = lr.predict(x_test_bow)

In [None]:
from sklearn.metrics import accuracy_score, recall_score, f1_score, confusion_matrix, classification_report
accuracy_score(y_test,y_pred)

0.5922

In [None]:
## Using Tf-IDF

In [None]:
lr= LogisticRegression()
lr.fit(x_train_tfidf,y_train)

LogisticRegression()

In [None]:
y_pred = lr.predict(x_test_tfidf)
accuracy_score(y_test,y_pred)

0.7522

## 2. GaussianNB

In [None]:
## Using BOW

In [None]:
#Gaussian Naive Bayes algorithm
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()

gnb.fit(x_train_bow,y_train)

GaussianNB()

In [None]:
y_pred = gnb.predict(x_test_bow)

from sklearn.metrics import accuracy_score,confusion_matrix
accuracy_score(y_test,y_pred)

0.5714

In [None]:
## Using TF-IDF

In [None]:
gnb = GaussianNB()

gnb.fit(x_train_tfidf,y_train)

GaussianNB()

In [None]:
y_pred = gnb.predict(x_test_tfidf)

accuracy_score(y_test,y_pred)

0.7356

## 3. RandomForestClassifier

In [None]:
## Using BOW

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()

rf.fit(x_train_bow,y_train)
y_pred = rf.predict(x_test_bow)
accuracy_score(y_test,y_pred)

0.5804

In [None]:
## Using TF-IDF

In [None]:
rf = RandomForestClassifier()

rf.fit(x_train_tfidf,y_train)
y_pred = rf.predict(x_test_tfidf)
accuracy_score(y_test,y_pred)

0.7402

- Best performing one is Random Forest model with TF-IDF feature extraction technique

Now, we will try increasing no.of features

In [None]:
## With max features = 3000

In [None]:
cv = CountVectorizer(max_features=3000)

x_train_bow = cv.fit_transform(x_train['review']).toarray()
x_test_bow = cv.transform(x_test['review']).toarray()

rf = RandomForestClassifier()

rf.fit(x_train_bow,y_train)
y_pred = rf.predict(x_test_bow)
accuracy_score(y_test,y_pred)

0.8276

In [None]:
tfidfvec = TfidfVectorizer(max_features=3000)

x_train_tfidf = tfidfvec.fit_transform(x_train['review']).toarray()
x_test_tfidf = tfidfvec.transform(x_test['review']).toarray()

rf = RandomForestClassifier()

rf.fit(x_train_tfidf,y_train)
y_pred = rf.predict(x_test_tfidf)
accuracy_score(y_test,y_pred)

0.8366

In [None]:
print('The accuracy score is: ', accuracy_score(y_test,y_pred))

The accuracy score is:  0.8366


In [None]:
from sklearn.metrics import accuracy_score, recall_score, f1_score, confusion_matrix, classification_report

In [None]:
recall_score(y_test,y_pred)

0.8328723824575267

In [None]:
f1_score(y_test,y_pred)

0.8376713689648321

In [None]:
confusion_matrix(y_test,y_pred)

array([[2075,  394],
       [ 423, 2108]], dtype=int64)

In [None]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.83      0.84      0.84      2469
           1       0.84      0.83      0.84      2531

    accuracy                           0.84      5000
   macro avg       0.84      0.84      0.84      5000
weighted avg       0.84      0.84      0.84      5000



#### using N-grams

In [None]:
# #N-grams(bi-gram)
# cv = CountVectorizer(ngram_range=(1,2),max_features=5000)

# X_train_bow = cv.fit_transform(X_train['review']).toarray()
# X_test_bow = cv.transform(X_test['review']).toarray()

# rf = RandomForestClassifier()

# rf.fit(X_train_bow,y_train)
# y_pred = rf.predict(X_test_bow)
# accuracy_score(y_test,y_pred)