**Sentiment Analysis of IMDB Movie Reviews**


This Notebook is based heavily on the Notebook by [Lakshmipathi N](https://www.kaggle.com/lakshmi25npathi) found on [Kaggle](https://www.kaggle.com/lakshmi25npathi/sentiment-analysis-of-imdb-movie-reviews).

**Import necessary libraries**

In [1]:
#Load the libraries
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression,SGDClassifier
from sklearn.naive_bayes import MultinomialNB
# https://online.stat.psu.edu/stat504/lesson/1/1.7
from utils import preprocesser_text, binarize_sentiment, train_test_split, evaluate

import os
import warnings

**Import the training dataset**

In [2]:
#importing the training data
imdb_data=pd.read_csv('data/IMDB Dataset.csv')
print(imdb_data.shape)
imdb_data.head(10)

(50000, 2)


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
5,"Probably my all-time favorite movie, a story o...",positive
6,I sure would like to see a resurrection of a u...,positive
7,"This show was an amazing, fresh & innovative i...",negative
8,Encouraged by the positive comments about this...,negative
9,If you like original gut wrenching laughter yo...,positive


**Exploratery data analysis**

In [3]:
#Summary of the dataset
imdb_data.describe()

Unnamed: 0,review,sentiment
count,50000,50000
unique,49582,2
top,Loved today's show!!! It was a variety and not...,positive
freq,5,25000


**Sentiment count**

In [4]:
#sentiment count
imdb_data['sentiment'].value_counts()

positive    25000
negative    25000
Name: sentiment, dtype: int64

We can see that the dataset is balanced.

**Spliting the training dataset**

In [5]:
imdb_data = preprocesser_text(imdb_data)

Pandas Apply:   0%|          | 0/50000 [00:00<?, ?it/s]

Pandas Apply:   0%|          | 0/50000 [00:00<?, ?it/s]

Pandas Apply:   0%|          | 0/50000 [00:00<?, ?it/s]

Pandas Apply:   0%|          | 0/50000 [00:00<?, ?it/s]

Pandas Apply:   0%|          | 0/50000 [00:00<?, ?it/s]

**Text normalization**

In [6]:
#normalized train reviews
norm_train, norm_test = train_test_split(imdb_data)
print(norm_train.sentiment.value_counts())
print(norm_test.sentiment.value_counts())
norm_train_reviews=norm_train.review
norm_train_reviews[0]

negative    20007
positive    19993
Name: sentiment, dtype: int64
positive    5007
negative    4993
Name: sentiment, dtype: int64


'one review ha mention watch 1 oz episod youll hook right thi exactli happen meth first thing struck oz wa brutal unflinch scene violenc set right word go trust thi show faint heart timid thi show pull punch regard drug sex violenc hardcor classic use wordit call oz nicknam given oswald maximum secur state penitentari focus mainli emerald citi experiment section prison cell glass front face inward privaci high agenda em citi home manyaryan muslim gangsta latino christian italian irish moreso scuffl death stare dodgi deal shadi agreement never far awayi would say main appeal show due fact goe show wouldnt dare forget pretti pictur paint mainstream audienc forget charm forget romanceoz doesnt mess around first episod ever saw struck nasti wa surreal couldnt say wa readi watch develop tast oz got accustom high level graphic violenc violenc injustic crook guard wholl sold nickel inmat wholl kill order get away well manner middl class inmat turn prison bitch due lack street skill prison exp

**Normalized test reviews**

In [7]:
#Normalized test reviews
norm_test_reviews=norm_test.review
norm_test_reviews.loc[40005]

'hickori dickori dock wa good poirot mysteri confess read book despit avid agatha christi fan adapt isnt without problem time humour valiant attempt get right wa littl overdon event lead final solut rather rush also thought slow moment mysteri felt pad howev love hickori dickori dock wa film veri similar visual style brilliant abc murder realli set atmospher dark camera work dark light darker moment somewhat creepi thi wa help one haunt music score poirot adapt mayb disturb one one two buckl shoe gave nightmar plot complex essenti ingredi though convolut buckl shoeand way good thing act wa veri good david suchet impeccablei know cant use thi word forev cant think better word describ hi perform seri poirot phillip jackson paulin moran justic integr charact brilliantli student great person well develop whole particularli damian lewi leonard solid mysteri doesnt rank along best 7510 bethani cox'

**Bags of words model**

It is used to convert text documents to numerical vectors or bag of words.

> Convert a collection of text documents to a matrix of token counts.
> This implementation produces a sparse representation of the counts using scipy.sparse.csr_matrix.
> If you do not provide an a-priori dictionary and you do not use an analyzer that does some kind of feature selection then the number of features will be equal to the vocabulary size found by analyzing the data.

In [8]:
#Count vectorizer for bag of words
cv=CountVectorizer(min_df=0,max_df=1,binary=False,ngram_range=(1,3))
#transformed train reviews
cv_train_reviews=cv.fit_transform(norm_train_reviews)
#transformed test reviews
cv_test_reviews=cv.transform(norm_test_reviews)

print('BOW_cv_train:',cv_train_reviews.shape)
print('BOW_cv_test:',cv_test_reviews.shape)
#vocab=cv.get_feature_names()-toget feature names

BOW_cv_train: (40000, 6209089)
BOW_cv_test: (10000, 6209089)


**Term Frequency-Inverse Document Frequency model (TFIDF)**

It is used to convert text documents to  matrix of  tfidf features.
> The formula that is used to compute the tf-idf for a term t of a document d in a document set is tf-idf(t, d) = tf(t, d) * idf(t), and the idf is computed as idf(t) = log [ n / df(t) ] + 1 (if smooth_idf=False), where n is the total number of documents in the document set and df(t) is the document frequency of t; the document frequency is the number of documents in the document set that contain the term t. The effect of adding “1” to the idf in the equation above is that terms with zero idf, i.e., terms that occur in all documents in a training set, will not be entirely ignored. (Note that the idf formula above differs from the standard textbook notation that defines the idf as idf(t) = log [ n / (df(t) + 1) ]).

tf(t, d) == # Wort t / max (# Wort t) über alle Dokumente

idf(t) == # Dokumente mit Wort t / # Dokumente


tf-idf(t, d) = tf(t, d) * idf(t)

In [9]:
#Tfidf vectorizer
tv=TfidfVectorizer(min_df=0,max_df=1,use_idf=True,ngram_range=(1,3))
#transformed train reviews
tv_train_reviews=tv.fit_transform(norm_train_reviews)
#transformed test reviews
tv_test_reviews=tv.transform(norm_test_reviews)
print('Tfidf_train:',tv_train_reviews.shape)
print('Tfidf_test:',tv_test_reviews.shape)

Tfidf_train: (40000, 6209089)
Tfidf_test: (10000, 6209089)


**Split and binarize the sentiment tdata**

In [14]:
#Spliting the sentiment data
train_sentiments=norm_train.sentiment
test_sentiments=norm_test.sentiment

test_sentiments = binarize_sentiment(test_sentiments)
train_sentiments = binarize_sentiment(train_sentiments)
print(train_sentiments)
print(test_sentiments)

0        1
1        1
2        1
3        0
4        1
        ..
39995    1
39996    1
39997    1
39998    0
39999    0
Name: sentiment, Length: 40000, dtype: int64
40000    0
40001    0
40002    0
40003    0
40004    0
        ..
49995    1
49996    0
49997    0
49998    0
49999    0
Name: sentiment, Length: 10000, dtype: int64


**Modelling the dataset**

Let us build logistic regression model for both bag of words and tfidf features

In [15]:
#training the model
lr=LogisticRegression(penalty='l2',max_iter=500,C=1,random_state=42)
#Fitting the model for Bag of words
lr_bow=lr.fit(cv_train_reviews,train_sentiments)
print(lr_bow)
#Fitting the model for tfidf features
lr_tfidf=lr.fit(tv_train_reviews,train_sentiments)
print(lr_tfidf)

LogisticRegression(C=1, max_iter=500, random_state=42)
LogisticRegression(C=1, max_iter=500, random_state=42)


**Logistic regression model performane on test dataset**

In [18]:
#Predicting the model for bag of words
lr_bow_predict=lr.predict(cv_test_reviews)
print(lr_bow_predict)
##Predicting the model for tfidf features
lr_tfidf_predict=lr.predict(tv_test_reviews)
print(lr_tfidf_predict)

#Predicting the model for bag of words
lr_bow_predict_train=lr.predict(cv_train_reviews)
##Predicting the model for tfidf features
lr_tfidf_predict_train=lr.predict(tv_train_reviews)


[0 0 0 ... 0 1 1]
[0 0 0 ... 0 1 1]


**Accuracy of the model**

In [20]:
#Accuracy score for bag of words
print("lr_bow_score test:",evaluate(test_sentiments,lr_bow_predict)[0])
#Accuracy score for tfidf features
print("lr_tfidf_score test:",evaluate(test_sentiments,lr_tfidf_predict)[0])
#Accuracy score for bag of words
print("lr_bow_score train:",evaluate(train_sentiments,lr_bow_predict_train)[0])
#Accuracy score for tfidf features
print("lr_tfidf_score train:",evaluate(train_sentiments,lr_tfidf_predict_train)[0])

lr_bow_score test: 0.7512
lr_tfidf_score test: 0.75
lr_bow_score train: 0.996275
lr_tfidf_score train: 0.996275


**Print the classification report**

In [21]:
#Classification report for bag of words 
print(evaluate(test_sentiments,lr_bow_predict)[1])

#Classification report for tfidf features
print(print(evaluate(test_sentiments,lr_tfidf_predict)[1]))

              precision    recall  f1-score   support

    Negative       0.75      0.75      0.75      4993
    Positive       0.75      0.75      0.75      5007

    accuracy                           0.75     10000
   macro avg       0.75      0.75      0.75     10000
weighted avg       0.75      0.75      0.75     10000

              precision    recall  f1-score   support

    Negative       0.74      0.77      0.75      4993
    Positive       0.76      0.73      0.75      5007

    accuracy                           0.75     10000
   macro avg       0.75      0.75      0.75     10000
weighted avg       0.75      0.75      0.75     10000

None


**Stochastic gradient descent or Linear support vector machines for bag of words and tfidf features**

In [22]:
#training the linear svm
svm=SGDClassifier(loss='hinge',max_iter=2000,random_state=42)
#fitting the svm for bag of words
svm_bow=svm.fit(cv_train_reviews,train_sentiments)
print(svm_bow)
#fitting the svm for tfidf features
svm_tfidf=svm.fit(tv_train_reviews,train_sentiments)
print(svm_tfidf)

SGDClassifier(max_iter=2000, random_state=42)
SGDClassifier(max_iter=2000, random_state=42)


**Model performance on test data**

In [33]:
#Predicting the model for bag of words
svm_bow_predict=svm.predict(cv_test_reviews)
print(svm_bow_predict)
#Predicting the model for tfidf features
svm_tfidf_predict=svm.predict(tv_test_reviews)
print(svm_tfidf_predict)

#Predicting the model for bag of words
svm_bow_predict_train=svm.predict(cv_train_reviews)
##Predicting the model for tfidf features
svm_tfidf_predict_train=svm.predict(tv_train_reviews)

[1 1 0 ... 1 1 1]
[1 1 1 ... 1 1 1]


**Accuracy of the model**

In [34]:
#Accuracy score for bag of words
print("svm_bow_score test:",evaluate(test_sentiments,svm_bow_predict)[0])
#Accuracy score for tfidf features
print("svm_tfidf_score test:",evaluate(test_sentiments,svm_tfidf_predict)[0])

#Accuracy score for bag of words
print("svm_bow_score train:",evaluate(train_sentiments,svm_bow_predict_train)[0])
#Accuracy score for tfidf features
print("svm_tfidf_score train:",evaluate(train_sentiments,svm_tfidf_predict_train)[0])

svm_bow_score test: 0.5829
svm_tfidf_score test: 0.5112
svm_bow_score train: 0.990425
svm_tfidf_score train: 0.990425


**Print the classification report**

In [25]:
#Classification report for bag of words 
print(evaluate(test_sentiments,svm_bow_predict)[1])
#Classification report for tfidf features
print(evaluate(test_sentiments,svm_tfidf_predict)[1])

              precision    recall  f1-score   support

    Negative       0.94      0.18      0.30      4993
    Positive       0.55      0.99      0.70      5007

    accuracy                           0.58     10000
   macro avg       0.74      0.58      0.50     10000
weighted avg       0.74      0.58      0.50     10000

              precision    recall  f1-score   support

    Negative       1.00      0.02      0.04      4993
    Positive       0.51      1.00      0.67      5007

    accuracy                           0.51     10000
   macro avg       0.75      0.51      0.36     10000
weighted avg       0.75      0.51      0.36     10000



**Multinomial Naive Bayes for bag of words and tfidf features**

In [26]:
#training the model
mnb=MultinomialNB()
#fitting the svm for bag of words
mnb_bow=mnb.fit(cv_train_reviews,train_sentiments)
print(mnb_bow)
#fitting the svm for tfidf features
mnb_tfidf=mnb.fit(tv_train_reviews,train_sentiments)
print(mnb_tfidf)

MultinomialNB()
MultinomialNB()


**Model performance on test data**

In [37]:
#Predicting the model for bag of words
mnb_bow_predict=mnb.predict(cv_test_reviews)
print(mnb_bow_predict)
#Predicting the model for tfidf features
mnb_tfidf_predict=mnb.predict(tv_test_reviews)
print(mnb_tfidf_predict)

#Predicting the model for bag of words
mnb_bow_predict_train=mnb.predict(cv_train_reviews)
##Predicting the model for tfidf features
mnb_tfidf_predict_train=mnb.predict(tv_train_reviews)

[0 0 0 ... 0 1 1]
[0 0 0 ... 0 1 1]


**Accuracy of the model**

In [38]:
#Accuracy score for bag of words
print("mnb_bow_score test:",evaluate(test_sentiments,mnb_bow_predict)[0])
print("mnb_bow_score train:",evaluate(train_sentiments,mnb_bow_predict_train)[0])
#Accuracy score for tfidf features
print("mnb_tfidf_score test:",evaluate(test_sentiments,mnb_tfidf_predict)[0])
print("mnb_tfidf_score train:",evaluate(train_sentiments,mnb_tfidf_predict_train)[0])

mnb_bow_score test: 0.751
mnb_bow_score train: 0.996275
mnb_tfidf_score test: 0.7509
mnb_tfidf_score train: 0.996275


**Print the classification report**

In [None]:
#Classification report for bag of words 
print(evaluate(test_sentiments,mnb_bow_predict)[1])
#Classification report for tfidf features
print(evaluate(test_sentiments,mnb_tfidf_predict)[1])