# Nafisur Rahman
nafisur21@gmail.com<br>
https://www.linkedin.com/in/nafisur-rahman

# Sentiment Analysis
Finding the sentiment (positive or negative) from IMDB movie reviews.

## About this Project
This is a kaggle project based on kaggle dataset of "Bag of Words Meets Bags of Popcorn". Original dataset can be found from stanford website http://ai.stanford.edu/~amaas/data/sentiment/.<br>
The labeled data set consists of 50,000 IMDB movie reviews, specially selected for sentiment analysis. The sentiment of reviews is binary, meaning the IMDB rating < 5 results in a sentiment score of 0, and rating >=7 have a sentiment score of 1. No individual movie has more than 30 reviews. The 25,000 review labeled training set does not include any of the same movies as the 25,000 review test set. <br>
* id - Unique ID of each review
* sentiment - Sentiment of the review; 1 for positive reviews and 0 for negative reviews
* review - Text of the review

## A. Loading libraries and Dataset

### Importing Packages

In [1]:
import numpy as np
import pandas as pd
import nltk
import matplotlib.pyplot as plt
import sklearn
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

%matplotlib inline

### Loading the dataset

In [2]:
raw_data_train=pd.read_csv('labeledTrainData.tsv',sep='\t')
raw_data_test=pd.read_csv('testData.tsv',sep='\t')

### Basic visualization of dataset

In [3]:
print(raw_data_train.shape)
print(raw_data_test.shape)

(25000, 3)
(25000, 2)


In [4]:
raw_data_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 3 columns):
id           25000 non-null object
sentiment    25000 non-null int64
review       25000 non-null object
dtypes: int64(1), object(2)
memory usage: 586.0+ KB


In [5]:
raw_data_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 2 columns):
id        25000 non-null object
review    25000 non-null object
dtypes: object(2)
memory usage: 390.7+ KB


In [6]:
raw_data_train.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


In [7]:
raw_data_test.head()

Unnamed: 0,id,review
0,12311_10,Naturally in a film who's main themes are of m...
1,8348_2,This movie is a disaster within a disaster fil...
2,5828_4,"All in all, this is a movie for kids. We saw i..."
3,7186_2,Afraid of the Dark left me with the impression...
4,12128_7,A very accurate depiction of small time mob li...


In [8]:
raw_data_train['review'][0]

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally sta

## B. Data Cleaning and Text Preprocessing

Removing tags and markup

In [9]:
from bs4 import BeautifulSoup
soup=BeautifulSoup(raw_data_train['review'][0],'lxml').text
soup

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.The actual feature film bit when it finally starts is only on for 20 mi

Removing non-letters

In [10]:
import re
re.sub('[^a-zA-Z]',' ',raw_data_train['review'][0])

'With all this stuff going down at the moment with MJ i ve started listening to his music  watching the odd documentary here and there  watched The Wiz and watched Moonwalker again  Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent  Moonwalker is part biography  part feature film which i remember going to see at the cinema when it was originally released  Some of it has subtle messages about MJ s feeling towards the press and also the obvious message of drugs are bad m kay  br    br   Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring  Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him  br    br   The actual feature film bit when it finally sta

Word tokenization

In [11]:
from nltk.tokenize import word_tokenize
word_tokenize((raw_data_train['review'][0]).lower())

['with',
 'all',
 'this',
 'stuff',
 'going',
 'down',
 'at',
 'the',
 'moment',
 'with',
 'mj',
 'i',
 "'ve",
 'started',
 'listening',
 'to',
 'his',
 'music',
 ',',
 'watching',
 'the',
 'odd',
 'documentary',
 'here',
 'and',
 'there',
 ',',
 'watched',
 'the',
 'wiz',
 'and',
 'watched',
 'moonwalker',
 'again',
 '.',
 'maybe',
 'i',
 'just',
 'want',
 'to',
 'get',
 'a',
 'certain',
 'insight',
 'into',
 'this',
 'guy',
 'who',
 'i',
 'thought',
 'was',
 'really',
 'cool',
 'in',
 'the',
 'eighties',
 'just',
 'to',
 'maybe',
 'make',
 'up',
 'my',
 'mind',
 'whether',
 'he',
 'is',
 'guilty',
 'or',
 'innocent',
 '.',
 'moonwalker',
 'is',
 'part',
 'biography',
 ',',
 'part',
 'feature',
 'film',
 'which',
 'i',
 'remember',
 'going',
 'to',
 'see',
 'at',
 'the',
 'cinema',
 'when',
 'it',
 'was',
 'originally',
 'released',
 '.',
 'some',
 'of',
 'it',
 'has',
 'subtle',
 'messages',
 'about',
 'mj',
 "'s",
 'feeling',
 'towards',
 'the',
 'press',
 'and',
 'also',
 'the',
 '

Removing stopwords

In [12]:
from nltk.corpus import stopwords
from string import punctuation
Cstopwords=set(stopwords.words('english')+list(punctuation))

In [13]:
Cstopwords

{'!',
 '"',
 '#',
 '$',
 '%',
 '&',
 "'",
 '(',
 ')',
 '*',
 '+',
 ',',
 '-',
 '.',
 '/',
 ':',
 ';',
 '<',
 '=',
 '>',
 '?',
 '@',
 '[',
 '\\',
 ']',
 '^',
 '_',
 '`',
 'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'need

In [14]:
[w for w in word_tokenize(raw_data_train['review'][0]) if w not in Cstopwords]

['With',
 'stuff',
 'going',
 'moment',
 'MJ',
 "'ve",
 'started',
 'listening',
 'music',
 'watching',
 'odd',
 'documentary',
 'watched',
 'The',
 'Wiz',
 'watched',
 'Moonwalker',
 'Maybe',
 'want',
 'get',
 'certain',
 'insight',
 'guy',
 'thought',
 'really',
 'cool',
 'eighties',
 'maybe',
 'make',
 'mind',
 'whether',
 'guilty',
 'innocent',
 'Moonwalker',
 'part',
 'biography',
 'part',
 'feature',
 'film',
 'remember',
 'going',
 'see',
 'cinema',
 'originally',
 'released',
 'Some',
 'subtle',
 'messages',
 'MJ',
 "'s",
 'feeling',
 'towards',
 'press',
 'also',
 'obvious',
 'message',
 'drugs',
 'bad',
 "m'kay.",
 'br',
 'br',
 'Visually',
 'impressive',
 'course',
 'Michael',
 'Jackson',
 'unless',
 'remotely',
 'like',
 'MJ',
 'anyway',
 'going',
 'hate',
 'find',
 'boring',
 'Some',
 'may',
 'call',
 'MJ',
 'egotist',
 'consenting',
 'making',
 'movie',
 'BUT',
 'MJ',
 'fans',
 'would',
 'say',
 'made',
 'fans',
 'true',
 'really',
 'nice',
 'him.',
 'br',
 'br',
 'The',


### Defining a function that will perform the preprocessing task at one go

In [15]:
from bs4 import BeautifulSoup
import re
from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer
stemmer=SnowballStemmer('english')
from nltk.corpus import stopwords
from string import punctuation
Cstopwords=set(stopwords.words('english')+list(punctuation))
def clean_review(df):
    review_corpus=[]
    for i in range(0,len(df)):
        review=df[i]
        review=BeautifulSoup(review,'lxml').text
        review=re.sub('[^a-zA-Z]',' ',review)
        review=[stemmer.stem(w) for w in word_tokenize(str(review).lower()) if w not in Cstopwords]
        review=' '.join(review)
        review_corpus.append(review)
    return review_corpus

In [16]:
df=raw_data_train['review']
clean_train_review_corpus=clean_review(df)
clean_train_review_corpus[0]

'stuff go moment mj start listen music watch odd documentari watch wiz watch moonwalk mayb want get certain insight guy thought realli cool eighti mayb make mind whether guilti innoc moonwalk part biographi part featur film rememb go see cinema origin releas subtl messag mj feel toward press also obvious messag drug bad kay visual impress cours michael jackson unless remot like mj anyway go hate find bore may call mj egotist consent make movi mj fan would say made fan true realli nice actual featur film bit final start minut exclud smooth crimin sequenc joe pesci convinc psychopath power drug lord want mj dead bad beyond mj overheard plan nah joe pesci charact rant want peopl know suppli drug etc dunno mayb hate mj music lot cool thing like mj turn car robot whole speed demon sequenc also director must patienc saint came film kiddi bad sequenc usual director hate work one kid let alon whole bunch perform complex danc scene bottom line movi peopl like mj one level anoth think peopl stay

In [17]:
df1=raw_data_test['review']
clean_test_review_corpus=clean_review(df1)
clean_test_review_corpus[0]

'natur film main theme mortal nostalgia loss innoc perhap surpris rate high older viewer younger one howev craftsmanship complet film anyon enjoy pace steadi constant charact full engag relationship interact natur show need flood tear show emot scream show fear shout show disput violenc show anger natur joyc short stori lend film readi made structur perfect polish diamond small chang huston make inclus poem fit neat truli masterpiec tact subtleti overwhelm beauti'

In [18]:
df=raw_data_train
df['clean_review']=clean_train_review_corpus
df.head()

Unnamed: 0,id,sentiment,review,clean_review
0,5814_8,1,With all this stuff going down at the moment w...,stuff go moment mj start listen music watch od...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi...",classic war world timothi hine entertain film ...
2,7759_3,0,The film starts with a manager (Nicholas Bell)...,film start manag nichola bell give welcom inve...
3,3630_4,0,It must be assumed that those who praised this...,must assum prais film greatest film opera ever...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...,superbl trashi wondrous unpretenti exploit hoo...


## C. Creating Features
1. Bag of Words (CountVectorizer)
2. tf
3. tfidf

### 1. Bag of Words model

In [19]:
from sklearn.feature_extraction.text import CountVectorizer

To limit the size of the feature vectors, we should choose some maximum vocabulary size. Below, we use the 5000 most frequent words (remembering that stop words have already been removed).

In [20]:
bow=CountVectorizer(max_features=10000,min_df=50)

In [21]:
train_data_features = bow.fit_transform(df['clean_review'])
train_data_features.shape

(25000, 4865)

In [22]:
train_data_features = train_data_features.toarray()

In [23]:
vocab = bow.get_feature_names()
print(vocab)

['abandon', 'abc', 'abduct', 'abil', 'abl', 'abomin', 'abort', 'abound', 'abraham', 'abrupt', 'absenc', 'absent', 'absolut', 'absorb', 'absurd', 'abund', 'abus', 'abysm', 'academi', 'accent', 'accept', 'access', 'accid', 'accident', 'acclaim', 'accompani', 'accomplish', 'accord', 'account', 'accur', 'accuraci', 'accus', 'ace', 'achiev', 'acid', 'acknowledg', 'acquaint', 'acquir', 'across', 'act', 'action', 'activ', 'actor', 'actress', 'actual', 'ad', 'adam', 'adapt', 'add', 'addict', 'addit', 'address', 'adequ', 'adjust', 'admir', 'admit', 'adolesc', 'adopt', 'ador', 'adult', 'advanc', 'advantag', 'adventur', 'advertis', 'advic', 'advis', 'aesthet', 'affair', 'affect', 'afford', 'aforement', 'afraid', 'africa', 'african', 'aftermath', 'afternoon', 'afterward', 'age', 'agenc', 'agenda', 'agent', 'aggress', 'ago', 'agoni', 'agre', 'ah', 'ahead', 'aid', 'aim', 'aimless', 'air', 'airplan', 'airport', 'aka', 'akin', 'akshay', 'al', 'ala', 'alan', 'alarm', 'albeit', 'albert', 'album', 'alcoh

In [24]:
import numpy as np

# Sum up the counts of each vocabulary word
dist = np.sum(train_data_features, axis=0)

# For each, print the vocabulary word and the number of times it 
# appears in the training set
for tag, count in zip(vocab, dist):
    print(count, tag)

288 abandon
125 abc
55 abduct
562 abil
1259 abl
83 abomin
92 abort
63 abound
93 abraham
136 abrupt
118 absenc
83 absent
1850 absolut
154 absorb
427 absurd
73 abund
398 abus
110 abysm
298 academi
704 accent
781 accept
165 access
344 accid
246 accident
118 acclaim
197 accompani
271 accomplish
329 accord
297 account
349 accur
82 accuraci
204 accus
75 ace
578 achiev
102 acid
109 acknowledg
73 acquaint
97 acquir
971 across
8794 act
3694 action
268 activ
6876 actor
1588 actress
5065 actual
793 ad
414 adam
835 adapt
1147 add
261 addict
499 addit
183 address
148 adequ
59 adjust
440 admir
875 admit
112 adolesc
162 adopt
230 ador
887 adult
275 advanc
172 advantag
773 adventur
228 advertis
262 advic
197 advis
78 aesthet
419 affair
430 affect
139 afford
126 aforement
343 afraid
212 africa
284 african
52 aftermath
197 afternoon
183 afterward
1726 age
79 agenc
86 agenda
455 agent
103 aggress
1033 ago
55 agoni
779 agre
119 ah
396 ahead
289 aid
325 aim
57 aimless
842 air
106 airplan
96 airport
195 aka

115 conserv
1585 consid
235 consider
491 consist
146 conspiraci
708 constant
64 constitut
252 construct
107 consum
200 contact
765 contain
96 contempl
236 contemporari
71 contempt
61 contend
415 content
200 contest
264 context
1159 continu
156 contract
84 contradict
116 contrari
307 contrast
256 contribut
268 contriv
697 control
210 controversi
130 conveni
298 convent
357 convers
75 convert
341 convey
233 convict
1105 convinc
128 convolut
247 cook
84 cooki
998 cool
201 cooper
934 cop
105 cope
741 copi
259 core
70 corman
65 corn
197 corner
263 corni
197 corpor
200 corps
387 correct
265 corrupt
464 cost
689 costum
76 couch
7922 could
507 count
101 counter
76 counterpart
137 countless
1086 countri
103 countrysid
85 coup
1894 coupl
192 courag
2520 cours
229 court
186 cousin
927 cover
70 cow
95 coward
243 cowboy
132 cox
253 crack
306 craft
123 craig
52 cram
1054 crap
248 crappi
363 crash
53 crave
123 craven
84 crawl
101 craze
699 crazi
84 cream
1683 creat
174 creation
474 creativ
219 creato

83 josh
147 journalist
460 journey
336 joy
302 jr
476 judg
82 judgment
124 judi
230 juli
189 julia
60 juliet
677 jump
90 june
204 jungl
92 junior
195 junk
57 juri
418 justic
220 justifi
100 justin
114 juvenil
136 kane
79 kansa
115 kapoor
57 karat
116 karen
78 karl
301 kate
118 kay
308 keaton
79 keen
2520 keep
88 keith
433 kelli
124 ken
111 kennedi
116 kenneth
750 kept
289 kevin
501 key
102 khan
587 kick
3152 kid
63 kiddi
281 kidnap
3701 kill
1700 killer
209 kim
3052 kind
275 kinda
1038 king
98 kingdom
114 kirk
305 kiss
125 kitchen
63 knee
897 knew
156 knife
146 knight
308 knock
7529 know
304 knowledg
1081 known
271 kong
144 korean
133 kubrick
128 kudo
82 kumar
243 kung
149 kurt
552 la
137 lab
133 label
89 labor
54 lace
1821 lack
79 lacklust
78 ladder
61 laden
1145 ladi
160 laid
255 lake
759 lame
86 lampoon
592 land
60 landmark
205 landscap
171 lane
90 lang
566 languag
784 larg
143 larger
183 larri
3207 last
1309 late
2202 later
201 latest
97 latin
362 latter
2930 laugh
504 laughabl
247

74 setup
360 seven
129 seventi
1663 sever
1726 sex
453 sexi
1000 sexual
103 sh
113 shade
318 shadow
195 shake
298 shakespear
83 shaki
133 shall
273 shallow
743 shame
82 shameless
273 shape
581 share
185 shark
78 sharon
223 sharp
72 shatter
61 shave
100 shaw
100 shed
247 sheer
81 sheet
103 shelf
80 shell
95 shelley
61 shelter
248 sheriff
142 shi
70 shield
157 shift
394 shine
446 ship
103 shirley
162 shirt
1032 shock
83 shoddi
176 shoe
1115 shoot
84 shootout
360 shop
2142 short
77 shortcom
66 shorter
2998 shot
59 shotgun
157 shoulder
156 shout
86 shove
9878 show
162 showcas
85 showdown
186 shower
994 shown
70 shred
55 shrink
194 shut
100 sibl
560 sick
82 sicken
82 sid
1490 side
127 sidekick
180 sidney
54 sigh
397 sight
503 sign
64 signal
302 signific
156 silenc
475 silent
965 silli
155 silver
1094 similar
87 simmon
299 simon
1023 simpl
1965 simpli
91 simplic
103 simplist
146 simpson
51 simul
85 simultan
221 sin
246 sinatra
2906 sinc
196 sincer
939 sing
353 singer
950 singl
164 sinist
187

In [25]:
X=train_data_features
X.shape

(25000, 4865)

In [26]:
X

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [27]:
y=df['sentiment'].values
y.shape

(25000,)

## D. Machine Learning

#### Splitting data into Training and Test set

In [28]:
# train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(20000, 4865) (20000,)
(5000, 4865) (5000,)


In [29]:
# average positive reviews in train and test
print('mean positive review in train : {0:.3f}'.format(np.mean(y_train)))
print('mean positive review in test : {0:.3f}'.format(np.mean(y_test)))

mean positive review in train : 0.502
mean positive review in test : 0.490


### 1. Naive Bayes Classifier

In [30]:
from sklearn.naive_bayes import GaussianNB
model_nb=GaussianNB()
model_nb.fit(X_train,y_train)
y_pred_nb=model_nb.predict(X_test)
print('accuracy for Naive Bayes Classifier :',accuracy_score(y_test,y_pred_nb))
print('confusion matrix for Naive Bayes Classifier:\n',confusion_matrix(y_test,y_pred_nb))

accuracy for Naive Bayes Classifier : 0.749
confusion matrix for Naive Bayes Classifier:
 [[2149  399]
 [ 856 1596]]


### 2. Random Forest

In [31]:
from sklearn.ensemble import RandomForestClassifier

In [32]:
model_rf=RandomForestClassifier(random_state=0)

In [33]:
%%time
from sklearn.model_selection import GridSearchCV
parameters = {'n_estimators':[100,200],'criterion':['entropy','gini'],
              'min_samples_leaf':[2,5,7],
              'max_depth':[5,6,7]
               }
grid_search = GridSearchCV(estimator = model_rf,
                           param_grid = parameters,
                           scoring = 'accuracy',
                           cv = 10,
                           n_jobs = -1)
grid_search = grid_search.fit(X_train, y_train)
best_accuracy = grid_search.best_score_
best_parameters = grid_search.best_params_
print('Best Accuracy :',best_accuracy)
print('Best parameters:\n',best_parameters)

Best Accuracy : 0.81605
Best parameters:
 {'criterion': 'gini', 'max_depth': 7, 'min_samples_leaf': 2, 'n_estimators': 200}
Wall time: 27min 4s


In [34]:
%%time
model_rf=RandomForestClassifier(n_estimators=100,max_depth=6,criterion='entropy',min_samples_leaf=3,max_features=9,
                                min_samples_split=2,min_impurity_decrease=0,random_state=0)
model_rf.fit(X_train,y_train)
y_pred_rf=model_rf.predict(X_test)
print('accuracy for Random Forest Classifier :',accuracy_score(y_test,y_pred_rf))
print('confusion matrix for Random Forest Classifier:\n',confusion_matrix(y_test,y_pred_rf))

accuracy for Random Forest Classifier : 0.8092
confusion matrix for Random Forest Classifier:
 [[1848  700]
 [ 254 2198]]
Wall time: 1.64 s
