In [68]:
import pandas as pd
import numpy as np

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB



In [19]:
df= pd.read_csv('IMDB_Dataset.csv')
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [20]:
print(df.isnull().sum())
print(df.shape)

review       0
sentiment    0
dtype: int64
(50000, 2)


So there are no missing entries in the data. But this data set contains 50 k movie reviews. Since this is alarge data set, for effective performance, a subset of 10,000 entries can be considered. Among these, 7000 are of positive review and 3000 are of negative. This ratio is just for learning how to deal with imbalanced data. Because I have to face dirty and imbalanced data while dealing with real time projects.

In [21]:
df_positive = df[df['sentiment']=='positive'][:7000]
df_negative = df[df['sentiment']=='negative'][:3000]
df_new = pd.concat([df_positive,df_negative])
df_new.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
5,"Probably my all-time favorite movie, a story o...",positive


In [22]:
df_new[6999:7001]

Unnamed: 0,review,sentiment
14162,Radio is a true story about a man who did what...,positive
3,Basically there's a family where a little boy ...,negative


Now the data frame is ready. Next step is splitting it up as training set and test set. Before that, it can be noted that these data set is imbalanced, as decided earlier. So how to deal with this imbalanced data...? There are two methods, one is undersampling the positive to the level of negative reviews. second is it's opposit, that is oversampling. Here for an instance, oversampling is selected. Since in the reference material, they selected undersampling, so I decided to try oversampling. Actually I should find later the method to be chosen to find the best way here. 

In [26]:
neg_len = 3000
pos_len = 7000
#df_neg = df_negative.sample(n=pos_len)
df_pos = df_positive.sample(n=neg_len)
df_pos

Unnamed: 0,review,sentiment
9786,"I first saw this film in the late 60's, and tr...",positive
6757,"OK, so I don't watch too many horror movies - ...",positive
11843,Ray is one of those movies that makes you paus...,positive
9387,"Disgused as an Asian Horror, ""A Tale Of Two Si...",positive
3661,"Okay, I know I shouldn't like this movie but I...",positive
...,...,...
14015,This is one of the best Fred Astaire-Ginger Ro...,positive
8569,"This is actually a groovy-neat little flick, m...",positive
2593,I cannot believe it has been 25 yrs since I fi...,positive
7488,"I put this second version of ""The Man Who Knew...",positive


I failed to do oversampling, so i have done under sampling. There are some modules and methods for these. But now I am going with this. Later, this topic should be covered in detail. 

In [29]:
df_bal = pd.concat([df_pos,df_negative])
df_bal


Unnamed: 0,review,sentiment
9786,"I first saw this film in the late 60's, and tr...",positive
6757,"OK, so I don't watch too many horror movies - ...",positive
11843,Ray is one of those movies that makes you paus...,positive
9387,"Disgused as an Asian Horror, ""A Tale Of Two Si...",positive
3661,"Okay, I know I shouldn't like this movie but I...",positive
...,...,...
5939,Something somewhere must have terribly gone wr...,negative
5942,This was the next to last film appearance by J...,negative
5946,I give this movie a 4 cause I'm a die hard fan...,negative
5947,"Are we serious??? I mean wow ... just, wow. I ...",negative


Splitting data into training set and test set. Here 67 percentage of data is chosen as training set and 33 % is chosen as test data.

In [44]:
from sklearn.model_selection import train_test_split

R_train,R_test,S_train,S_test = train_test_split(df_bal['review'],df_bal['sentiment'],test_size=.33)


9536     After the initial shock of realizing the guts ...
2184     This has to be the funniest stand up comedy I ...
6185     This movie is sort of a Carrie meets Heavy Met...
7316     Will and Ted's Bodacious journey is an existen...
7152     Aside from the great movie METROPOLIS, this is...
                               ...                        
4477     I didn't like watching DS9 compared to other S...
5655     There wasn't much thought put into the story l...
5160     Kidman and Law lack the chemistry to make this...
13450    Let me get the bad out of the way first, James...
495      "American Nightmare" is officially tied, in my...
Name: review, Length: 1980, dtype: object

Our data is a raw test document. to do analysis, this should be convereted into numerical vectors. There are three methods, Bag of words, wor2vec, one hot encoding. Here BOW is used. Among the two methods in BOW, TF-IDF(Term Frequency-Inverse Documant Frequency) is the best method for this application.

In [56]:
from sklearn.feature_extraction import text

In [70]:
tfidf = text.TfidfVectorizer(stop_words = 'english')
train_R_vector = tfidf.fit_transform(R_train)

In [59]:
pd.DataFrame.sparse.from_spmatrix(train_R_vector,
                                  index=R_train.index,
                                  columns=tfidf.get_feature_names())

Unnamed: 0,00,000,00001,007,00am,00s,01,02,04,05,...,zzzzip,álvaro,ángel,æon,élan,ís,ísnt,île,óli,önsjön
3145,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
11058,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5523,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5275,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4389,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4367,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5935,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
684,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6979,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Similarly, test set of review also should be transformed into numerical vector

In [71]:
test_R_vector = tfidf.transform(R_test)# since tfidf is already fit, only transformation is needed for test data, no need of fit again.


Now the data is all set for modeling. 

In [87]:
#model = LogisticRegression()
#model = RandomForestClassifier()
#model = SVC()
#model = DecisionTreeClassifier()
GaussianNB = GaussianNB()

model.fit(train_R_vector,S_train)
S_pred = model.predict(test_R_vector)


In [77]:
from sklearn.metrics import accuracy_score as acy


In [88]:
#accuracy_RFC = acy(S_test,S_pred)
#accuracy_LR = acy(S_test,S_pred)
#accuracy_SVC = acy(S_test,S_pred)
#accuracy_DTC = acy(S_test,S_pred)
accuracy_GNB = acy(S_test,S_pred)
print("RFC=",accuracy_RFC)
print("LR=",accuracy_LR)
print("SVC=",accuracy_SVC)
print("DTC=",accuracy_DTC)
print("GNB=",accuracy_GNB)

RFC= 0.843939393939394
LR= 0.8712121212121212
SVC= 0.8696969696969697
DTC= 0.702020202020202
GNB= 0.705050505050505


Among the models, Logistic Regression and SVM are more accurate. But I felt as SVC takes more running time.