# Senitment Analysis of Movie Review

### Data and Problem Description: 

Dataset is the IMDB data set from Kaggle, which contains 2 columns 'Review' and the 'Sentiment', and has 50 k movie reviews. 

Problem objective is to find the best suited machine learning problem to predict the sentiment given a movie reciew. 

### Preprocessing the Data:

In [1]:
# importing the required packages
import pandas as pd
import numpy as np

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split

from sklearn.feature_extraction import text

from sklearn.metrics import accuracy_score, f1_score, recall_score, auc


In [2]:
#importing the data file and reading it as a data frame
df= pd.read_csv('IMDB_Dataset.csv')
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [15]:
df_positive = df[df['sentiment']=='positive'][:5000]
df_negative = df[df['sentiment']=='negative'][:5000]
df = pd.concat([df_positive,df_negative])
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
5,"Probably my all-time favorite movie, a story o...",positive


In [16]:
print(df.isnull().sum())
print(df.shape)

review       0
sentiment    0
dtype: int64
(10000, 2)


So there are no missing entries in the data..

In [17]:
R_train,R_test,S_train,S_test = train_test_split(df['review'],df['sentiment'],test_size=.33)

### Text Representations:

Reviews are stored as a raw texts. But the classification algorithms expect numerical feature vectors. So these texts should be converted as appropriate numerical vectors. Some of the text representation techniques are Bag of Words, word2vec, one-hot coding etc. Here the frequency of the words is the important aspect. (Eg:- if one review contains more positive sense words like happy, glad, super etc, it would be a positive review). So bag of words technique is going to be used in this work. TF-IDF method, which uses the weights of the words, in according to it's present in the document is used. 

In [18]:
tfidf = text.TfidfVectorizer(stop_words = 'english') # stop words to remove words like is, and, for, etc.
train_R_vector = tfidf.fit_transform(R_train)

In [19]:
pd.DataFrame.sparse.from_spmatrix(train_R_vector,
                                  index=R_train.index,
                                  columns=tfidf.get_feature_names())

Unnamed: 0,00,000,00001,007,00am,00s,01,0126,02,03,...,álex,álvaro,ángel,æon,élan,émigrés,était,önsjön,überwoman,ünfaithful
7973,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8300,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5877,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8198,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4603,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4929,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8912,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1924,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9335,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [20]:
test_R_vector = tfidf.transform(R_test)# since tfidf is already fit, only transformation is needed for test data, no need of fit again.


### Modeling


In [43]:
model = RandomForestClassifier()
#model =  SVC()
model.fit(train_R_vector,S_train)
S_pred = model.predict(test_R_vector)


In [44]:
accuracy = accuracy_score(S_test,S_pred)
#f1=f1_score(S_test,S_pred)
#recall = recall_score(S_test,S_pred)
#auc = auc(S_test,S_pred)
accuracy

0.8403030303030303

In [45]:
review = "I liked this movie. I like it's direction and I like the way cast is slected"
review_df = pd.DataFrame([review],columns=['Review'])
review_df

Unnamed: 0,Review
0,I liked this movie. I like it's direction and ...


In [46]:
review_vector = tfidf.transform(review_df)
pd.DataFrame.sparse.from_spmatrix(review_vector,columns=tfidf.get_feature_names())
s=model.predict(review_vector)
print(s)

['positive']


This is just a baseline model. I am working on this project and this file should be updated later.