### **Note: Preprocessing steps such as lemmatizing and removing stopwords have already been done in the data wrangling step so exploratory data analysis could be performed.

# In this notebook I will first get the data ready to be used in models and then I will see which model performs best.

## 1. To get the data ready for model use I will be using the bag of words method.

## 2. The three models I will be implementing and comparing are: Logistic Regression, Random Forest, Support Vector Machine

In [1]:
# import necessary packages
import numpy as np
import pandas as pd
from sklearn import linear_model
from sklearn.svm import SVC
import xgboost as xgb
from nltk.tokenize import word_tokenize
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, precision_score, recall_score, classification_report

In [2]:
df = pd.read_csv('../input/imdb-reviews-cleaned/imdb_reviews_cleaned.csv')
df.head()

Unnamed: 0,review,sentiment,clean_reviews,clean_reviews_str,review_word_count,review_length
0,One of the other reviewers has mentioned that ...,positive,"['one', 'reviewer', 'ha', 'mentioned', 'watchi...",one reviewer ha mentioned watching oz episode ...,168,1098
1,A wonderful little production. <br /><br />The...,positive,"['wonderful', 'little', 'production', 'filming...",wonderful little production filming technique ...,86,646
2,I thought this was a wonderful way to spend ti...,positive,"['thought', 'wa', 'wonderful', 'way', 'spend',...",thought wa wonderful way spend time hot summer...,88,583
3,Basically there's a family where a little boy ...,negative,"['basically', 'family', 'little', 'boy', 'jake...",basically family little boy jake think zombie ...,64,425
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,"['petter', 'matteis', 'love', 'time', 'money',...",petter matteis love time money visually stunni...,126,846


In [3]:
df['sentiment'].replace('positive', 1, inplace=True)
df['sentiment'].replace('negative', 0, inplace=True)

In [4]:
df

Unnamed: 0,review,sentiment,clean_reviews,clean_reviews_str,review_word_count,review_length
0,One of the other reviewers has mentioned that ...,1,"['one', 'reviewer', 'ha', 'mentioned', 'watchi...",one reviewer ha mentioned watching oz episode ...,168,1098
1,A wonderful little production. <br /><br />The...,1,"['wonderful', 'little', 'production', 'filming...",wonderful little production filming technique ...,86,646
2,I thought this was a wonderful way to spend ti...,1,"['thought', 'wa', 'wonderful', 'way', 'spend',...",thought wa wonderful way spend time hot summer...,88,583
3,Basically there's a family where a little boy ...,0,"['basically', 'family', 'little', 'boy', 'jake...",basically family little boy jake think zombie ...,64,425
4,"Petter Mattei's ""Love in the Time of Money"" is...",1,"['petter', 'matteis', 'love', 'time', 'money',...",petter matteis love time money visually stunni...,126,846
...,...,...,...,...,...,...
49995,I thought this movie did a down right good job...,1,"['thought', 'movie', 'right', 'good', 'job', '...",thought movie right good job wa creative origi...,81,506
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",0,"['bad', 'plot', 'bad', 'dialogue', 'bad', 'act...",bad plot bad dialogue bad acting idiotic direc...,59,399
49997,I am a Catholic taught in parochial elementary...,0,"['catholic', 'taught', 'parochial', 'elementar...",catholic taught parochial elementary school nu...,115,789
49998,I'm going to have to disagree with the previou...,0,"['going', 'disagree', 'previous', 'comment', '...",going disagree previous comment side maltin on...,114,815


In [5]:
X = df['clean_reviews_str']

y = df['sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [6]:
cv = CountVectorizer(min_df=0, max_df=1, ngram_range=(1,2))

X_train = cv.fit_transform(X_train)

X_test = cv.transform(X_test)

print(X_train.shape, X_test.shape)

(40000, 2025867) (10000, 2025867)


## Logistic Regression

In [7]:
logr = linear_model.LogisticRegression()

model_logr = logr.fit(X_train, y_train)

y_pred_logr = model_logr.predict(X_test)

In [8]:
print(classification_report(y_test, y_pred_logr))

print('The accuracy score is:', accuracy_score(y_test, y_pred_logr))
print('The precision score is:', precision_score(y_test, y_pred_logr))
print('The recall score is:', recall_score(y_test, y_pred_logr))
print('The f1_score is:', f1_score(y_test, y_pred_logr))

              precision    recall  f1-score   support

           0       0.65      0.80      0.72      5069
           1       0.73      0.56      0.63      4931

    accuracy                           0.68     10000
   macro avg       0.69      0.68      0.67     10000
weighted avg       0.69      0.68      0.67     10000

The accuracy score is: 0.6795
The precision score is: 0.729643427354976
The recall score is: 0.5560738186980329
The f1_score is: 0.6311428242605593


## Support Vector Classifier

In [9]:
svc = SVC()

model_svc = svc.fit(X_train, y_train)

y_pred_svc = model_svc.predict(X_test)

In [10]:
print(classification_report(y_test, y_pred_svc))

print('The accuracy score is:', accuracy_score(y_test, y_pred_svc))
print('The precision score is:', precision_score(y_test, y_pred_svc))
print('The recall score is:', recall_score(y_test, y_pred_svc))
print('The f1_score is:', f1_score(y_test, y_pred_svc))

              precision    recall  f1-score   support

           0       0.51      1.00      0.67      5069
           1       0.84      0.02      0.03      4931

    accuracy                           0.51     10000
   macro avg       0.68      0.51      0.35     10000
weighted avg       0.67      0.51      0.36     10000

The accuracy score is: 0.5129
The precision score is: 0.8409090909090909
The recall score is: 0.015007097951733928
The f1_score is: 0.029487945805937436


## Random Forest Classifier

In [11]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()

model_rf = rf.fit(X_train, y_train)

y_pred_rf = model_rf.predict(X_test)

In [12]:
print(classification_report(y_test, y_pred_rf))

print('The accuracy score is:', accuracy_score(y_test, y_pred_rf))
print('The precision score is:', precision_score(y_test, y_pred_rf))
print('The recall score is:', recall_score(y_test, y_pred_rf))
print('The f1_score is:', f1_score(y_test, y_pred_rf))

              precision    recall  f1-score   support

           0       0.51      1.00      0.68      5069
           1       0.95      0.02      0.03      4931

    accuracy                           0.52     10000
   macro avg       0.73      0.51      0.36     10000
weighted avg       0.73      0.52      0.36     10000

The accuracy score is: 0.5151
The precision score is: 0.9456521739130435
The recall score is: 0.017643480024335835
The f1_score is: 0.0346406529962174


 Model                 |   Accuracy    |   Precision   |    Recall     |     F1       |
---------------------- | ------------- | ------------- | ------------- | ------------ |
Logistic Regression    | 0.6795        | 0.7296        | 0.5561        | 0.6311       |
Support Vector Machine | 0.5129        | 0.8409        | 0.0150        | 0.0295       |
Random Forest          | 0.5151        | 0.9457        | 0.0176        | 0.0346       |

### In this case Logistic Regression is the best. Even though precision is higher for Support Vector Machine and Random Forest, all of the scores are more consistent for Logistic Regression.