## Movie Classifier based on reviews
#### Life cycle of Machine learning Project
- Understanding the Problem Statement
- Data Collection
- Data Checks to perform
- Exploratory data analysis
- Data Pre-Processing
- Model Training
- Choose best model
### 1) Problem statement
NLP Challenge: IMDB Dataset of 50K Movie Reviews to perform Sentiment analysis
### 2) Data Collection
- Dataset Source - https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?resource=download
- The data consists of 2 column and 50k rows.
### 2.1 Import Data and Required Packages
#### Importing Pandas, Numpy, Matplotlib, Seaborn and Warings Library.

In [114]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

In [115]:
rawData = pd.read_csv("data/IMDB_Dataset.csv")
rawData

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [116]:
rawData.shape

(50000, 2)

In [117]:
rawData.isna().sum()

review       0
sentiment    0
dtype: int64

In [27]:
#rawData.duplicated().sum()


np.int64(418)

In [28]:
#df = rawData.drop_duplicates()
#df.shape

(49582, 2)

In [118]:
#df.nunique()

review       49582
sentiment        2
dtype: int64

In [119]:
df['sentiment'].unique()

array(['positive', 'negative'], dtype=object)

In [120]:
df['sentiment'].value_counts()

sentiment
positive    24884
negative    24698
Name: count, dtype: int64

In [122]:
sentiment = []
for l in rawData.sentiment:
    if l == "positive":
        sentiment.append(1)
    elif l == "negative":
        sentiment.append(0)

#sentiment

In [123]:
rawData['sentiment']= sentiment
rawData.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


In [124]:
X=rawData.drop('sentiment',axis=1)
y=rawData.drop('review', axis=1)
X.head()

Unnamed: 0,review
0,One of the other reviewers has mentioned that ...
1,A wonderful little production. <br /><br />The...
2,I thought this was a wonderful way to spend ti...
3,Basically there's a family where a little boy ...
4,"Petter Mattei's ""Love in the Time of Money"" is..."


In [125]:
# Removing punctuations
X.replace("[^a-zA-Z]"," ",regex=True, inplace=True)


# Convertng headlines to lower case
X["review"]=X["review"].str.lower()
X.head()

Unnamed: 0,review
0,one of the other reviewers has mentioned that ...
1,a wonderful little production br br the...
2,i thought this was a wonderful way to spend ti...
3,basically there s a family where a little boy ...
4,petter mattei s love in the time of money is...


In [74]:
X[(X.review.str.len() < 5)]

Unnamed: 0,review


In [None]:
#X[~X.applymap(lambda x: len(str(x)) > 10).any(axis=1)]

In [126]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=42,test_size=0.3,shuffle=True)

In [127]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier

In [128]:
train_reviews = []

for r in range(0,len(X_train.index)):
    train_reviews.append(X_train.iloc[r,0])

train_reviews

['as much as i love trains  i couldn t stomach this movie  the premise that one could steal a locomotive and  drive  from arkansas to chicago without hitting another train along the way has to be right up there on the impossible plot lines hit board  imagine two disgruntled nasa employees stealing the  crawler  that totes the shuttles to and fro and driving it to new york and you get the idea  br    br   having said all that  it s a nice try  wilford brimely is at his quaker oats best  and levon helm turns a good performance as his dimwitted but well meaning sidekick  bob balaban is suitably wormy as the corporate guy  and the  little guy takes on goliath  story gets another airing ',
 'this was a very good ppv  but like wrestlemania xx some    years later  the wwe crammed so many matches on it  some of the matches were useless  i m not going to go through every match on the card because it would take forever to do  br    br   however major highlights included the huge pop for demoliti

In [129]:
## implement BAG OF WORDS
countvector=CountVectorizer(ngram_range=(2,2))
traindataset=countvector.fit_transform(train_reviews)

In [None]:
# implement RandomForest Classifier
randomclassifier=RandomForestClassifier(n_estimators=200,criterion='entropy')
randomclassifier.fit(traindataset,y_train)

In [None]:
## Predict for the Test Dataset
test_reviews = []

for r in range(0,len(X_test.index)):
    test_reviews.append(X_test.iloc[r,0])

test_reviews

In [None]:
test_dataset = countvector.transform(test_reviews)
predictions = randomclassifier.predict(test_dataset)

In [87]:
## Import library to check accuracy
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
matrix=confusion_matrix(y_test,predictions)
print(matrix)
score=accuracy_score(y_test,predictions)
print(score)
report=classification_report(y_test,predictions)
print(report)

[[10592  1891]
 [ 1791 10726]]
0.85272
              precision    recall  f1-score   support

           0       0.86      0.85      0.85     12483
           1       0.85      0.86      0.85     12517

    accuracy                           0.85     25000
   macro avg       0.85      0.85      0.85     25000
weighted avg       0.85      0.85      0.85     25000



In [112]:

## Import library to check accuracy
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
matrix=confusion_matrix(y_test,predictions)
print(matrix)
score=accuracy_score(y_test,predictions)
print(score)
report=classification_report(y_test,predictions)
print(report)

[[6347 1064]
 [1001 6588]]
0.8623333333333333
              precision    recall  f1-score   support

           0       0.86      0.86      0.86      7411
           1       0.86      0.87      0.86      7589

    accuracy                           0.86     15000
   macro avg       0.86      0.86      0.86     15000
weighted avg       0.86      0.86      0.86     15000

