# Data Engineering project: ML models

Import all the libraries:

In [68]:
import pandas as pd
import numpy as np
import nltk
import pickle
import string
import re
import os, sys
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from sklearn.preprocessing import LabelEncoder
from collections import defaultdict
from nltk.corpus import wordnet as wn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import model_selection, naive_bayes, svm
from sklearn.metrics import accuracy_score
from bs4 import BeautifulSoup
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import chi2
import unicodedata
import plotly.graph_objects as go
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\Awn\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


## The dataset: IMDB dataset, review of movies

Import the dataset:

In [69]:
df = pd.read_csv("IMDB Dataset.csv")
df.head(15)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
5,"Probably my all-time favorite movie, a story o...",positive
6,I sure would like to see a resurrection of a u...,positive
7,"This show was an amazing, fresh & innovative i...",negative
8,Encouraged by the positive comments about this...,negative
9,If you like original gut wrenching laughter yo...,positive


In [70]:
df.shape

(50000, 2)

We have 50000 reviews in the dataset.

In [71]:
df.describe()

Unnamed: 0,review,sentiment
count,50000,50000
unique,49582,2
top,Loved today's show!!! It was a variety and not...,negative
freq,5,25000


In [72]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
review       50000 non-null object
sentiment    50000 non-null object
dtypes: object(2)
memory usage: 781.3+ KB


In [73]:
df['sentiment'].value_counts()

negative    25000
positive    25000
Name: sentiment, dtype: int64

- We have 50% of reviews which are positive.
- We have 50% of reviews which are negative.

## Preprocessing part

Let's see a review :

In [74]:
df.loc[39]['review']

'After sitting through this pile of dung, my husband and I wondered whether it was actually the product of an experiment to see whether a computer program could produce a movie. It was that listless and formulaic. But the U.S. propaganda thrown in your face throughout the film proves--disappointingly--that it\'s the work of humans. Call me a conspiracy theorist, but quotes like, "We have to steal the Declaration of Independence to protect it" seem like ways to justify actions like the invasion of Iraq, etc. The fact that Nicholas Cage spews lines like, "I would never use the Declaration of Independence as a bargaining chip" with a straight face made me and my husband wonder whether the entire cast took Valium before shooting each scene. The "reasoning" behind each plot turn and new "clue" is truly ridiculous and impossible to follow. And there\'s also a bonus side plot of misogyny, with Dr. Whatever-Her-Name-Was being chided by all involved for "never shutting up." She\'s clearly in th

We first remove all the special characters that appear when we load the dataset:

In [75]:
df['review'] = [entry.lower() for entry in df['review']]
df['review'] = df.review.str.replace("<br /><br />", " ")
df['review'] = df.review.str.replace("+", " ")
df['review'] = df.review.str.replace("--", " ")
df['review'] = df.review.str.replace("-", " ")
df['review'] = df.review.str.replace('\'', " ")
df['review'] = df.review.str.replace('"', " ")

We decided to remove all the punctuations:

In [76]:
def remove_punctuation(s):
    s = ''.join([i for i in s if i not in frozenset(string.punctuation)])
    return s

df['review']  = df['review'].apply(remove_punctuation)


In [77]:
df.loc[39]['review']

'after sitting through this pile of dung my husband and i wondered whether it was actually the product of an experiment to see whether a computer program could produce a movie it was that listless and formulaic but the us propaganda thrown in your face throughout the film proves disappointingly that it s the work of humans call me a conspiracy theorist but quotes like  we have to steal the declaration of independence to protect it  seem like ways to justify actions like the invasion of iraq etc the fact that nicholas cage spews lines like  i would never use the declaration of independence as a bargaining chip  with a straight face made me and my husband wonder whether the entire cast took valium before shooting each scene the  reasoning  behind each plot turn and new  clue  is truly ridiculous and impossible to follow and there s also a bonus side plot of misogyny with dr whatever her name was being chided by all involved for  never shutting up  she s clearly in the movie only for look

In [84]:
df.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production the filming tec...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there s a family where a little boy ...,negative
4,petter mattei s love in the time of money is...,positive


Label encoding:

In [85]:
sentiment_map = {'positive':1, 'negative':0}

df['sentiment'] = df['sentiment'].map(sentiment_map)

All the positive review are represented with a 1, and all the negative ones with a 0.

In [86]:
df2= df.copy()
df2.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,1
1,a wonderful little production the filming tec...,1
2,i thought this was a wonderful way to spend ti...,1
3,basically there s a family where a little boy ...,0
4,petter mattei s love in the time of money is...,1


### Classification with 4 models: Naive Bayes, SVM, Random Forest, Logistic regression

Let's see which classifier has the best accuracy :

We split our data into 80/20 % groups of training/testing:

In [88]:
Train_X, Test_X, Train_Y, Test_Y = model_selection.train_test_split(df2['review'],df2['sentiment'],test_size=0.2,random_state=42)

We vectorize the text:

In [89]:
Tfidf_vect = TfidfVectorizer(max_features=5000)
Tfidf_vect.fit(df['review'])
Train_X_Tfidf = Tfidf_vect.transform(Train_X)
Test_X_Tfidf = Tfidf_vect.transform(Test_X)

### Naive Bayes

We fit the training dataset on the Naive Bayes classifier and get the accuricy score:

In [91]:
Naive = naive_bayes.MultinomialNB()
Naive.fit(Train_X_Tfidf,Train_Y)
predictions_NB = Naive.predict(Test_X_Tfidf)
print("Naive Bayes Accuracy Score -> ",accuracy_score(predictions_NB, Test_Y)*100)

Naive Bayes Accuracy Score ->  85.31


### SVM

In [93]:
SVM = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto')
SVM.fit(Train_X_Tfidf,Train_Y)
predictions_SVM = SVM.predict(Test_X_Tfidf)
print("SVM Accuracy Score -> ",accuracy_score(Test_Y,predictions_SVM)*100)

SVM Accuracy Score ->  89.45


### Random Forest

In [112]:
from sklearn.ensemble import RandomForestClassifier

text_classifier = RandomForestClassifier(n_estimators=200, random_state=42)
text_classifier.fit(Train_X_Tfidf,Train_Y)
predictions = text_classifier.predict(Test_X_Tfidf)
print(accuracy_score(predictions,Test_Y))
print("SVM Accuracy Score -> ",accuracy_score(predictions,Test_Y)*100)

0.8569
SVM Accuracy Score ->  85.69


### Logistic regression

In [94]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(random_state=42)
clf.fit(Train_X_Tfidf,Train_Y)
y_pred =clf.predict(Test_X_Tfidf)
print("Logistic Regression Accuracy Score -> ",accuracy_score(y_pred, Test_Y)*100)





Logistic Regression Accuracy Score ->  89.68


All our models have an accuracy above 80%, which is good.

Let's use the pre-trained model VaderSentiment, which will be more efficient for our application.

### Vader Sentiment

In [95]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

score = SentimentIntensityAnalyzer()

In [96]:
df1 = df2.copy()
df1.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,1
1,a wonderful little production the filming tec...,1
2,i thought this was a wonderful way to spend ti...,1
3,basically there s a family where a little boy ...,0
4,petter mattei s love in the time of money is...,1


We compute all the scores: Pos, Neg, Neu, compound.

In [97]:
df1['scores'] = df1['review'].apply(lambda review: score.polarity_scores(review))

df1.head()

Unnamed: 0,review,sentiment,scores
0,one of the other reviewers has mentioned that ...,1,"{'neg': 0.204, 'neu': 0.748, 'pos': 0.048, 'co..."
1,a wonderful little production the filming tec...,1,"{'neg': 0.054, 'neu': 0.76, 'pos': 0.186, 'com..."
2,i thought this was a wonderful way to spend ti...,1,"{'neg': 0.105, 'neu': 0.651, 'pos': 0.244, 'co..."
3,basically there s a family where a little boy ...,0,"{'neg': 0.136, 'neu': 0.782, 'pos': 0.082, 'co..."
4,petter mattei s love in the time of money is...,1,"{'neg': 0.052, 'neu': 0.791, 'pos': 0.157, 'co..."


In [99]:
df1['compound']  = df1['scores'].apply(lambda score_dict: score_dict['compound'])

In [98]:
df1['pos']  = df1['scores'].apply(lambda score_dict: score_dict['pos'])

In [100]:
df1['neg']  = df1['scores'].apply(lambda score_dict: score_dict['neg'])

In [101]:
df1['neu']  = df1['scores'].apply(lambda score_dict: score_dict['neu'])

In [102]:
df1.head()

Unnamed: 0,review,sentiment,scores,pos,compound,neg,neu
0,one of the other reviewers has mentioned that ...,1,"{'neg': 0.204, 'neu': 0.748, 'pos': 0.048, 'co...",0.048,-0.9951,0.204,0.748
1,a wonderful little production the filming tec...,1,"{'neg': 0.054, 'neu': 0.76, 'pos': 0.186, 'com...",0.186,0.9693,0.054,0.76
2,i thought this was a wonderful way to spend ti...,1,"{'neg': 0.105, 'neu': 0.651, 'pos': 0.244, 'co...",0.244,0.9813,0.105,0.651
3,basically there s a family where a little boy ...,0,"{'neg': 0.136, 'neu': 0.782, 'pos': 0.082, 'co...",0.082,-0.8858,0.136,0.782
4,petter mattei s love in the time of money is...,1,"{'neg': 0.052, 'neu': 0.791, 'pos': 0.157, 'co...",0.157,0.9766,0.052,0.791


As our dataset only told us if the review is positive or negative, we will use the compound score for our predictions:

In [103]:
df1['prediction'] = df1['compound'].apply(lambda c: 'pos' if c >=0 else 'neg')
df2= df1.copy()

In [104]:
df2.head()

Unnamed: 0,review,sentiment,scores,pos,compound,neg,neu,prediction
0,one of the other reviewers has mentioned that ...,1,"{'neg': 0.204, 'neu': 0.748, 'pos': 0.048, 'co...",0.048,-0.9951,0.204,0.748,neg
1,a wonderful little production the filming tec...,1,"{'neg': 0.054, 'neu': 0.76, 'pos': 0.186, 'com...",0.186,0.9693,0.054,0.76,pos
2,i thought this was a wonderful way to spend ti...,1,"{'neg': 0.105, 'neu': 0.651, 'pos': 0.244, 'co...",0.244,0.9813,0.105,0.651,pos
3,basically there s a family where a little boy ...,0,"{'neg': 0.136, 'neu': 0.782, 'pos': 0.082, 'co...",0.082,-0.8858,0.136,0.782,neg
4,petter mattei s love in the time of money is...,1,"{'neg': 0.052, 'neu': 0.791, 'pos': 0.157, 'co...",0.157,0.9766,0.052,0.791,pos


In [105]:
from sklearn.metrics import accuracy_score

Label encoding:



In [106]:
prediction_map = {'pos':1, 'neg':0}

df2['prediction'] = df2['prediction'].map(prediction_map)

We have this dataframe with all the VaderSentiment scores and the predictions:

In [107]:
df2

Unnamed: 0,review,sentiment,scores,pos,compound,neg,neu,prediction
0,one of the other reviewers has mentioned that ...,1,"{'neg': 0.204, 'neu': 0.748, 'pos': 0.048, 'co...",0.048,-0.9951,0.204,0.748,0
1,a wonderful little production the filming tec...,1,"{'neg': 0.054, 'neu': 0.76, 'pos': 0.186, 'com...",0.186,0.9693,0.054,0.760,1
2,i thought this was a wonderful way to spend ti...,1,"{'neg': 0.105, 'neu': 0.651, 'pos': 0.244, 'co...",0.244,0.9813,0.105,0.651,1
3,basically there s a family where a little boy ...,0,"{'neg': 0.136, 'neu': 0.782, 'pos': 0.082, 'co...",0.082,-0.8858,0.136,0.782,0
4,petter mattei s love in the time of money is...,1,"{'neg': 0.052, 'neu': 0.791, 'pos': 0.157, 'co...",0.157,0.9766,0.052,0.791,1
5,probably my all time favorite movie a story of...,1,"{'neg': 0.017, 'neu': 0.761, 'pos': 0.222, 'co...",0.222,0.9828,0.017,0.761,1
6,i sure would like to see a resurrection of a u...,1,"{'neg': 0.024, 'neu': 0.85, 'pos': 0.126, 'com...",0.126,0.9403,0.024,0.850,1
7,this show was an amazing fresh innovative ide...,0,"{'neg': 0.146, 'neu': 0.642, 'pos': 0.213, 'co...",0.213,0.9302,0.146,0.642,1
8,encouraged by the positive comments about this...,0,"{'neg': 0.168, 'neu': 0.657, 'pos': 0.174, 'co...",0.174,0.2362,0.168,0.657,1
9,if you like original gut wrenching laughter yo...,1,"{'neg': 0.092, 'neu': 0.478, 'pos': 0.43, 'com...",0.430,0.9432,0.092,0.478,1


In [108]:
predict= df2['prediction']
sent= df2['sentiment']

Let's see the accuracy of VaderSentiment model:

In [111]:
print("Accuracy Score of the Vader model: {0:.2%}".format(accuracy_score(predict, sent)))

Accuracy Score of the Vader model: 69.23%


As the dataset have only positive or negative sentiment and the vaderSentiment model take in account the neutral sentiment too, these differencies can explain our accuracy, which is less than previous models. 

However, for our application, we will use the vaderSentiment model as it was one of the requirements: Tell if a sentence is positive, neutral or negative.