<h1>IMDb Comments, Sentiment Analysis with NLP<h1>

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from bs4 import BeautifulSoup
import re
import nltk
from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords

In [2]:
#Load dataframe
df = pd.read_csv('IMDb_comments.tsv', delimiter = "\t", quoting=3)

In [3]:
#Check data
df.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


In [4]:
#Length of the df
len(df)

25000

In [5]:
len(df["review"])

25000

In [6]:
#Download the stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /Users/halil/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [7]:
#Cleanin process of the data
#Look at a review right now and firstly clean this one
#Then clean all data
sample_review = df.review[0]
sample_review

'"With all this stuff going down at the moment with MJ i\'ve started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ\'s feeling towards the press and also the obvious message of drugs are bad m\'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally

In [8]:
#Remove the HTML tags
sample_review = BeautifulSoup(sample_review).get_text()
sample_review

'"With all this stuff going down at the moment with MJ i\'ve started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ\'s feeling towards the press and also the obvious message of drugs are bad m\'kay.Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.The actual feature film bit when it finally starts is only on for 2

In [9]:
#Regex process (replace all char which not are letters)
sample_review = re.sub('[^A-Za-z]',' ', sample_review)
sample_review

' With all this stuff going down at the moment with MJ i ve started listening to his music  watching the odd documentary here and there  watched The Wiz and watched Moonwalker again  Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent  Moonwalker is part biography  part feature film which i remember going to see at the cinema when it was originally released  Some of it has subtle messages about MJ s feeling towards the press and also the obvious message of drugs are bad m kay Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring  Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him The actual feature film bit when it finally starts is only on for    m

In [10]:
#Regex process (Lower case all letters)
sample_review = sample_review.lower()
sample_review

' with all this stuff going down at the moment with mj i ve started listening to his music  watching the odd documentary here and there  watched the wiz and watched moonwalker again  maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent  moonwalker is part biography  part feature film which i remember going to see at the cinema when it was originally released  some of it has subtle messages about mj s feeling towards the press and also the obvious message of drugs are bad m kay visually impressive but of course this is all about michael jackson so unless you remotely like mj in anyway then you are going to hate this and find it boring  some may call mj an egotist for consenting to the making of this movie but mj and most of his fans would say that he made it for the fans which if true is really nice of him the actual feature film bit when it finally starts is only on for    m

In [11]:
#Edit grammar words (Remove the 'the, is , are, ...')
sample_review = sample_review.split()
sample_review


['with',
 'all',
 'this',
 'stuff',
 'going',
 'down',
 'at',
 'the',
 'moment',
 'with',
 'mj',
 'i',
 've',
 'started',
 'listening',
 'to',
 'his',
 'music',
 'watching',
 'the',
 'odd',
 'documentary',
 'here',
 'and',
 'there',
 'watched',
 'the',
 'wiz',
 'and',
 'watched',
 'moonwalker',
 'again',
 'maybe',
 'i',
 'just',
 'want',
 'to',
 'get',
 'a',
 'certain',
 'insight',
 'into',
 'this',
 'guy',
 'who',
 'i',
 'thought',
 'was',
 'really',
 'cool',
 'in',
 'the',
 'eighties',
 'just',
 'to',
 'maybe',
 'make',
 'up',
 'my',
 'mind',
 'whether',
 'he',
 'is',
 'guilty',
 'or',
 'innocent',
 'moonwalker',
 'is',
 'part',
 'biography',
 'part',
 'feature',
 'film',
 'which',
 'i',
 'remember',
 'going',
 'to',
 'see',
 'at',
 'the',
 'cinema',
 'when',
 'it',
 'was',
 'originally',
 'released',
 'some',
 'of',
 'it',
 'has',
 'subtle',
 'messages',
 'about',
 'mj',
 's',
 'feeling',
 'towards',
 'the',
 'press',
 'and',
 'also',
 'the',
 'obvious',
 'message',
 'of',
 'drugs',

In [12]:
len(sample_review)

437

In [13]:
swords = set(stopwords.words("english"))
sample_review = [w for w in sample_review if not w in swords]
sample_review

['stuff',
 'going',
 'moment',
 'mj',
 'started',
 'listening',
 'music',
 'watching',
 'odd',
 'documentary',
 'watched',
 'wiz',
 'watched',
 'moonwalker',
 'maybe',
 'want',
 'get',
 'certain',
 'insight',
 'guy',
 'thought',
 'really',
 'cool',
 'eighties',
 'maybe',
 'make',
 'mind',
 'whether',
 'guilty',
 'innocent',
 'moonwalker',
 'part',
 'biography',
 'part',
 'feature',
 'film',
 'remember',
 'going',
 'see',
 'cinema',
 'originally',
 'released',
 'subtle',
 'messages',
 'mj',
 'feeling',
 'towards',
 'press',
 'also',
 'obvious',
 'message',
 'drugs',
 'bad',
 'kay',
 'visually',
 'impressive',
 'course',
 'michael',
 'jackson',
 'unless',
 'remotely',
 'like',
 'mj',
 'anyway',
 'going',
 'hate',
 'find',
 'boring',
 'may',
 'call',
 'mj',
 'egotist',
 'consenting',
 'making',
 'movie',
 'mj',
 'fans',
 'would',
 'say',
 'made',
 'fans',
 'true',
 'really',
 'nice',
 'actual',
 'feature',
 'film',
 'bit',
 'finally',
 'starts',
 'minutes',
 'excluding',
 'smooth',
 'crim

In [14]:
len(sample_review)

219

In [15]:
#It seems good. Now apply all data of these processes
def process(review):
    review = BeautifulSoup(review).get_text()
    review = re.sub('[^a-zA-Z]',' ',review)
    review = review.lower()
    review = review.split()
    swords = set(stopwords.words("english"))
    review = [w for w in review if not w in swords]
    return(" ".join(review))

In [16]:
train_x_all = []
for r in range(len(df["review"])):
    if (r+1) %1000 == 0:
        print("Number of reviews processed. ", r+1)
    train_x_all.append(process(df["review"][r]))

  review = BeautifulSoup(review).get_text()


Number of reviews processed.  1000
Number of reviews processed.  2000
Number of reviews processed.  3000
Number of reviews processed.  4000
Number of reviews processed.  5000
Number of reviews processed.  6000
Number of reviews processed.  7000
Number of reviews processed.  8000
Number of reviews processed.  9000
Number of reviews processed.  10000
Number of reviews processed.  11000
Number of reviews processed.  12000
Number of reviews processed.  13000
Number of reviews processed.  14000
Number of reviews processed.  15000
Number of reviews processed.  16000
Number of reviews processed.  17000
Number of reviews processed.  18000
Number of reviews processed.  19000
Number of reviews processed.  20000
Number of reviews processed.  21000
Number of reviews processed.  22000
Number of reviews processed.  23000
Number of reviews processed.  24000
Number of reviews processed.  25000


In [17]:
train_x_all

['stuff going moment mj started listening music watching odd documentary watched wiz watched moonwalker maybe want get certain insight guy thought really cool eighties maybe make mind whether guilty innocent moonwalker part biography part feature film remember going see cinema originally released subtle messages mj feeling towards press also obvious message drugs bad kay visually impressive course michael jackson unless remotely like mj anyway going hate find boring may call mj egotist consenting making movie mj fans would say made fans true really nice actual feature film bit finally starts minutes excluding smooth criminal sequence joe pesci convincing psychopathic powerful drug lord wants mj dead bad beyond mj overheard plans nah joe pesci character ranted wanted people know supplying drugs etc dunno maybe hates mj music lots cool things like mj turning car robot whole speed demon sequence also director must patience saint came filming kiddy bad sequence usually directors hate worki

In [18]:
#Train test split
x=train_x_all
y=np.array(df["sentiment"])
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1, random_state=42)


In [19]:
x_train

['definite resounding movie absolute dud recommended friend much sort thing watched movie anticipation informed changed moved altered uplifted positive mystical things could happen suddenly see truth may sound like someone already predisposed poo pooing anything dealing metaphysical metaphysical physical boundaries existence believe person try open presentation decide accordingly terms content thing found mildly interesting informative bit peptides emotions addiction cellular receptors unifying element could find documentary part film rest documentary rambled around several topics never seemed unify cohere try tie conclude point stuff native americans able see ships columbus came told authorities film happened compared scientific work done visual cognition famous gorilla video example visit visual cognition lab university illinois site may convincing point made however seemed like unsupported mystical mumbo jumbo film one film two found documentary part mildly interesting hear people t

In [20]:
#Create bag of words
vectorizer = CountVectorizer(max_features=5000)
x_train = vectorizer.fit_transform(x_train)

In [21]:
x_train

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 1776690 stored elements and shape (22500, 5000)>

In [22]:
x_train = x_train.toarray()

In [23]:
x_train.shape, y_train.shape

((22500, 5000), (22500,))

In [24]:
y_train

array([0, 1, 1, ..., 1, 1, 0])

In [25]:
#Create random forest model and fit
model = RandomForestClassifier(n_estimators= 100, random_state=42)
model.fit(x_train, y_train)

In [27]:
#Test data
test_xx = vectorizer.transform(x_test)

In [28]:
test_xx

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 198330 stored elements and shape (2500, 5000)>

In [29]:
test_xx = test_xx.toarray()

In [30]:
test_xx.shape

(2500, 5000)

In [31]:
#Prediction process
test_predict = model.predict(test_xx)
acc = roc_auc_score(y_test, test_predict)

In [32]:
print("The acuuracy score: ",acc)

The acuuracy score:  0.8412705119537828


In [33]:
#F1-score, recall and precision
from sklearn.metrics import classification_report
print(classification_report(y_test, test_predict))

              precision    recall  f1-score   support

           0       0.83      0.85      0.84      1228
           1       0.85      0.84      0.84      1272

    accuracy                           0.84      2500
   macro avg       0.84      0.84      0.84      2500
weighted avg       0.84      0.84      0.84      2500



In [34]:
#Test with an input data
input_data = "I absolutely love this movie. The acting is amazing, and the story is fantastic."
input_data = process(input_data)
input_data = vectorizer.transform([input_data])
input_data = input_data.toarray()
model.predict(input_data)

array([1])