# IMDB Movie Reviews Sentiment Analysis

In this project I try to perform sentiment analysis of IMDB movie reviews using NLP techniques

In [58]:
import pandas as pd

In [59]:
data = pd.read_csv('imdb.csv')

data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


The next two lines convert the positive and negative sentiments to 1 and 0 respectively so we can use it later for our ML Model.

In [60]:
data.loc[data['sentiment']=='positive','sentiment'] = 1

In [61]:
data.loc[data['sentiment']=='negative','sentiment'] = 0

In [69]:
data["sentiment"] = pd.to_numeric(data["sentiment"])
data

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1
...,...,...
49995,I thought this movie did a down right good job...,1
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",0
49997,I am a Catholic taught in parochial elementary...,0
49998,I'm going to have to disagree with the previou...,0


Now we have two columns, the review and the sentiment.

In [70]:
review = data['review']

label = data['sentiment']

We save the reviews column to a variable called **review** and the labels to a variable called **label**.

In [71]:
review.head()

0    One of the other reviewers has mentioned that ...
1    A wonderful little production. <br /><br />The...
2    I thought this was a wonderful way to spend ti...
3    Basically there's a family where a little boy ...
4    Petter Mattei's "Love in the Time of Money" is...
Name: review, dtype: object

In [72]:
label.head()

0    1
1    1
2    1
3    0
4    1
Name: sentiment, dtype: int64

## Pre Processing
Now in this section we have to process the data by:
1. Converting all the rows to lower case.
2. Removing stop words like i, me , you, our, your etc
3. Removing hyperlinks,numbers,punctuations etc.

Now we import the nltk library. NLTK is a toolkit build for working with NLP in Python. It provides us various text processing libraries with a lot of test datasets.

In [73]:
import nltk
import re
import string

In [74]:
nltk.download('stopwords')

stop_words = nltk.corpus.stopwords.words('english')

[nltk_data] Downloading package stopwords to C:\Users\IFEANYI
[nltk_data]     PC\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


We download the stopwords we want to remove from the dataset.

In [75]:
nltk.download('punkt')

from nltk.tokenize import word_tokenize

[nltk_data] Downloading package punkt to C:\Users\IFEANYI
[nltk_data]     PC\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [76]:
def pre_process(txt):
    lowered_text = txt.lower()
    
    removed_numbers = re.sub(r'\d+','',lowered_text) # re. is for regular expressions. Substitutes digits with an empty string.
    
    removed_punctuation = removed_numbers.translate(str.maketrans('','',string.punctuation)) # This removes punctuation from the text and replaces it with an empty string
    
    # now we split the text to obtain tokens and then remove the stopwords.
    
    word_tokens = word_tokenize(removed_punctuation)
    
    processed_text = ''.join([word for word in word_tokens if word not in stop_words])
    
    return processed_text

In [77]:
processed = review.apply(pre_process) #.apply applies a function across a pandas dataframe.

processed

0        onereviewersmentionedwatchingozepisodeyoullhoo...
1        wonderfullittleproductionbrbrfilmingtechniqueu...
2        thoughtwonderfulwayspendtimehotsummerweekendsi...
3        basicallytheresfamilylittleboyjakethinkstheres...
4        pettermatteislovetimemoneyvisuallystunningfilm...
                               ...                        
49995    thoughtmovierightgoodjobwasntcreativeoriginalf...
49996    badplotbaddialoguebadactingidioticdirectingann...
49997    catholictaughtparochialelementaryschoolsnunsta...
49998    imgoingdisagreepreviouscommentsidemaltinonesec...
49999    oneexpectsstartrekmovieshighartfansexpectmovie...
Name: review, Length: 50000, dtype: object

We have now processed the text but we still need to tokenize it.

In [78]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

input_data = vectorizer.fit_transform(processed)
input_data

<50000x54131 sparse matrix of type '<class 'numpy.int64'>'
	with 54605 stored elements in Compressed Sparse Row format>

We have now created our sparse matrix with number of reviews as rows(50000) and all the words in the dataset as columns after removing the stopwords(54605)

In [79]:
print(input_data)

  (0, 33523)	1
  (1, 52330)	1
  (2, 47296)	1
  (3, 3292)	1
  (4, 34949)	1
  (5, 36048)	1
  (6, 45390)	1
  (7, 42384)	1
  (8, 10557)	1
  (9, 24508)	1
  (10, 34963)	1
  (11, 40135)	1
  (12, 20553)	1
  (13, 5754)	1
  (14, 12330)	1
  (15, 22987)	1
  (16, 13935)	1
  (17, 29071)	1
  (18, 38317)	1
  (19, 2837)	1
  (20, 45084)	1
  (21, 46037)	1
  (22, 173)	1
  (23, 14792)	1
  (24, 52689)	1
  :	:
  (49977, 9287)	1
  (49977, 23754)	1
  (49977, 21977)	1
  (49978, 18524)	1
  (49979, 40021)	1
  (49980, 44949)	1
  (49981, 38674)	1
  (49982, 19772)	1
  (49983, 25347)	1
  (49984, 19141)	1
  (49985, 20457)	1
  (49986, 28372)	1
  (49987, 38251)	1
  (49988, 15399)	1
  (49989, 17655)	1
  (49990, 23482)	1
  (49991, 23902)	1
  (49992, 22565)	1
  (49993, 39103)	1
  (49994, 48636)	1
  (49995, 47217)	1
  (49996, 3100)	1
  (49997, 5777)	1
  (49998, 20717)	1
  (49999, 33005)	1


Now we can feed the matrix to a machine learning model. In this case we'll use the Logistic Regression model since we are trying to classify it into positive or negative.

In [80]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(input_data, label)

LogisticRegression()

In [86]:
def prediction_input(sentence):
    processed = pre_process(sentence)
    input_data = vectorizer.transform([processed])
    prediction = model.predict(input_data)
    
    if (prediction[0] == 1):
        print('This is a Positive Sentiment Sentence.')
    elif (prediction[0] == 0):
        print('This is a Negative Sentiment Sentence.')

In [87]:
review_input = input("What is your review: ")
prediction_input(review_input)

What is your review: that movie was bad
This is a Positive Sentiment Sentence.


In [88]:
from sklearn import metrics
accuracy_score = metrics.accuracy_score(model.predict(input_data), label)
print("accuracy_score without data pre-processing = " + str('{:04.2f}'.format(accuracy_score*100))+" %")

accuracy_score without data pre-processing = 100.00 %
