# IMDB Movie Reviews Sentiment Analysis

In this project I try to perform sentiment analysis of IMDB movie reviews using NLP techniques

In [36]:
import pandas as pd

In [37]:
data = pd.read_csv('imdb_reviews.csv')

data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


Because the dataset doesn't have any headers it uses the first row as the header. Therefore we have to add the argument **'header = None'** when reading the dataset."

In [5]:
data = pd.read_csv('imdb_reviews.csv', header = None)
data.head()

Unnamed: 0,0,1
0,"A very, very, very slow-moving, aimless movie ...",0
1,Not sure who was more lost - the flat characte...,0
2,Attempting artiness with black & white and cle...,0
3,Very little music or anything to speak of.,0
4,The best scene in the movie was when Gerardo i...,1


Now we have two columns, the review and the labels. 0 indicates a negative review while 1 indicates a positive review

In [6]:
review = data[0]

label = data[1]

We save the reviews column to a variable called **review** and the labels to a variable called **label**.

In [7]:
review.head()

0    A very, very, very slow-moving, aimless movie ...
1    Not sure who was more lost - the flat characte...
2    Attempting artiness with black & white and cle...
3         Very little music or anything to speak of.  
4    The best scene in the movie was when Gerardo i...
Name: 0, dtype: object

In [8]:
label.head()

0    0
1    0
2    0
3    0
4    1
Name: 1, dtype: int64

## Pre Processing
Now in this section we have to process the data by:
1. Converting all the rows to lower case.
2. Removing stop words like i, me , you, our, your etc
3. Removing hyperlinks,numbers,punctuations etc.

Now we import the nltk library. NLTK is a toolkit build for working with NLP in Python. It provides us various text processing libraries with a lot of test datasets.

In [10]:
import nltk
import re
import string

In [11]:
nltk.download('stopwords')

stop_words = nltk.corpus.stopwords.words('english')

[nltk_data] Downloading package stopwords to C:\Users\IFEANYI
[nltk_data]     PC\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


We download the stopwords we want to remove from the dataset.

In [12]:
nltk.download('punkt')

from nltk.tokenize import word_tokenize

[nltk_data] Downloading package punkt to C:\Users\IFEANYI
[nltk_data]     PC\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


In [17]:
def pre_process(txt):
    lowered_text = txt.lower()
    
    removed_numbers = re.sub(r'\d+','',lowered_text) # re. is for regular expressions. Substitutes digits with an empty string.
    
    removed_punctuation = removed_numbers.translate(str.maketrans('','',string.punctuation)) # This removes punctuation from the text and replaces it with an empty string
    
    # now we split the text to obtain tokens and then remove the stopwords.
    
    word_tokens = word_tokenize(removed_punctuation)
    
    processed_text = ''.join([word for word in word_tokens if word not in stop_words])
    
    return processed_text

In [18]:
processed = review.apply(pre_process) #.apply applies a function across a pandas dataframe.

processed

0       slowmovingaimlessmoviedistresseddriftingyoungman
1         surelostflatcharactersaudiencenearlyhalfwalked
2      attemptingartinessblackwhiteclevercameraangles...
3                               littlemusicanythingspeak
4      bestscenemoviegerardotryingfindsongkeepsrunnin...
                             ...                        
743              gotboredwatchingjessicelangetakeclothes
744    unfortunatelyvirtuefilmsproductionworklostregr...
745                                     wordembarrassing
746                                     exceptionallybad
747                 insultonesintelligencehugewastemoney
Name: 0, Length: 748, dtype: object

We have now processed the text but we still need to tokenize it.

In [19]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

input_data = vectorizer.fit_transform(processed)
input_data

<748x748 sparse matrix of type '<class 'numpy.int64'>'
	with 754 stored elements in Compressed Sparse Row format>

We have now created our sparse matrix with number of reviews as rows(748) and all the words in the dataset as columns after removing the stopwords(748)

In [20]:
print(input_data)

  (0, 614)	1
  (1, 648)	1
  (2, 41)	1
  (3, 373)	1
  (4, 64)	1
  (5, 560)	1
  (6, 711)	1
  (7, 575)	1
  (8, 67)	1
  (9, 380)	1
  (10, 50)	1
  (11, 440)	1
  (12, 618)	1
  (13, 116)	1
  (14, 565)	1
  (15, 44)	1
  (16, 563)	1
  (17, 313)	1
  (18, 524)	1
  (18, 686)	1
  (19, 643)	1
  (20, 236)	1
  (21, 182)	1
  (22, 488)	1
  (23, 516)	1
  :	:
  (723, 632)	1
  (724, 109)	1
  (725, 324)	1
  (726, 712)	1
  (727, 716)	1
  (728, 336)	1
  (729, 339)	1
  (730, 569)	1
  (731, 595)	1
  (732, 387)	1
  (733, 633)	1
  (734, 554)	1
  (735, 444)	1
  (736, 566)	1
  (737, 36)	1
  (738, 483)	1
  (739, 205)	1
  (740, 350)	1
  (741, 376)	1
  (742, 452)	1
  (743, 268)	1
  (744, 697)	1
  (745, 730)	1
  (746, 196)	1
  (747, 325)	1


Now we can feed the matrix to a machine learning model. In this case we'll use the Logistic Regression model since we are trying to classify it into positive or negative.

In [27]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(input_data, label)

LogisticRegression()

In [35]:
def prediction_input(sentence):
    processed = pre_process(sentence)
    input_data = vectorizer.transform([processed])
    prediction = model.predict(input_data)
    
    if (prediction[0] == 1):
        print('This is a Positive Sentiment Sentence.')
    elif (prediction[0] == 0):
        print('This is a Negative Sentiment Sentence.')

prediction_input('That movie was bad')

This is a Negative Sentiment Sentence.


In [30]:
prediction_input(review_input)

This is a Positive Sentiment Sentence.
