<a href="https://colab.research.google.com/github/Prabhulakshman/Sentiment-Analysis-/blob/main/IMDB_Review_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Install Kaggle Library

In [1]:
!pip install kaggle



Import Dependencies

In [2]:
import os
import json

from zipfile import ZipFile
import pandas as pd
from sklearn.model_selection import train_test_split
from tensorflow .keras.models import Sequential
from tensorflow.keras.layers import Dense,Embedding, LSTM
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences


os library and json are pre installed with Python
ZipFile utility used to Extract Zip Files from Kaggle Dataset
Pandas is used to loading csv files
Data Set IMDB Dataset.csv 50k Reviews (Positive and Negative Labelled Classes)
Sequential is used to build Sequential Model ( Stack the Layers in CNN)
Dense layer is fully connected layer
Embedding layer is first layer

Data Collection using Kaggle API

In [3]:
kaggle_dictionary=json.load(open("kaggle.json"))

Create a Kaggle Account
Sign in -> Profile-> Settings-> API-> Create New Token -> Kaggle.json(Automatically gets downloaded)
Upload the file in Folders Section(Left Side Icons) in Colab Notebook

Setup Kaggle Credentials as Environment Variables


In [5]:
kaggle_dictionary.keys()

dict_keys(['username', 'key'])

Store Kaggle_dictionary username and key values as an Environmental Variables

In [None]:
os.environ['KAGGLE_USERNAME']=kaggle_dictionary['username']
os.environ['KAGGLE_KEY']=kaggle_dictionary['key']

Load dataset in a zip file format
Go to Kaggle and search IMDB Datset for 50k Movie Reviews and click on it on the right top corner near Download icon press the 3 dots and click "Copy API Command" to get the dataset API link

In [7]:
!kaggle datasets download -d lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

Dataset URL: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
License(s): other
Downloading imdb-dataset-of-50k-movie-reviews.zip to /content
 74% 19.0M/25.7M [00:00<00:00, 198MB/s]
100% 25.7M/25.7M [00:00<00:00, 215MB/s]


In [8]:
!ls

imdb-dataset-of-50k-movie-reviews.zip  kaggle.json  sample_data


Un Zip the Datset File

In [9]:
with ZipFile("imdb-dataset-of-50k-movie-reviews.zip", "r") as zip_ref:
  zip_ref.extractall()

In [10]:
!ls

'IMDB Dataset.csv'   imdb-dataset-of-50k-movie-reviews.zip   kaggle.json   sample_data


Data Preprocessing and Loading the dataset

In [11]:
data=pd.read_csv("IMDB Dataset.csv")

In [13]:
data.shape

(50000, 2)

Print First Five Data

In [14]:
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


Print Last Five Data

In [15]:
data.tail()

Unnamed: 0,review,sentiment
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative
49999,No one expects the Star Trek movies to be high...,negative


Checks Data Distribution ( Number of Positive and Negative Target data Distribution is to be Equally otherwise it may leads to Underfitting Issues )

In [16]:
data["sentiment"].value_counts()

sentiment
positive    25000
negative    25000
Name: count, dtype: int64

Convert positive to '1' and negative to '0' ( Numerical Values )

In [21]:
data.replace({"sentiment":{"positive":1,"negative":0}})

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1
...,...,...
49995,I thought this movie did a down right good job...,1
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",0
49997,I am a Catholic taught in parochial elementary...,0
49998,I'm going to have to disagree with the previou...,0


In [22]:
data.replace({"sentiment":{"positive":1,"negative":0}},inplace=True)

inplace=True which means we dont require additional variable to store data such as "data=data.replace({"sentiment":{"positive":1,"negative":0}})"

In [24]:
data["sentiment"].value_counts()

sentiment
1    25000
0    25000
Name: count, dtype: int64

Split data into Train Data and Test Data

In [27]:
train_data, test_data=train_test_split(data,test_size=0.2,random_state=42)

Train Data Size is 80% and Test Data Size is 20%

In [29]:
print(test_data.shape)
print(train_data.shape)

(10000, 2)
(40000, 2)


Data Pre Processing

Tokenize Text Data
Tokenizer is used to convert Words into Integers and Vectors

In [31]:
tokenizer=Tokenizer(num_words=5000)
tokenizer.fit_on_texts(train_data["review"])
X_train=pad_sequences(tokenizer.texts_to_sequences(train_data["review"]),maxlen=200)
X_test=pad_sequences(tokenizer.texts_to_sequences(test_data["review"]),maxlen=200)

text_to_sequence() is used to convert Text Data into Integer Numbers
Train data => [1] Fit , [2] Transform
Test data => [1] Transform
pad_sequences() is used to ensure all those data length is same (maximum length=200) and uniform length of inputs

In [32]:
print(X_train)
print(X_test)

[[1935    1 1200 ...  205  351 3856]
 [   3 1651  595 ...   89  103    9]
 [   0    0    0 ...    2  710   62]
 ...
 [   0    0    0 ... 1641    2  603]
 [   0    0    0 ...  245  103  125]
 [   0    0    0 ...   70   73 2062]]
[[   0    0    0 ...  995  719  155]
 [  12  162   59 ...  380    7    7]
 [   0    0    0 ...   50 1088   96]
 ...
 [   0    0    0 ...  125  200 3241]
 [   0    0    0 ... 1066    1 2305]
 [   0    0    0 ...    1  332   27]]


In [34]:
Y_train=train_data["sentiment"]
Y_test=test_data["sentiment"]

In [35]:
print(Y_train)
print(Y_test)

39087    0
30893    0
45278    1
16398    0
13653    0
        ..
11284    1
44732    1
38158    0
860      1
15795    1
Name: sentiment, Length: 40000, dtype: int64
33553    1
9427     1
199      0
12447    1
39489    0
        ..
28567    0
25079    1
18707    1
15200    0
5857     1
Name: sentiment, Length: 10000, dtype: int64


Building LSTM Model ( Long Short Term Memory )
It's a kind of RNN Model (Reccurent Neural Networks )
It is used for Textual data, Sequential data ( Time Series Data)

In [37]:
model=Sequential()
model.add(Embedding(input_dim=5000,output_dim=128,input_length=200))
model.add(LSTM(128,dropout=0.2,recurrent_dropout=0.2))
model.add(Dense(1,activation="sigmoid"))

All the Layers are from tensorflow.keras.layers
Embedding Layer is First Layer ( represents data as Vector Embeddings ) outputs as 128 units of vector space
Dropout Layer is to avoid Overfitting Problem and turns off some output of neurons
Recurrent Dropout is a kind of Feedback loop flows to previous neurons prsent in previous Layers
It is known as Regularization Parameters
Dense Layer all the neurons in previous layer are connected to all the neurons in this layer ( Output Layer )
Sigmoid Activation Fuction gives the Probability ( P< 0.5 => Negative Class P>0.5 => Positive Class )


In [38]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 200, 128)          640000    
                                                                 
 lstm_1 (LSTM)               (None, 128)               131584    
                                                                 
 dense (Dense)               (None, 1)                 129       
                                                                 
Total params: 771713 (2.94 MB)
Trainable params: 771713 (2.94 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


Compile the Model

In [39]:
model.compile(loss="binary_crossentropy",optimizer="adam",metrics=["accuracy"])

Training the Model

In [40]:
model.fit(X_train,Y_train,epochs=10,batch_size=64,validation_split=0.2,validation_data=(X_test,Y_test))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x7dcff6a37e50>

Model Evaluation Performance Metrics

In [41]:
loss,accuracy=model.evaluate(X_test,Y_test)
print(f"Test Loss: {loss}")
print(f"Test Accuracy: {accuracy}")

Test Loss: 0.3893789052963257
Test Accuracy: 0.8777999877929688


Building a Predictive System

In [42]:
def predict_sentiment(review):
  # tokenize and pad the review input
  sequence=tokenizer.texts_to_sequences([review])
  padded_sequence=pad_sequences(sequence,maxlen=200)
  # prediction
  prediction=model.predict(padded_sequence)
  # return the predicted sentiment
  if prediction[0][0] > 0.5:
    return "Positive"
  else:
    return "Negative"


Example Review

In [43]:
new_review=" This movie isfantastic and I loved it. "
predict_sentiment(new_review)
print(f" The Sentiment of movie is {predict_sentiment(new_review)}")

 The Sentiment of movie is Positive


In [44]:
new_review=" This movie is very bad and not a good movie. "
predict_sentiment(new_review)
print(f" The Sentiment of movie is {predict_sentiment(new_review)}")

 The Sentiment of movie is Negative


In [45]:
new_review=" It's a watch worth movie. "
predict_sentiment(new_review)
print(f" The Sentiment of movie is {predict_sentiment(new_review)}")

 The Sentiment of movie is Positive


In [46]:
new_review=" Movie made me emotional and climax is stuuning. "
predict_sentiment(new_review)
print(f" The Sentiment of movie is {predict_sentiment(new_review)}")

 The Sentiment of movie is Positive


In [47]:
new_review=" Kids are not interested to watch the movie once again. "
predict_sentiment(new_review)
print(f" The Sentiment of movie is {predict_sentiment(new_review)}")

 The Sentiment of movie is Negative
