<a href="https://colab.research.google.com/github/MikelCerio/IMDB-Sentiment-Analysis/blob/main/IMDB_SENTIMENT_ANALYSIS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis with LSTM

## Get the data

In [12]:
! pip install kaggle



### DATA COLLECTION - KAGGLE API

### Importing dependencies

In [13]:
import os
import json

from zipfile import ZipFile
import pandas as pd
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM, SpatialDropout1D


In [14]:
kaggle_dictionary = json.load(open('kaggle.json'))

In [15]:
# setup kaggle credentials as environment variables
os.environ["KAGGLE_USERNAME"] = kaggle_dictionary["username"]
os.environ["KAGGLE_KEY"] = kaggle_dictionary["key"]

In [16]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json


In [17]:
!kaggle datasets download -d lakshmi25npathi/imdb-dataset-of-50k-movie-reviews


Dataset URL: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
License(s): other
imdb-dataset-of-50k-movie-reviews.zip: Skipping, found more recently modified local copy (use --force to force download)


In [18]:
!ls

'IMDB Dataset.csv'   imdb-dataset-of-50k-movie-reviews.zip   kaggle.json   sample_data


In [19]:
# unzip the dataset file
with ZipFile('imdb-dataset-of-50k-movie-reviews.zip', 'r') as zipObj:
   # Extract all the contents of zip file in current directory
   zipObj.extractall()

### Loading the dataset

In [20]:
data = pd.read_csv("IMDB Dataset.csv")

In [21]:
data.shape

(50000, 2)

In [22]:
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [23]:
data['sentiment'].value_counts()

Unnamed: 0_level_0,count
sentiment,Unnamed: 1_level_1
positive,25000
negative,25000


### Replacing negative and positive values to 0 and 1

In [24]:
data.replace({"sentiment":{"positive":1, "negative":0}},inplace=True)

  data.replace({"sentiment":{"positive":1, "negative":0}},inplace=True)


In [25]:
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


In [26]:
# split data into training data and test data
train_data, test_data = train_test_split(data, test_size=0.2, random_state=42)

### Data Preprocessing

### Text Preprocessing

Before training the model, we need to convert the text data into a numerical format that a neural network can understand. We use `Tokenizer` and `pad_sequences` for this purpose.

####Step 1: Tokenizer
Creates a dictionary of the 5000 most frequent words in the dataset (num_words=5000).

Assigns a unique integer to each word.

fit_on_texts(...): builds the vocabulary based on the training data.

🔍 Purpose: Converts words into integers so the neural network can process them.

####Step 2: texts_to_sequences(...)
Converts each review into a sequence of integers (one per word).

####Step 3: pad_sequences(..., maxlen=200)
Ensures all sequences have the same length (200 words).

Shorter sequences are padded with zeros at the beginning.

Longer sequences are truncated.

🔍 Why? Neural networks require all input sequences to be of the same length.

In [27]:
# Tokenize text data
tokenizer = Tokenizer(num_words=5000, split=' ')
tokenizer.fit_on_texts(train_data['review'].values)
# Change 'max_len' to 'maxlen' in pad_sequences
X_train = pad_sequences(tokenizer.texts_to_sequences(train_data['review']),maxlen=200)
X_test = pad_sequences(tokenizer.texts_to_sequences(test_data['review']),maxlen=200) # Assign to X_test instead of overwriting X_train

In [28]:
print(X_train)

[[1935    1 1200 ...  205  351 3856]
 [   3 1651  595 ...   89  103    9]
 [   0    0    0 ...    2  710   62]
 ...
 [   0    0    0 ... 1641    2  603]
 [   0    0    0 ...  245  103  125]
 [   0    0    0 ...   70   73 2062]]


### Extract variable "Sentiment" to create Y

In [29]:
Y_train = train_data['sentiment']
Y_test = test_data['sentiment']

In [30]:
print(Y_train)

39087    0
30893    0
45278    1
16398    0
13653    0
        ..
11284    1
44732    1
38158    0
860      1
15795    1
Name: sentiment, Length: 40000, dtype: int64


LSTM - Long Short Term Memory

In [34]:
model = Sequential()
model.add(Embedding(5000, 128, input_length=200))
model.add(SpatialDropout1D(0.4))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))



## Model Explanation

This model is a **sequential neural network** designed for binary text classification tasks (e.g., detecting whether a message is positive or negative).


1. `Embedding(5000, 128, input_length=X_train.shape[1])`  
Converts each word (represented as an integer) into a 128-dimensional vector.

- **5000**: maximum number of words considered (vocabulary size).  
- **128**: vector size per word.  
- **input_length**: length of input sequences (number of words per text).

🔍 Transforms text into dense vectors that the neural network can process.

---

2. `SpatialDropout1D(0.4)`  
Randomly drops 40% of word vectors in each batch during training.

🔍 Helps prevent overfitting and improves the model’s generalization.

---

3. `LSTM(128, dropout=0.2, recurrent_dropout=0.2)`  
LSTM (Long Short-Term Memory) layer with 128 units. It can understand the context and word order.

- **dropout**: drops 20% of the inputs to this layer.  
- **recurrent_dropout**: drops 20% of the recurrent (memory) connections.

🔍 Ideal for working with text sequences.

---

4. `Dense(1, activation='sigmoid')`  
Output layer with a single neuron and sigmoid activation (returns values between 0 and 1).

🔍 Returns a probability to classify the text into one of two classes: 0 or 1.


In [35]:
model.summary()

### Compile the Model

Before training, we need to **compile** the model. This step defines how the model will learn.

In [36]:
# compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
#

* Parameters:
loss='binary_crossentropy'
This is the loss function used for binary classification problems.
🔍 It measures how far the predicted probabilities are from the actual labels (0 or 1).

* optimizer='adam'
Adam is an efficient optimization algorithm that adjusts weights to minimize the loss.
🔍 It's widely used because it combines the benefits of other optimizers like SGD and RMSprop.

* metrics=['accuracy']
We want to monitor the accuracy during training and evaluation.
🔍 This tells us what percentage of predictions were correct.

## Training the Model

Now we train the model using the training data

In [37]:
model.fit(X_train, Y_train, epochs=5, batch_size=64, validation_split=0.2)

Epoch 1/5
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m358s[0m 704ms/step - accuracy: 0.6987 - loss: 0.5513 - val_accuracy: 0.8372 - val_loss: 0.3762
Epoch 2/5
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m342s[0m 684ms/step - accuracy: 0.8418 - loss: 0.3750 - val_accuracy: 0.8105 - val_loss: 0.4066
Epoch 3/5
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m340s[0m 680ms/step - accuracy: 0.8622 - loss: 0.3328 - val_accuracy: 0.8643 - val_loss: 0.3198
Epoch 4/5
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m382s[0m 681ms/step - accuracy: 0.8839 - loss: 0.2891 - val_accuracy: 0.8525 - val_loss: 0.3508
Epoch 5/5
[1m500/500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m386s[0m 688ms/step - accuracy: 0.8961 - loss: 0.2612 - val_accuracy: 0.8662 - val_loss: 0.3198


<keras.src.callbacks.history.History at 0x7a484ca533d0>

Parameters:
X_train, Y_train
These are the input data (features) and labels (targets) for training.

epochs=5
The model will go through the entire training dataset 5 times.
🔁 More epochs may improve performance, but also increase training time and risk of overfitting.

batch_size=64
The training data is divided into batches of 64 samples.
📦 The model updates its weights after each batch to learn more efficiently.

validation_split=0.2
20% of the training data is set aside for validation.
📊 This allows us to monitor how well the model performs on unseen data during trainin

## Model Evaluation

In [38]:
loss = model.evaluate(X_test, Y_test)
print(f"Test Loss: {loss}")
print(f"Test Accuracy: {loss}")

[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m28s[0m 87ms/step - accuracy: 0.8694 - loss: 0.3113
Test Loss: [0.3053390085697174, 0.8736000061035156]
Test Accuracy: [0.3053390085697174, 0.8736000061035156]


## Building a Predictive System

In [39]:
def predict_sentiment(review):
    # Tokenize and pad the review
    sequence = tokenizer.texts_to_sequences([review])
    padded_sequence = pad_sequences(sequence, maxlen=200)
    prediction = model.predict(padded_sequence)
    return "Positive" if prediction > 0.5 else "Negative"

    # Make a prediction
    prediction = model.predict(padded_sequence)

predict_sentiment(review) function:
Converts the input text to a sequence of numbers using the tokenizer.

Pads the sequence to match the model's input shape.

Makes a prediction using the trained model.

Returns "Positive" if the probability is greater than 0.5, otherwise "Negative".

In [40]:
# example usage
new_review = 'This movie was fantastic. I love it.'
sentiment = predict_sentiment(new_review)
print(f"The sentiment of the review is: {sentiment}")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 584ms/step
The sentiment of the review is: Positive


In [41]:
# example of usage
new_review = 'This movie was not that good'
sentiment = predict_sentiment(new_review)
print(f"The sentiment of the review is: {sentiment}")
#

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 115ms/step
The sentiment of the review is: Negative
