# End-to-End Deep Learning Project Using Simple RNN for Sentiment Analysis

This project demonstrates how to build a simple Recurrent Neural Network (RNN) to perform sentiment analysis on movie reviews using the IMDB dataset. The model classifies movie reviews as either positive or negative based on the review text.

## Overview

- **Dataset:** IMDB movie reviews dataset (10,000 most frequent words)
- **Task:** Binary classification (positive/negative sentiment)
- **Model:** Sequential Keras model with an Embedding layer, a Simple RNN layer, and a Dense output layer with sigmoid activation
- **Training:** Uses early stopping to prevent overfitting
- **Output:** Saves the trained model to disk

## Detailed Explanation of the Code

1. **Data Loading:**
   - The IMDB dataset is loaded with a vocabulary size limited to the top 10,000 most common words.
   - Training and test data are split into reviews (`X_train`, `X_test`) and sentiment labels (`y_train`, `y_test`).

2. **Data Inspection:**
   - Prints the shapes of training and testing datasets.
   - Prints a sample review (encoded as integers) and its corresponding label.
   - Maps the integer-encoded review back to words for interpretability.

3. **Preprocessing:**
   - Reviews are padded or truncated to a fixed length (`max_len=500`) using `pad_sequences` so that all input sequences have the same length.

4. **Model Building:**
   - The model uses an Embedding layer to convert integer-encoded words into dense vectors of fixed size (128).
   - A SimpleRNN layer with 128 units and ReLU activation processes the sequence data.
   - A Dense layer with a sigmoid activation outputs the probability of the review being positive.

5. **Compilation:**
   - The model is compiled using the Adam optimizer and binary cross-entropy loss, tracking accuracy as a metric.

6. **Training:**
   - Uses 20% of the training data for validation.
   - Implements EarlyStopping to stop training if validation loss doesn’t improve for 5 consecutive epochs and restores the best weights.
   - Trains for up to 10 epochs, with batch size 32.

7. **Saving the Model:**
   - The trained model is saved to the file `simple_rnn_imdb.h5`.

In [1]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, SimpleRNN

In [2]:
## load the imdb data set

max_features = 10000 # vocabsize
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=max_features)

In [3]:
print((f'Training set: {X_train.shape}'), (f'Training label shape: {X_test.shape}'))
print((f'Test set size: {X_test.shape[0]}'),(f'Testing label shape: {y_train.shape}'))

Training set: (25000,) Training label shape: (25000,)
Test set size: 25000 Testing label shape: (25000,)


In [4]:
## sample rewview and its label
sample_review = X_train[0]
sample_label = y_train[0]
print(f"Sample review: {sample_review}")
print(f"Sample label: {sample_label}")

Sample review: [1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]
Sample label: 1


In [5]:
## mapping of words index bacl to words
word_index = imdb.get_word_index()

In [6]:
reverse_word_index = {value:key for key,value in word_index.items() }

In [7]:
decoded_review = ' '.join([reverse_word_index.get(i-3,'?') for i in sample_review])

In [8]:
#padding 
max_len = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_len)
X_test = sequence.pad_sequences(X_test, maxlen=max_len)

In [9]:
## train Simple RNN
model  = Sequential()
model.add(Embedding(max_features, 128,input_length=max_len))
model.add(SimpleRNN(128,activation='relu'))
model.add(Dense(1,activation='sigmoid'))



In [10]:
#model.build(input_shape=(None, max_len))

In [11]:
model.summary()

In [12]:
## create an instance of early stopping Callback
from tensorflow.keras.callbacks import EarlyStopping

In [13]:
earlystopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
earlystopping

<keras.src.callbacks.early_stopping.EarlyStopping at 0x1ab0933fbc0>

In [16]:
## TRAIN THE MODEL
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
history = model.fit(
    X_train,y_train,epochs=10,batch_size = 32,validation_split=0.2,callbacks=[earlystopping]
)

Epoch 1/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m50s[0m 78ms/step - accuracy: 0.5827 - loss: 89.9680 - val_accuracy: 0.6620 - val_loss: 0.6147
Epoch 2/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m49s[0m 78ms/step - accuracy: 0.7570 - loss: 0.5146 - val_accuracy: 0.8012 - val_loss: 0.4421
Epoch 3/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m45s[0m 73ms/step - accuracy: 0.8577 - loss: 0.3558 - val_accuracy: 0.8122 - val_loss: 0.4285
Epoch 4/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m46s[0m 73ms/step - accuracy: 0.9072 - loss: 0.2343 - val_accuracy: 0.8360 - val_loss: 0.4125
Epoch 5/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m44s[0m 71ms/step - accuracy: 0.9349 - loss: 0.1708 - val_accuracy: 0.8248 - val_loss: 0.4457
Epoch 6/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m44s[0m 70ms/step - accuracy: 0.9587 - loss: 0.1157 - val_accuracy: 0.8210 - val_loss: 0.4847
Epoch 7/10
[1m

In [17]:
model.save('Simple_rnn_imdb.h5')

