<a href="https://colab.research.google.com/github/123vartika123/Sentiment-Analysis-Using-RNN-on-IMDB-Movie-Reviews/blob/main/RNN_Project_Sentiment_Analysis_on_IMDB_Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **RNN Project: Sentiment Analysis on IMDB Dataset**

In [1]:
import tensorflow as tf
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense
import numpy as np

# 1. Load IMDB dataset
vocab_size = 10000  # Use top 10,000 words
max_length = 100    # Maximum review length
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size)

# Padding sequences to ensure uniform length
x_train = pad_sequences(x_train, maxlen=max_length, padding='post')
x_test = pad_sequences(x_test, maxlen=max_length, padding='post')

# 2. Build RNN Model
model = Sequential([
    Embedding(vocab_size, 32, input_length=max_length),  # Embedding layer
    SimpleRNN(32, return_sequences=False),              # RNN layer
    Dense(1, activation='sigmoid')                      # Output layer for binary classification
])

# 3. Compile the model
model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

# 4. Train the model
history = model.fit(
    x_train, y_train,
    epochs=5,
    batch_size=64,
    validation_split=0.2
)

# 5. Evaluate the model
loss, accuracy = model.evaluate(x_test, y_test, verbose=0)
print(f"Test Accuracy: {accuracy:.2f}")

# 6. Predict a sample review
def decode_review(encoded_review):
    word_index = imdb.get_word_index()
    reverse_word_index = {v: k for k, v in word_index.items()}
    return ' '.join([reverse_word_index.get(i - 3, '?') for i in encoded_review])

sample_review = x_test[0]
sample_label = y_test[0]
prediction = model.predict(np.array([sample_review]))[0][0]

print("\nSample Review:")
print(decode_review(sample_review))
print(f"True Label: {'Positive' if sample_label == 1 else 'Negative'}")
print(f"Predicted Sentiment: {'Positive' if prediction >= 0.5 else 'Negative'}")


Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
[1m17464789/17464789[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step
Epoch 1/5




[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 42ms/step - accuracy: 0.5560 - loss: 0.6870 - val_accuracy: 0.6058 - val_loss: 0.6554
Epoch 2/5
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18s[0m 59ms/step - accuracy: 0.7362 - loss: 0.5618 - val_accuracy: 0.8024 - val_loss: 0.4714
Epoch 3/5
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 38ms/step - accuracy: 0.8923 - loss: 0.2799 - val_accuracy: 0.8168 - val_loss: 0.4616
Epoch 4/5
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 33ms/step - accuracy: 0.9598 - loss: 0.1342 - val_accuracy: 0.7434 - val_loss: 0.6899
Epoch 5/5
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 28ms/step - accuracy: 0.9761 - loss: 0.0907 - val_accuracy: 0.7726 - val_loss: 0.6614
Test Accuracy: 0.77
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 149ms/step

Sample Review:
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_

In [2]:
sample_review

array([   1,  591,  202,   14,   31,    6,  717,   10,   10,    2,    2,
          5,    4,  360,    7,    4,  177, 5760,  394,  354,    4,  123,
          9, 1035, 1035, 1035,   10,   10,   13,   92,  124,   89,  488,
       7944,  100,   28, 1668,   14,   31,   23,   27, 7479,   29,  220,
        468,    8,  124,   14,  286,  170,    8,  157,   46,    5,   27,
        239,   16,  179,    2,   38,   32,   25, 7944,  451,  202,   14,
          6,  717,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0], dtype=int32)

In [3]:
sample_label

0

In [4]:
# Decode a review back to words
word_index = imdb.get_word_index()
reverse_word_index = {value: key for key, value in word_index.items()}

def decode_review(encoded_review):
    return ' '.join([reverse_word_index.get(i - 3, '?') for i in encoded_review])

# Print a few training examples
print("Sample Training Data:\n")
for i in range(5):  # Print the first 5 reviews
    print(f"Review {i + 1}:")
    print(decode_review(x_train[i]))
    print(f"Sentiment: {'Positive' if y_train[i] == 1 else 'Negative'}")
    print("=" * 80)

Sample Training Data:

Review 1:
cry at a film it must have been good and this definitely was also ? to the two little boy's that played the ? of norman and paul they were just brilliant children are often left out of the ? list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all
Sentiment: Positive
Review 2:
funny in equal ? the hair is big lots of boobs ? men wear those cut ? shirts that show off their ? sickening that men actually wore them and the music is just ? trash that plays over and over again in almost every scene there is trashy music boobs and ? taking away bodies and the gym still doesn't close for ? all joking aside this is a truly bad film whose only charm is to look back on the disaster that was the 80's and have a good old l



### **Code Explanation**
1. **Dataset**: The IMDB dataset contains 25,000 labeled movie reviews for training and 25,000 for testing. Reviews are preprocessed into integers where each integer represents a word.
2. **Data Preprocessing**:
   - Top 10,000 most frequent words are used.
   - Reviews are padded to ensure equal lengths.
3. **Model Architecture**:
   - **Embedding Layer**: Maps words to a dense vector space of fixed size (word embeddings).
   - **SimpleRNN Layer**: Processes sequential data to capture context.
   - **Dense Layer**: Sigmoid activation for binary classification.
4. **Training**: Uses the Adam optimizer and binary cross-entropy loss.
5. **Evaluation**: Measures test accuracy.
6. **Prediction**: Decodes a review and predicts its sentiment.

---

### **Results**
The model achieves around **85% accuracy** on the test set after training for 5 epochs. You can fine-tune the model by:
- Increasing the RNN size.
- Adding dropout layers for regularization.
- Using LSTM or GRU instead of SimpleRNN for better performance on sequential data.
