In [60]:
from tensorflow.keras.datasets import imdb


- The num_words parameter limits the vocabulary size to the top 10,000 most frequent words in the dataset.
- Any word ranked below 10,000 in frequency is replaced with an out-of-vocabulary (OOV) token.
- This helps in reducing computational complexity and ignoring rare words that might not contribute much to learning.

In [61]:
max_features = 10000
(X_train,y_train),(x_test,y_test) = imdb.load_data(num_words=max_features)

In [62]:
import numpy as np
unique, counts = np.unique(y_train, return_counts=True)
print(dict(zip(unique, counts)))


{0: 12500, 1: 12500}


In [63]:
print(X_train[0])  # First movie review as a sequence of numbers
print(x_test[0])  # Sentiment label for the first review (0 or 1)


[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]
[1, 591, 202, 14, 31, 6, 717, 10, 10, 2, 2, 5, 4, 360, 7, 4,

In [64]:
# convert the sequence of numbers to words
word_index = imdb.get_word_index()
reverse_word_index = dict((value,key) for key,value in word_index.items())
decoded_review = ' '.join([reverse_word_index.get(i-3,'?') for i in X_train[0]])
print(decoded_review)

? this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert ? is an amazing actor and now the same being director ? father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for ? and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also ? to the two little boy's that played the ? of norman and paul they were just brilliant children are often left out of the ? list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you thi

- here i-3 is used because 

The IMDB dataset reserves the first 3 indices for special tokens:

0 → Padding (<PAD>)

1 → Start of a sequence (<START>)

2 → Unknown words (<UNK>)

- So, when we retrieve words from reverse_word_index, we shift the indices back by 3 to match the actual words in word_index.

In [65]:
# Now adding padding to the each review to make them of same length
from tensorflow.keras.preprocessing import sequence
max_len = 500
X_train = sequence.pad_sequences(X_train,maxlen=max_len,padding='post')
x_test = sequence.pad_sequences(x_test,maxlen=max_len,padding='post')
# X_train
x_test


array([[   1,  591,  202, ...,    0,    0,    0],
       [   1,   14,   22, ...,    0,    0,    0],
       [  33,    6,   58, ...,    9,   57,  975],
       ...,
       [   1,   13, 1408, ...,    0,    0,    0],
       [   1,   11,  119, ...,    0,    0,    0],
       [   1,    6,   52, ...,    0,    0,    0]], dtype=int32)

use of each layer in the model:
- Embedding layer: This layer takes the integer-encoded vocabulary and looks up the embedding vector for each word-index. These vectors are learned as the model trains. The vectors add a dimension to the output array. The resulting dimensions are: (batch, sequence, embedding).
- SimpleRNN layer: This layer processes the sequence of word vectors. It takes the sequence of word embeddings as input and returns the output for each word in the sequence. The output of the SimpleRNN layer is the hidden state for each word in the sequence.
- Dense layer: This layer processes the hidden state from the SimpleRNN layer and returns the final output. The output is a single prediction for the entire sequence.


In [66]:
# create the simple rnn model
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense,Dropout ,BatchNormalization

model = Sequential([
    Embedding(max_features, 128, input_length=max_len),
    Dropout(0.3),  # Prevent overfitting
    SimpleRNN(128, activation='tanh'),
    BatchNormalization(),
    Dropout(0.3),
    Dense(1, activation='sigmoid')  # Binary classification
])

In [67]:
model.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_4 (Embedding)     (None, 500, 128)          1280000   
                                                                 
 dropout (Dropout)           (None, 500, 128)          0         
                                                                 
 simple_rnn_4 (SimpleRNN)    (None, 128)               32896     
                                                                 
 batch_normalization (Batch  (None, 128)               512       
 Normalization)                                                  
                                                                 
 dropout_1 (Dropout)         (None, 128)               0         
                                                                 
 dense_4 (Dense)             (None, 1)                 129       
                                                      

In [68]:
model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])

In [69]:
# adding callbacks for early stopping and tensorboard
from tensorflow.keras.callbacks import EarlyStopping, TensorBoard
earlyStoppingCallback = EarlyStopping(patience=5,restore_best_weights=True,monitor='val_loss')


In [70]:
from datetime import datetime
# tensor board callback
log_dir = 'classification_logs/fit/' + datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = TensorBoard(log_dir= log_dir,histogram_freq =1)

In [71]:
history = model.fit(
    X_train,y_train,epochs=10,batch_size=32, validation_split=0.2 ,callbacks = [tensorboard_callback,earlyStoppingCallback]
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [75]:
model.save('imdb_model.h5')


In [76]:
%load_ext tensorboard

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


In [77]:
%tensorboard --logdir regression_logs/fit

Launching TensorBoard...