In [2]:
import numpy as np

docs = ['go india',
		'india india',
		'hip hip hurray',
		'jeetega bhai jeetega india jeetega',
		'bharat mata ki jai',
		'kohli kohli',
		'sachin sachin',
		'dhoni dhoni',
		'modi ji ki jai',
		'inquilab zindabad']

In [8]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer

In [9]:
tokenizer = Tokenizer(oov_token='<nothing>')

This code snippet from your notebook initializes a Tokenizer object from the tensorflow.keras.preprocessing.text module.

Here's a breakdown:

tokenizer = Tokenizer(oov_token='<nothing>'): This line creates an instance of the Tokenizer class and assigns it to the variable tokenizer.
- The oov_token='<nothing>' argument specifies that any words encountered during text processing that were not in the original vocabulary (out-of-vocabulary words) will be replaced by the token ''. This is a common practice to handle words that don't appear frequently in the training data.

In [10]:
tokenizer.fit_on_texts(docs)

In [11]:
tokenizer.word_index

{'<nothing>': 1,
 'india': 2,
 'jeetega': 3,
 'hip': 4,
 'ki': 5,
 'jai': 6,
 'kohli': 7,
 'sachin': 8,
 'dhoni': 9,
 'go': 10,
 'hurray': 11,
 'bhai': 12,
 'bharat': 13,
 'mata': 14,
 'modi': 15,
 'ji': 16,
 'inquilab': 17,
 'zindabad': 18}

In [12]:
tokenizer.word_counts

OrderedDict([('go', 1),
             ('india', 4),
             ('hip', 2),
             ('hurray', 1),
             ('jeetega', 3),
             ('bhai', 1),
             ('bharat', 1),
             ('mata', 1),
             ('ki', 2),
             ('jai', 2),
             ('kohli', 2),
             ('sachin', 2),
             ('dhoni', 2),
             ('modi', 1),
             ('ji', 1),
             ('inquilab', 1),
             ('zindabad', 1)])

In [13]:
tokenizer.document_count

10

In [14]:
sequences = tokenizer.texts_to_sequences(docs)
sequences

[[10, 2],
 [2, 2],
 [4, 4, 11],
 [3, 12, 3, 2, 3],
 [13, 14, 5, 6],
 [7, 7],
 [8, 8],
 [9, 9],
 [15, 16, 5, 6],
 [17, 18]]

This code snippet from your notebook uses the tokenizer object to convert the text data in the docs list into sequences of numbers.

Here's a breakdown:

- sequences =
  - tokenizer.texts_to_sequences(docs): This line takes the docs list (which contains your text data) and uses the texts_to_sequences method of the tokenizer to convert each document into a sequence of integers. Each integer in the sequence corresponds to the index of a word in the tokenizer's word index (which you can see in the output of the previous cell). Out-of-vocabulary words will be replaced by the oov_token you specified during the tokenizer's initialization. The resulting list of sequences is stored in the sequences variable.
  - sequences: This line simply displays the contents of the sequences variable, showing the numerical representation of your text data.

In [15]:
from keras.utils import pad_sequences

In [16]:
#Because we need to make all the input senteneces of same size we are goona do padding

sequences = pad_sequences(sequences,padding='post')

This code snippet uses the pad_sequences function from Keras to ensure that all sequences in your dataset have the same length. This is a common preprocessing step before feeding sequences into a neural network.

Here's a breakdown:

 - sequences = pad_sequences(sequences, padding='post'): This line takes the sequences list (which contains your numerical representations of the text data) and pads them.
   - sequences: This is the input list of sequences.
   - padding='post': This argument specifies that padding should be added to the end of each sequence. If a sequence is shorter than the maxlen (which is automatically determined by the longest sequence if not specified), zeros are added to the end. If a sequence is longer than maxlen, it will be truncated from the end.
The result is a NumPy array where each row represents a padded sequence of integers, all having the same length.


Padding is needed here because when you're working with sequences of text data in neural networks, especially with layers like SimpleRNN, the network expects inputs of a fixed size.

Here's why padding is important:

  - Fixed Input Size: Neural networks, particularly recurrent neural networks (RNNs) and convolutional neural networks (CNNs) used for sequence processing, are typically designed to work with input tensors of a consistent shape. Padding ensures that all your input sequences have the same length, creating a uniform input shape for the network.
  - Batch Processing: When you train a neural network, you usually process data in batches. Each batch needs to have the same dimensions. Padding allows you to create batches of sequences with the same length, making it possible to feed them into the network efficiently.
  - Matrix Operations: Many of the operations within neural network layers involve matrix multiplications. These operations require the input matrices to have compatible dimensions. Padding helps ensure that the sequence data is in a format that allows for these matrix operations to be performed correctly.


By padding the sequences, you are essentially making them the same length, which is a necessary step to feed them into a neural network model that expects fixed-size inputs. In this specific case, pad_sequences adds zeros to the end of the shorter sequences to match the length of the longest sequence, as you specified padding='post'.

In [17]:
sequences

array([[10,  2,  0,  0,  0],
       [ 2,  2,  0,  0,  0],
       [ 4,  4, 11,  0,  0],
       [ 3, 12,  3,  2,  3],
       [13, 14,  5,  6,  0],
       [ 7,  7,  0,  0,  0],
       [ 8,  8,  0,  0,  0],
       [ 9,  9,  0,  0,  0],
       [15, 16,  5,  6,  0],
       [17, 18,  0,  0,  0]], dtype=int32)

In [18]:
from keras.datasets import imdb #Prebuit dataset in library
from keras import Sequential
from keras.layers import Dense,SimpleRNN,Embedding,Flatten

In [19]:
(X_train,y_train),(X_test,y_test) = imdb.load_data()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
[1m17464789/17464789[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step


In [27]:
X_train # the data already prprocessed and intger encoded

array([list([1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 22665, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 21631, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 19193, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 10311, 8, 4, 107, 117, 5952, 15, 256, 4, 31050, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 12118, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]),
       list([1, 194, 1

In [21]:
X_train.shape

(25000,)

In [22]:
X_test.shape

(25000,)

In [24]:
y_train

array([1, 0, 0, ..., 0, 1, 0])

In [25]:
y_train.shape

(25000,)

In [26]:
 X_train[0]

[1,
 14,
 22,
 16,
 43,
 530,
 973,
 1622,
 1385,
 65,
 458,
 4468,
 66,
 3941,
 4,
 173,
 36,
 256,
 5,
 25,
 100,
 43,
 838,
 112,
 50,
 670,
 22665,
 9,
 35,
 480,
 284,
 5,
 150,
 4,
 172,
 112,
 167,
 21631,
 336,
 385,
 39,
 4,
 172,
 4536,
 1111,
 17,
 546,
 38,
 13,
 447,
 4,
 192,
 50,
 16,
 6,
 147,
 2025,
 19,
 14,
 22,
 4,
 1920,
 4613,
 469,
 4,
 22,
 71,
 87,
 12,
 16,
 43,
 530,
 38,
 76,
 15,
 13,
 1247,
 4,
 22,
 17,
 515,
 17,
 12,
 16,
 626,
 18,
 19193,
 5,
 62,
 386,
 12,
 8,
 316,
 8,
 106,
 5,
 4,
 2223,
 5244,
 16,
 480,
 66,
 3785,
 33,
 4,
 130,
 12,
 16,
 38,
 619,
 5,
 25,
 124,
 51,
 36,
 135,
 48,
 25,
 1415,
 33,
 6,
 22,
 12,
 215,
 28,
 77,
 52,
 5,
 14,
 407,
 16,
 82,
 10311,
 8,
 4,
 107,
 117,
 5952,
 15,
 256,
 4,
 31050,
 7,
 3766,
 5,
 723,
 36,
 71,
 43,
 530,
 476,
 26,
 400,
 317,
 46,
 7,
 4,
 12118,
 1029,
 13,
 104,
 88,
 4,
 381,
 15,
 297,
 98,
 32,
 2071,
 56,
 26,
 141,
 6,
 194,
 7486,
 18,
 4,
 226,
 22,
 21,
 134,
 476,
 26,
 480,
 5

In [29]:
len(X_train[0]),len(X_train[1])

(218, 189)

# We can see the length of each input is diffrient

In [30]:
len(X_train[2])

141

In [33]:
X_train = pad_sequences(X_train,padding='post',maxlen=50) # For demostration purpose we are  trimming the lenghth of input to 50 only but when doing real training remove the max_lenghth paramter /
X_test = pad_sequences(X_test,padding='post',maxlen=50)

This code snippet uses the pad_sequences function from Keras again, but this time it's applied to the X_train and X_test datasets loaded from the IMDB dataset.

Here's a breakdown:

X_train = pad_sequences(X_train, padding='post', maxlen=50): This line pads the X_train dataset.
   - X_train: The input sequences (numerical representations of movie reviews).
   - padding='post': Specifies that padding (zeros) should be added to the end of each sequence.
   - maxlen=50: This is a crucial argument here. It explicitly sets the maximum length for each sequence to 50. If a sequence is shorter than 50, it's padded with zeros at the end. If a sequence is longer than 50, it's truncated from the end. The comment in the code mentions this is for demonstration purposes and you might remove maxlen for real training.
- X_test = pad_sequences(X_test, padding='post', maxlen=50): This line does the same padding operation for the X_test dataset, ensuring that the test data also has sequences of length 50.
By applying pad_sequences with maxlen=50, you are creating a consistent input shape of (number of samples, 50) for your neural network model, which is necessary for training.



In [35]:
X_train.shape

(25000, 50)

In [34]:
X_train[0]

array([2071,   56,   26,  141,    6,  194, 7486,   18,    4,  226,   22,
         21,  134,  476,   26,  480,    5,  144,   30, 5535,   18,   51,
         36,   28,  224,   92,   25,  104,    4,  226,   65,   16,   38,
       1334,   88,   12,   16,  283,    5,   16, 4472,  113,  103,   32,
         15,   16, 5345,   19,  178,   32], dtype=int32)

In [36]:
model = Sequential()

model.add(SimpleRNN(32,input_shape=(50,1),return_sequences=False)) # 50 time steps and 1 input feature
model.add(Dense(1,activation='sigmoid'))

model.summary()

  super().__init__(**kwargs)


This code snippet defines a simple Recurrent Neural Network (RNN) model using Keras for a binary classification task, likely for the IMDB sentiment analysis dataset you loaded earlier.

Here's a breakdown:

- model = Sequential(): This line initializes a Sequential model, which is a linear stack of layers.
- model.add(SimpleRNN(32, input_shape=(50, 1), return_sequences=False)): This adds a SimpleRNN layer to the model.
  - 32: This is the number of units (neurons) in the RNN layer. More units can capture more complex patterns but also increase computational cost.
  - input_shape=(50, 1): This specifies the shape of the input to this layer. It indicates that each input sample will be a sequence of 50 time steps, and each time step has 1 feature. In the context of your padded sequences with maxlen=50, this means each padded sequence of 50 integers is treated as a sequence of 50 time steps, with each integer being the single feature at that time step.
  - return_sequences=False: This means the RNN layer will output only the final hidden state for each input sequence, not the hidden state at each time step. This is suitable for classification tasks where you need a single output per sequence.
- model.add(Dense(1, activation='sigmoid')): This adds a Dense (fully connected) layer to the model.
  - 1: This is the number of units in the dense layer. For binary classification (like positive/negative sentiment), you need one output unit.
  - activation='sigmoid': The sigmoid activation function is used here. It squashes the output of the dense layer to a value between 0 and 1, which can be interpreted as the probability of the input belonging to the positive class.
- model.summary(): This line prints a summary of the model's architecture, including the layers, their output shapes, and the number of parameters.

The model is designed to take your padded sequences of length 50 as input, process them through the SimpleRNN layer to capture sequential information, and then use the Dense layer with a sigmoid activation to output a single probability score for binary classification.




In a recurrent neural network (like the SimpleRNN layer you've used), the network processes input sequences one element (or time step) at a time.

- When return_sequences=True, the RNN layer outputs the hidden state for each time step in the input sequence. This means if your input sequence has 50 time steps, the RNN layer will output a sequence of 50 hidden states. This is useful when you're building models where the output at each time step is important, such as in sequence-to-sequence tasks like machine translation or text generation.
- When return_sequences=False (which is the default behavior and what you have in your code), the RNN layer only outputs the hidden state from the last time step of the input sequence. It discards the hidden states from all the previous time steps. This is typically used when you want to produce a single output for the entire input sequence, such as in classification tasks like the sentiment analysis you're doing with the IMDB dataset. You feed the entire movie review sequence into the RNN, and you only need one final output (the sentiment prediction) after the RNN has processed the whole sequence.
In your specific model, you have a Dense layer after the SimpleRNN layer that outputs a single value (the probability of positive sentiment). This Dense layer expects a single input vector per sample, not a sequence of hidden states. Therefore, setting return_sequences=False in the SimpleRNN layer provides the correct input format for the subsequent Dense layer.
?

In [37]:
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])

model.fit(X_train,y_train,epochs=5,validation_data=(X_test,y_test))

Epoch 1/5
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 21ms/step - accuracy: 0.5094 - loss: 0.6952 - val_accuracy: 0.5026 - val_loss: 0.6959
Epoch 2/5
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 14ms/step - accuracy: 0.5103 - loss: 0.6934 - val_accuracy: 0.5041 - val_loss: 0.6941
Epoch 3/5
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 17ms/step - accuracy: 0.5081 - loss: 0.6933 - val_accuracy: 0.4996 - val_loss: 0.6938
Epoch 4/5
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 17ms/step - accuracy: 0.5101 - loss: 0.6926 - val_accuracy: 0.5028 - val_loss: 0.6933
Epoch 5/5
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 14ms/step - accuracy: 0.5083 - loss: 0.6930 - val_accuracy: 0.5086 - val_loss: 0.6935


<keras.src.callbacks.history.History at 0x7e41b66cb8d0>