## Stage 1: Install dependencies and setting up GPU environment

In [48]:
!pip install numpy



## Stage 2: Importing project dependencies

In [49]:
import numpy as np
import tensorflow as tf

from tensorflow.keras.datasets import imdb #imdb dataset https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

In [50]:
tf.__version__

'2.17.0'

## Stage 3: Dataset preprocessing

### Setting up dataset parameters

In [51]:
number_of_words = 20000
max_len = 100 #max review length will be 100.If the review is less than that, rest of that will be padding.

### Loading the IMDB dataset

In [52]:
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=number_of_words) #we are taking all the reviews that have 20k most frequent words

### Padding all sequences to be the same length

Time to pad all the sequences to be the same length.

And so as we said, we're going to make sure that our reviews after being padded, will have 100 elements and the rest of the elements, if the review doesn't have 100 elements, will just be pad tokens just to finish the sequence up to 100.

In [53]:
X_train = tf.keras.preprocessing.sequence.pad_sequences(X_train, maxlen=max_len) #for X_train

In [54]:
X_test = tf.keras.preprocessing.sequence.pad_sequences(X_test, maxlen=max_len) #fir X_test

### Setting up Embedding Layer parameters

The embedding layer is actually a layer used to create a word vector representation of the words, you know, the words and the reviews so that instead of using pre-trained word vectors as if you had the reviews in vectors of words with the padding included, well, we're going to use what we call this
embedding layer to train the word vectors in a large matrix.

And this large matrix will be a matrix where each row corresponds to a word. You know, all the 20,000 words in our reviews and the columns are actually encoding the word with what we call a representation of the word in the dataset vocabulary.

So by using this embedding layer, we're going to learn those word representations jointly with the weights in the network

In [55]:
vocab_size = number_of_words
vocab_size

20000

In [56]:
embed_size = 128

## Step 4: Building a Recurrent Neural Network

### Defining the model

In [57]:
model = tf.keras.Sequential()

### Adding the Embeding Layer

In [58]:
model.add(tf.keras.layers.Embedding(vocab_size, embed_size, input_shape=(X_train.shape[1],)))

#vocab_size =  the input dimension. You know, before we create this embedding matrix. And the input dimension is simply the number of words. Because remember, in this matrix, each row corresponds to each of the 20,000 words among all our reviews.
#embed_size = Then the second argument is output dim, and you might guess what it is. It is, of course, the number of columns that are going to embed each word into this large representation of words in our embedding matrix. And so we're going to choose
#128 columns to represent the words, you know, to encode the words. And so, well, you will get an embedding matrix composed of 128 columns.

#then the input shape from training data. And the shape itself is actually the second element of this tensor meaning of index one.


### Adding the LSTM Layer

- units: 128
- activation: tanh

In [59]:
model.add(tf.keras.layers.LSTM(units=128, activation='tanh')) #LSTM Layer mostly uses tanh activation function.

### Adding the Dense output layer

- units: 1
- activation: sigmoid

In [60]:
model.add(tf.keras.layers.Dense(units=1, activation='sigmoid')) #as we are expencing 0 or 1 as output (negative or positiobe), we take just 1 neuron in output

### Compiling the model

In [61]:
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy']) #rmsprop is mostly  used in RNN, As we have 0 or 1 in output; we use binary_crossentropy, then the accuracy as metrics as binary classification

In [62]:
model.summary()

### Training the model

In [63]:
model.fit(X_train, y_train, epochs=3, batch_size=128) #we are going to feed different batches which has 128 data each time and epochs=3 means, the whole data will be trained 3 times

Epoch 1/3
[1m196/196[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 13ms/step - accuracy: 0.5971 - loss: 0.6494
Epoch 2/3
[1m196/196[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 10ms/step - accuracy: 0.8227 - loss: 0.4039
Epoch 3/3
[1m196/196[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 10ms/step - accuracy: 0.8624 - loss: 0.3285


<keras.src.callbacks.history.History at 0x7e74efafb5b0>

### Evaluating the model

In [64]:
test_loss, test_acurracy = model.evaluate(X_test, y_test)

[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3ms/step - accuracy: 0.8471 - loss: 0.3472


In [66]:
print("Test accuracy: {}".format(test_acurracy))

Test accuracy: 0.8493599891662598
