[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/francisco-ortin/data-science-course/blob/main/deep-learning/rnn/lazy.ipynb)
[![License: CC BY-NC-SA 4.0](https://img.shields.io/badge/License-CC%20BY--NC--SA%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc-sa/4.0/)

# Training Deep ANNs with lazy upload of data

In the notebook, we limited the data loaded into memory to avoid memory issues. However, deep models have plenty of parameters to be learned from lots of data. This generates a problem: we need to load the data into memory to train the model, but we cannot load all the data at once.

To solve that problem, mini-batches are used to lazily load the data in small chunks. The `fit` method allows us to pass a generator that yields the data in small chunks (mini-batches). In this way, we do not need to load the whole dataset into memory at once. This technique is very important when dealing with large datasets (big data), very common when training deep learning models.

In [1]:
# make sure the required packages are installed
%pip install pandas numpy seaborn matplotlib scikit-learn keras tensorflow tensorflow-hub --quiet
# if running in colab, install the required packages and copy the necessary files
directory='data-science-course/deep-learning/rnn'
if get_ipython().__class__.__module__.startswith('google.colab'):
    !git clone https://github.com/francisco-ortin/data-science-course.git  2>/dev/null
    !cp --update {directory}/*.py .
    !mkdir -p img data
    !cp {directory}/data/* data/.
    !cp {directory}/img/* img/.

import os
import numpy as np
from tensorflow import keras
from tensorflow.keras import layers
import tensorflow as tf
import tensorflow_hub as hub
import zipfile

Note: you may need to restart the kernel to use updated packages.


## Parameters

 Now, we use the 100,000 most frequent words (5,000 in the previous example). We use 49,000 reviews for training (1,000 in the previous notebook), lazily loading the batches into memory (applying the ELMo embedding transformation). For test and validation, we keep using 500 reviews each.

In [2]:
# Consider all the words with a frequency higher than this value. The higher, the more memory is needed.
vocabulary_size = 100_000
# Compute the maximum length of the reviews (for speeding up the training it is better to cut the reviews)
max_review_length = 80
# Max number of epochs to train the models (we use early stopping)
n_epochs = 50
# Embedding dimensions (ELMo embeddings have 1024 dimensions)
embedding_dim = 1024
# Number of sentences for validation and test, the remaining ones (49,000) will be used for training
n_sentences_val, n_max_sentences_test = 500, 500

## Load the dataset

We load the dataset from the Keras API. 500 for testing, 500 for validation, and 49,000 for training.

In [3]:
(X_train, y_train), (X_half, y_half) = keras.datasets.imdb.load_data(num_words=vocabulary_size)
assert n_sentences_val + n_max_sentences_test <= len(X_half), "Not enough sentences for validation and test."
X_test, X_val = X_half[-n_max_sentences_test:], X_half[-(n_sentences_val + n_max_sentences_test):-n_max_sentences_test]
y_test, y_val = y_half[-n_max_sentences_test:], y_half[-(n_sentences_val + n_max_sentences_test):-n_max_sentences_test]
# concat to X_train the remaining samples not used for validation and test
X_train = np.concatenate((X_train, X_half[:-(n_sentences_val + n_max_sentences_test)]), axis=0)
y_train = np.concatenate((y_train, y_half[:-(n_sentences_val + n_max_sentences_test)]), axis=0)
X_half, y_half = None, None  # free memory

print(f"Training sequences: {len(X_train):,}.\nValidation sequences: {len(X_val):,}.\nTesting sequences: {len(X_test):,}.")

Training sequences: 49,000.
Validation sequences: 500.
Testing sequences: 500.


The rest of the loading process is the same as in the previous notebook.

In [4]:
# Let's print some reviews. We need to convert the integers (token ids) back to words.
word_to_index = {word: index+3 for word, index in keras.datasets.imdb.get_word_index().items()}  # word -> integer dictionary
# The IMDB dataset reserves the 4 first indices for special tokens <PAD>, <START>, <OOV>, <END>
index_to_word = {value: key for key, value in word_to_index.items()}  # integer -> word dictionary
index_to_word[0] = "<PAD>"
index_to_word[1] = "<START>"
index_to_word[2] = "<OOV>"
index_to_word[3] = "<END>"


def decode_review(encoded_review: list[int]) -> str:
    """Decode a review from a list of integers to a string."""
    return ' '.join(index_to_word.get(word_index, "<OOV>") for word_index in encoded_review)

# We show the first reviews and their corresponding sentiment.
print("First reviews in training set, with the corresponding labels:")
for (i, (review, label)) in enumerate(zip(X_train[:5], y_train[:5])):
    print(f"Review {i + 1}: {decode_review(review)}.\nLabel: {label}.")

# We add padding to the reviews to have the same length. We use the `post` mode to pad and truncate at the end of the reviews.
X_train = keras.preprocessing.sequence.pad_sequences(X_train, maxlen=max_review_length, padding="post", truncating="post")
X_val = keras.preprocessing.sequence.pad_sequences(X_val, maxlen=max_review_length, padding="post", truncating="post")
X_test = keras.preprocessing.sequence.pad_sequences(X_test, maxlen=max_review_length, padding="post", truncating="post")

First reviews in training set, with the corresponding labels:
Review 1: <START> this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert redford's is an amazing actor and now the same being director norman's father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for retail and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also congratulations to the two little boy's that played the part's of norman and paul they were just brilliant children are often left out of the praising list i think because the stars that play them all grown up are such

## ELMo embeddings

The `get_elmo_embeddings` function returns the ELMo embeddings for a list of sentences. It is used to convert the small test and validation sets to ELMo embeddings. For training, we have to do it lazily, since we cannot load all the data into memory at once.

In [5]:
elmo = hub.load("https://tfhub.dev/google/elmo/3")

def get_elmo_embeddings(sentences: list[str]) -> np.array:
    """
    Get ELMo embeddings for a list of sentences.
    """
    # ELMo returns a tensor, but we want to extract the embeddings
    embeddings = elmo.signatures['default'](tf.constant(sentences))['elmo']
    return embeddings.numpy()  # Convert to numpy array for easier manipulation

X_val_elmo = get_elmo_embeddings([decode_review(review) for review in X_val])
X_test_elmo = get_elmo_embeddings([decode_review(review) for review in X_test])











## Lazy data generation for training

The following `generate_data_lazy` function generates the training data in a lazy way, batch after batch, to avoid memory issues.

In [6]:
def generate_data_lazy(X_train_p: np.array, y_train_p: np.array, batch_size: int, n_epochs_p:int) -> np.array:
    """
    Generate training data in a lazy way, batch after batch, to avoid memory issues.
    In this way, all the data is not loaded into memory at once.
    :param X_train_p: The original training data with original shape (max_sentences_train, max_review_length)
    :param y_train_p: The original training labels with original shape (max_sentences_train,)
    :param batch_size: The batch size
    :param n_epochs_p: The number of epochs
    :return: Each batch of data, with ELMo embeddings of shape (batch_size, max_review_length, embedding_dim)
    """
    for epoch in range(n_epochs_p):
        to_index = 0
        for batch_number in range(X_train_p.shape[0] // batch_size):
            from_index, to_index = batch_number*batch_size, (batch_number+1)*batch_size
            X_train_elmo_batch = get_elmo_embeddings([decode_review(review) for review in X_train_p[from_index:to_index]])
            yield X_train_elmo_batch, np.array(y_train_p[from_index:to_index])
        X_train_elmo_batch = get_elmo_embeddings([decode_review(review) for review in X_train_p[to_index:]])
        yield X_train_elmo_batch, np.array(y_train_p[to_index:])


Now, the `train` function is modified to use the `generate_data_lazy` function to train the model using a lazy data generator. The rest of the parameters are the same as in the previous notebook.

In [7]:
def compile_train_evaluate(model_p: keras.Model, X_train_p: np.array, y_train_p: np.array,
                           x_val_p: np.array, y_val_p: np.array, x_test_p: np.array, y_test_p: np.array,
                           batch_size: int, epochs: int) -> (float, float, keras.Model):
    """
    Compile, train and evaluate the model.
    :param model_p: the model to compile, train and evaluate
    :param X_train_p: train X sequences
    :param y_train_p: train y labels
    :param x_val_p: validation X sequences
    :param y_val_p: validation y labels
    :param x_test_p: test X sequences
    :param y_test_p: test y labels
    :param batch_size: batch size
    :param epochs: number of epochs
    :return: (test_loss, test_accuracy, model)
    """
    # we compile and train the model
    model_p.compile("adam", "binary_crossentropy", metrics=["accuracy"])
    early_stopping_callback = tf.keras.callbacks.EarlyStopping(patience=2, restore_best_weights=True)
    model_p.fit(generate_data_lazy(X_train_p, y_train_p, batch_size, epochs),
                batch_size=batch_size, epochs=epochs,
                steps_per_epoch=X_train_p.shape[0] // batch_size + 1,
                validation_data=(x_val_p, y_val_p),
                callbacks=[early_stopping_callback])
    # Evaluate the model on the test set
    loss, accuracy = model_p.evaluate(x_test_p, y_test_p)
    return loss, accuracy, model_p

## Model compilation, training, and evaluation

We create the same model as in the previous notebook, compile, train, and evaluate it.

In [None]:
inputs = keras.Input(shape=(None, embedding_dim), dtype="float32")
# Add 2 LSTM layers
x = layers.Bidirectional(layers.LSTM(64, return_sequences=True, dropout=0.2))(inputs)
x = layers.Bidirectional(layers.LSTM(64))(x)
# Dropout layer with 20% rate
x = layers.Dropout(0.2)(x)
# Add a classifier (sigmoid activation function for binary classification)
outputs = layers.Dense(1, activation="sigmoid")(x)
elmo_model = keras.Model(inputs, outputs)
elmo_model.summary()

# Compile, train and evaluate the model
test_loss, test_accuracy, elmo_model = compile_train_evaluate(elmo_model, X_train, y_train, X_val_elmo, y_val, X_test_elmo, y_test, 32, n_epochs)
print(f"Test loss: {test_loss:.4f}.\nTest accuracy: {test_accuracy:.4f}.")






Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, None, 1024)]      0         
                                                                 
 bidirectional (Bidirection  (None, None, 128)         557568    
 al)                                                             
                                                                 
 bidirectional_1 (Bidirecti  (None, 128)               98816     
 onal)                                                           
                                                                 
 dropout (Dropout)           (None, 128)               0         
                                                                 
 dense (Dense)               (None, 1)                 129       
                                                                 
Total params: 656513 (2.50 MB)
Trainable params: 656513 (2.50




Epoch 1/50










Epoch 2/50

## Inference

Let's test the model with the same reviews as in the previous notebook. Feel free to add more reviews to test the model.

In [None]:
example_reviews = ["The movie was a great waste of time. I is awful and boring.",
                   "I loved the movie. The plot was amazing.",
                   "This movie is not worth watching.",
                   "Although the film is not a masterpiece, you may have a good time if your expectations are not high."]


review_embeddings = get_elmo_embeddings(example_reviews)
predictions = elmo_model.predict(review_embeddings)

for i, prediction in enumerate(predictions):
    print(f"Review {i + 1}: {example_reviews[i]}")
    print(f"Probability of being positive: {prediction[0]:.4f}.\n")

## ✨ Questions

1. Has this model improved the accuracy of the similar mode in the previous notebook?
Answer: Yes, accuracy has improved from 0.7320 to XXXXX.
2. What is the main reason that explains the previous answer.
Answer: The model has more training data. Another secondary reason is that we use a bigger vocabulary size, and hence there are fewer OOV tokens.
3. When do you think you will need to train a model in this way?
Answer: When the dataset is too large to fit in memory. That would be the case of deep models.

### Answers

*Write your answers here.*
