# Sequence to Sequence Learning with Keras (Beta)
Author: Hayson Cheung [hayson.cheung@mail.utoronto.ca]\
Adapted from: Ilya Sutskever, Oriol Vinyals, Quoc V. Le

In this notebook, we learn from the works of Ilya Sutskever, Oriol Vinyals, Quoc V. Le, Sequence to Sequence Learning with Neural Networks, NIPS 2014. We will implement a simple sequence to sequence model using LSTM in Keras. The model will be trained on a dataset of English sentences and their corresponding German sentences. The model will be able to translate English sentences from German sentences.

We map sequences of English words from sequences of German words. The model is trained on a dataset of English sentences and their corresponding German sentences. The goal of the model is to be able to translate English sentences from German sentences.

## Initialization & Hyper Params

Import tensor flow

In [None]:
# sample.ipynb
print("Importing Tensorflow")

from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, LSTM, Dense
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.layers import Embedding


IT WOULD TAKE INSANELY LONG IF U TRAIN IT URSELF WITH A CPU

In [None]:
# Check if GPU is available
import tensorflow as tf
CPU_FALLBACK = False
num_GPUs = len(tf.config.experimental.list_physical_devices('GPU'))
if num_GPUs > 0:
    print(f"Number of GPUs: {num_GPUs}")
else:
    print("No GPUs available, the code is intended to be run on a GPU for faster training.")
    import inquirer
    questions = [
        inquirer.List('continue',
                      message="Do you want to continue training on a CPU?",
                      choices=['Yes', 'No'],
                  ),
    ]
    answers = inquirer.prompt(questions)
    if answers['continue'] != 'Yes':
        print("Exiting...")
        exit()
        
if not tf.test.is_gpu_available():
    CPU_FALLBACK = True
    print("Tensorflow is running on CPU")
        

Define the dimension of the latent space. It is a hyperparameter. Typically, we take powers of 2

In [None]:
# Parameters

# Latent dimension is the number of hidden units |h(t)| in the LSTM cell

# IN ACCORDANCE TO THE PAPER:
""" 
LATENT_DIM = 1024 # Number of LSTM units per layer
EMBEDDING_DIM = 1024  # Embedding dimension
NUM_LAYERS = 4  # Deep LSTM layers
"""

# SMALLER DIMENSIONS FOR TESTING
LATENT_DIM = 256 
EMBEDDING_DIM = 256
NUM_LAYERS = 4


## Load Data (Make sure of the path to the data file)

Choose a data set online in tmx format, it shall start with <tu> then <seq>English<\seq>
<seq>Deutsch<\seq>
then
<\tu>

In [None]:
import load_data

load_data.main()

from load_data import INPUT_VOCAB_SIZE, OUTPUT_VOCAB_SIZE, MAX_INPUT_LENGTH, MAX_OUTPUT_LENGTH

print(f"Input vocab size: {INPUT_VOCAB_SIZE}")
print(f"Output vocab size: {OUTPUT_VOCAB_SIZE}")
print(f"Max input length: {MAX_INPUT_LENGTH}")
print(f"Max output length: {MAX_OUTPUT_LENGTH}")

## ENCODER and DECODER

In the two LSTM models, the encoder LSTM model will take the input sequence and return the encoder states. The decoder LSTM model will take the output sequence and the encoder states as input and return the output sequence. The encoder and decoder models are defined separately and then combined to form the final model.

We also would like to implment a learning rate schedule, the paper rescricts lr after 5 epochs

In [None]:
# Define Encoder
encoder_input = Input(shape=(MAX_INPUT_LENGTH,))
encoder_embedding = Embedding(INPUT_VOCAB_SIZE, EMBEDDING_DIM)(encoder_input)

encoder_lstm = []
encoder_states = []
x = encoder_embedding
for i in range(NUM_LAYERS):
    return_state = (i == NUM_LAYERS - 1)  # Only return state for the last layer
    lstm_layer = LSTM(LATENT_DIM, return_sequences=True, return_state=True, kernel_initializer=tf.keras.initializers.RandomUniform(-0.08, 0.08))
    if return_state:
        x, state_h, state_c = lstm_layer(x)
        encoder_states = [state_h, state_c]
    else:
        x = lstm_layer(x)[0]
    encoder_lstm.append(lstm_layer)

# Define Decoder
decoder_input = Input(shape=(MAX_OUTPUT_LENGTH,))
decoder_embedding = Embedding(OUTPUT_VOCAB_SIZE, EMBEDDING_DIM)(decoder_input)

decoder_lstm = []
x = decoder_embedding
for i in range(NUM_LAYERS):
    return_state = i == NUM_LAYERS - 1 
    lstm_layer = LSTM(LATENT_DIM, return_sequences=True, return_state=True, kernel_initializer=tf.keras.initializers.RandomUniform(-0.08, 0.08))
    if return_state:
        x, _, _ = lstm_layer(x, initial_state=encoder_states)
    else:
        x = lstm_layer(x)[0]
    decoder_lstm.append(lstm_layer)

# Output Layer
decoder_dense = Dense(OUTPUT_VOCAB_SIZE, activation='softmax')
decoder_output = decoder_dense(x)

In [None]:
# Define Model
model = Model([encoder_input, decoder_input], decoder_output)

# Optimizer with Gradient Clipping
optimizer = SGD(learning_rate=0.7, clipnorm=5)

# Compile Model
model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy', metrics=['accuracy'])
print(model.summary())

## Training the Model
This is where we train the model. We use the encoder input and decoder input to predict the decoder output. The model is trained on the dataset of English sentences and their corresponding German sentences.

This takes a while to run. We can save the model and load it later. Below is the lr (learning rate) schedule:

In [None]:
def lr_schedule(epoch, lr):
    if epoch >= 5:
        return lr * 0.5
    return lr

lr_callback = tf.keras.callbacks.LearningRateScheduler(lr_schedule)

### Explaination of the data set
encoder_input_train: Training data for the encoder (German sentences).
decoder_input_train: Training data for the decoder (English sentences with <start> token).
decoder_target_train: Target data for the decoder (English sentences).

encoder_input_val: Validation data for the encoder (German sentences).
decoder_input_val: Validation data for the decoder (English sentences with <start> token).
decoder_target_val: Target data for the decoder (English sentences).



In [None]:
# Data Set Preparation
from load_data import encoder_input_train, decoder_input_train, decoder_target_train, encoder_input_val, decoder_input_val, decoder_target_val
with tf.device('/GPU:0') if not CPU_FALLBACK else tf.device('/CPU:0'):
    history = model.fit(
        [encoder_input_train, decoder_input_train],
        decoder_target_train,
        batch_size=128,
        epochs=8,
        validation_data=([encoder_input_val, decoder_input_val], decoder_target_val),
        callbacks=[lr_callback],
        verbose=1
    )

## Plotting the Training Loss

In [None]:
import matplotlib.pyplot as plt
# Plot the training loss
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Training Loss Over Epochs')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

As you can see, it would take forever to train the model (little more than 2h).
This is actually on a reduced dataset

If we use the ted dataset, it's taking little more than an hour for a epoch

Also, this model doesn't work from the training above, can you see why?

<details>
### Overfitting

From GPT

Overfitting in a seq2seq model using LSTMs can occur due to a number of factors. Here are the most likely ones:

Insufficient Training Data: If the dataset is too small or doesn't adequately represent the variety of real-world data the model will encounter, the model can memorize the training data instead of learning generalizable patterns.

Model Complexity: LSTM networks have a large number of parameters. If the architecture is too complex (too many layers or units), the model may overfit, especially with limited data.

Lack of Regularization: If regularization techniques like dropout or L2 regularization (weight decay) are not applied, the model may overfit by relying too heavily on specific features of the training data.

Training for Too Many Epochs: Training for too long without early stopping or monitoring the validation loss can lead to the model memorizing the training data.

Noisy Data: If the training data contains a lot of noise (irrelevant or inconsistent information), the model may end up fitting that noise rather than learning the underlying patterns.

Batch Size: A very small batch size can lead to noisy updates that could cause overfitting, while too large of a batch size might lead to poor generalization.

Lack of Data Augmentation: For certain types of data (such as text), data augmentation techniques (like paraphrasing) can help increase the diversity of the training set and reduce overfitting.

Do YOU have any fix to that?
</details>

### We save the model:

In [None]:
model.save("seq2seq_model.h5")

## Model Inference

Below is the code to load to model as inference and translation on the user end

In [None]:
from load_data import input_tokenizer, output_tokenizer
from tensorflow.keras.models import load_model
import numpy as np

model = load_model("seq2seq_model.h5")

# Extract Encoder Model
encoder_model = Model(encoder_input, encoder_states)

# Define Decoder Model
decoder_state_input_h = Input(shape=(LATENT_DIM,))
decoder_state_input_c = Input(shape=(LATENT_DIM,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

decoder_outputs, state_h, state_c = decoder_lstm[-1](decoder_embedding, initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)

decoder_model = Model(
    [decoder_input] + decoder_states_inputs,
    [decoder_outputs] + decoder_states
)

# Chat Function
def decode_sequence(input_text):
    input_seq = input_tokenizer.texts_to_sequences([input_text])
    print(input_seq)
    input_seq = tf.keras.preprocessing.sequence.pad_sequences(input_seq, maxlen=MAX_INPUT_LENGTH)
    states_value = encoder_model.predict(input_seq)
    target_seq = np.zeros((1, 1))
    target_seq[0, 0] = output_tokenizer.word_index['sos']
    stop_condition = False
    decoded_sentence = ""
    
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_word = output_tokenizer.index_word.get(sampled_token_index, '')
        decoded_sentence += sampled_word + " "
        if sampled_word == 'eos' or len(decoded_sentence.split()) > MAX_OUTPUT_LENGTH:
            stop_condition = True
        target_seq[0, 0] = sampled_token_index
        states_value = [h, c]
    
    return decoded_sentence.strip()


In [None]:

# Test the Model
print(decode_sequence("Ich bin ein Student"))
print(decode_sequence("Hallo, wie geht es dir?"))
print(decode_sequence("Guten Morgen"))

# Interactive Chatbot
while True:
    user_input = input("You: ")
    if user_input.lower() == 'exit':
        break
    response = decode_sequence(user_input)
    print("Bot:", response)