# Sequence to Sequence Learning with Keras (Beta)
Author: Hayson Cheung [hayson.cheung@mail.utoronto.ca]\
Adapted from: Ilya Sutskever, Oriol Vinyals, Quoc V. Le

In this notebook, we learn from the works of Ilya Sutskever, Oriol Vinyals, Quoc V. Le, Sequence to Sequence Learning with Neural Networks, NIPS 2014. We will implement a simple sequence to sequence model using LSTM in Keras. The model will be trained on a dataset of English sentences and their corresponding German sentences. The model will be able to translate English sentences from German sentences.

We map sequences of English words from sequences of German words. The model is trained on a dataset of English sentences and their corresponding German sentences. The goal of the model is to be able to translate English sentences from German sentences.

## Initialization & Hyper Params

Import tensor flow

In [None]:
# sample.ipynb
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, LSTM, Dense
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.layers import Embedding


IT WOULD TAKE INSANELY LONG IF U TRAIN IT URSELF WITH A CPU

In [None]:
"""
# for GPU
"""
import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))

Num GPUs Available:  1


Define the dimension of the latent space. It is a hyperparameter. Typically, we take powers of 2

In [None]:
# Parameters

# Latent dimension is the number of hidden units |h(t)| in the LSTM cell
LATENT_DIM = 256

## Load Data (Make sure of the path to the data file)

Choose a data set online in tmx format, it shall start with <tu> then <seq>English<\seq>
<seq>Deutsch<\seq>
then
<\tu>

In [None]:
import load_data

load_data.main("de-en.tmx")

from load_data import INPUT_VOCAB_SIZE, OUTPUT_VOCAB_SIZE, MAX_INPUT_LENGTH, MAX_OUTPUT_LENGTH, input_tokenizer, \
    output_tokenizer

print(f"Input vocab size: {INPUT_VOCAB_SIZE}")
print(f"Output vocab size: {OUTPUT_VOCAB_SIZE}")
print(f"Max input length: {MAX_INPUT_LENGTH}")
print(f"Max output length: {MAX_OUTPUT_LENGTH}")

print(output_tokenizer.word_index)

Reading data...


Processing lines: 100%|██████████| 94377/94377 [00:00<00:00, 252967.51it/s]


Data read successfully
Sample data:
['fish,health,mission blue,oceans,science', '899', 'Stephen Palumbi: Der Spur des Quecksilbers folgen', 'Das Meer kann ziemlich kompliziert sein.', 'Und was menschliche Gesundheit ist, kann auch ziemlich kompliziert sein.']
['fish,health,mission blue,oceans,science', '899', 'Stephen Palumbi: Following the mercury trail', 'It can be a very complicated thing, the ocean.', 'And it can be a very complicated thing, what human health is.']
Tokenizing Input
Tokenizing Output
Input vocab size: 33096
Output vocab size: 19409
Max input length: 424
Max output length: 412


## ENCODER and DECODER

In the two LSTM models, the encoder LSTM model will take the input sequence and return the encoder states. The decoder LSTM model will take the output sequence and the encoder states as input and return the output sequence. The encoder and decoder models are defined separately and then combined to form the final model.

In [None]:
# Define Encoder
encoder_input = Input(shape=(MAX_INPUT_LENGTH,))

encoder_embedding = Embedding(INPUT_VOCAB_SIZE, LATENT_DIM)(encoder_input)
encoder_lstm = LSTM(LATENT_DIM, return_state=True)
encoder_outputs, state_h, state_c = encoder_lstm(encoder_embedding)
encoder_states = [state_h, state_c]

# Define Decoder
decoder_input = Input(shape=(MAX_OUTPUT_LENGTH,))
decoder_embedding = Embedding(OUTPUT_VOCAB_SIZE, LATENT_DIM)(decoder_input)
decoder_lstm = LSTM(LATENT_DIM, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_embedding, initial_state=encoder_states)

# Softmax means output is a probability distribution, and enhances the maximum probability output
# dense layer is a regular densely-connected NN layer with softmax activation
decoder_dense = Dense(OUTPUT_VOCAB_SIZE, activation='softmax')
decoder_output = decoder_dense(decoder_outputs)

In [None]:
# Define the model
model = Model([encoder_input, decoder_input], decoder_output)

# Compile the model
model.compile(optimizer=Adam(), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
print(model.summary())

None


## Training the Model
This is where we train the model. We use the encoder input and decoder input to predict the decoder output. The model is trained on the dataset of English sentences and their corresponding German sentences.

This takes a while to run. We can save the model and load it later.

### Explaination of the data set
encoder_input_train: Training data for the encoder (German sentences).
decoder_input_train: Training data for the decoder (English sentences with <start> token).
decoder_target_train: Target data for the decoder (English sentences).

encoder_input_val: Validation data for the encoder (German sentences).
decoder_input_val: Validation data for the decoder (English sentences with <start> token).
decoder_target_val: Target data for the decoder (English sentences).



In [None]:
# Data Set Preparation
from load_data import encoder_input_train, decoder_input_train, decoder_target_train, encoder_input_val, decoder_input_val, decoder_target_val
with tf.device('/GPU:0'):
  history = model.fit(
      [encoder_input_train, decoder_input_train],  # Inputs for encoder and decoder
      decoder_target_train,  # Target data for decoder
      batch_size=32,  # Adjust as needed
      epochs=30,  # Adjust as needed
      validation_data=([encoder_input_val, decoder_input_val], decoder_target_val),
      verbose=1
  )

import matplotlib.pyplot as plt
# Plot the training loss
plt.plot(history.history['loss'], label='Training Loss')
plt.title('Training Loss Over Epochs')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

Epoch 1/30
[1m590/590[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m216s[0m 367ms/step - accuracy: 0.9577 - loss: 0.3229 - val_accuracy: 0.9601 - val_loss: 0.2801
Epoch 2/30
[1m590/590[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m279s[0m 396ms/step - accuracy: 0.9609 - loss: 0.2716 - val_accuracy: 0.9615 - val_loss: 0.2665
Epoch 3/30
[1m590/590[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m245s[0m 368ms/step - accuracy: 0.9620 - loss: 0.2593 - val_accuracy: 0.9624 - val_loss: 0.2575
Epoch 4/30
[1m590/590[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m263s[0m 370ms/step - accuracy: 0.9629 - loss: 0.2474 - val_accuracy: 0.9627 - val_loss: 0.2520
Epoch 5/30
[1m590/590[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m261s[0m 369ms/step - accuracy: 0.9636 - loss: 0.2381 - val_accuracy: 0.9630 - val_loss: 0.2483
Epoch 6/30
[1m590/590[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m263s[0m 370ms/step - accuracy: 0.9640 - loss: 0.2302 - val_accuracy: 0.9633 - val_loss: 0.2457
Epoc

As you can see, it would take forever to train the model (little more than 2h).
This is actually on a reduced dataset

If we use the ted dataset, it's taking little more than an hour for a epoch

Also, this model doesn't work from the training above, can you see why?

<details>
### Overfitting

From GPT

Overfitting in a seq2seq model using LSTMs can occur due to a number of factors. Here are the most likely ones:

Insufficient Training Data: If the dataset is too small or doesn't adequately represent the variety of real-world data the model will encounter, the model can memorize the training data instead of learning generalizable patterns.

Model Complexity: LSTM networks have a large number of parameters. If the architecture is too complex (too many layers or units), the model may overfit, especially with limited data.

Lack of Regularization: If regularization techniques like dropout or L2 regularization (weight decay) are not applied, the model may overfit by relying too heavily on specific features of the training data.

Training for Too Many Epochs: Training for too long without early stopping or monitoring the validation loss can lead to the model memorizing the training data.

Noisy Data: If the training data contains a lot of noise (irrelevant or inconsistent information), the model may end up fitting that noise rather than learning the underlying patterns.

Batch Size: A very small batch size can lead to noisy updates that could cause overfitting, while too large of a batch size might lead to poor generalization.

Lack of Data Augmentation: For certain types of data (such as text), data augmentation techniques (like paraphrasing) can help increase the diversity of the training set and reduce overfitting.

Do YOU have any fix to that?
</details>

We save the model:

In [None]:
model.save("/content/seq2seq_model.h5")

NameError: name 'model' is not defined

## Model Inference

Below is the code to load to model as inference and translation on the user end

In [None]:
# Load the model
from tensorflow.keras.models import load_model
from load_data import INPUT_VOCAB_SIZE, OUTPUT_VOCAB_SIZE, MAX_INPUT_LENGTH, MAX_OUTPUT_LENGTH, input_tokenizer, output_tokenizer
model = load_model("/content/seq2seq_model.h5")

# set up the encoder and decoder, from the trained model
encoder_model = Model(encoder_input, encoder_states)

decoder_state_input_h = Input(shape=(LATENT_DIM,))
decoder_state_input_c = Input(shape=(LATENT_DIM,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

decoder_embedding = Embedding(OUTPUT_VOCAB_SIZE, LATENT_DIM)(decoder_input)

decoder_outputs, state_h, state_c = decoder_lstm(decoder_embedding, initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)

decoder_model = Model(
    [decoder_input] + decoder_states_inputs,  # input: [decoder_input, h, c]
    [decoder_outputs] + decoder_states  # output: [output, h, c]
)

# map indexes back into real words
idx2word_input = {v:k for k, v in input_tokenizer.word_index.items()}
idx2word_target = {v:k for k, v in output_tokenizer.word_index.items()}
import numpy as np

def decode_sequence(input_seq):
    # Step 1: Get encoder states
    states_value = encoder_model.predict(input_seq)

    # Step 2: Generate empty target sequence of length 1
    target_seq = np.zeros((1, 1))
    target_seq[0, 0] = output_tokenizer.word_index['sos']

    # Step 3: Loop to generate the translated sequence
    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)

        # Sample a token
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_word = idx2word_target.get(sampled_token_index, '<UNK>')

        # Append the sampled word to the decoded sentence
        decoded_sentence += ' ' + sampled_word

        # Exit condition: either hit max length or find stop token
        if (sampled_word == 'eos' or len(decoded_sentence.split()) > MAX_OUTPUT_LENGTH):
            stop_condition = True

        # Update the target sequence (of length 1)
        target_seq = np.zeros((1, 1))
        target_seq[0, 0] = sampled_token_index

        # Update states
        states_value = [h, c]

    return decoded_sentence.strip()

def translate(input_text):
    # Tokenize the input sequence
    input_seq = input_tokenizer.texts_to_sequences([input_text])
    input_seq = tf.keras.preprocessing.sequence.pad_sequences(input_seq, maxlen=MAX_INPUT_LENGTH)

    # Get the translated sentence
    translated_sentence = decode_sequence(input_seq)
    return translated_sentence


In [None]:
# Test the model
print(translate("Ich bin ein Student."))  # I am a student.
print(translate("Ich bin traurig."))  # I am sad.
print(translate("Ich bin mude."))  # I am tired.


