# Overview

In this notebook, we implement a basic character-level recurrent sequence-to-sequence model. We apply it to translating short English sentences into short French sentences, character-by-character. Normally, word-level models are more common in machine translation domain.

We will start with input sentences from a English sentences and corresponding target sequences from French sentences. And we will use LSTM as an encoder turns input sequences to 2 state vectors(the last LSTM state and discard the outputs). A LSTM decoder is trained to turn the target sequences into the same sequence but offset by one timestep in the future, a training process called "teacher forcing" in this context. It uses as initial state the state vectors from the encoder. 

In inference mode, when we want to decode unknown input sequences, we encode the input sequence into state vectors. -Start with a target sequence of size 1(just the start-of-sequence character). - Feed the state vectors and 1-char target sequence to the decoder to produce predictions for the next character - Sample the next character using these predictions(here we use argmax). And append the sampled character to the target sequence- Repeat until we generate the end-of-sequence character or we hit the character limit.


# Loading the Dataset

In [1]:
%%capture
!wget http://www.manythings.org/anki/fra-eng.zip

In [2]:
!unzip fra-eng.zip

Archive:  fra-eng.zip
  inflating: _about.txt              
  inflating: fra.txt                 


# Preparing the Data

In [3]:
import os

num_samples=10000 # number of samples to train on

# vectorize the data
input_texts=[]
target_texts=[]
input_characters=set()
target_characters=set()

data_path=os.path.join('', "fra.txt")

with open(data_path, "r", encoding="utf-8") as f:
    lines=f.read().split("\n")
    

for line in lines[:min(num_samples, len(lines)-1)]:
    input_text, target_text,_=line.split("\t")
    # we use "tab" as the "start sequence" character for the targets, and "\n" as "end sequence" character
    target_text="\t"+target_text+"\n"
    input_texts.append(input_text)
    target_texts.append(target_text)
    
    for char in input_text:
        if char not in input_characters:
            input_characters.add(char)
    for char in target_text:
        if char not in target_characters:
            target_characters.add(char)

input_characters=sorted(list(input_characters))
target_characters=sorted(list(target_characters))
num_encoder_tokens=len(input_characters)
num_decoder_tokens=len(target_characters)
max_encoder_seq_length=max([len(txt) for txt in input_texts])
max_decoder_seq_length=max([len(txt) for txt in target_texts])

print("Number of samples:", len(input_texts))
print("Number of unique input tokens:", num_encoder_tokens)
print("Number of unique output tokens:", num_decoder_tokens)
print("Max sequence length for inputs:", max_encoder_seq_length)
print("Max sequence length for outputs:", max_decoder_seq_length)

Number of samples: 10000
Number of unique input tokens: 70
Number of unique output tokens: 93
Max sequence length for inputs: 14
Max sequence length for outputs: 59


In [4]:
import numpy as np

input_token_index=dict([(char, i) for i, char in enumerate(input_characters)])
target_token_index=dict([(char, i) for i, char in enumerate(target_characters)])

encoder_input_data=np.zeros(
    (len(input_texts), max_encoder_seq_length, num_encoder_tokens), dtype="float32"
)

decoder_input_data=np.zeros(
    (len(input_texts), max_decoder_seq_length, num_decoder_tokens), dtype="float32"
)

decoder_target_data=np.zeros(
    (len(input_texts), max_decoder_seq_length, num_decoder_tokens), dtype="float32"
)

for i, (input_text, target_text) in enumerate(zip(input_texts, target_texts)):
    for t, char in enumerate(input_text):
        encoder_input_data[i, t, input_token_index[char]]=1.0
    encoder_input_data[i, t+1:, input_token_index[" "]]=1.0
    for t, char in enumerate(target_text):
        decoder_input_data[t, t,target_token_index[char]]=1.0
        if t>0:
            decoder_target_data[i, t-1, target_token_index[char]]=1.0
    decoder_input_data[i, t+1:, target_token_index[" "]]=1.0
    decoder_target_data[i, t:, target_token_index[" "]]=1.0

# Build the Model

In [5]:
import keras

latent_dim=256 # Latnet dimensionality of the encoding space

# Define an input sequence and process it
encoder_inputs=keras.Input(shape=(None, num_encoder_tokens))
encoder=keras.layers.LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c=encoder(encoder_inputs)

# we discard encoder_outputs and only keep the states
encoder_states=[state_h, state_c]

# set up the decoder, using encoder_states as initial state
decoder_inputs=keras.Input(shape=(None, num_decoder_tokens))

# we set up our decoder to return full output sequences, and to return internal states as well. We do not use the return
# states in the trianing model, but we will use them in inference.
decoder_lstm=keras.layers.LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _=decoder_lstm(decoder_inputs, initial_state=encoder_states)
decoder_dense=keras.layers.Dense(num_decoder_tokens, activation="softmax")
decoder_outputs=decoder_dense(decoder_outputs)

# define the model that will turn `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
model=keras.Model(
    [encoder_inputs, decoder_inputs],
    decoder_outputs
)

model

2024-03-14 06:08:22.211606: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-14 06:08:22.211708: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-14 06:08:22.329168: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


<Functional name=functional_1, built=True>

# Train the Model

In [6]:
import tensorflow as tf

tf.config.experimental.list_physical_devices('GPU')

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

In [7]:
tf.debugging.set_log_device_placement(True)

In [8]:
epochs=50 # number of epochs to train for
batch_size=64 # batch size of training

model.compile(optimizer="rmsprop", loss="categorical_crossentropy", metrics=["accuracy"])

model.fit(
    [encoder_input_data, decoder_input_data],
    decoder_target_data,
    batch_size=batch_size,
    epochs=epochs,
    validation_split=0.2
)

# save model
model.save("s2s_model.keras")

Epoch 1/50
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 16ms/step - accuracy: 0.7055 - loss: 1.5872 - val_accuracy: 0.7129 - val_loss: 1.1143
Epoch 2/50
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 12ms/step - accuracy: 0.7442 - loss: 0.9866 - val_accuracy: 0.7129 - val_loss: 1.0468
Epoch 3/50
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 12ms/step - accuracy: 0.7442 - loss: 0.9334 - val_accuracy: 0.7193 - val_loss: 1.0117
Epoch 4/50
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 12ms/step - accuracy: 0.7484 - loss: 0.9007 - val_accuracy: 0.7165 - val_loss: 0.9951
Epoch 5/50
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 12ms/step - accuracy: 0.7486 - loss: 0.8928 - val_accuracy: 0.7222 - val_loss: 0.9844
Epoch 6/50
[1m125/125[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 12ms/step - accuracy: 0.7504 - loss: 0.8847 - val_accuracy: 0.7243 - val_loss: 0.9689
Epoch 7/50
[1m125/125

# Inference

* Encoding input and retrieve intial decoder state
* Runing one step of decoder with this initial state and a "start of sequence" token as target. Output wil be the bext target token
* Repeat with the current target token and curent states

In [9]:
model=keras.models.load_model("s2s_model.keras")

encoder_inputs=model.input[0] # input_1
encoder_outputs, state_h_enc, state_c_enc=model.layers[2].output # lstm_1
encoder_states=[state_h_enc, state_c_enc]
encoder_model=keras.Model(encoder_inputs, encoder_states)

decoder_inputs=model.input[1] # input_2
decoder_state_input_h=keras.Input(shape=(latent_dim,))
decoder_state_input_c=keras.Input(shape=(latent_dim,))
decoder_states_inputs=[decoder_state_input_h, decoder_state_input_c]
decoder_lstm=model.layers[3]
decoder_outputs, state_h_dec, state_c_dec=decoder_lstm(
    decoder_inputs, initial_state=decoder_states_inputs
)

decoder_states=[state_h_dec, state_c_dec]
decoder_dense=model.layers[4]
decoder_outputs=decoder_dense(decoder_outputs)
decoder_model=keras.Model(
    [decoder_inputs]+decoder_states_inputs, [decoder_outputs]+decoder_states
)

reverse_input_char_index=dict((i, char) for char, i in input_token_index.items())
reverse_target_char_index=dict((i, char) for char, i in target_token_index.items())


def decode_sequence(input_seq):
    # encode the input as state vectors
    states_value=encoder_model.predict(input_seq, verbose=0)
    
    # generate empty target sequence of length 1
    target_seq=np.zeros((1,1, num_decoder_tokens))
    # populate the first character of target sequence with the start character
    target_seq[0,0, target_token_index["\t"]]=1.0
    
    # sampling loop for a batch of sequences (to simplify, here we assume a batch of size 1)
    stop_condition=False
    decoded_sentence=""
    while not stop_condition:
        output_tokens, h,c=decoder_model.predict(
            [target_seq]+states_value, verbose=0
        )
        
        # sample a token
        sampled_token_index=np.argmax(output_tokens[0,-1,:])
        sampled_char=reverse_target_char_index[sampled_token_index]
        decoded_sentence+=sampled_char
        
        
        # exit condition: either hit max length of find stop character
        if sampled_char=="\n" or len(decoded_sentence) >max_decoder_seq_length:
            stop_condition=True
        
        # update the target sequence (of length 1)
        target_seq=np.zeros((1,1, num_decoder_tokens))
        target_seq[0,0, sampled_token_index]=1.0
        
        # update states
        states_value=[h,c]
    return decoded_sentence

In [10]:
for seq_index in range(5):
    # take one sequence (part of the trianing set)
    # for trying out decoding
    input_seq=encoder_input_data[seq_index:seq_index+1]
    decoded_sentence=decode_sequence(input_seq)
    print("-")
    print("Input sentence:", input_texts[seq_index])
    print("Decoded sentence:", decoded_sentence)

-
Input sentence: Go.
Decoded sentence: Fouez                                                       
-
Input sentence: Go.
Decoded sentence: Fouez                                                       
-
Input sentence: Go.
Decoded sentence: Fouez                                                       
-
Input sentence: Go.
Decoded sentence: Fouez                                                       
-
Input sentence: Hi.
Decoded sentence: Fattee                                                      


# Acknowledge
* https://github.com/keras-team/keras-io/blob/master/examples/nlp/lstm_seq2seq.py
* https://keras.io/examples/nlp/lstm_seq2seq/
* https://huggingface.co/blog/ray-rag