# <b> PROJECT :Character-level Language Translation using LSTM-based Seq2Seq Model

OBJECTIVE : 
           
We will implement a character-level sequence-to-sequence model, processing the input character-by-character and generating the output character-by-character (English ----> FRENCH).
             
> Example : "the cat sat on the mat" -> [Seq2Seq model] -> "le chat etait assis sur le tapis"

___________________
</b>
Here's a summary of our process:

1) Turn the sentences into 3 Numpy arrays, encoder_input_data, decoder_input_data, decoder_target_data:

    * encoder_input_data is a 3D array of shape (num_pairs, max_english_sentence_length, num_english_characters) containing a one-hot vectorization of the English sentences.
    * decoder_input_data is a 3D array of shape (num_pairs, max_french_sentence_length, num_french_characters) containg a one-hot vectorization of the French sentences.
    * decoder_target_data is the same as decoder_input_data but offset by one timestep. decoder_target_data[:, t, :] will be the same as decoder_input_data[:, t + 1, :].

    

2) Train a basic LSTM-based Seq2Seq model to predict decoder_target_data given encoder_input_data and decoder_input_data. Our model uses teacher forcing.

3) Decode some sentences to check that the model is working (i.e. turn samples from encoder_input_data into corresponding samples from decoder_target_data).

________________________________________________________________

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
# Importing Libraries

import pandas as pd
import numpy as np

import tensorflow as tf 
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input , LSTM , Dense

In [3]:
## Defining some Model Training Parameters:

batch_size= 64              # Batch size for the training 
epochs = 100                # Number of epochs to train for

latent_dim = 256            # latent dimensionality for ENCODING SPACE
num_samples = 10000         # Number of Samples to train on


In [4]:
# Gettging the data:
data = pd.read_csv('/content/drive/MyDrive/UNIV.AI/NLP Intro /Datasets/fra.txt', sep = '\t', header= None)
data

Unnamed: 0,0,1,2
0,Go.,Va !,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
1,Go.,Marche.,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
2,Go.,En route !,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
3,Go.,Bouge !,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
4,Hi.,Salut !,CC-BY 2.0 (France) Attribution: tatoeba.org #5...
...,...,...,...
208901,A carbon footprint is the amount of carbon dio...,Une empreinte carbone est la somme de pollutio...,CC-BY 2.0 (France) Attribution: tatoeba.org #1...
208902,Death is something that we're often discourage...,La mort est une chose qu'on nous décourage sou...,CC-BY 2.0 (France) Attribution: tatoeba.org #1...
208903,Since there are usually multiple websites on a...,Puisqu'il y a de multiples sites web sur chaqu...,CC-BY 2.0 (France) Attribution: tatoeba.org #9...
208904,If someone who doesn't know your background sa...,Si quelqu'un qui ne connaît pas vos antécédent...,CC-BY 2.0 (France) Attribution: tatoeba.org #9...


In [5]:
data = data.rename(columns= {0:'English', 1: 'French'}).drop(2, axis= 1)
data.head()

Unnamed: 0,English,French
0,Go.,Va !
1,Go.,Marche.
2,Go.,En route !
3,Go.,Bouge !
4,Hi.,Salut !


In [6]:
## We now need to make data ready for modeling 

# Getting all input text, output text. 
input_text = data['English'][:num_samples]              # English Sentences are the Input text
target_text = '\t'+ data['French'][:num_samples]+ '\n'    # we use tab '\t' and '\n' for the as the start and end sequence for the target text

In [7]:
input_text[:10]

0     Go.
1     Go.
2     Go.
3     Go.
4     Hi.
5     Hi.
6    Run!
7    Run!
8    Run!
9    Run!
Name: English, dtype: object

In [8]:
target_text[:10]

0                              \tVa !\n
1                           \tMarche.\n
2                        \tEn route !\n
3                           \tBouge !\n
4                           \tSalut !\n
5                            \tSalut.\n
6                           \tCours !\n
7                          \tCourez !\n
8    \tPrenez vos jambes à vos cous !\n
9                            \tFile !\n
Name: French, dtype: object

In [9]:
# Now we get the unique input characters used in input and target text:
input_char = set()
target_char = set()

for i in range(num_samples):
    for char in (input_text[i]):
        input_char.add(str(char))

    for char in (target_text[i]):
        target_char.add(str(char))

In [10]:
print(f'''
Length of Input Characters (ENGLISH)  : {len(input_char)}
Length of target Characters (FRENCH)  : {len(target_char)}
''')


Length of Input Characters (ENGLISH)  : 71
Length of target Characters (FRENCH)  : 93



In [11]:
# declaring few parameters that we may need in the future:

input_char = sorted(list(input_char))
target_char= sorted(list(target_char))

num_encoder_tokens = len(input_char)
num_decoder_tokens = len(target_char)

max_encoder_seq_len = max([len(texts) for texts in input_text])
max_decoder_seq_len = max([len(texts) for texts in target_text])


print (f'''
PARAMETERS : 
* Number of Samples                 : {len(input_text)}
* Number of unique input tokens     : {num_encoder_tokens} 
* Number of unique target tokens    : {num_decoder_tokens} 
* Max Sequence lenght of Input      : {max_encoder_seq_len}
* Max Sequence lenght of target     : {max_decoder_seq_len}


''')


PARAMETERS : 
* Number of Samples                 : 10000
* Number of unique input tokens     : 71 
* Number of unique target tokens    : 93 
* Max Sequence lenght of Input      : 15
* Max Sequence lenght of target     : 59





In [12]:
# Assigning Tokens to each and every characters we have in input and output texts:

input_token_index = dict([(char, i) for i , char in enumerate (input_char)])

target_token_index = dict([(char, i) for i , char in enumerate (target_char)])

In [13]:
## One- hot Representation using numpy:

# Creating the required variables with required shapes:

encoder_input_data = np.zeros((len(input_text), max_encoder_seq_len, num_encoder_tokens), dtype= 'float32')
decoder_input_data = np.zeros((len(input_text), max_decoder_seq_len, num_decoder_tokens), dtype= 'float32')
decoder_target_data = np.zeros((len(input_text), max_decoder_seq_len, num_decoder_tokens), dtype= 'float32')

display(encoder_input_data.shape, decoder_input_data.shape, decoder_target_data.shape)

(10000, 15, 71)

(10000, 59, 93)

(10000, 59, 93)

In [14]:
# one hot representation: 

for i, (input_text_, target_text_) in enumerate(zip(input_text, target_text)):
    for t, char in enumerate(input_text_):
        encoder_input_data[i, t, input_token_index[char]] = 1.
    encoder_input_data[i, t + 1:, input_token_index[' ']] = 1.
    for t, char in enumerate(target_text_):
        # decoder_target_data is ahead of decoder _input_data by one timestep
        decoder_input_data[i, t, target_token_index[char]] = 1.
        if t > 0:
            # decoder_target_data will be ahead by one timestep
            # and will not include the start character.
            decoder_target_data[i, t - 1, target_token_index[char]] = 1.
    decoder_input_data[i, t + 1:, target_token_index[' ']] = 1.
    decoder_target_data[i, t:, target_token_index[" "]] = 1.


In [15]:
encoder_input_data[0][0]

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0.], dtype=float32)

MODELING : LSTM (seq2se1)

In [16]:
# Defining the input Layer and process it .

encoder_inputs  = Input(shape= (None, num_encoder_tokens))

encoder = LSTM (latent_dim, return_state= True)

encoder_outputs , state_h, state_c = encoder(encoder_inputs)

# We done need the encoder_outputs while working with endcoders :

encoder_states = [state_h, state_c]

In [17]:
# setting up the decoder, using the encoder states as initial state:

decoder_inputs = Input(shape = (None , num_decoder_tokens))

decoder_lstm  = LSTM (latent_dim, return_sequences= True , return_state= True)

# We don't care about the decoder state here . We want the outpts here:

decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state =encoder_states)

decoder_dense = Dense(num_decoder_tokens, activation= 'softmax')
decoder_outputs = decoder_dense (decoder_outputs)

In [18]:
# COMPILING THE MODEL (encoder_input_data + decoder_input_data -- > decoder_output_data)

model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

model.compile(optimizer = 'rmsprop',
              loss = 'categorical_crossentropy',
              metrics = ['accuracy'])


# RUN THE MODEL:
model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
          batch_size = batch_size,
          epochs = 170,
          validation_split = 0.25)

Epoch 1/170
Epoch 2/170
Epoch 3/170
Epoch 4/170
Epoch 5/170
Epoch 6/170
Epoch 7/170
Epoch 8/170
Epoch 9/170
Epoch 10/170
Epoch 11/170
Epoch 12/170
Epoch 13/170
Epoch 14/170
Epoch 15/170
Epoch 16/170
Epoch 17/170
Epoch 18/170
Epoch 19/170
Epoch 20/170
Epoch 21/170
Epoch 22/170
Epoch 23/170
Epoch 24/170
Epoch 25/170
Epoch 26/170
Epoch 27/170
Epoch 28/170
Epoch 29/170
Epoch 30/170
Epoch 31/170
Epoch 32/170
Epoch 33/170
Epoch 34/170
Epoch 35/170
Epoch 36/170
Epoch 37/170
Epoch 38/170
Epoch 39/170
Epoch 40/170
Epoch 41/170
Epoch 42/170
Epoch 43/170
Epoch 44/170
Epoch 45/170
Epoch 46/170
Epoch 47/170
Epoch 48/170
Epoch 49/170
Epoch 50/170
Epoch 51/170
Epoch 52/170
Epoch 53/170
Epoch 54/170
Epoch 55/170
Epoch 56/170
Epoch 57/170
Epoch 58/170
Epoch 59/170
Epoch 60/170
Epoch 61/170
Epoch 62/170
Epoch 63/170
Epoch 64/170
Epoch 65/170
Epoch 66/170
Epoch 67/170
Epoch 68/170
Epoch 69/170
Epoch 70/170
Epoch 71/170
Epoch 72/170
Epoch 73/170
Epoch 74/170
Epoch 75/170
Epoch 76/170
Epoch 77/170
Epoch 78

<keras.callbacks.History at 0x7f661810e850>

In [19]:
# Sampling Inferencing :
# Steps :
# 1) encode input and retrive initital decoder state
# 2) run one step of decoder with the initial state and 'start of sequence' token as target
# 3) output will be the next target token
# 4) Repeat with current target token and current states .

In [20]:
# DEFINING THE SAMPLING MODEL:

encoder_model = Model(encoder_inputs, encoder_states)

decoder_state_input_h = Input(shape = (latent_dim, ))
decoder_state_input_c = Input(shape = (latent_dim, ))
decoder_state_inputs = [decoder_state_input_h,decoder_state_input_c ]

decoder_outputs, state_h, state_c = decoder_lstm(decoder_inputs, initial_state= decoder_state_inputs)

decoder_states  =  [state_h, state_c] 
decoder_outputs = decoder_dense(decoder_outputs)

decoder_model = Model([decoder_inputs]+ decoder_state_inputs, [decoder_outputs] + decoder_states)

In [21]:
# Reverse look up token index to decode sequence back to something readable:
reverse_input_char_index = dict((i,char) for char, i in input_token_index.items())

reverse_target_char_index  = dict((i, char) for char, i in target_token_index.items())


In [26]:
def decode_sequence(input_seq):
    # Encode the input as state vectors.
    states_value = encoder_model.predict(input_seq, verbose = 0)

    # Generate empty target sequence of length 1.
    target_seq = np.zeros((1, 1, num_decoder_tokens))
    # Populate the first character of target sequence with the start character.
    target_seq[0, 0, target_token_index['\t']] = 1.

    # Sampling loop for a batch of sequences
    # (to simplify, here we assume a batch of size 1).
    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict(
            [target_seq] + states_value, verbose = 0)

        # Sample a token
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = reverse_target_char_index[sampled_token_index]
        decoded_sentence += sampled_char

        # Exit condition: either hit max length
        # or find stop character.
        if (sampled_char == '\n' or
           len(decoded_sentence) > max_decoder_seq_len):
            stop_condition = True

        # Update the target sequence (of length 1).
        target_seq = np.zeros((1, 1, num_decoder_tokens))
        target_seq[0, 0, sampled_token_index] = 1.

        # Update states
        states_value = [h, c]

    return decoded_sentence

In [28]:
# Initiating the for loop :

for seq_index in range(30,100):
  # Taking one sequence from training set to decode:
  input_seq = encoder_input_data[seq_index: seq_index+1]
  decode_sent = decode_sequence(input_seq)

  print (f'''
  Input Sequence    : {input_text[seq_index]}
  Decoded Sequence  : {decode_sent}
  ''')


  Input Sequence    : Help!
  Decoded Sequence  : Aide-moi !

  

  Input Sequence    : Hide.
  Decoded Sequence  : Cachez-vous.

  

  Input Sequence    : Hide.
  Decoded Sequence  : Cachez-vous.

  

  Input Sequence    : Jump!
  Decoded Sequence  : Saute.

  

  Input Sequence    : Jump.
  Decoded Sequence  : Saute.

  

  Input Sequence    : Stop!
  Decoded Sequence  : Arrête-toi !

  

  Input Sequence    : Stop!
  Decoded Sequence  : Arrête-toi !

  

  Input Sequence    : Stop!
  Decoded Sequence  : Arrête-toi !

  

  Input Sequence    : Wait!
  Decoded Sequence  : Attendez.

  

  Input Sequence    : Wait!
  Decoded Sequence  : Attendez.

  

  Input Sequence    : Wait!
  Decoded Sequence  : Attendez.

  

  Input Sequence    : Wait.
  Decoded Sequence  : Attends.

  

  Input Sequence    : Wait.
  Decoded Sequence  : Attends.

  

  Input Sequence    : Wait.
  Decoded Sequence  : Attends.

  

  Input Sequence    : Wait.
  Decoded Sequence  : Attends.

  

  Input Sequence  