# Building a toy language translator

We will use the **Tab-delimited Bilingual Sentence Pairs** you can get a particular data set from here: http://www.manythings.org/anki/

#### How the data looks:

English + TAB + The Other Language

Tom broke the window.	トムは窓を割った。<br>
Tom checked the time.	トムは時間を確認した。

## Loading libraries

In [1]:
from keras.models import Model
from keras.layers import Input, LSTM, Dense
import numpy as np

Using TensorFlow backend.


## Some parameters

In [2]:
batch_size = 64
epochs = 60 
latent_dim = 200  # Latent dimensionality of the encoding space for the characters
num_samples = 20000 # number of pairs for training (English sentence, Spanish sentence)

## Importing and processing sequences

In [3]:
data_path = './data/translation/spa-eng/spa.txt'

### Reading the lines of the file

In [4]:
inputs = []
targets = []
input_chars = set()
target_chars = set()
with open(data_path, 'r', encoding='utf-8') as f:
    lines = f.read().split('\n')

## Last line is blank: remove    
lines = lines[:-1]

In [5]:
print("Number of examples: {:,} ".format(len(lines)))

Number of examples: 118,121 


### Getting (x,y) pairs

- For each line: extract inputs and targets
- For the target sequence indicate the begining of the sequence with a TAB (\t) and the *end of sequence* with a NEW LINE (\n).
- Build the sets of unique characters for input and target sequences

In [6]:
for line in lines[: min(num_samples, len(lines))]:
    x_seq, y_seq = line.split('\t')
    y_seq = '\t' + y_seq + '\n'

    inputs.append(x_seq); targets.append(y_seq)
    
    for char in x_seq:
        if char not in input_chars:
            input_chars.add(char)
            
    for char in y_seq:
        if char not in target_chars:
            target_chars.add(char)

In [7]:
input_chars = sorted(list(input_chars))
target_chars = sorted(list(target_chars))
num_encoder_tokens = len(input_chars)
num_decoder_tokens = len(target_chars)
max_encoder_seq_length = max([len(txt) for txt in inputs])
max_decoder_seq_length = max([len(txt) for txt in targets])

print('Number of unique input tokens:', num_encoder_tokens)
print('Number of unique output tokens:', num_decoder_tokens)
print('Max sequence length for inputs:', max_encoder_seq_length)
print('Max sequence length for outputs:', max_decoder_seq_length)

Number of unique input tokens: 73
Number of unique output tokens: 89
Max sequence length for inputs: 20
Max sequence length for outputs: 70


### Mapping each character to an integer index

In [8]:
input_token_index = dict(
    [(char, i) for i, char in enumerate(input_chars)])
target_token_index = dict(
    [(char, i) for i, char in enumerate(target_chars)])

### Building input data

Each input sentece will be represented by a matrix of dimensions: `max_encoder_seq_length x num_encoder_tokens` of ones and zeros. 1 in (i,j) if character i is token j, 0 otherwise.

`decoder_target_data` will be ahead of `decoder_input_data` by one timestep and will not include the start character.

In [9]:
# Encoder inputs
encoder_input_data = np.zeros(
    (len(inputs), max_encoder_seq_length, num_encoder_tokens),
    dtype='float32')

# Decoder inputs
decoder_input_data = np.zeros(
    (len(inputs), max_decoder_seq_length, num_decoder_tokens),
    dtype='float32')

# Decoder targets
decoder_target_data = np.zeros(
    (len(inputs), max_decoder_seq_length, num_decoder_tokens),
    dtype='float32')

for i, (x_seq, y_seq) in enumerate(zip(inputs, targets)):
    for t, char in enumerate(x_seq):
        encoder_input_data[i, t, input_token_index[char]] = 1.
    for t, char in enumerate(y_seq):
        # decoder_target_data is ahead of decoder_input_data by one timestep
        decoder_input_data[i, t, target_token_index[char]] = 1.
        if t > 0:
            decoder_target_data[i, t - 1, target_token_index[char]] = 1.

## Model for training

#### Building the encoder

In [10]:
encoder_inputs = Input(shape=(None, num_encoder_tokens))
encoder = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)
encoder_states = [state_h, state_c]

#### Building the decoder

In [11]:
decoder_inputs = Input(shape=(None, num_decoder_tokens))
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)
## Obtain probabilities for each token
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

#### Complete model: chain encoder and decoder, compile and train

This model will transform the encoder_inputs and decoder_inputs in decoder_outputs

In [12]:
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
          batch_size=batch_size,
          epochs=epochs,
          validation_split=0.1)
# Save model
model.save('./SavedModels/seq2seq.h5')

Train on 18000 samples, validate on 2000 samples
Epoch 1/60
Epoch 2/60
Epoch 3/60
Epoch 4/60
Epoch 5/60
Epoch 6/60
Epoch 7/60
Epoch 8/60
Epoch 9/60
Epoch 10/60
Epoch 11/60
Epoch 12/60
Epoch 13/60
Epoch 14/60
Epoch 15/60
Epoch 16/60
Epoch 17/60
Epoch 18/60
Epoch 19/60
Epoch 20/60
Epoch 21/60
Epoch 22/60
Epoch 23/60
Epoch 24/60
Epoch 25/60
Epoch 26/60
Epoch 27/60
Epoch 28/60
Epoch 29/60
Epoch 30/60
Epoch 31/60
Epoch 32/60
Epoch 33/60
Epoch 34/60
Epoch 35/60
Epoch 36/60
Epoch 37/60
Epoch 38/60
Epoch 39/60
Epoch 40/60
Epoch 41/60
Epoch 42/60
Epoch 43/60
Epoch 44/60
Epoch 45/60
Epoch 46/60
Epoch 47/60
Epoch 48/60
Epoch 49/60
Epoch 50/60
Epoch 51/60
Epoch 52/60
Epoch 53/60
Epoch 54/60
Epoch 55/60
Epoch 56/60
Epoch 57/60
Epoch 58/60
Epoch 59/60
Epoch 60/60


  str(node.arguments) + '. They will not be included '


## Model for inference

#### Encoder part:

`encoder_model` transforms `encoder_inputs` into `encoder_states`

In [13]:
encoder_model = Model(encoder_inputs, encoder_states)

#### Decoder part:

`decoder_model` transforms (decoder_inputs, decoder_states_inputs) into  (decoder_outputs, decoder_states)

In [14]:
decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

decoder_outputs, state_h, state_c = decoder_lstm(decoder_inputs, initial_state=decoder_states_inputs)

decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = Model([decoder_inputs] + decoder_states_inputs, [decoder_outputs] + decoder_states)

#### Reverse mapping: (index: char)

In [15]:
reverse_input_char_index = dict(
    (i, char) for char, i in input_token_index.items())
reverse_target_char_index = dict(
    (i, char) for char, i in target_token_index.items())

### Geting predictions:

The translated sequences will be produced character by character, following these steps:

For the first character:
1. Produce the zero-matrix representing the input sentence
1. For each caracter in input sentence, fill with 1 in the corresponding place
1. Get the initial state from the encoder
1. Generate an empty target sequence of length 1
1. Populate the first character of target sequence with the start character

For the following output characters, enter for following loop: for each character, while `stop_condition = False`, start with an empty decoded sequence.

1. Feed `target_seq` and `states_value` to get a prediction from `decoder_model`
1. Extract the character with the highest probability from `output_tokens` 
1. Add this character to the decoded sentence
1. Check for exit condition: either hit max length or find stop character.
1. Reset the target sequence for next character
1. Update states

### Translate a sentence

In [16]:
def decode_sentence(input_sentence):
    #Produce the zero-matrix representing the input sentence
    input_seq = np.zeros((1, max_encoder_seq_length, num_encoder_tokens))
    #For each caracter in input sentence, fill with 1 in the corresponding place
    for t, char in enumerate(input_sentence):
        input_seq[0, t, input_token_index[char]] = 1.
    #Get the initial state from the encoder
    states_value = encoder_model.predict(input_seq)
    # Generate an empty target sequence of length 1
    target_seq = np.zeros((1, 1, num_decoder_tokens))
    # Populate the first character of target sequence with the start character
    target_seq[0, 0, target_token_index['\t']] = 1.

    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        #Feed target_seq and states_value to get a prediction from decoder_model
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)
        #Extract the character with the highest probability from output_tokens
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = reverse_target_char_index[sampled_token_index]
        #Add this character to the decoded sentence
        decoded_sentence += sampled_char

        # Check for exit condition
        if (sampled_char == '\n' or len(decoded_sentence) > max_decoder_seq_length):
            stop_condition = True
        # Reset the target sequence for next character
        target_seq = np.zeros((1, 1, num_decoder_tokens))
        target_seq[0, 0, sampled_token_index] = 1.
        # Update states
        states_value = [h, c]

    return decoded_sentence

In [19]:
sentences = ['Wait.', 'This is yours.', 'Are you OK?', 'I am here.']
for sentence in sentences:
    print(sentence, ':', decode_sentence(sentence))

Wait. : Espere.

This is yours. : Esto es el momento.

Are you OK? : ¿Eres normal?

I am here. : Estoy aquí.



## This is a very bad translator :(

Some reasons:

1. Incomplete dataset
1. Very small dataset
1. Character-level model
1. In a real translator this would be just a piece of a larger system

Attribution: <br>
Most of the code based on:

https://github.com/keras-team/keras/blob/master/examples/lstm_seq2seq.py