# Pipple LSTM Language Translator Introduction

This notebook runs you through all basic steps when building a machine learning language translator, using LSTM networks. The notebook is configured to be used in Google Colaboratory: a free of use Jupyter Notebook environment running on Google servers.

Notebooks are documents which contain both computer code (e.g. python) and rich text elements (paragraph, equations, figures, links, etc…). Notebooks are both human-readable documents containing the analysis description and the results (figures, tables, etc..) as well as executable commands which can be run to perform data analysis. Google Colaboratory combines these features with other Google services such as Google Drive. 

Before continuing this tutorial make sure you select 'GPU' as hardware accelerator. This will speed up the time of training a the language translator model substantially. 

Runtime -> Change runtime type -> Hardware accelerator = 'GPU' -> Save

---



# Table of Contents


1.   Use case
2.   Retrieve data
3.   Data preparation
4.   Sequence-to-sequence models
5.   LSTM inputs
6.   Creating a train and test set
7.   Building the sequence-to-sequence model
8.   Trainng the sequence-to-sequence model
9.   Inferencing from the model
10.  Improvements on the current architecture



---
---
# 1. Use case


In this tutorial you will be building your own LSTM language translator! Neural machine translation (NMT) is one of the many use cases the LSTM networks offer. Specifically, we will be introducing sequence-to-sequence models, using an encoder and decoder framework with LSTM as main block. The encoder-decoder architecture is currenty the state of the art in machine translation. Google Translate started using these kinds of models in production in late 2016, and the world happily translated ever after. 

We will train our LSTM language translator to translate English to Dutch:)



---
---

# 2. Retrieve Data

The data we use comes from sentences from the Taboeba Project (http://www.manythings.org/anki/) and can be retrieved by running the cell block below.


In [1]:
# clone github repository
!git clone https://github.com/PippleNL/lstm_lecture.git


# import packages
import zipfile
from os.path import join


dataset = 'nld-eng'

# path to zip
local_zip = f'lstm_lecture/{dataset}.zip' 

# extract zip file
zip_ref = zipfile.ZipFile(local_zip, 'r')
zip_ref.extractall(f'/tmp/{dataset}')
zip_ref.close()

# data directory
base_dir = '/tmp/'

Cloning into 'lstm_lecture'...
remote: Enumerating objects: 6, done.[K
remote: Counting objects: 100% (6/6), done.[K
remote: Compressing objects: 100% (5/5), done.[K
remote: Total 6 (delta 0), reused 3 (delta 0), pack-reused 0[K
Unpacking objects: 100% (6/6), done.


The data set is located at /tmp/nld-eng/nld.txt.

Below are some bash commands listed to guide you through the repository and let you get a feeling of how the data set is structured.

In [2]:
# list all files and directories in temp directory
!ls '/tmp'

dap_multiplexer.445384d9b81d.root.log.INFO.20210120-145735.51
dap_multiplexer.INFO
debugger_afkmqbzwz
initgoogle_syslog_dir.0
nld-eng


In [3]:
# list all files nld-eng folder.
!ls '/tmp/nld-eng'

_about.txt  nld.txt


Run the cell block below to get an impression of how the data looks like

In [4]:
# import packages
import pandas as pd


# translation settings
translation = 'nld-eng'

# data directory and text file path
data_dir = join(base_dir, translation)
inf_path = join(data_dir, f"{translation.split('-')[0]}.txt")

# show data source
df = pd.read_table(inf_path, names =['source', 'target', 'comments'])
df.sample(5)

Unnamed: 0,source,target,comments
49859,I will have my sister pick you up at the station.,Ik zal mijn zus je laten oppikken aan het stat...,CC-BY 2.0 (France) Attribution: tatoeba.org #3...
9726,He finally arrived.,Hij is uiteindelijk aangekomen.,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
13934,Have you tried sushi?,Heb je sushi geprobeerd?,CC-BY 2.0 (France) Attribution: tatoeba.org #4...
17122,Raise your right hand.,Steek uw rechterhand op.,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
37432,I don't know where my gloves are.,Ik weet niet waar mijn handschoenen zijn.,CC-BY 2.0 (France) Attribution: tatoeba.org #7...


---
---
# 3. Data preparation

Data preparation is an essential step in any machine learning tasks. For this use case, we do the following for both the source and target sentences:

- Convert text to lower case
- Remove quotes
- Remove all special characters like @, !, *, &, ect
- Remove digits from the sentences
- Remove extra with spaces

Next to that we add a START_ and _END token to the target sentences as this is very useful for training and inference purposes. These tags help in knowing when to start and end a translation

In [5]:
# import packages
import re
from string import punctuation, digits


def data_prep(prep_series: pd.Series):
  """
  Data preparation
  :param prep_series: Pandas Series object
  :return: Pandas Series object
  """
  # convert to lower case
  prep_series = prep_series.apply(lambda x: x.lower())

  # remove quotes
  prep_series = prep_series.apply(lambda x: re.sub("'", '', x))

  # remove special characters
  prep_series = prep_series.apply(lambda x: ''.join(ch for ch in x if ch not in set(punctuation)))

  # remove digits
  num_digits = str.maketrans('', '', digits)
  prep_series = prep_series.apply(lambda x: x.translate(num_digits))
  
  # remove extra spaces
  prep_series = prep_series.apply(lambda x: x.strip())
  prep_series = prep_series.apply(lambda x: re.sub(" +", " ", x))

  if prep_series.name == 'target':
    # add start and end tokens
    prep_series = prep_series.apply(lambda x: 'START_ ' + x + ' _END')

  return prep_series

# prepate source and target sentences
df.source = data_prep(df.source)
df.target = data_prep(df.target)
df.sample(5)

Unnamed: 0,source,target,comments
42366,how many books do you read per month,START_ hoeveel boeken leest ge per maand _END,CC-BY 2.0 (France) Attribution: tatoeba.org #1...
1523,tom is poor,START_ tom is arm _END,CC-BY 2.0 (France) Attribution: tatoeba.org #1...
25177,im glad to see you again,START_ ik ben blij jullie weer te zien _END,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
1050,be merciful,START_ wees genadig _END,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
34086,he gets a haircut once a month,START_ hij knipt zijn haar eens per maand _END,CC-BY 2.0 (France) Attribution: tatoeba.org #1...


---
---
# 4. Sequence-to-sequence models

Sequence to sequence models map a source sequence to a target sequence, both text sentences in this use case. The source sequence is input language to the machine translation system, the target sequence is the output language.

In [6]:
from IPython.display import HTML
HTML("""
    <video alt="s2s" controls>
        <source src="https://jalammar.github.io/images/seq2seq_4.mp4" type="video/mp4">
    </video>
""")

Under the hood, the model is composed of an encoder and a decoder.
The encoder processes each item in the input sequence and compiles the information it captures into a vector (called the context). After processing the entire input sequence, the encoder sends the context over to the decoder, which begins producing the output sequence item by item.

The context is a vector (an array of numbers, basically) in the case of machine translation. The encoder and decoder tend to both be RNNs or likewise, such as LSTM networks. The vector represents the cell state (long-term memory) and the hidden state (working memory), in case of an LSTM, capturing the meaning of the sentence in a universal language (consisting of numbers).

![context](https://jalammar.github.io/images/context.png)

You can set the size of the context vector when you set up your model. It is basically the number of hidden units in the encoder LSTM. These visualizations show a vector of size 4, but in real world applications the context vector would be of a size like 256, 512, or 1024.

For a simple and clear introduction to RNNs, please watch the video of Luis Serrano:) https://www.youtube.com/watch?v=UNmqTiOnRfg

---
---
# 5. LSTM inputs

By design, a LSTM takes three inputs at each time step: an input (in the case of the encoder, one word from the input sentence), and a hidden state and a cell state. The word, however, needs to be represented by a vector. 

![embedding](https://jalammar.github.io/images/embedding.png)

To transform a word into a vector, we turn to the class of methods called “word embedding” algorithms. These algorithms turn words into vector spaces that capture a lot of the meaning/semantic information of the words. One of the most famous embedding algorithms is word2vec, developed by Google.  

![word2vec](https://miro.medium.com/max/3010/1*OEmWDt4eztOcm5pr2QbxfA.png)

More information on embedding algorithms can be found on: https://machinelearningmastery.com/what-are-word-embeddings/



In keras, we illustrate this as follows:

![input_keras](https://miro.medium.com/max/532/1*AQKRJsRdWx2HZ85H1yWoKw.png)

---
---
# 6. Creating a train and test set

When creating a train and test set on which a machine translator model can be trained and evaluated, a couple of things should thought of throughoutly:

- The vocabularies of both the source and target language should be mappable to the embedding layers, having the semantic meaning of the word
- The lenght of sentences vary accross sentences to translate
- The word in a sequence at t + 1, is the target value of the word in a sequence at t
- Data sets are often too large to entirely fit into memory, therefore it is often preferred to generate data batches

In [7]:
from itertools import chain
import numpy as np


# creating vocabulary mappings
all_source_words = sorted(list(set(chain(*[x.split() for x in df.source]))))
all_target_words = sorted(list(set(chain(*[x.split() for x in df.target]))))

# creating a word to index mapping for source and target
source_word2idx = {x: i for i, x in enumerate(all_source_words)}
target_word2idx = {x: i for i, x in enumerate(all_target_words)}

# creating an index to word mapping for source and target vocabulary
source_idx2word = {value: key for key, value in source_word2idx.items()}
target_idx2word = {value: key for key, value in target_word2idx.items()}

# determine the maximum length of sentences in the source and target data
max_source_length = max([len(x.split()) for x in df.source])
max_target_length = max([len(x.split()) for x in df.target])


# define a generate batch function yielding input data to both the encoder and decoder and the target sentences mapped to its vocabulary features (decoded)
def generate_batch(X: np.ndarray, y: np.ndarray, max_source_length: int, max_target_length: int, num_decoder_tokens: int, batch_size=128):
  """
  Generating batches of data
  """
  while True:
    # loop over input samples to generate samples of size batch size 
    for j in range(0, len(X), batch_size):
      # initialize encoder input, decoder input and decoded target data
      encoder_input_data = np.zeros((batch_size, max_source_length), dtype='float32')
      decoder_input_data = np.zeros((batch_size, max_target_length), dtype='float32')
      decoder_target_data = np.zeros((batch_size, max_target_length, num_decoder_tokens), dtype='float32')

      # convert input and output sentences to structure which can be trained by LSTM
      X_batch = X[j:j + batch_size]
      y_batch = y[j:j + batch_size]
      for i, (input_text, target_text) in enumerate(zip(X_batch, y_batch)):
        # map each word of the source sentence to its vocabulary integer
        for t, word in enumerate(input_text.split()):
          encoder_input_data[i, t] = source_word2idx[word] 

        # map each word of the target sentence to its vocabulary integer
        for t, word in enumerate(target_text.split()):
          # exclude the last token in target sentence (i.e. _END) as this token should not be input for a next word in the sequence
          if t < len(target_text.split()) - 1: 
              decoder_input_data[i, t] = target_word2idx[word]
          
          # add boolean array of true next words in the sequence (i.e. a word in the sequence at t + 1 is the target value of the word in a sequence at t)
          # offset by 1 time step
          if t>0:
              decoder_target_data[i, t - 1, target_word2idx[word]] = 1.
      
      # yield batch data
      yield([encoder_input_data, decoder_input_data], decoder_target_data)

The data set is splitted into a train and test set, before these are fed into the generate batch function. Moreover, the data is shuffeled which helps the model to be more robust. For your knowledge, the original dataset is sorted on sentence length.

In [8]:
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split


# shuffle the data
df = shuffle(df)

# create training and test data set
X, y = df.source, df.target
X_train, X_test, y_train, y_test = train_test_split(X.values, y.values, test_size=0.1)

---
--- 
# 7. Building the sequence-to-sequence model

Recall that we need to create an encoder and decoder LSTM network. 

Moreover, we will be training our own embedding layers to capture the necessary meanings of the input and output words.

Next to that, we will be using the method 'Teacher Forcing' to train the sequence-to-sequence model. Teacher Forcing works by using the actual or expected output from the training data set at the current time step y(t) as input in the next time step X(t+1), rather than the output generated by the network. 

- Embedding dimension = 100
- Latent dimension = 256 (i.e. dimension of context vectors c and h)
- Number of encoder tokens = lenght of source vocabulary (i.e. used in embedding layer)
- Number of decoder tokens = length of target vocabulary + 1 (i.e. as we zero pad for non existing words in a sequence of length max target sequence)

In [9]:
# set some parameters for the encoder and decoder
embedding_dim = 100
latent_dim = 256
num_encoder_tokens = len(all_source_words)
num_decoder_tokens = len(all_target_words) + 1  # zero pad

### Encoder

The encoder will encode the input sequence into a context vector, consisting of the working memory and its long-term memory.
1. Input sequence goes into input layer
2. Input layer connects to embedding layer (this maps the input to a 3D array of shape (batch samples, max sentence length, word features)
3. The embedding layer connects to the LSTM layer(s)

We are particulary interested in the context of this encoder, which is achieved by setting return_state=True and discarding the final outputs of the LSTM

In [10]:
from keras.layers import Input, LSTM, Embedding


# 1. input sequence goes into input layer
encoder_inputs = Input(shape=(None,))

# 2. input layer connects to embedding layer
# mask_zero is set True which implies that the input value of 0 is a special padding value that should be masked out
enc_emb = Embedding(num_encoder_tokens, embedding_dim, mask_zero=True)(encoder_inputs)

# 3. Connect embedding layer to LSTM layer
encoder_lstm = LSTM(latent_dim, return_state=True)
final_enc_outputs, state_h, state_c = encoder_lstm(enc_emb)

# discard the final encoder outputs and only keep the states (used as input for LSTM layer in decoder)
encoder_context = [state_h, state_c]

### Decoder

The decoder will decode the context vector into an output sequence, item by item.

1. Output sequence goes into input layer (i.e. remember it is a sequential process)
2. Input layer is again connected to embedding layer
3. LSTM layer in the decoder takes the input of the embedding layer and the context of the encoder as its input
4. Add a dense layer including a softmax activation to map the output of the decoder to a word in the target vocabulary

In case of the decoder, we want to return the output of the LSTM layers in between time steps, as well as the internal working memory (h(i) )and long-term memory states (c(i)). These internal states, c(i) and h(i), are not used during training, but will be used during the inference phase.


In [11]:
from keras.layers import Dense


# 1. output sequence goes into input layer Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = Input(shape=(None,))

# 2. input layer is connected to embedding layer
dec_emb_layer = Embedding(num_decoder_tokens, embedding_dim, mask_zero=True)
dec_emb = dec_emb_layer(decoder_inputs)

# 3. LSTM layer takes input of embedding layer and the context of the encoder
# return sequence is set True as we are interested in the output prections for all time steps
# internal states (h and c) are not used during training
# LSTM layer is initialized with the context vector of the encoder
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(dec_emb, initial_state=encoder_context)

# 4. adding a dense layer incl. softmax activation such that mapping to target vocabulary can be done
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

The total sequence to sequence model takes both encoder and decoder inputs and predicts decoder outputs

In [12]:
from keras.models import Model

# the model that takes encoder and decoder input to output decoder outputs
train_model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

# print summary
train_model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, None)]       0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            [(None, None)]       0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, None, 100)    868400      input_1[0][0]                    
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, None, 100)    1231400     input_2[0][0]                    
______________________________________________________________________________________________

---
---
# 8. Training the sequence-to-sequence model

To train the model, in keras, we first compile the model and then fit the data (using batches) to the model. To fit the model using batches of input data, we use the fit_generator() method. 

More specifically, we use the following settings:
- optimizer: RMSPROP
- loss function: categorical crossentropy, as we use categorical labels (one-hot encoded vectors for each target word)
- metrics: accuracy
- batch size: 128
- epochs: 100

In [13]:
# parameters
train_samples = len(X_train) # total training samples
val_samples = len(X_test)    # total validation or test samples
batch_size = 128
epochs = 30
optimizer = 'rmsprop'
loss = 'categorical_crossentropy'
metrics = ['acc']

# compiling the model
train_model.compile(optimizer=optimizer, loss=loss, metrics=metrics)

# define batch generator for train and validation purposes
train_gen = generate_batch(X_train, y_train, max_source_length=max_source_length, max_target_length=max_target_length, num_decoder_tokens=num_decoder_tokens, batch_size=batch_size)
test_gen = generate_batch(X_test, y_test, max_source_length=max_source_length, max_target_length=max_target_length, num_decoder_tokens=num_decoder_tokens, batch_size=batch_size)

# training the model
train_model.fit_generator(generator=train_gen,
                          steps_per_epoch=train_samples//batch_size,
                          epochs=epochs,
                          validation_data=test_gen,
                          validation_steps=val_samples//batch_size)



Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<tensorflow.python.keras.callbacks.History at 0x7f35f667f320>

---
---
# 9. Inferencing from the model

During inference we want to translate unknown input sequence to predict the output, and evaluate from there. Batch size equals 1 for simplicity.

1. Encode the input sequences into context vectors (1 context vector for each sample)
2. Predict the decoded sentence one by one, where the first input to the decoder will the context vector of the encoder and the embedded _START 
3. The output of the decoder will be fed as an input to the decoder for the next time step

  ![inference](https://miro.medium.com/max/554/1*AyyknGa07gMGVLhiWCruNQ.png)

4. Convert the decoded output (one-hot encoded) vector to the word from the target dictionary
5. Append the generated target word to the target sentence
6. Repeat the steps till we hit the _END tag or a sentence limit

### Define the inference model

By defining the inference model, we basically mean, use the trained layers from the trained model, but now also use the working memory (h(i)) and long term memory (c(i)) from the decoder model as input for the next output sequence.

In [14]:
# inference model uses encoder model to convert input sequence to context vectors
encoder_model = Model(encoder_inputs, encoder_context)

# decoder gets new context output (i.e. h(i), c(i)) from previous time step
dec_context_input = [Input(shape=(latent_dim, )), Input(shape=(latent_dim, ))]

# decoder gets embedded target word in sequence as input; using trained embedding layer
dec_emb_inf = dec_emb_layer(decoder_inputs)

# set the lstm initial state to the states from the previous time step; using trained lstm layer
dec_outputs_inf, state_h_inf, state_c_inf = decoder_lstm(dec_emb_inf, initial_state=dec_context_input)
dec_context_inf = [state_h_inf, state_c_inf]

# use the trained dense softmax layer to generate vocabulary predictions, using the newly generated context vectors
dec_outputs_inf = decoder_dense(dec_outputs_inf)

# combine steps in a total inference model; decoder inputs and its context vectors output decoded outputs and decoded context vectors
decoder_model = Model([decoder_inputs] + dec_context_input, [dec_outputs_inf] + dec_context_inf)

# print model summary
decoder_model.summary()

Model: "model_2"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_2 (InputLayer)            [(None, None)]       0                                            
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, None, 100)    1231400     input_2[0][0]                    
__________________________________________________________________________________________________
input_3 (InputLayer)            [(None, 256)]        0                                            
__________________________________________________________________________________________________
input_4 (InputLayer)            [(None, 256)]        0                                            
____________________________________________________________________________________________

Inference multiple test sequences from the inference model

In [15]:
output_sentence_limit = 50


def inference_from_model(input_seq: np.ndarray, output_sentence_limit=50):
  """
  Decode a single sentence using the trained model. That is batch_size = 1
  """
  # 1. encode the input into context vectors
  states_value = encoder_model.predict(input_seq)

  # 2. start with a target sequence of size 1 (just the start-of-sequence word)
  target_seq = np.zeros((1,1))
  target_seq[0, 0] = target_word2idx['START_']

  decoded_sentence = ''
  while True:
    # print(f'target_seq: {target_seq}')
    # 3. feed the state vectors and start word target sequence to the decoder to produce predictions for the next word
    output_tokens, h, c = decoder_model.predict([target_seq] + states_value)

    # 4. sample the next word using the predictions done by decoder
    sampled_token_index = np.argmax(output_tokens[0, -1, :])
    sampled_word = target_idx2word[sampled_token_index]

    # 5. append the sampled character to the target sequence
    target_seq[0, 0] = sampled_token_index
    decoded_sentence += ' ' + sampled_word

    # Update states
    states_value = [h, c]

    # exit loop if stop condition is met
    if sampled_word == '_END' or len(decoded_sentence) > output_sentence_limit:
      break
    
  # return sentence
  return decoded_sentence

#### Inference train set

In [16]:
# redefine test set generator having batch size one for simplicity
train_gen = generate_batch(X_train, y_train, max_source_length=max_source_length, max_target_length=max_target_length, num_decoder_tokens=num_decoder_tokens, batch_size=1)

# inference from the model x times
x = 10

for i, ((input_seq, actual_output), _) in enumerate(train_gen):
  # inference from model
  inference_sentence = inference_from_model(input_seq)
  print(f'Input source sentence: {X_train[i]}')
  print(f'Actual target translation: {y_train[i]}')
  print(f'Predicted target translation: {inference_sentence}')
  print()

  if i >= x:
    break

Input source sentence: anyone can write his own name
Actual target translation: START_ iedereen kan zijn eigen naam schrijven _END
Predicted target translation:  een kan ik zijn eigen naam maken _END

Input source sentence: the balloon is filled with air
Actual target translation: START_ de ballon is gevuld met lucht _END
Predicted target translation:  een onder het ligt onder _END

Input source sentence: tom doesnt know what mary does for a living
Actual target translation: START_ tom weet niet wat mary voor de kost doet _END
Predicted target translation:  een weet niet wat tom en mary voor het halen _END

Input source sentence: it isnt a secret
Actual target translation: START_ het is geen geheim _END
Predicted target translation:  een het is niet geheim _END

Input source sentence: how long would you say theyve been there tom
Actual target translation: START_ hoe lang denk je dat ze daar zijn geweest tom _END
Predicted target translation:  een keer denk je denk dat tom daar me zo ri

#### Inference test set

In [17]:
# redefine test set generator having batch size one for simplicity
test_gen = generate_batch(X_test, y_test, max_source_length=max_source_length, max_target_length=max_target_length, num_decoder_tokens=num_decoder_tokens, batch_size=1)

# inference from the model x times
x = 10

for i, ((input_seq, actual_output), _) in enumerate(test_gen):
  # inference from model
  inference_sentence = inference_from_model(input_seq)
  print(f'Input source sentence: {X_test[i]}')
  print(f'Actual target translation: {y_test[i]}')
  print(f'Predicted target translation: {inference_sentence}')
  print()

  if i >= x:
    break

Input source sentence: how many books do you read a month
Actual target translation: START_ hoeveel boeken lees je per maand _END
Predicted target translation:  een boeken heeft u per maand van de engels houdt _END

Input source sentence: do you like it when i do this
Actual target translation: START_ vind je het leuk als ik dat doe _END
Predicted target translation:  een reden hoe je het doen is ik _END

Input source sentence: everybody loves him
Actual target translation: START_ iedereen houdt van hem _END
Predicted target translation:  een kind is gek _END

Input source sentence: when i was a child i believed in santa claus
Actual target translation: START_ toen ik nog klein was geloofde ik in de kerstman _END
Predicted target translation:  een kind was ik hem in de buurt van een autoongeluk

Input source sentence: we can win this war
Actual target translation: START_ we kunnen deze oorlog winnen _END
Predicted target translation:  een kunnen jullie deur nemen _END

Input source sen

---
---
# 10. Improvements on current architecture

As you might notice, the results are not quite as estonishing as we would have hoped. Of course this, amongst others, has to do with the limited training capacities. Other possibilities to investigate include:

- Include bi-directional LSTM layers


```
# replace parts in encode
encoder_lstm = LSTM(latent_dim, return_state=True)
final_enc_outputs, state_h, state_c = encoder_lstm(enc_emb)
# by
encoder_lstm = Bidirectional(LSTM(latent_dim, return_state=True), merge_mode='concat')
final_enc_outputs, forward_state_h, forward_state_c, backward_state_h, backward_state_c = encoder_lstm(enc_emb)
state_h = Concatenate()([forward_state_h, backward_state_h])
state_c = Concatenate()([forward_state_c, backward_state_c])

# replace parts in decoder
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
# by
decoder_lstm = LSTM(latent_dim*2, return_sequences=True, return_state=True)
```


- Include attention layers
- Include pre-trained embedded vectors

All of these will be covered in LSTM lecture part 2:) We hope to see you there as well!!

---