# Deep Learning & Generative ChatBots
By using **neural networks** with many hidden layers — known as **deep learning** —, generative chatbot models **can build sentences that are completely original** rather than retrieved from a list of possible responses.

1. [**Long Short-Term Memory (LSTM) networks**](#LSTMs)
2. [**seq2seq**](#seq2seq)

<a name='LSTMs'></a>
## Long Short-Term Memory (LSTM) networks
**Recurrent Neural Networks (RNNs)** are specifically designed to process inputs in a **temporal order** and update future based on the past (speech recognition, machine translation). **LSTMs** are a special type of RNNs that can generate language that is both **persistent** across interactions, and **adaptable** to new conversations.
 
The chain structure of RNNs places them in close relation to **data with a clear temporal ordering** or list-like structure — such as **human language**, where words obviously appear one after another. **Standard RNNs** are certainly the best fit for tasks that involve sequences, like the translation of a sentence from one language to another. As the **gap between context and the word to predict** grows, standard RNNs become less and less accurate (**long-term dependency problem**).

The most important aspect of an LSTM is the way in which the transformed input data is combined by adding results to **state** (**cell memory**), represented as vectors. There are two states that are produced for the first step in the sequence and then carried over as subsequent inputs are processed: cell state, and hidden state.

The **cell state** carries information through the network as we process a sequence of inputs. At each timestep, or step in the sequence, the updated input is appended to the cell state by a gate, which controls how much of the input should be included in **the final product of the cell state**. This final product, which is fed as input to the next neural network layer at the next timestep, is called a **hidden state**. The final output of a neural network is often the result contained in the final hidden state, or an average of the results across all hidden states in the network.

The persistence of the majority of a cell state across data transformations, combined with incremental additions controlled by the gates, allows for important **information from the initial input data to be maintained** in the neural network. Ultimately, this allows for information from far earlier in the input data to be used in decisions at any point in the model.

<a name="seq2seq"> </a>
## Sequence-to-Sequence (seq2seq)
One of the most **common neural models used for text generation** is the seq2seq model. A type of **encoder-decoder model**, which uses RNNs like LSTM in order to generate output, token by token or character by character.

Used for machine translation, text summary generation, chatbots, Named Entity Recognition (NER), and speech recognition.

seq2seq networks have two parts:
1. An **encoder** that accepts language (or audio or video) input. The output matrix of the encoder is discarded, but its state is preserved as a vector.
2. A **decoder** that takes the encoder’s final state (or memory) as its initial state. By using a technique called “teacher forcing” to train the decoder to predict the following text (characters or words) in a target sequence given the previous text.

## Building a pretty limited **English-to-Spanish translator**. 
There are a few **neural network libraries** such as **TensorFlow with the Keras API**.

We’ll need the following for our Keras implementation:

1. **Vocabulary sets** for both our input (English) and target (Spanish) data
2. The **total number of unique word tokens** we have for each set
3. The **maximum sentence length** we’re using for each language

We also need to **mark the start and end of each document** (sentence) in the target samples so that the model recognizes where to begin and end its text generation. 

One way to do this is adding ```<START>``` at the beginning and ```<END>``` at the end of each target document (in this case, this will be our Spanish sentences). 

For example, ```Estoy feliz.``` becomes ```<START> Estoy feliz. <END>```.

In [5]:
from tensorflow import keras
import re

# Importing our translations
data_path = "engspan.txt"

# Defining lines as a list of each line
with open(data_path, 'r', encoding='utf-8') as f:
  lines = f.read().split('\n')

print(lines)

["We'll see.\tDespués veremos.", "We'll see.\tYa veremos.", "We'll try.\tLo intentaremos.", "We've won!\t¡Hemos ganado!", 'Well done.\tBien hecho.', "What's up?\t¿Qué hay?", 'Who cares?\t¿A quién le importa?', 'Who drove?\t¿Quién condujo?', 'Who drove?\t¿Quién conducía?', 'Who is he?\t¿Quién es él?', 'Who is it?\t¿Quién es?']


In [10]:
# Building empty lists to hold sentences
input_docs = []
target_docs = []

# Building empty vocabulary sets
input_tokens = set()
target_tokens = set()

for line in lines:
    # Input and target sentences are separated by tabs
    input_doc, target_doc = line.split('\t')
    # Appending each input sentence to input_docs
    input_docs.append(input_doc)
    # Splitting words from punctuation
    target_doc = " ".join(re.findall(r"[\w']+|[^\s\w]", target_doc))
    # Redefine target_doc and append it to target_docs
    target_doc = '<START> ' + target_doc + ' <END>'
    target_docs.append(target_doc)

    # split up each sentence into words and add each unique word to the vocabulary set
    for token in re.findall(r"[\w']+|[^\s\w]", input_doc):
        print(token)
        if token not in input_tokens:
          input_tokens.add(token)
    for token in target_doc.split():
        print(token)
        if token not in target_tokens:
            target_tokens.add(token)

input_tokens = sorted(list(input_tokens))
target_tokens = sorted(list(target_tokens))

# Create num_encoder_tokens and num_decoder_tokens
num_encoder_tokens = len(input_tokens)
num_decoder_tokens = len(target_tokens)

try:
  max_encoder_seq_length = max([len(re.findall(r"[\w']+|[^\s\w]", input_doc)) for input_doc in input_docs])
  max_decoder_seq_length = max([len(re.findall(r"[\w']+|[^\s\w]", target_doc)) for target_doc in target_docs])
except ValueError:
  pass

We'll
see
.
<START>
Después
veremos
.
<END>
We'll
see
.
<START>
Ya
veremos
.
<END>
We'll
try
.
<START>
Lo
intentaremos
.
<END>
We've
won
!
<START>
¡
Hemos
ganado
!
<END>
Well
done
.
<START>
Bien
hecho
.
<END>
What's
up
?
<START>
¿
Qué
hay
?
<END>
Who
cares
?
<START>
¿
A
quién
le
importa
?
<END>
Who
drove
?
<START>
¿
Quién
condujo
?
<END>
Who
drove
?
<START>
¿
Quién
conducía
?
<END>
Who
is
he
?
<START>
¿
Quién
es
él
?
<END>
Who
is
it
?
<START>
¿
Quién
es
?
<END>


### Training Setup (part 1)
For each sentence, **Keras expects a NumPy matrix containing one-hot vectors** for each token.

In order to **vectorize our data** and later **translate it from vectors** we need:
1. Features dictionary for English
2. Features dictionary for Spanish
3. Reverse features dictionary for English (where the keys and values are swapped)
4. Reverse features dictionary for Spanish  

Once we have all four we will vectorize the data. We will need vectors to input into our encoder and decoder, as well as a vector of target data we can use to train the decoder.

Because each matrix is almost all zeros, we’ll use ```numpy.zeros()``` from the NumPy library to build them out:  

```encoder_input_data = np.zeros((len(input_docs), max_encoder_seq_length, num_encoder_tokens), dtype='float32')```

We defined a NumPy matrix of zeros called encoder_input_data with two arguments:
1. The **shape of the matrix** — in our case the number of documents (or sentences) by the maximum token sequence length (the longest sentence we want to see) by the number of unique tokens (or words)
2. The **data type** we want — in our case NumPy’s float32, which can speed up the processing a bit

In [13]:
import numpy as np

print('Number of samples:', len(input_docs))
print('Number of unique input tokens:', num_encoder_tokens)
print('Number of unique output tokens:', num_decoder_tokens)
print('Max sequence length for inputs:', max_encoder_seq_length)
print('Max sequence length for outputs:', max_decoder_seq_length)

# build input features dictionary
input_features_dict = dict([(token, i) for i, token in enumerate(input_tokens)])
# build target features_dictionary
target_features_dict = dict([(token, i) for i, token in enumerate(target_tokens)])

# reverse-lookup token index to decode sequences back to something readable
# build reverse input features dictionary
reverse_input_features_dict = dict((i, token) for token, i in input_features_dict.items())
# same for reverse target features dictionary
reverse_target_features_dict = dict((i, token) for token, i in target_features_dict.items())

# build a Numpy matrix of zeros
encoder_input_data = np.zeros((len(input_docs), max_encoder_seq_length, num_encoder_tokens), dtype='float32')
print("\nHere's the first item in the encoder input matrix:\n", encoder_input_data[0], "\n\nThe number of columns should match the number of unique input tokens and the number of rows should match the maximum sequence length for input sentences.")

# build out the decoder_input_data matrix
decoder_input_data = np.zeros((len(input_docs), max_decoder_seq_length, num_decoder_tokens), dtype='float32')
# build out the decoder_target_data matrix
decoder_target_data = np.zeros((len(input_docs), max_decoder_seq_length, num_decoder_tokens), dtype='float32')

Number of samples: 11
Number of unique input tokens: 18
Number of unique output tokens: 27
Max sequence length for inputs: 4
Max sequence length for outputs: 12

Here's the first item in the encoder input matrix:
 [[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]] 

The number of columns should match the number of unique input tokens and the number of rows should match the maximum sequence length for input sentences.


### Training Setup (part 2)
At this point we need to fill out the 1s in each vector. We can loop over each English-Spanish pair in our training sample using the features dictionaries to add a 1 for the token in question.

You’ll notice the vectors have timesteps — we use these to track where in a given document (sentence) we are.

To build out a three-dimensional NumPy matrix of one-hot vectors, we can assign a value of 1 for a given word at a given timestep in a given line:

```matrix_name[line, timestep, features_dict[token]] = 1.```

Keras will fit — or train — the seq2seq model using these matrices of one-hot vectors:
* the encoder input data
* the decoder input data
* the decoder target data

Hang on a second, why build two matrices of decoder data? Aren’t we just encoding and decoding?

The reason has to do with a technique known as **teacher forcing** that most seq2seq models employ during training. Here’s the idea: we have a Spanish input token from the previous timestep to help train the model for the current timestep’s target token.

In [15]:
for line, (input_doc, target_doc) in enumerate(zip(input_docs, target_docs)):

    for timestep, token in enumerate(re.findall(r"[\w']+|[^\s\w]", input_doc)):

        print("Encoder input timestep & token:", timestep, token)
        print(input_features_dict[token])
        # Assign 1. for the current line, timestep, & word in encoder_input_data:
        encoder_input_data[line, timestep, input_features_dict[token]] = 1.

    for timestep, token in enumerate(target_doc.split()):

        # decoder_target_data is ahead of decoder_input_data by one timestep
        print("Decoder input timestep & token:", timestep, token)
        # Assign 1. for the current line, timestep, & word in decoder_input_data:
        decoder_input_data[line, timestep, target_features_dict[token]] = 1.
    
    if timestep > 0:
        # decoder_target_data is ahead by 1 timestep and doesn't include the start token.
        print("Decoder target timestep:", timestep)
    
        # Assign 1. for the current line, timestep, & word in decoder_target_data
    if timestep > 0:
        decoder_target_data[line, timestep-1, target_features_dict[token]] = 1.

Encoder input timestep & token: 0 We'll
3
Encoder input timestep & token: 1 see
14
Encoder input timestep & token: 2 .
1
Decoder input timestep & token: 0 <START>
Decoder input timestep & token: 1 Después
Decoder input timestep & token: 2 veremos
Decoder input timestep & token: 3 .
Decoder input timestep & token: 4 <END>
Decoder target timestep: 4
Encoder input timestep & token: 0 We'll
3
Encoder input timestep & token: 1 see
14
Encoder input timestep & token: 2 .
1
Decoder input timestep & token: 0 <START>
Decoder input timestep & token: 1 Ya
Decoder input timestep & token: 2 veremos
Decoder input timestep & token: 3 .
Decoder input timestep & token: 4 <END>
Decoder target timestep: 4
Encoder input timestep & token: 0 We'll
3
Encoder input timestep & token: 1 try
15
Encoder input timestep & token: 2 .
1
Decoder input timestep & token: 0 <START>
Decoder input timestep & token: 1 Lo
Decoder input timestep & token: 2 intentaremos
Decoder input timestep & token: 3 .
Decoder input timestep

### Encoder Training Setup

Deep learning models in Keras are built in layers, where each layer is a step in the model.

Our encoder requires two layer types from Keras:
1. An **input layer**, which defines a matrix to hold all the one-hot vectors that we’ll feed to the model.
2. An **LSTM layer**, with some output dimensionality.

Next, we **set up the input layer**, which requires some number of dimensions that we’re providing. 

In this case, we know that we’re passing in all the encoder tokens, but we don’t necessarily know our batch size (how many sentences we’re feeding the model at a time). Fortunately, we can say None because the code is written to handle varying batch sizes, so we don’t need to specify that dimension.

For the **LSTM layer** we need to **select the dimensionality**(the size of the LSTM’s hidden states, which helps determine how closely the model molds itself to the training data) and whether to return the state (in this case we do):

**The only thing we want from the encoder is its final states.**

In [17]:
from keras.layers import Input, LSTM
from keras.models import Model

# create the input layer
encoder_inputs = Input(shape=(None, num_encoder_tokens))

# create the LSTM layer:
encoder_lstm = LSTM(256, return_state=True)

# retrieve the outputs and states:
encoder_outputs, state_hidden, state_cell = encoder_lstm(encoder_inputs)

# put the states together in a list:
encoder_states = [state_hidden, state_cell]

### Decoder Training Setup
The decoder looks a lot like the encoder, with an input layer and an LSTM layer that we use together.


This time **we care about full return sequences**. However, with our decoder, we pass in the state data from the encoder, along with the decoder inputs. This time, we will keep the output instead of the states.

We also need to run the output through a final activation layer, using the **Softmax function**, that will give us the probability distribution for each token. The final layer also transforms our LSTM output from a dimensionality of whatever we gave it (in our case, 10) to the number of unique words within the hidden layer’s vocabulary (i.e., the number of unique target tokens, which is definitely more than 10).
 
Keras’s implementation could work with several layer types, but **Dense is the least complex**.

In [20]:
from keras.layers import Dense


# Encoder training setup
encoder_inputs = Input(shape=(None, num_encoder_tokens))
encoder_lstm = LSTM(256, return_state=True)
encoder_outputs, state_hidden, state_cell = encoder_lstm(encoder_inputs)
encoder_states = [state_hidden, state_cell]

# the decoder input and LSTM layers
decoder_inputs = Input(shape=(None, num_decoder_tokens))
# this time we care about full return sequences
decoder_lstm = LSTM(256, return_sequences=True, return_state=True)

# retrieve the LSTM outputs and states
decoder_outputs, decoder_state_hidden, decoder_state_cell = decoder_lstm(decoder_inputs, initial_state=encoder_states)

# build a final Dense layer
decoder_dense = Dense(num_decoder_tokens, activation='softmax')

# filter outputs through the Dense layer
decoder_outputs = decoder_dense(decoder_outputs)

<a name="buildseq2seq"> </a>
### Build & Train seq2seq

1. **Define the seq2seq model** using the ```Model()``` function from Keras. To make it a seq2seq model, feed it the encoder and decoder inputs, as well as the decoder output:

```model = Model([encoder_inputs, decoder_inputs], decoder_outputs)```

2. **Train** the model. 
    1. Compile everything. Keras models demand two arguments to compile:
    
    
* An ```optimizer``` (RMSprop is a fancy version of gradient descent) to help minimize our error rate (how bad the model is  at guessing the true next word given the previous words in a sentence).  
* A ```loss``` function (logarithm-based cross-entropy function) to determine the error rate.
        
Add **accuracy** to pay attention to while training.

```model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])```

3. **Fit** the compiled model to both the encoder and decoder input data (what we pass into the model), the decoder target data (what we expect the model to return given the data we passed in), and some numbers we can adjust as needed:


* ```batch_size``` (smaller batch sizes mean more time, and for some problems, smaller batch sizes will be better, while for other problems, larger batch sizes are better)
* Number of ```epochs``` or cycles of training (more epochs mean a model that is more trained on the dataset, and that the process will take more time)
* ```validation_split``` (what percentage of the data should be set aside for validating — and determining when to stop training your model — rather than training)
    
```model.fit([encoder_input_data, decoder_input_data], decoder_target_data, batch_size=10, epochs=100, validation_split=0.2)```

In [22]:
# Encoder training setup
encoder_inputs = Input(shape=(None, num_encoder_tokens))
encoder_lstm = LSTM(256, return_state=True)
encoder_outputs, state_hidden, state_cell = encoder_lstm(encoder_inputs)
encoder_states = [state_hidden, state_cell]

# Decoder training setup:
decoder_inputs = Input(shape=(None, num_decoder_tokens))
decoder_lstm = LSTM(256, return_sequences=True, return_state=True)
decoder_outputs, decoder_state_hidden, decoder_state_cell = decoder_lstm(decoder_inputs, initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

# building the training model
training_model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

print("Model summary:\n")
training_model.summary()
print("\n\n")

# compile the model
training_model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])


# batch size and number of epochs
batch_size = 50
epochs = 50

print("Training the model:\n")
# train the model:
training_model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
                   batch_size=batch_size, epochs=epochs, validation_split=0.2)

Model summary:

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_7 (InputLayer)           [(None, None, 18)]   0           []                               
                                                                                                  
 input_8 (InputLayer)           [(None, None, 27)]   0           []                               
                                                                                                  
 lstm_6 (LSTM)                  [(None, 256),        281600      ['input_7[0][0]']                
                                 (None, 256),                                                     
                                 (None, 256)]                                                     
                                                                              

Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x2303dfaa8e0>

<a name="Testing"> </a>
### Setup for Testing
To generate some original output text, the seq2seq model  architecture needs to be **redefined in pieces**.

**The model used for training only works when we already know the target sequence**. This time, we have no idea what the Spanish should be for the English we pass in! So we need a model that will **decode step-by-step** instead of using teacher forcing.

1. Build an encoder model with our encoder inputs and the placeholders for the encoder’s output states:

```encoder_model = Model(encoder_inputs, encoder_states)```

2. We need placeholders for the decoder’s input states, which we can build as input layers and store together. We don’t know what we want to decode yet or what hidden state we’re going to end up with, so we need to do everything step-by-step. We need to pass the encoder’s final hidden state to the decoder, sample a token, and get the updated hidden state back. Then we’ll be able to (manually) pass the updated hidden state back into the network:

```
latent_dim = 256
decoder_state_input_hidden = Input(shape=(latent_dim,))
decoder_state_input_cell = Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_hidden, decoder_state_input_cell]
```

3. Using the decoder LSTM and decoder dense layer (with the activation function) that we trained earlier, we’ll create new decoder states and outputs:

```
decoder_outputs, state_hidden, state_cell = 
    decoder_lstm(decoder_inputs, 
    initial_state=decoder_states_inputs)
 
# Saving the new LSTM output states:
decoder_states = [state_hidden, state_cell]
```

4. Redefine the decoder output by passing it through the dense layer:

```decoder_outputs = decoder_dense(decoder_outputs)```

5. Set up the decoder model. This is where we bring together:

* the decoder inputs (the decoder input layer)
* the decoder input states (the final states from the encoder)
* the decoder outputs (the NumPy matrix we get from the final output layer of the decoder)
* the decoder output states (the memory throughout the network from one word to the next)

```decoder_model = Model([decoder_inputs] + decoder_states_inputs, [decoder_outputs] + decoder_states)```

In [24]:
# Building the encoder test model
encoder_model = Model(encoder_inputs, encoder_states)

latent_dim = 256
# Building the two decoder state input layers
decoder_state_input_hidden = Input(shape=(latent_dim,))

decoder_state_input_cell = Input(shape=(latent_dim,))

# Put the state input layers into a list
decoder_states_inputs = [decoder_state_input_hidden,
  decoder_state_input_cell]

# Call the decoder LSTM
decoder_outputs, state_hidden, state_cell = decoder_lstm(decoder_inputs, initial_state=decoder_states_inputs)
decoder_states = [state_hidden, state_cell]

# Redefine the decoder outputs
decoder_outputs = decoder_dense(decoder_outputs)

# Build the decoder test model
decoder_model = Model(
  [decoder_inputs] + decoder_states_inputs,
  [decoder_outputs] + decoder_states)

<a name="TestFunction"> </a>
### The Test Function (part 1)
1. Build a function that:
    1. Accepts a NumPy matrix representing the test English sentence input
    2. Uses the encoder and decoder we’ve created to generate Spanish output

Inside the test function, we’ll run our new English sentence through the encoder model. The ```.predict()``` method takes in new input (as a NumPy matrix) and gives us output states that we can pass on to the decoder:

```states = encoder.predict(test_input)```  
*(test_input is a NumPy matrix representing an English sentence)*

2. Build an empty NumPy array for our Spanish translation, giving it three dimensions:

```target_sequence = np.zeros((1, 1, num_decoder_tokens))```  
*(batch size: 1, number of tokens to start with: 1, number of tokens in our target vocabulary)*

We already know the first value in our Spanish sentence — ```"<Start>"```, so we can give ```"<Start>"``` a value of 1 at the first timestep:

```target_sequence[0, 0, target_features_dict['<START>']] = 1.```
    
Before we get decoding, we’ll need a string where we can add our translation to, word by word:

```decoded_sentence = ''```
    
This is the variable that we will ultimately return from the function.

In [25]:
def decode_sequence(test_input):
    # Encode the input as state vectors:
    encoder_states_value = encoder_model.predict(test_input)
    
    # Set decoder states equal to encoder final states
    decoder_states_value = encoder_states_value
    
    # Generate empty target sequence of length 1:
    target_seq = np.zeros((1, 1, num_decoder_tokens))

    # Populate the first token of target sequence with the start token:
    target_seq[0, 0 , target_features_dict['<START>']] = 1.

    decoded_sentence = ''

    return decoded_sentence

for seq_index in range(10):
    test_input = encoder_input_data[seq_index: seq_index + 1]
    decoded_sentence = decode_sequence(test_input)
    print('-')
    print('Input sentence:', input_docs[seq_index])
    print('Decoded sentence:', decoded_sentence)

-
Input sentence: We'll see.
Decoded sentence: 
-
Input sentence: We'll see.
Decoded sentence: 
-
Input sentence: We'll try.
Decoded sentence: 
-
Input sentence: We've won!
Decoded sentence: 
-
Input sentence: Well done.
Decoded sentence: 
-
Input sentence: What's up?
Decoded sentence: 
-
Input sentence: Who cares?
Decoded sentence: 
-
Input sentence: Who drove?
Decoded sentence: 
-
Input sentence: Who drove?
Decoded sentence: 
-
Input sentence: Who is he?
Decoded sentence: 


### Test Function (part 2)
**Translation time**:
1. Decode the sentence word by word using the output state that we retrieved from the encoder (which becomes our decoder’s initial hidden state). 
2. Update the decoder hidden state after each word so that we use previously decoded words to help decode new ones.

To tackle one word at a time, we need a while loop that will run until one of two things happens (we don’t want the model generating words forever):

    1. The current token is ```"<END>"```.
    2. The decoded Spanish sentence length hits the maximum target sentence length.  
    
Inside the while loop, the decoder model can use the current target sequence (beginning with the ```"<START>"``` token) and the current state (initially passed to us from the encoder model) to get a bunch of possible next words and their corresponding probabilities. In Keras, it looks something like this:

```output_tokens, new_decoder_hidden_state, new_decoder_cell_state = decoder_model.predict([target_seq] + decoder_states_value)```  
    
3. Use NumPy’s .argmax() method to determine the token (word) with the highest probability and add it to the decoded sentence:

```sampled_token_index = np.argmax(output_tokens[0, -1, :])```  
*(slicing [0, -1, :] gives us a specific token vector within the 3d NumPy matrix)*   
```sampled_token = reverse_target_features_dict[sampled_token_index]```  
*(the reverse features dictionary translates back from index to Spanish)*  
```decoded_sentence += " " + sampled_token```  

4. Update a few values for the next word in the sequence:

```target_seq = np.zeros((1, 1, num_decoder_tokens))```  
```target_seq[0, 0, sampled_token_index] = 1.```  
*(move to the next timestep of the target sequence)*  
```decoder_states_value = [new_decoder_hidden_state, new_decoder_cell_state]```  
*(update the states with values from the most recent decoder prediction)*  

And now we can test it all out!

In [26]:
def decode_sequence(test_input):
    encoder_states_value = encoder_model.predict(test_input)
    decoder_states_value = encoder_states_value
    target_seq = np.zeros((1, 1, num_decoder_tokens))
    target_seq[0, 0, target_features_dict['<START>']] = 1.
    decoded_sentence = ''

    stop_condition = False
    while not stop_condition:
        # Run the decoder model to get possible output tokens (with probabilities) & states
        output_tokens, new_decoder_hidden_state, new_decoder_cell_state = decoder_model.predict([target_seq] + decoder_states_value)

        # Choose token with highest probability
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_token = reverse_target_features_dict[sampled_token_index]
        decoded_sentence += " " + sampled_token
        # Exit condition: either hit max length or find stop token.
    if (sampled_token == '<END>' or len(decoded_sentence) > max_decoder_seq_length):
        stop_condition = True

        # Update the target sequence (of length 1).
        target_seq = np.zeros((1, 1, num_decoder_tokens))
        target_seq[0, 0, sampled_token_index] = 1.
        # Update states
        decoder_states_value = [new_decoder_hidden_state,
        new_decoder_cell_state]

    return decoded_sentence

for seq_index in range(10):
    test_input = encoder_input_data[seq_index: seq_index + 1]
    decoded_sentence = decode_sequence(test_input)
    print('-')
    print('Input sentence:', input_docs[seq_index])
    print('Decoded sentence:', decoded_sentence)

KeyboardInterrupt: 

The program can be improved by:
1. using a larger data set
2. increasing the size of the model
3. adding more epochs for training
4. convert the one-hot vectors into word embeddings during training. 

Using embeddings of words rather than one-hot vectors would help the model capture that semantically similar words might have semantically similar embeddings (helping the LSTM generalize).