Since there is basically no external data file, this notebook is extremly simple to open in collab and to train your models there. This is highly recommended since LSTMS take longer to train. 

In [1]:
from keras.models import Model
from keras.layers import Input, LSTM, Dense
import numpy as np
import pandas as pd
from datetime import datetime

Using TensorFlow backend.


# **Task 3: **

**We're going to build a network that takes and converts dates from one format into another. **

For example, given a date string such as "14-03-2020", we want out network to, character by character read this string and output to us "The 14th of March 2020".

Since our data is a sequence of information, each part derives it's meaning from a prior part.
"2" as the second month character could either encode for Feb or for december depending on what number preceded it. This is a problem that is well handled by recurrent neural networks. 

We're going to be using LSTM's to build this network, which are recurrent learning cells. 


Below is a model that allows us to do sequence to sequence conversion where the input and output are of different lengths, the example provided is one of english to french translation. This is similar to the encoder, decoder style of machine translation we have learnt about in class.


![alt text](https://blog.keras.io/img/seq2seq/seq2seq-teacher-forcing.png)






Below is a function that generates the dataset, giving you date entries in different formats in for as many days (2019 April 15th onwards) as you'd like.
Go ahead, test it, see how it returns values and what they are.

### Make Dataset

In [None]:
def make_short_date(dt):
    return dt.strftime('%d-%m-%Y')

def make_long_date(dt):
    date = dt.strftime('%d')
    if date[-1] == '1':
        suffix = 'st'
    elif date[-1] == '2':
        suffix = 'nd'
    elif date[-1] == '3':
        suffix = 'rd'
    else:
        suffix = 'th'
    month = dt.strftime('%B')
    year = dt.strftime('%Y')
    
    return date + suffix + ' of ' + month + ' ' + year

def make_dataset(n):
    dates = pd.date_range(datetime(1990, 4, 14), periods=n, normalize=True)
    
    x = dates.map(make_short_date).values
    y = dates.map(make_long_date).values
    
    return x, y

In [3]:
x, y = make_dataset(50)
x[:5], y[:5]

(array(['14-04-1990', '15-04-1990', '16-04-1990', '17-04-1990',
        '18-04-1990'], dtype=object),
 array(['14th of April 1990', '15th of April 1990', '16th of April 1990',
        '17th of April 1990', '18th of April 1990'], dtype=object))

We've got some hyper-paramters set for you here, we're going to start working with 10,000 training examples and see how well our models trains with that.

In [None]:
batch_size = 64  # Batch size for training.
epochs = 100  # Number of epochs to train for.
latent_dim = 256  # Latent dimensionality of the encoding space.
num_samples = 10000  # Number of samples to train on.
# Path to the data txt file on disk.

In [5]:
dataset = make_dataset(num_samples)
print(dataset[1][0])

14th of April 1990


### Part 1 - Generation and preperation of dataset

Prepare the dataset for training. The following steps will have to be taken.

We need a total of 3 datasets: 
1. encoder_input (our original data)
2. decoder_input (the target data with start and end tokens added) -> our start token is a "\t" character, and the stop character "\n".
3. decoder_target (target data without a start token, but with an end token) 

decoder_input and decoder_target data are different since once the model is trained, we will pass the decoder a sequence containing only a "\t" and it will generate the rest of the sentence for us after, ending with the "\n" token.

Here is an example of this format of data for a single sample.

encoder_input: "14-03-2019"
decoder_input: "\tThe 14th of March 2019\n"
decoder_target: "The 14th of March 2019\n"

Now that we know what the target is for the dataset, it's time to start converting it into a form the network can understand and work with.
We need each sample to be an n*m numpy array of 0's. Where n is the maximum length of the sequence and m is the vocubulary size.

An input would go from "14-05-19" to a array of size (1*8*10), where 1 is our batch size, 8 is sequence length and our vocab is 10 (including the '-').

To do this, complete the following:

1. Create a list of all possible vocab for the input and output target data (use a set)
2. Use this set to create a dictionary that can convert characters into ints

    *For instance you'll have a 'char_2_index' array that will function as "char_2_index['-'] = 13"*
3. Convert these lists of ints into a 2d numpy array (3d when considering batches)

In [6]:
inp_list=['-','0','1','2','3','4','5','6','7','8','9']
out_list=['\t','\n',]
inpDict={}
outDict={'-': 0, '0': 1, '1': 2, '2': 3, '3': 4, '4': 5, '5': 6, '6': 7, '7': 8, '8': 9, '9': 10}
date_len=10
max_output_len=27

count=11
input_shape=(date_len,len(inp_list))     #8 represents the length of the date string
output_shape=(max_output_len,41)
dec_inp_shape=(max_output_len,41)

def populate_outlist(data):
    for i in range(len(data)):
        for j in range(len(data[i])):
            if(data[i][j] not in out_list):
                out_list.append(data[i][j])

def char2array(data):
    last_index=0
    enc_I_array=np.zeros((len(data[0]), *input_shape))
    
    for i in range(len(data[0])):        
        date_array=np.zeros((*input_shape,))
        for k in range(date_len):
            index=inpDict[data[0][i][k]]
            date_array[k][index]=1

        enc_I_array[i]=date_array
    
    dec_out_array=np.zeros((len(data[1]), *output_shape))       #decoded output array
    dec_inp_array=np.zeros((len(data[1]), *dec_inp_shape))
    
    for i in range(len(data[1])):        
        date_array=np.zeros((*output_shape,))
        dec_array=np.zeros((*dec_inp_shape,))
        dec_array[0][outDict['\t']]=1
        for j in range(len(data[1][i])):
            index=outDict[data[1][i][j]]
            date_array[j][index]=1
            dec_array[j+1][index]=1

        dec_out_array[i]=date_array
        dec_inp_array[i]=dec_array
        
    
    return enc_I_array,dec_inp_array,dec_out_array


populate_outlist(dataset[1])
# print(len(out_list))
for i in range(len(inp_list)):      #Populate my_dict
    inpDict[inp_list[i]]=i

# print(inpDict)
    
for j in range(len(out_list)):
    if(out_list[j] not in outDict):
        outDict[out_list[j]]=count
        count+=1;

# print(inpDict)
# print(outDict)

enc_inp,dec_inp,dec_output=char2array(dataset)

print(enc_inp)
print(dec_output)


[[[0. 0. 1. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  [1. 0. 0. ... 0. 0. 0.]
  ...
  [0. 0. 0. ... 0. 0. 1.]
  [0. 0. 0. ... 0. 0. 1.]
  [0. 1. 0. ... 0. 0. 0.]]

 [[0. 0. 1. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  [1. 0. 0. ... 0. 0. 0.]
  ...
  [0. 0. 0. ... 0. 0. 1.]
  [0. 0. 0. ... 0. 0. 1.]
  [0. 1. 0. ... 0. 0. 0.]]

 [[0. 0. 1. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  [1. 0. 0. ... 0. 0. 0.]
  ...
  [0. 0. 0. ... 0. 0. 1.]
  [0. 0. 0. ... 0. 0. 1.]
  [0. 1. 0. ... 0. 0. 0.]]

 ...

 [[0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 1. 0. 0.]
  [1. 0. 0. ... 0. 0. 0.]
  ...
  [0. 1. 0. ... 0. 0. 0.]
  [0. 0. 1. ... 0. 0. 0.]
  [0. 0. 0. ... 1. 0. 0.]]

 [[0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 1. 0.]
  [1. 0. 0. ... 0. 0. 0.]
  ...
  [0. 1. 0. ... 0. 0. 0.]
  [0. 0. 1. ... 0. 0. 0.]
  [0. 0. 0. ... 1. 0. 0.]]

 [[0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 1.]
  [1. 0. 0. ... 0. 0. 0.]
  ...
  [0. 1. 0. ... 0. 0. 0.]
  [0. 0. 1. ... 0. 0. 0.]
  [0. 0. 0. ... 1. 0. 0.]]]
[[[0. 0. 1

**Example: **

Input sentence: 14-04-2019 into a 3d tensor would result in the following:


[[[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]

  [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
  
  [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  
  [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  
  [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
  
  [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  
  [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
  
  [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
  
  [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
  
  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]]]
-


**Part 2 - Setting up the network**

Before we begin, uncomment the following lines of code and fill in appropriate variables to have an overview of what your network will be training with.

In [7]:
print('Number of samples:', len(enc_inp))
print('Number of unique input tokens:', len(inpDict))
print('Number of unique output tokens:', len(outDict))
print('Max sequence length for inputs:', date_len)
print('Max sequence length for outputs:', max_output_len)

Number of samples: 10000
Number of unique input tokens: 11
Number of unique output tokens: 41
Max sequence length for inputs: 10
Max sequence length for outputs: 27


Great, now you have to set up an encoder decoder network. 

This will require 2 LSTMS

1. An encoder LSTM (size - latent dimension as we defined above):
  - We'll pass our encoder_input data to this
  - We will let it run through the LSTM and get the states back from it (discard the network output, we only need the c and h states), save these
 
2. A decoder LSTM (size - latent dimension):
  - We'll be passing decoder_input data to this (with the '\t' and ''\n' added and encoded)
  - We will also be passing a specific initial state to this (states c and h, taken from the encoder network)
  
Following this LSTM, you will need a dense layer of output_tokens (output vocab) size to convert the result into a one hot encoded target. Figure out what activation this should require

In [8]:
encoder_inputs = Input(shape=(None, len(inpDict)))
encoder = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)
# We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]

Instructions for updating:
Colocations handled automatically by placer.


In [None]:
# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = Input(shape=(None, len(outDict)))
# We set up our decoder to return full output sequences,
# and to return internal states as well. We don't use the 
# return states in the training model, but we will use them in inference.
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs,
                                     initial_state=encoder_states)
decoder_dense = Dense(len(outDict), activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

In [None]:
 #Define the model that will turn
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

In [None]:
# Run training
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
model.fit([enc_inp,dec_inp],dec_output,batch_size=1,epochs=20,validation_split=0.2)


** Model structure ** 

So you have 

  1. (encoder_input) -> encoder LSTM -> (output, states)
  2. (decoder_input, states) -> decoder LSTM -> Dense (decoder_output)
  
For the overall model: 
1. Inputs - [encoder_input, decoder_input]
2. Outputs - [decoder_target]

Model Optimizer - RMSProp
Model Loss - categorical_crossentropy


In [None]:
encoder_model = Model(encoder_inputs, encoder_states)

decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
decoder_outputs, state_h, state_c = decoder_lstm(
    decoder_inputs, initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = Model(
    [decoder_inputs] + decoder_states_inputs,
    [decoder_outputs] + decoder_states)

In [None]:
reverse_input_char_index = dict(
    (i, char) for char, i in inpDict.items())
reverse_target_char_index = dict(
    (i, char) for char, i in outDict.items())

def decode_sequence(input_seq):
    # Encode the input as state vectors.
    states_value = encoder_model.predict(input_seq)

    # Generate empty target sequence of length 1.
    target_seq = np.zeros((1, 1, len(outDict)))
    # Populate the first character of target sequence with the start character.
    target_seq[0, 0, outDict['\t']] = 1.

    # Sampling loop for a batch of sequences
    # (to simplify, here we assume a batch of size 1).
    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict(
            [target_seq] + states_value)

        # Sample a token
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = reverse_target_char_index[sampled_token_index]
        decoded_sentence += sampled_char

        # Exit condition: either hit max length
        # or find stop character.
        if (sampled_char == '\n'):
            print("came in stop condition")

        
        
        
        if (sampled_char == '\n' or
           len(decoded_sentence) > max_output_len):
            stop_condition = True

        # Update the target sequence (of length 1).
        target_seq = np.zeros((1, 1, len(outDict)))
        target_seq[0, 0, sampled_token_index] = 1.

        # Update states
        states_value = [h, c]

    return decoded_sentence

In [26]:
for seq_index in range(100):
    # Take one sequence (part of the training set)
    # for trying out decoding.
    input_seq = enc_inp[seq_index: seq_index + 1]
    decoded_sentence = decode_sequence(input_seq)
    print('-')
    print('Input sentence:', dataset[0][seq_index])
    print('Decoded sentence:', decoded_sentence)

-
Input sentence: 14-04-1990
Decoded sentence: 14th of April 19900820ec2000
-
Input sentence: 15-04-1990
Decoded sentence: 15th of April 19900820ec2000
-
Input sentence: 16-04-1990
Decoded sentence: 16th of April 19900820ec2000
-
Input sentence: 17-04-1990
Decoded sentence: 17th of April 19900820ec2000
-
Input sentence: 18-04-1990
Decoded sentence: 18th of April 19900820ec2000
-
Input sentence: 19-04-1990
Decoded sentence: 19th of April 19900820ecembe
-
Input sentence: 20-04-1990
Decoded sentence: 20th of April 19900820ec2000
-
Input sentence: 21-04-1990
Decoded sentence: 21th of April 19900820ec2000
-
Input sentence: 22-04-1990
Decoded sentence: 22th of April 19900820ec2000
-
Input sentence: 23-04-1990
Decoded sentence: 23th of April 19900820ecembe
-
Input sentence: 24-04-1990
Decoded sentence: 24th of April 19900820ec2000
-
Input sentence: 25-04-1990
Decoded sentence: 25th of April 19900820ec2000
-
Input sentence: 26-04-1990
Decoded sentence: 26th of April 19900820ec2000
-
Input sent

** Generating results **

Now that you've trained the network, you need to create two smaller subnetworks so that you can use them indepedantly for predictions:

1. An encoder model to give you (encoder_input) -> (model states)
2. a decoder model to give you (model_states + start_token) -> (next character)

You will have to use these as following: 

  1. encode input and retrieve initial decoder state
  
  2. run one step of decoder with this initial state and a "start of sequence" token as target.
  
  Output will be the next target token
  
  3. Repeat with the current target token and current states

The following illustration should help solidify this prediction loop better. 



![alt text](https://blog.keras.io/img/seq2seq/seq2seq-inference.png)

** Part 3 - Improving result ** 

Now that you've got a working model, answer the following questions. 

1. What does the model return for a date from 1987? Why?
2. What about a date from 2034?
3. Now try the same date but in year 2134, what does the model return? Why is this so?
4. How do we fix this problem?


Answers:

1. The model gives the correct result. It gives the correct translation
2. No it can't translate sice it has not seen this in training.
3. Same-> Can't predict the year. 
4. Just add the data in training model so that the model can assign weight to it. Otherwise it will be zero.



In [None]:
# improve the 'generate dataset' function to overcome the limitations you've highlighted in the previous part, use your answer to (4) for this
# code this function below

What did you change in this new version of the function?

How will it help improve model results for the specific data points we mentioned earlier that our model had trouble with?

In [None]:
# Demonstrate the improvement


In [28]:
!pip install numpy==1.16.1
import numpy as np

Collecting numpy==1.16.1
[?25l  Downloading https://files.pythonhosted.org/packages/f5/bf/4981bcbee43934f0adb8f764a1e70ab0ee5a448f6505bd04a87a2fda2a8b/numpy-1.16.1-cp36-cp36m-manylinux1_x86_64.whl (17.3MB)
[K     |████████████████████████████████| 17.3MB 3.4MB/s 
[31mERROR: datascience 0.10.6 has requirement folium==0.2.1, but you'll have folium 0.8.3 which is incompatible.[0m
[31mERROR: albumentations 0.1.12 has requirement imgaug<0.2.7,>=0.2.5, but you'll have imgaug 0.2.8 which is incompatible.[0m
[?25hInstalling collected packages: numpy
  Found existing installation: numpy 1.16.3
    Uninstalling numpy-1.16.3:
      Successfully uninstalled numpy-1.16.3
Successfully installed numpy-1.16.1


In [1]:

# LSTM for sequence classification in the IMDB dataset
import numpy
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
# fix random seed for reproducibility
numpy.random.seed(7)
# load the dataset but only keep the top n words, zero the rest
top_words = 5000
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=top_words)
# truncate and pad input sequences
max_review_length = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)
# create the model
embedding_vecor_length = 32
model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X_train, y_train, epochs=3, batch_size=64)
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Using TensorFlow backend.


Instructions for updating:
Colocations handled automatically by placer.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 500, 32)           160000    
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 101       
Total params: 213,301
Trainable params: 213,301
Non-trainable params: 0
_________________________________________________________________
None
Instructions for updating:
Use tf.cast instead.
Epoch 1/3
 3264/25000 [==>...........................] - ETA: 5:59 - loss: 0.6791 - acc: 0.5680

KeyboardInterrupt: ignored