Exercise from Géron, Aurélien. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow
Train an Encoder– Decoder model that can convert a date string from one format to another (e.g., from “April 22, 2019” to “2019-04-22”).

In [1]:
import tensorflow as tf

In [2]:
import tensorflow_addons as tfa

In [3]:
import numpy as np
from tensorflow import keras

Let us make the training set. Instead of ysing any pre-built function, I will do it by hand to practice strings. Also, even though teh randomization method here will not be the most computational efficient, I decided for it to be able to have more control.

In [4]:
import random

In [5]:
def dataset_builder(number_of_iterations):
    output = []
    
    day_list = list(range(1, 29))
    month_list = ["January", "February", "March", "April", "May", "June", 
                  "July", "August", "September", "October", "November", "December"]
    month_number_list = list(range(12))
    list_year = list(range(1809, 2021)) 
    
    for i in range(number_of_iterations):
        day = random.choice(day_list)
        month_number = random.choice(month_number_list)
        month = month_list[month_number]
        year = random.choice(list_year)
        
        long_date = str(month)+" "+str(day)+", "+str(year)
        short_date = str(year)+"-"+str(month_number+1)+"-"+str(day)
        
        time_step = [long_date, short_date]
        output.append(time_step)
    
    return np.array(output)

In [6]:
len(list(range(1809, 2021)))

212

Testing...

In [7]:
test_dataset = dataset_builder(10)
test_dataset

array([['April 9, 1960', '1960-4-9'],
       ['January 14, 1940', '1940-1-14'],
       ['June 9, 1957', '1957-6-9'],
       ['May 18, 1828', '1828-5-18'],
       ['February 24, 1955', '1955-2-24'],
       ['April 7, 1989', '1989-4-7'],
       ['May 3, 1809', '1809-5-3'],
       ['May 4, 1831', '1831-5-4'],
       ['November 2, 1867', '1867-11-2'],
       ['September 20, 1830', '1830-9-20']], dtype='<U18')

Great, all working. Time to make a few data set sizes so we can train with all of them.

In [8]:
dataset_500, dataset_5000, dataset_50000 = dataset_builder(500), dataset_builder(5000),dataset_builder(50000)

Now let us make the preprocess the data. First, let us make the dictionary to numberize the data. We start by making a list of the data

In [9]:
input_data = [str(year) for year in list(range(1809, 2021))]
for month in ["January", "February", "March", "April", "May", "June", 
                  "July", "August", "September", "October", "November", "December"]:
    input_data.append(month)
for day in list(range(1, 29)):
    input_data.append(str(day)+",")

In [10]:
input_data[-1]

'28,'

In [11]:
output_data = [str(year) for year in list(range(1809, 2021))]
for day in list(range(1, 29)): #we will not do string for day to avoid double-assigning numbers
    output_data.append(str(day))

In [12]:
len(input_data), len(output_data)

(252, 240)

Then we create the dictionaries

In [13]:
input_dictionary = dict([(y,x+1) for x,y in enumerate(sorted(set(input_data)))])

In [14]:
output_dictionary = dict([(y,x+1) for x,y in enumerate(sorted(set(output_data)))])

Let us just try the dictionaries out

In [15]:
test_dataset[0][0].split()

['April', '9,', '1960']

In [16]:
test_input = test_dataset[0][0].split()
[input_dictionary[x] for x in test_input]

[241, 240, 163]

In [17]:
input_dictionary['April'],input_dictionary['12,'],input_dictionary['1833']

(241, 4, 35)

In [18]:
test_dataset[0][1].split("-")

['1960', '4', '9']

In [19]:
test_output = test_dataset[0][1].split("-")
[output_dictionary[x] for x in test_output]

[163, 235, 240]

In [20]:
output_dictionary['1833'],output_dictionary['4'],output_dictionary['12']

(35, 235, 4)

Good, let us make the sets, already numberized

In [21]:
def preprocess(data, percentage_train, percentage_valid): #inputs the dataset, numnber of steps, 
    #the percentage of data that will be used, for training, and the percentage of data used for validation, both in decimal form
    #percentage for test is given by subtracting both
    #outputs the six X_train, y_train, X_valid,...
    X_list, y_list = data[:,0], data[:,1]
    X = []
    y = []
    X_decoder = []
    
    decoder_value = len(output_dictionary) + 1 #create the first entry to be able to shift the decoder inputs one step
    
    for i in range(len(X_list)):
        X_i = X_list[i].split()
        y_i = y_list[i].split("-")
        
        X_i_numberized = [input_dictionary[x] for x in X_i] #numberize the data
        y_i_numberized = [output_dictionary[x] for x in y_i]
        
        X_decoder_i = [decoder_value]
        X_decoder_i = X_decoder_i + y_i_numberized[:-1] #creates a list so that we shift every entry by one
        
        X.append(np.array(X_i_numberized))
        y.append(np.array(y_i_numberized))
        X_decoder.append(np.array(X_decoder_i))
    
    X = np.array(X)
    y = np.array(y)
    X_decoder = np.array(X_decoder)
    
    
    train_size = int(len(data)*percentage_train) #eventhough we might loose one or two data points by using int, given the 
    #dataset sizes, they are not super important
    valid_size = int(len(data)*percentage_valid)
    test_size = int(len(data)-train_size-valid_size)
    
    X_train, X_valid, X_test = X[:train_size], X[train_size:train_size+valid_size], X[train_size+valid_size:]
    y_train, y_valid, y_test = y[:train_size], y[train_size:train_size+valid_size], y[train_size+valid_size:]
    X_train_decoder, X_valid_decoder, X_test_decoder = X_decoder[:train_size], X_decoder[train_size:train_size+valid_size], X_decoder[train_size+valid_size:]
    
    return  X_train, X_valid, X_test, y_train, y_valid, y_test, X_train_decoder, X_valid_decoder, X_test_decoder

In [22]:
preprocess(test_dataset, 0.8, 0.1)

(array([[241, 240, 163],
        [245,   6, 143],
        [247, 240, 160],
        [249,  10,  30],
        [244, 229, 158],
        [241, 238, 192],
        [249, 234,  11],
        [249, 235,  33]]),
 array([[250, 203,  69]]),
 array([[252, 204,  32]]),
 array([[163, 235, 240],
        [143,   1,   6],
        [160, 237, 240],
        [ 30, 236,  10],
        [158, 203, 229],
        [192, 235, 238],
        [ 11, 236, 234],
        [ 33, 236, 235]]),
 array([[ 69,   3, 203]]),
 array([[ 32, 240, 204]]),
 array([[241, 163, 235],
        [241, 143,   1],
        [241, 160, 237],
        [241,  30, 236],
        [241, 158, 203],
        [241, 192, 235],
        [241,  11, 236],
        [241,  33, 236]]),
 array([[241,  69,   3]]),
 array([[241,  32, 240]]))

In [23]:
embed_size = 20
vocab_size_input = len(input_data)
vocab_size_output = len(output_data)

In [24]:
units = 252

encoder_inputs = keras.layers.Input(shape=[None], dtype=np.int32)
decoder_inputs = keras.layers.Input(shape=[None], dtype=np.int32)
sequence_lengths = keras.layers.Input(shape=[], dtype=np.int32)

encoder_embeddings = keras.layers.Embedding(vocab_size_input + 1, embed_size)(encoder_inputs)

decoder_embedding_layer = keras.layers.Embedding(vocab_size_input + 2, embed_size)
decoder_embeddings = decoder_embedding_layer(decoder_inputs)

encoder = keras.layers.LSTM(units, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_embeddings)
encoder_state = [state_h, state_c]

sampler = tfa.seq2seq.sampler.TrainingSampler()

decoder_cell = keras.layers.LSTMCell(units)
output_layer = keras.layers.Dense(vocab_size_output + 1)

decoder = tfa.seq2seq.basic_decoder.BasicDecoder(decoder_cell,sampler,output_layer=output_layer)
final_outputs, final_state, final_sequence_lengths = decoder(decoder_embeddings,initial_state=encoder_state)
Y_proba = keras.layers.Activation("softmax")(final_outputs.rnn_output)

Let us run the model for each set

In [25]:
X_train, X_valid, X_test, y_train, y_valid, y_test, X_train_decoder, X_valid_decoder, X_test_decoder = preprocess(dataset_500, 0.8,0.1)

In [26]:
X_train.shape

(400, 3)

In [27]:
model_500 = keras.models.Model(inputs=[encoder_inputs, decoder_inputs],outputs=[Y_proba])

In [28]:
optimizer = keras.optimizers.Nadam()
model_500.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])

In [29]:
history = model_500.fit([X_train, X_train_decoder], y_train, epochs=20, validation_data=([X_valid, X_valid_decoder], y_valid))

Train on 400 samples, validate on 50 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [30]:
model_500.evaluate([X_test, X_test_decoder], y_test)



[3.352668743133545, 0.35333332]

In [31]:
X_train, X_valid, X_test, y_train, y_valid, y_test, X_train_decoder, X_valid_decoder, X_test_decoder = preprocess(dataset_5000, 0.8,0.1)

In [32]:
model_5000 = keras.models.Model(inputs=[encoder_inputs, decoder_inputs],outputs=[Y_proba])

In [33]:
optimizer = keras.optimizers.Nadam()
model_5000.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])

In [34]:
history = model_5000.fit([X_train, X_train_decoder], y_train, epochs=20, validation_data=([X_valid, X_valid_decoder], y_valid))

Train on 4000 samples, validate on 500 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [35]:
model_5000.evaluate([X_test, X_test_decoder], y_test)



[0.036339461877942084, 0.9993333]

Let us make a function that makes predictions. First we must reverse the dictionary

In [36]:
dictionary_reverse = dict((v, k) for k, v in output_dictionary.items())

Finally

In [37]:
def prediction(input_string):
    input_string = input_string.split()
    input_numbers = [input_dictionary[x] for x in input_string]
    input_numbers = np.array(input_numbers)
    input_numbers = input_numbers.reshape(1,3)   
    
    prediction = list(np.argmax(model_5000.predict([test_input_numbers,test_input_numbers])[0], axis=1))
    
    
    prediction = [dictionary_reverse[x] for x in prediction]
    
    return prediction[0]+"-"+prediction[1]+"-"+prediction[2]    

On my birthday

In [38]:
prediction("February 4, 1998")

NameError: name 'test_input_numbers' is not defined