<h1> Chatbot with a seq2seq Model</h1> 

<h3>Introduction</h3>
<p>Using a "sequence to sequence" model is one of the most stimulating ways to develop a chatbot, but also one of the most complex from a computational point of view. The algorithm must be trained with a large corpus of dialogues: as input receives a message and provides another one as output.</p>
<p>I found the algorithm used in the fourth week of the HSE course on NLP (https://github.com/hse-aml/natural-language-processing/tree/master/week4) very clean and clear and I will use that as a basic model, at least in the way of structuring the task. There are several useful datasets for our purpose, as suggested by the HSE, I will start relyng on the Cornell Movie Dialogues.

<h4>Data exploration and preparation</h4> 
<p>First of all I need to prepare our corpus of data, It is useful to take the functions already used for this purpose by <a href="https://github.com/Conchylicultor/DeepQA" target="_blank">DeepQ&A. I import files 'cornelldata.py' and 'textdata.py'</p>

In [1]:
from datasets import *

In [79]:
dataset_path = 'data/cornell/' 
max_sentence_len = 50  # I use just short sentences for the beginning  

data = readCornellData(dataset_path, max_len = max_sentence_len)

100%|██████████| 83097/83097 [00:05<00:00, 15465.27it/s]


<h4>Now, we just explore the data.</h4> 
<p>It is necessary to always keep in mind the dimension of the dataset, and how it is structured. But it is also necessary to be clear about its content: in this case each line is represented by a question / answer pair. </p>

In [80]:
initial_data_len = len(data)
print('Size of our dataset: ', initial_data_len, '\n')
print('Three lines of our dataset: ', data[:3], '\n')

print('The same lines in a more readable form: ')
for line in data[:8]:
    que, ans = line
    print(' Q:', que, '\n', 'A:', ans)

Size of our dataset:  101349 

Three lines of our dataset:  [('gosh if only we could find kat a boyfriend', 'let me see what i can do'), ('cesc ma tete this is my head', 'right see youre ready for the quiz'), ('thats because its such a nice one', 'forget french')] 

The same lines in a more readable form: 
 Q: gosh if only we could find kat a boyfriend 
 A: let me see what i can do
 Q: cesc ma tete this is my head 
 A: right see youre ready for the quiz
 Q: thats because its such a nice one 
 A: forget french
 Q: there 
 A: where
 Q: you have my word as a gentleman 
 A: youre sweet
 Q: hi 
 A: looks like things worked out tonight huh
 Q: you know chastity 
 A: i believe we share an art instructor
 Q: have fun tonight 
 A: tons


<h4>Data preparations</h4>
<p>I prepares the sentences for our training. For this purpose I use some of the potential provided by the python NLTK module (<a href="https://www.nltk.org/" target="_blank">Natural Language Toolkit</a>), in particular to identify the stopwords. I then create a function that receives a sentence and applies a filter that simplifies it for our purposes (it reduces the sentence in lowercase characters, eliminates strange symbols and eliminates unnecessary spacing).

In [81]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

def prepare_text(sentence):
    '''A filter function to prepare our sentences.'''
    
    GOOD_SYMBOLS_RE = re.compile('[^a-z ]')
    REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;#+_]')
    REPLACE_SEVERAL_SPACES = re.compile('\s+')

    sentence = sentence.lower()
    sentence = REPLACE_BY_SPACE_RE.sub(' ', sentence)
    sentence = GOOD_SYMBOLS_RE.sub('', sentence)
    sentence = REPLACE_SEVERAL_SPACES.sub(' ', sentence)
    
    return sentence

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/marcofosci/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


<br/>
<p>I create a function to apply the filter above to all the dataset:</p>

In [82]:
def prepare_data(data):
    '''A utility function to prepare all dataset'''

    new_data = []
    for line in data:
        new_line = []
        for sentence in line:    
            new_line.append(prepare_text(sentence))
        new_data.append(new_line)
        
    return new_data

In [83]:
data = prepare_data(data)

<h4>A dictionary</h4>
<p>At this point I create a dictionary of symbols that are allowed. It will be the dictionary that the neural network will use to interpret the inputs and generate the outputs. In our case the dictionary will use the letters of the alphabet (only lowercase characters) and three symbols ('^', '$', '#') that we will use respectively as initial symbol, final symbol, and padding respectively.</p>

In [84]:
letter2id = {symbol:i for i, symbol in enumerate('^$#abcdefghijklmnopqrstuvwxyz ')}
id2letter = {i:symbol for symbol, i in letter2id.items()}

start_symbol = '^'
end_symbol = '$'
padding_symbol = '#'

Just to get a clear view of our dictionary and conversion tools:

In [85]:
print(letter2id, ' \n', id2letter)

{'^': 0, '$': 1, '#': 2, 'a': 3, 'b': 4, 'c': 5, 'd': 6, 'e': 7, 'f': 8, 'g': 9, 'h': 10, 'i': 11, 'j': 12, 'k': 13, 'l': 14, 'm': 15, 'n': 16, 'o': 17, 'p': 18, 'q': 19, 'r': 20, 's': 21, 't': 22, 'u': 23, 'v': 24, 'w': 25, 'x': 26, 'y': 27, 'z': 28, ' ': 29}  
 {0: '^', 1: '$', 2: '#', 3: 'a', 4: 'b', 5: 'c', 6: 'd', 7: 'e', 8: 'f', 9: 'g', 10: 'h', 11: 'i', 12: 'j', 13: 'k', 14: 'l', 15: 'm', 16: 'n', 17: 'o', 18: 'p', 19: 'q', 20: 'r', 21: 's', 22: 't', 23: 'u', 24: 'v', 25: 'w', 26: 'x', 27: 'y', 28: 'z', 29: ' '}


<h4>Converting the Dataset to make it digestible to the Neural Network</h4>
<p>I create a function that converts sentences into a padded sequence of symbol index. And then another function that on the contrary converts a sequence of indices into a sentence.</p>

In [86]:
def sentence_to_ids(sentence, symbol2id, padded_len):
    ''' 
    Converts a sequence of symbols to a padded sequence of their ids.
    
    Input:
        sentence: (str), a sequence of our dictionary symbols
        symbol2id: (dict), a mapping from original symbols to ids
        padded_len: (int), the desirable length of the sequence.

    Output: 
        sent_ids: (tuple) a list of ids 
        sent_len: (int) the original length of the sentence.
    '''
    
    sent_ids = [] 
    sent_ids = [symbol2id[sentence[i]] for i in range(min(len(sentence),padded_len))]
    
    if len(sentence) == padded_len:
        sent_ids[-1] = symbol2id['$']
    else:
        sent_ids += [symbol2id['$']]
    
    for i in range(padded_len-len(sentence)-1): 
        sent_ids += [symbol2id['#']]                     
        
    sent_len = sent_ids.index(1) + 1
    
    return sent_ids, sent_len

In [87]:
def ids_to_sentence(ids_sentence, id2symbol):
    ''' Converts a sequence of idx in a sequence of symbols'''
    return [id2symbol[i] for i in ids_sentence]

<p>Here is an example of their behaviour. <br/> I consider just the firs eight lines of our data.</p>

In [88]:
Xa, Ya = [], []
for line in data[:8]:
    que, ans = line
    Xa.append(que)
    Ya.append(ans)
    
print(Xa, '\n', Ya)

['gosh if only we could find kat a boyfriend', 'cesc ma tete this is my head', 'thats because its such a nice one', 'there', 'you have my word as a gentleman', 'hi', 'you know chastity', 'have fun tonight'] 
 ['let me see what i can do', 'right see youre ready for the quiz', 'forget french', 'where', 'youre sweet', 'looks like things worked out tonight huh', 'i believe we share an art instructor', 'tons']


<p>The input data of our neural network must have the same length. For this reason the conversion creates a list in which each line contains 'n' indices (in our case 20). </p>
<p>To get an idea of what the indices represent we can convert our list with the function 'ids_to_sentence'. We observe that each sentence ends with the symbol '$'. When the sentence is less than 20 characters long, it is followed by many '#' symbols for the number of missing characters.</p>
<p>The 'sentence_to_ids' function in addition to the list of indices also returns the actual length of each sentence.</p>

In [210]:
se = []
sl = []
for x in Xa:
    sentence, s_len = sentence_to_ids(x, letter2id, 20)
    se.append(sentence)
    print(sentence)
    sl.append(s_len)
    
print('')
for j in se:
    print(''.join(ids_to_sentence(j, id2letter)))

print('\nThe lenght of each sentence:', sl)

[9, 17, 21, 10, 29, 11, 8, 29, 17, 16, 14, 27, 29, 25, 7, 29, 5, 17, 23, 14, 1]
[5, 7, 21, 5, 29, 15, 3, 29, 22, 7, 22, 7, 29, 22, 10, 11, 21, 29, 11, 21, 1]
[22, 10, 3, 22, 21, 29, 4, 7, 5, 3, 23, 21, 7, 29, 11, 22, 21, 29, 21, 23, 1]
[22, 10, 7, 20, 7, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
[27, 17, 23, 29, 10, 3, 24, 7, 29, 15, 27, 29, 25, 17, 20, 6, 29, 3, 21, 29, 1]
[10, 11, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
[27, 17, 23, 29, 13, 16, 17, 25, 29, 5, 10, 3, 21, 22, 11, 22, 27, 1, 2, 2]
[10, 3, 24, 7, 29, 8, 23, 16, 29, 22, 17, 16, 11, 9, 10, 22, 1, 2, 2, 2]

gosh if only we coul$
cesc ma tete this is$
thats because its su$
there$##############
you have my word as $
hi$#################
you know chastity$##
have fun tonight$###

The lenght of each sentence: [21, 21, 21, 6, 21, 3, 18, 17]


<h4>Batch generation</h4>
<p>We do not pass the whole dataset into the neural network at one time. We will use mini-batches. We therefore create a function that allows us to create them starting from the initial dataset.</p>

In [90]:
def batch_to_ids(sentences, symbol2id, max_len):
    '''
    Prepares batches of indices: every sequences are padded to match the longest sequence in the batch,
       if it's longer than max_len, then max_len is used instead.
    
    Input:
        sentences: (list of str), the original sequences
        symbol2id: (dict), a mapping from original symbols to ids
        max_len: (int), max len of sequences allowed.
    
    Output
        batch_ids: (list), lists of ids, 
        batch_ids_len: (list), actual lengths.
    '''
    
    max_len_in_batch = min(max(len(s) for s in sentences) + 1, max_len)
    
    batch_ids, batch_ids_len = [], []
    
    for sent in sentences:
        ids, ids_len = sentence_to_ids(sent, symbol2id, max_len_in_batch)
        batch_ids.append(ids)
        batch_ids_len.append(ids_len)
    
    return batch_ids, batch_ids_len

<p>We convert our dataset in batches of dimension batch_size:</p>


In [91]:
def generate_batches(samples, batch_size):
    '''
    A generator of batches of size batch_size     
    '''
    X, Y = [], []
    for i, (x, y) in enumerate(samples, 1):
        X.append(x)
        Y.append(y)
        if i % batch_size == 0:
            yield X, Y
            X, Y = [], []
    if X and Y:
        yield X, Y

<p>An example of the two previous functions, we generate 4 batches, the last one with only 2 elements<br/>
I use just the xxx value (only the questions of our initial data pair) but we can do the same for our labels yyy.</p>

In [215]:
xxx, yyy = [],[]
for xxx, yyy in generate_batches(data[:20], 6):
    print(xxx)
    xxx_ = batch_to_ids(xxx, letter2id, max_len = 6)
    print(xxx_)

['gosh if only we could find kat a boyfriend', 'cesc ma tete this is my head', 'thats because its such a nice one', 'there', 'you have my word as a gentleman', 'hi']
([[9, 17, 21, 10, 29, 11, 1], [5, 7, 21, 5, 29, 15, 1], [22, 10, 3, 22, 21, 29, 1], [22, 10, 7, 20, 7, 1], [27, 17, 23, 29, 10, 3, 1], [10, 11, 1, 2, 2, 2]], [7, 7, 7, 6, 7, 3])
['you know chastity', 'have fun tonight', 'i was', 'well no', 'then thats all you had to say', 'but']
([[27, 17, 23, 29, 13, 16, 1], [10, 3, 24, 7, 29, 8, 1], [11, 29, 25, 3, 21, 1], [25, 7, 14, 14, 29, 16, 1], [22, 10, 7, 16, 29, 22, 1], [4, 23, 22, 1, 2, 2]], [7, 7, 6, 7, 7, 4])
['do you listen to this crap', 'i figured youd get to the good stuff eventually', 'what good stuff', 'the real you', 'no', 'wow']
([[6, 17, 29, 27, 17, 23, 1], [11, 29, 8, 11, 9, 23, 1], [25, 10, 3, 22, 29, 9, 1], [22, 10, 7, 29, 20, 7, 1], [16, 17, 1, 2, 2, 2], [25, 17, 25, 1, 2, 2]], [7, 7, 7, 7, 3, 4])
['she okay', 'they do to']
([[21, 10, 7, 29, 17, 13, 1], [22, 10, 7

<h3>The Neural Network</h3>
<p>Encoder-Decoder is a great architecture, very useful in those operations where we have a series of input data and we need to generate a series of output data (sequence to sequence models). Simplifying, with these kind of Neural Network we divide the problem into two parts: in the first one, the encoding phase, we begin to collect the sequence of input data and we create an abstract representation; in the second one, the decoding phase, we start from the abstract representation we have created to generate the output data.</p> 
<p>Basically, I will use two Recurrent Neural Networks, where the first one encodes the input sequence into a real-valued vector and then the second one decodes this vector into the output sequence.</p>
<p>To create the neural network I will use Tensorflow</p>

In [93]:
import tensorflow as tf
print(tf.__version__)

1.1.0


<p>I create the model as a class and I call it <b>Model</b>.</p>

In [94]:
class Model(object):
    pass

<p>The structure of our model includes: the initialization of our placeholders, the creation of embeddings, the encoding phase, the decoding phase, the calculation of the loss, the optimization, and then two separate modules for the predictions - one for the phase training and one to generate answers for new sentences.</p>

In [95]:
def init_model(self, vocab_size, embeddings_size, hidden_size, 
               max_iter, start_symbol_id, end_symbol_id, padding_symbol_id):
    
    self.__declare_placeholders()
    self.__create_embeddings(vocab_size, embeddings_size)
    self.__build_encoder(hidden_size)
    self.__build_decoder(hidden_size, vocab_size, max_iter, start_symbol_id, end_symbol_id)
    
    # Compute loss and back-propagate.
    self.__compute_loss()
    self.__perform_optimization()
    
    # Get predictions for evaluation.
    self.train_predictions = self.train_outputs.sample_id
    self.infer_predictions = self.infer_outputs.sample_id

In [96]:
Model.__init__ = classmethod(init_model)

<p>I initialize the placeholders that will contain input batches and their length, output labels and their length, and then some parameters such as the learning rate and the dropout:</p>

In [97]:
def declare_placeholders(self):
    '''
    Specifies placeholders for the model
    '''
    
    # Placeholders for input and its actual lengths.
    self.input_batch = tf.placeholder(shape=(None, None), dtype=tf.int32, name='input_batch')
    self.input_batch_lengths = tf.placeholder(shape=(None, ), dtype=tf.int32, name='input_batch_lengths')
   
    # Placeholders for groundtruth and its actual lengths.
    self.ground_truth = tf.placeholder(shape=(None, None), dtype=tf.int32, name='ground_truth') 
    self.ground_truth_lengths = tf.placeholder(shape=(None, ), dtype=tf.int32, name='ground_truth_lengths') 

    self.dropout_ph = tf.placeholder_with_default(tf.cast(1.0, tf.float32), shape=[])
    self.learning_rate_ph = tf.placeholder(shape=[], dtype=tf.float32) 

In [98]:
Model.__declare_placeholders = classmethod(declare_placeholders)

<p>I start to built the layers of the Neural Network: I initialize the embeddings matrix using random values:</p>

In [99]:
def create_embeddings(self, vocab_size, embeddings_size):
    '''
    Specifies embeddings layer and embeds an input batch
    '''

    random_initializer = tf.random_uniform((vocab_size, embeddings_size), -1.0, 1.0)
    self.embeddings = tf.Variable(random_initializer, name='embeddings', dtype = tf.float32)  
    
    # Perform embeddings lookup for self.input_batch. 
    self.input_batch_embedded = tf.nn.embedding_lookup(self.embeddings, self.input_batch)  

In [100]:
Model.__create_embeddings = classmethod(create_embeddings)

<h4>Encoding phase</h4>
<p>I encode the input sequences to a real-valued vector. Input of the RNN is an embedded input batch. Since sentences in the same batch could have different actual lengths, I provide input lengths to avoid unnecessary computations. The final encoder state will be passed to a second RNN (decoder).</p>
<p>I create the GRU cells - the neurons - and insert them into the Recurrent Neural Network. I use the dropout to reduce the risk of overfitting.</p> 

In [101]:
def build_encoder(self, hidden_size):
    '''
    Specifies encoder architecture and computes its output
    '''

    # Create GRUCell with dropout.
    encoder_cell = tf.contrib.rnn.DropoutWrapper(
        tf.contrib.rnn.GRUCell (num_units = hidden_size), 
        input_keep_prob = self.dropout_ph,
        dtype = tf.float32
        ) 
    
    # Create RNN with the predefined cell.
    _, self.final_encoder_state = tf.nn.dynamic_rnn(
        cell = encoder_cell,
        inputs = self.input_batch_embedded,
        sequence_length = self.input_batch_lengths,
        dtype = tf.float32
        ) 

In [102]:
Model.__build_encoder = classmethod(build_encoder)

<h4>Decoding phase</h4>
<p>Now I have to generate the output sequences. To do this, I create a second RNN that will act as a decoder.</p>
<p>During training the decoder uses also information about the true labels. However, during the prediction stage (which I called <i>inference</i>), the decoder can only use its own generated output from the previous step to feed it in at the next step. To differentiate the training phase from the inference one I create two distinct instances. TrainingHelper and GreedyEmbeddingHelper helps to differentiate the two phases.</p>

<p>The decoding layer is also made up of GRU cells.</p>

In [180]:
def build_decoder(self, hidden_size, vocab_size, max_iter, start_symbol_id, end_symbol_id):
    '''
    Specifies decoder architecture and computes the output.
    
        Uses different helpers:
          - for train: feeding ground truth
          - for inference: feeding generated output

        As a result, self.train_outputs and self.infer_outputs are created. 
        Each of them contains two fields:
          rnn_output (predicted logits)
          sample_id (predictions).
    '''

    # Use start symbols as the decoder inputs at the first time step.
    batch_size = tf.shape(self.input_batch)[0]
    start_tokens = tf.fill([batch_size], start_symbol_id)
    ground_truth_as_input = tf.concat([tf.expand_dims(start_tokens, 1), self.ground_truth], 1)
    
    # Use the embedding layer defined before to lookup embedings for ground_truth_as_input. 
    self.ground_truth_embedded = tf.nn.embedding_lookup(self.embeddings, ground_truth_as_input) 
     
    # Create TrainingHelper for the train stage.
    train_helper = tf.contrib.seq2seq.TrainingHelper(self.ground_truth_embedded, 
                                                     self.ground_truth_lengths)

    # Create GreedyEmbeddingHelper for the inference stage.
    # You should provide the embedding layer, start_tokens and index of the end symbol.
    infer_helper = tf.contrib.seq2seq.GreedyEmbeddingHelper(
        embedding = self.embeddings, 
        start_tokens = start_tokens, 
        end_token = end_symbol_id
        ) 
    
  
    def decode(helper, scope, reuse=None):
        '''
        Creates decoder and return the results of the decoding with a given helper
        '''
        
        with tf.variable_scope(scope, reuse=reuse):
            # Create GRUCell with dropout. Do not forget to set the reuse flag properly.
            decoder_cell = tf.contrib.rnn.DropoutWrapper(
                tf.contrib.rnn.GRUCell (num_units = hidden_size, reuse = reuse), 
                input_keep_prob = self.dropout_ph,
                dtype = tf.float32
                ) 
            
            # Create a projection wrapper.
            decoder_cell = tf.contrib.rnn.OutputProjectionWrapper(decoder_cell, vocab_size, reuse = reuse)
            
            # Create BasicDecoder, pass the defined cell, a helper, and initial state.
            # The initial state should be equal to the final state of the encoder!
            decoder = tf.contrib.seq2seq.BasicDecoder(
                cell = decoder_cell,
                helper = helper,
                initial_state = self.final_encoder_state
                ) 

            # The first returning argument of dynamic_decode contains two fields:
            #   rnn_output (predicted logits)
            #   sample_id (predictions)
            outputs, dec_final_state = tf.contrib.seq2seq.dynamic_decode(
                decoder = decoder, 
                output_time_major = False, 
                impute_finished = True,
                maximum_iterations = max_iter
                )

            return outputs
        
    self.train_outputs = decode(train_helper, 'decode')
    self.infer_outputs = decode(infer_helper, 'decode', reuse=True)

In [181]:
Model.__build_decoder = classmethod(build_decoder)

<p>And now the <b>loss function</b>:</p>

In [105]:
def compute_loss(self):
    '''
    Computes sequence loss (masked cross-entopy loss with logits)
    '''
    
    weights = tf.cast(tf.sequence_mask(self.ground_truth_lengths), dtype=tf.float32)

    self.loss = tf.contrib.seq2seq.sequence_loss(
        self.train_outputs.rnn_output, 
        self.ground_truth, 
        weights
        )

In [106]:
Model.__compute_loss = classmethod(compute_loss)

... and the <b>optimization function</b> (I use the Adam optimizer passing the learning rate that I will establish from time to time):

In [107]:
def perform_optimization(self):
    '''
    Specifies train_op that optimizes self.loss
    '''
    
    self.train_op = tf.contrib.layers.optimize_loss(
        loss = self.loss,
        global_step = tf.train.get_global_step(),
        learning_rate = self.learning_rate_ph,
        optimizer = 'Adam',
        clip_gradients= 1.0,
        )

In [108]:
Model.__perform_optimization = classmethod(perform_optimization)

<p>The function below transmits the batch data and parameters to the Neural Network:</p>

In [109]:
def train_on_batch(self, session, X, X_seq_len, Y, Y_seq_len, learning_rate, dropout_keep_probability):
    feed_dict = {
            self.input_batch: X,
            self.input_batch_lengths: X_seq_len,
            self.ground_truth: Y,
            self.ground_truth_lengths: Y_seq_len,
            self.learning_rate_ph: learning_rate,
            self.dropout_ph: dropout_keep_probability
        }
    pred, loss, _ = session.run([
            self.train_predictions,
            self.loss,
            self.train_op], feed_dict=feed_dict)

    return pred, loss

In [110]:
Model.train_on_batch = classmethod(train_on_batch)

<p>Finally the two <b>prediction functions</b>, with and without the loss function:</p>

In [111]:
def predict_for_batch(self, session, X, X_seq_len):
    feed_dict = {
            self.input_batch: X,
            self.input_batch_lengths: X_seq_len
        }
    
    pred = session.run([
            self.infer_predictions
            ], feed_dict=feed_dict)[0]
    return pred

def predict_for_batch_with_loss(self, session, X, X_seq_len, Y, Y_seq_len):

    feed_dict = {
            self.input_batch: X,
            self.input_batch_lengths: X_seq_len,
            self.ground_truth: Y,
            self.ground_truth_lengths: Y_seq_len
        }
    
    pred, loss = session.run([
            self.infer_predictions,
            self.loss,
        ], feed_dict=feed_dict)
    return pred, loss

In [112]:
Model.predict_for_batch = classmethod(predict_for_batch)
Model.predict_for_batch_with_loss = classmethod(predict_for_batch_with_loss)

<h4>Training phase</h4>
<p>It is time to establish the parameters for the neural network. I used the sentences of our dataset with a maximum length of 50 characters. As the basic parameters for the moment, I will consider those already tested during the HSE course.</p>

In [119]:
tf.reset_default_graph()

model = Model(
    vocab_size = len(letter2id),
    embeddings_size = 20,
    max_iter = 51,
    hidden_size = 512,
    start_symbol_id = letter2id['^'],
    end_symbol_id = letter2id['$'],
    padding_symbol_id = letter2id['#']
    ) 

batch_size = 128 
n_epochs = 10 
learning_rate = 0.001 
dropout_keep_probability = 0.5 
max_len = 50 

<h4>Training set and Test set</h4>
<p>Before starting the training phase, I divide my dataset into trainset and testset (to be able to test the results later). <br/>The ratio between the two will be 80% and 20%.</p>

In [120]:
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(data, test_size=0.2, random_state=42)

In [122]:
# Just to know the batch number for each era
n_step = int(len(train_set) / batch_size)
print(n_step)

633


<p>For convenience I generate a file of logs that I will use later to evaluate the performance of the algorithm (the trend and the time it takes for the training phase)</p>

In [123]:
from datetime import datetime
now = datetime.utcnow().strftime('%Y_%m_%d_-%H_%M_%S')
root_logdir = 'tf_logs'
logdir = '{}/run-{}'.format(root_logdir, now)
file_writer = tf.summary.FileWriter(logdir, tf.get_default_graph())

session = tf.Session()
session.run(tf.global_variables_initializer())
            
invalid_number_prediction_counts = []
all_model_predictions = []
all_ground_truth = []

<p>And now the training phase:</p>

In [124]:
print('Start training... \n')
for epoch in range(n_epochs):  
    random.shuffle(train_set)
    random.shuffle(test_set)
    
    print('Train: epoch', epoch + 1)
    for n_iter, (X_batch, Y_batch) in enumerate(generate_batches(train_set, batch_size=batch_size)):

        # prepare the data (X_batch and Y_batch) for training
        # using function batch_to_ids
        X, X_seq_len = batch_to_ids(X_batch, letter2id, batch_size) 
        Y, Y_seq_len = batch_to_ids(Y_batch, letter2id, batch_size) 
        
        predictions, loss = Model.train_on_batch(session, X, X_seq_len, Y, Y_seq_len, learning_rate, dropout_keep_probability) 

        if n_iter % 200 == 0:
            file_writer = tf.summary.FileWriter(logdir, tf.get_default_graph())            
            print("Epoch: [%d/%d], step: [%d/%d], loss: %f" % (epoch + 1, n_epochs, n_iter + 1, n_step, loss))
                
    X_sent, Y_sent = next(generate_batches(test_set, batch_size=batch_size))

    # prepare test data (X_sent and Y_sent) for predicting 
    # quality and computing value of the loss function
    # using function batch_to_ids
    X, X_seq_len = batch_to_ids(X_sent, letter2id, batch_size) 
    Y, Y_seq_len = batch_to_ids(Y_sent, letter2id, batch_size) 
    
    predictions, loss = Model.predict_for_batch_with_loss(session, X, X_seq_len, Y, Y_seq_len) 
    print('Test: epoch', epoch + 1, 'loss:', loss,)
    for x, y, p  in list(zip(X, Y, predictions))[:3]:
        print('X:',''.join(ids_to_sentence(x, id2letter)))
        print('Y:',''.join(ids_to_sentence(y, id2letter)))
        print('O:',''.join(ids_to_sentence(p, id2letter)))
        print('')

    model_predictions = []
    ground_truth = []
    invalid_number_prediction_count = 0
    # For the whole test set calculate ground-truth values (as integer numbers)
    # and prediction values (also as integers) to calculate metrics.
    # If generated by model number is not correct (e.g. '1-1'), 
    # increase invalid_number_prediction_count and don't append this and corresponding
    # ground-truth value to the arrays.
    for X_batch, Y_batch in generate_batches(test_set, batch_size=batch_size):

        X, X_seq_len = batch_to_ids(X_batch, letter2id, batch_size) 
        test_ids_predictions = (Model.predict_for_batch(session, X, X_seq_len))
        test_predictions = list(''.join(ids_to_sentence(i, id2letter)) for i in test_ids_predictions)
        test_pred_end = list(k[:k.find('$')] for k in test_predictions)
        
        # convert test predictions and ground truth as integer and count errors
        for z in zip(test_pred_end, Y_batch):
            try:
                model_predictions.append(int(z[0]))
                ground_truth.append(int(z[1]))              
            except:
                invalid_number_prediction_count += 1

    all_model_predictions.append(model_predictions)
    all_ground_truth.append(ground_truth)
    invalid_number_prediction_counts.append(invalid_number_prediction_count)
            
print('\n...training finished.')
file_writer.close()

Start training... 

Train: epoch 1
Epoch: [1/10], step: [1/633], loss: 3.406866
Epoch: [1/10], step: [201/633], loss: 2.247396
Epoch: [1/10], step: [401/633], loss: 1.910273
Epoch: [1/10], step: [601/633], loss: 1.863026
Test: epoch 1 loss: 1.63839
X: woodward$##########################################
Y: hmm$##############################################
O: i dont know what are you doing to the stope$

X: that too$##########################################
Y: would the station put me up at a good hotel$######
O: i dont know what are you doing to the stope$

X: sure see you later$################################
Y: bye$##############################################
O: i dont know what are you doing to the stope$

Train: epoch 2
Epoch: [2/10], step: [1/633], loss: 1.824005
Epoch: [2/10], step: [201/633], loss: 1.795414
Epoch: [2/10], step: [401/633], loss: 1.719810
Epoch: [2/10], step: [601/633], loss: 1.688869
Test: epoch 2 loss: 1.56471
X: five bucks$##################################

<h4>Some tests</h4>
<p>It is time to take random tests on our network's ability to respond effectively to new questions.</p>
<p>for convenience I have created a function with which I can ask new questions to the neural network and see the answers it generates.</p>

In [156]:
def predict_question(self, session, xx, xx_len):
    feed_dict = {
            self.input_batch: xx,
            self.input_batch_lengths: xx_len
        }
    
    pred = session.run([
            self.infer_predictions
            ], feed_dict=feed_dict)
    return pred

In [157]:
Model.predict_question = classmethod(predict_question)

In [220]:
question = ['i like to go to swim',]
xx_len = [len(question[0]), ]
print(xx_len)
xx = [sentence_to_ids(question[0], letter2id, 50)[0], ]
print(xx)

[20]
[[11, 29, 14, 11, 13, 7, 29, 22, 17, 29, 9, 17, 29, 22, 17, 29, 21, 25, 11, 15, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]]


In [221]:
predictions = Model.predict_question(session, xx, xx_len)
print('Test: ')
print(question[0], '\n', ''.join(ids_to_sentence(predictions[0][0], id2letter)))

Test: 
i like to go to swim 
 what are you talking about$


In [222]:
question = ['are you a bot',]
xx_len = [len(question[0]), ]
xx = [sentence_to_ids(question[0], letter2id, 50)[0], ]
predictions = Model.predict_question(session, xx, xx_len)
print('Test: ')
print(question[0], '\n', ''.join(ids_to_sentence(predictions[0][0], id2letter)), '\n')

Test: 
are you a bot 
 yes i do$ 



In [225]:
xx = []
xx_len = []

questions = ['what is your name', 'please come with me', 'hi', 'how are you', 'do you love me', 'what s the meaning of life']
for q in questions:
    xx_len += [len(q)]
    xx += [sentence_to_ids(q, letter2id, 50)[0]]
    
predictions = Model.predict_question(session, xx, xx_len)
print('Test: ')

for i, q in enumerate (questions):
    print(q, '\n', ''.join(ids_to_sentence(predictions[0][i], id2letter)), '\n')

Test: 
what is your name 
 its a place to be so sure i want to go to the polic 

please come with me 
 i dont know what you want$^^^^^^^^^^^^^^^^^^^^^^^^^ 

hi 
 her the way you want to go to me$^^^^^^^^^^^^^^^^^^ 

how are you 
 im sorry i dont know what you want$^^^^^^^^^^^^^^^^ 

do you love me 
 yes i do$^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 

what s the meaning of life 
 i dont know what you want to do$^^^^^^^^^^^^^^^^^^^ 



In [226]:
xx = []
xx_len = []

questions = ['how is the weather today', 'let s have a dinner', 'are you a bot', 'why not']
for q in questions:
    xx_len += [len(q)]
    xx += [sentence_to_ids(q, letter2id, 50)[0]]
    
predictions = Model.predict_question(session, xx, xx_len)
print('Test: ')

for i, q in enumerate (questions):
    print(q, '\n', ''.join(ids_to_sentence(predictions[0][i], id2letter)), '\n')

Test: 
how is the weather today 
 the way i want to go to the police$ 

let s have a dinner 
 what are you talking about$^^^^^^^^ 

are you a bot 
 yes i do$^^^^^^^^^^^^^^^^^^^^^^^^^^ 

why not 
 because i want to talk to you$^^^^^ 

