# Seq2Seq Chatbot
1. Tuto link https://medium.com/predict/creating-a-chatbot-from-scratch-using-keras-and-tensorflow-59e8fc76be79
2. Download dataset https://www.kaggle.com/kausr25/chatterbotenglish#botprofile.yml    
3. Learn https://stackoverflow.com/questions/51956000/what-does-keras-tokenizer-method-exactly-do
4. Padding https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/pad_sequences
5. Onehot encoding https://stackoverflow.com/questions/41494625/issues-using-keras-np-utils-to-categorical/53430549

In [1]:
import os
import yaml
import numpy as np
import tensorflow as tf
from mute_tf_warnings import tf_mute_warning
from tensorflow.keras import preprocessing, utils
tf_mute_warning()

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [2]:
dir_path = './data_seq2seq/'
files_list = os.listdir(dir_path + os.sep) # our dateset in a list
files_list.remove('.ipynb_checkpoints')
files_list.remove('politics.yml')
files_list.remove('gossip.yml')
files_list.remove('history.yml')
files_list.remove('movies.yml')
files_list.remove('money.yml')
files_list.remove('trivia.yml')
files_list.remove('sports.yml')
files_list.remove('psychology.yml')
files_list.remove('literature.yml')


questions = []
answers = list()
for filepath in files_list:
    stream = open(dir_path + os.sep + filepath, 'rb')
    docs = yaml.safe_load(stream)

    conversations = docs['conversations']
    
    for con in conversations:
        if len(con) > 2:
            questions.append(con[0])
            replies = con[1:]
            ans = ''
            
            for rep in replies:
                ans += ' ' + rep
            answers.append(ans)
            
        elif len(con) > 1:
            questions.append(con[0])
            answers.append(con[1])

In [3]:
answers[0]

'My brain does not require any beverages.'

In [4]:
answers_with_tags = list()
for i in range(len(answers)):
    if type( answers[i] ).__name__ == 'str':
        answers_with_tags.append( answers[i] )
    else:
        questions.pop(i)

answers = list()
for i in range( len( answers_with_tags ) ):
    answers.append('<START>' + answers_with_tags[i] + '<END>' )

    
tokenizer = preprocessing.text.Tokenizer()
tokenizer.fit_on_texts( questions + answers )
VOCAB_SIZE = len( tokenizer.word_index )+1
print(answers[0])

<START>My brain does not require any beverages.<END>


In [5]:
print(len( tokenizer.word_index )+1)

1186


In [6]:
love = 0
for word in questions:
#     word = word.lower()
    if word.find('What') != -1:
        love += 1
print(love)

60


In [7]:
hey = 'I love you'
print(hey.find('me'))

-1


# Preparing data for Seq2Seq model
Our model requires three arrays namely encoder_input_data, decoder_input_data and decoder_output_data.

1. For encoder_input_data :
    * Tokenize the questions. Pad them to their maximum length.
2. For decoder_input_data :
    * Tokenize the answers. Pad them to their maximum length.
3. For decoder_output_data :
    * Tokenize the answers. Remove the first element from all the tokenized_answers. This is the START> element which we added earlier.

In [5]:
# # encoder_input_data
tokenized_questions = tokenizer.texts_to_sequences(questions)
maxlen_questions = max( [ len(x) for x in tokenized_questions ] )
padded_questions = preprocessing.sequence.pad_sequences( tokenized_questions, maxlen=maxlen_questions, padding='post')
encoder_input_data = np.array(padded_questions)
print( encoder_input_data.shape)

# # decoder_output_data
tokenized_answers = tokenizer.texts_to_sequences( answers )
maxlen_answers = max( len(x) for x in tokenized_answers )
padded_answers = preprocessing.sequence.pad_sequences( tokenized_answers, maxlen=maxlen_answers, padding='post' )
decoder_input_data = np.array( padded_answers )
print( decoder_input_data.shape )

# decoder_output_data
tokenized_answers = tokenizer.texts_to_sequences( answers ) # here we are removing the <start> sequence
for i in range(len(tokenized_answers)):
    tokenized_answers[i] = tokenized_answers[i][1:]

padded_answers = preprocessing.sequence.pad_sequences( tokenized_answers, maxlen=maxlen_answers, padding='post')
onehot_answers = utils.to_categorical( padded_answers, VOCAB_SIZE )
decoder_output_data = np.array( onehot_answers )
print( decoder_output_data.shape)

# Saving all the arrays to storage
np.save('./saved arrays/enc_in_data.npy', encoder_input_data)
np.save('./saved arrays/dec_in_data.npy', decoder_input_data)
np.save('./saved arrays/dec_tar_data.npy', decoder_output_data)

(303, 9)
(303, 74)
(303, 74, 1186)


## 3) Defining the Encoder-Decoder model
The model will have Embedding, LSTM and Dense layers. The basic configuration is as follows.


*   2 Input Layers : One for `encoder_input_data` and another for `decoder_input_data`.
*   Embedding layer : For converting token vectors to fix sized dense vectors. **( Note :  Don't forget the `mask_zero=True` argument here )**
*   LSTM layer : Provide access to Long-Short Term cells.

Working : 

1.   The `encoder_input_data` comes in the Embedding layer (  `encoder_embedding` ). 
2.   The output of the Embedding layer goes to the LSTM cell which produces 2 state vectors ( `h` and `c` which are `encoder_states` )
3.   These states are set in the LSTM cell of the decoder.
4.   The decoder_input_data comes in through the Embedding layer.
5.   The Embeddings goes in LSTM cell ( which had the states ) to produce seqeunces.

**Important points :**


*   `200` is the output of the GloVe embeddings.
*   `embedding_matrix` is the GloVe embedding which we downloaded earlier.


<center><img style="float: center;" src="https://cdn-images-1.medium.com/max/1600/1*bnRvZDDapHF8Gk8soACtCQ.gif"></center>


Image credits to [Hackernoon](https://hackernoon.com/tutorial-3-what-is-seq2seq-for-text-summarization-and-why-68ebaa644db0).


In [6]:
encoder_inputs = tf.keras.layers.Input(shape=(None, ))
encoder_embedding = tf.keras.layers.Embedding( VOCAB_SIZE, 200, mask_zero=True)(encoder_inputs)
encoder_outputs, state_h, state_c = tf.keras.layers.LSTM( 200, return_state=True)(encoder_embedding)
encoder_states = [ state_h, state_c ]

decoder_inputs = tf.keras.layers.Input(shape=(None, ))
decoder_embedding = tf.keras.layers.Embedding( VOCAB_SIZE, 200, mask_zero=True)(decoder_inputs)
decoder_lstm = tf.keras.layers.LSTM(200, return_state=True, return_sequences=True)
decoder_outputs, _,_ = decoder_lstm(decoder_embedding, initial_state=encoder_states)
decoder_dense = tf.keras.layers.Dense(VOCAB_SIZE, activation=tf.keras.activations.softmax )
output = decoder_dense(decoder_outputs)

model = tf.keras.models.Model([encoder_inputs, decoder_inputs], output)
model.compile(optimizer=tf.keras.optimizers.RMSprop(), loss='categorical_crossentropy')

model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, None)]       0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            [(None, None)]       0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, None, 200)    237200      input_1[0][0]                    
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, None, 200)    237200      input_2[0][0]                    
______________________________________________________________________________________________

## 4) Training the model
We train the model for a number of epochs with `RMSprop` optimizer and `categorical_crossentropy` loss function.

In [7]:
model.fit([encoder_input_data, decoder_input_data], decoder_output_data, batch_size=50, epochs=150) #75
model.save('./seq_saved_model/model.h5')

Epoch 1/150
Epoch 2/150
Epoch 3/150
Epoch 4/150
Epoch 5/150
Epoch 6/150
Epoch 7/150
Epoch 8/150
Epoch 9/150
Epoch 10/150
Epoch 11/150
Epoch 12/150
Epoch 13/150
Epoch 14/150
Epoch 15/150
Epoch 16/150
Epoch 17/150
Epoch 18/150
Epoch 19/150
Epoch 20/150
Epoch 21/150
Epoch 22/150
Epoch 23/150
Epoch 24/150
Epoch 25/150
Epoch 26/150
Epoch 27/150
Epoch 28/150
Epoch 29/150
Epoch 30/150
Epoch 31/150
Epoch 32/150
Epoch 33/150
Epoch 34/150
Epoch 35/150
Epoch 36/150
Epoch 37/150
Epoch 38/150
Epoch 39/150
Epoch 40/150
Epoch 41/150
Epoch 42/150
Epoch 43/150
Epoch 44/150
Epoch 45/150
Epoch 46/150
Epoch 47/150
Epoch 48/150
Epoch 49/150
Epoch 50/150
Epoch 51/150
Epoch 52/150
Epoch 53/150
Epoch 54/150
Epoch 55/150
Epoch 56/150
Epoch 57/150
Epoch 58/150
Epoch 59/150
Epoch 60/150
Epoch 61/150
Epoch 62/150
Epoch 63/150
Epoch 64/150
Epoch 65/150
Epoch 66/150
Epoch 67/150
Epoch 68/150
Epoch 69/150
Epoch 70/150
Epoch 71/150
Epoch 72/150
Epoch 73/150
Epoch 74/150
Epoch 75/150
Epoch 76/150
Epoch 77/150
Epoch 78

## 5) Defining inference models
We create inference models which help in predicting answers.

**Encoder inference model** : Takes the question as input and outputs LSTM states ( `h` and `c` ).

**Decoder inference model** : Takes in 2 inputs, one are the LSTM states ( Output of encoder model ), second are the answer input seqeunces ( ones not having the `<start>` tag ). It will output the answers for the question which we fed to the encoder model and its state values.

In [8]:
def make_inference_models():
    
    encoder_model = tf.keras.models.Model(encoder_inputs, encoder_states)
    
    decoder_state_input_h = tf.keras.layers.Input(shape=(200, ))
    decoder_state_input_c = tf.keras.layers.Input(shape=(200, ))
    
    decoder_states_inputs = [ decoder_state_input_h, decoder_state_input_c ]
    
    decoder_outputs, state_h, state_c = decoder_lstm(decoder_embedding, initial_state=decoder_states_inputs)
    decoder_states = [ state_h, state_c ]
    decoder_outputs = decoder_dense(decoder_outputs)
    decoder_model = tf.keras.models.Model([decoder_inputs] + decoder_states_inputs, [decoder_outputs] + decoder_states)
    
    return encoder_model, decoder_model

## 6) Talking with our Chatbot

First, we define a method `str_to_tokens` which converts `str` questions to Integer tokens with padding.

In [9]:
def str_to_tokens(sentence: str):
    words = sentence.lower().split()
    tokens_list = list()
    for word in words:
        tokens_list.append( tokenizer.word_index[ word ])
    return preprocessing.sequence.pad_sequences([tokens_list], maxlen=maxlen_questions, padding='post')

1.   First, we take a question as input and predict the state values using `enc_model`.
2.   We set the state values in the decoder's LSTM.
3.   Then, we generate a sequence which contains the `<start>` element.
4.   We input this sequence in the `dec_model`.
5.   We replace the `<start>` element with the element which was predicted by the `dec_model` and update the state values.
6.   We carry out the above steps iteratively till we hit the `<end>` tag or the maximum answer length.

In [10]:
enc_model, dec_model = make_inference_models()

for _ in range(10):
    states_values = enc_model.predict( str_to_tokens(input('You: ')))
    empty_target_seq = np.zeros((1,1))
    empty_target_seq[0,0] = tokenizer.word_index['start']
    stop_condition = False
    decoded_translation = ''
    while not stop_condition:
        dec_outputs, h, c = dec_model.predict([empty_target_seq] + states_values)
        sampled_word_index = np.argmax(dec_outputs[0,-1,:])
        sampled_word = None
        for word, index in tokenizer.word_index.items():
            if sampled_word_index == index:
                decoded_translation += f' {word}'
                sampled_word = word
                
            if sampled_word == 'end' or len(decoded_translation.split()) > maxlen_answers:
                stop_condition = True
                
            empty_target_seq = np.zeros((1,1))
            empty_target_seq[0,0] = sampled_word_index
            states_values = [ h, c ]
    print( decoded_translation )

You:  hi


 hello end


You:  hello


 greetings end


You:  welcome


KeyError: 'welcome'