# Chatbot using Seq2Seq LSTM models
In this notebook, we will assemble a seq2seq LSTM model using Keras Functional API to create a working Chatbot which would answer questions asked to it.

Chatbots have become applications themselves. You can choose the field or stream and gather data regarding various questions. We can build a chatbot for an e-commerce webiste or a school website where parents could get information about the school.


Messaging platforms like Allo have implemented chatbot services to questionsage users. The famous [Google Assistant](https://assistant.google.com/), [Siri](https://www.apple.com/in/siri/), [Cortana](https://www.microsoft.com/en-in/windows/cortana) and [Alexa](https://www.alexa.com/) may have been build using simialr models.

So, let's start building our Chatbot.


## 1) Importing the packages

We will import [TensorFlow](https://www.tensorflow.org) and our beloved [Keras](https://www.tensorflow.org/guide/keras). Also, we import other modules which help in defining model layers.






In [1]:
import numpy as np
import tensorflow as tf
import pickle
from tensorflow.keras import layers , activations , models , preprocessing, utils

print( tf.VERSION )


  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


1.14.0


  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


## 2) Preprocessing the data

### A) Download the data

The dataset hails from [chatterbot/questionslish on Kaggle](https://www.kaggle.com/kausr25/chatterbotquestionslish).com by [kausr25](https://www.kaggle.com/kausr25). It contains pairs of questions and answers based on a number of subjects like food, history, AI etc.

The raw data could be found from this repo -> https://github.com/shubham0204/Dataset_Archives


In [2]:

#import requests, zipfile, io

#r = requests.get( 'https://github.com/shubham0204/Dataset_Archives/blob/master/chatbot_nlp.zip?raw=true' ) 
#z = zipfile.ZipFile(io.BytesIO(r.content))
#z.extractall()

#from tensorflow.keras import preprocessing , utils
#import os
#import yaml
#
#dir_path = 'chatbot_nlp/data'
#files_list = os.listdir(dir_path + os.sep)
#
#questions = list()
#answers = list()
#
#for filepath in files_list:
#    stream = open( dir_path + os.sep + filepath , 'rb')
#    docs = yaml.safe_load(stream)
#    conversations = docs['conversations']
#    for con in conversations:
#        if len( con ) > 2 :
#            questions.append(con[0])
#            replies = con[ 1 : ]
#            ans = ''
#            for rep in replies:
#                ans += ' ' + rep
#            answers.append( ans )
#        elif len( con )> 1:
#            questions.append(con[0])
#            answers.append(con[1])
#
#answers_with_tags = list()
#for i in range( len( answers ) ):
#    if type( answers[i] ) == str:
#        answers_with_tags.append( answers[i] )
#    else:
#        questions.pop( i )
#
#answers = list()
#for i in range( len( answers_with_tags ) ) :
#    answers.append( '<START> ' + answers_with_tags[i] + ' <END>' )
#
#tokenizer = preprocessing.text.Tokenizer()
#tokenizer.fit_on_texts( questions + answers )
#VOCAB_SIZE = len( tokenizer.word_index )+1
#print( 'VOCAB SIZE : {}'.format( VOCAB_SIZE ))#

In [3]:
from nltk.tokenize import sent_tokenize
import numpy as np
import re
import pandas as pd
import string
from string import digits

#lines= pd.read_table('jpn.txt', names=['questions', 'answers'])
#data_path = '../../Dataset/Bangladesh Dummy dataset from wikipedia.txt'
data_path = '../../../Dataset/Dummy data from wikipedia short.txt'


with open(data_path,'r', encoding='utf-8') as f:
    lines = f.read()

lines = " ".join(lines.split())
lines = re.sub(r"\s+", " ", lines)
lines = lines.replace('\n', ' ')

sentences = sent_tokenize(lines)
sent = np.asarray(sentences)
sentTwo = sent[:-1].reshape(len(sent)//2,2)


In [4]:
sentTwo

array([["Bangladesh officially the People's Republic of Bangladesh (গণপ্রজাতন্ত্রী বাংলাদেশ Gônoprojatontri Bangladesh), is a country in South Asia.",
        "While it is the 92nd-largest country, spanning 147,570 square kilometres (56,980 sq mi), it is the world's 8th-most populous country with a population nearing 163 million, making it one of the most densely populated countries in the world."],
       ['Bangladesh shares land borders with India to the west, north and the east and Myanmar to the east, whereas the Bay of Bengal lies to its south.',
        'Dhaka, its capital and largest city, is also the economic, political and the cultural hub of the country.'],
       ['Chittagong, the largest sea port, is the second largest city.',
        "The country's geography is dominated by the Ganges delta which empties into the Bay of Bengal the combined waters of several river systems, including those of the Brahmaputra and the Ganges."],
       ['As a result, the country is criss-cross

In [5]:
def create_dataset(dataset, look_back=1):
	dataX, dataY = [], []
	for i in range(len(dataset)-look_back-1):
		a = dataset[i:(i+look_back), 0]
		dataX.append(a)
		dataY.append(dataset[i + look_back, 0])
	return np.array(dataX), np.array(dataY)

In [6]:
col1, col2 = create_dataset(sentTwo)
#col1 = col1.reshape(-1)
col2 = col2.reshape(-1,1)
#col2.shape
dt = np.concatenate((col1,col2), axis=1)

lines = pd.DataFrame(dt, columns=['questions','answers'])
# Lowercase all characters
lines.questions=lines.questions.apply(lambda x: x.lower())
lines.answers=lines.answers.apply(lambda x: x.lower())

# to install mecab
# sudo apt install mecab mecab-ipadic-utf8
#import MeCab
#wakati = MeCab.Tagger("-Owakati")
#lines.answers = lines.answers.apply(lambda x: wakati.parse(x).strip("\n"))

# Remove quotes
lines.questions=lines.questions.apply(lambda x: re.sub("'", '', x))
lines.answers=lines.answers.apply(lambda x: re.sub("'", '', x))
exclude = set(string.punctuation) # Set of all special characters

# Remove all the special characters
lines.questions=lines.questions.apply(lambda x: ''.join(ch for ch in x if ch not in exclude))
lines.answers=lines.answers.apply(lambda x: ''.join(ch for ch in x if ch not in exclude))

# Remove all numbers from text
#remove_digits = str.maketrans('', '', digits)
#lines.questions=lines.questions.apply(lambda x: x.translate(remove_digits))
#lines.answers = lines.answers.apply(lambda x: re.sub("[123456789]", "", x))
# Remove extra spaces

lines.questions=lines.questions.apply(lambda x: x.strip())
lines.answers=lines.answers.apply(lambda x: x.strip())
lines.questions=lines.questions.apply(lambda x: re.sub(" +", " ", x))
lines.answers=lines.answers.apply(lambda x: re.sub(" +", " ", x))

# Add start and end tokens to target sequences
lines.answers = lines.answers.apply(lambda x : '<START> ' + x + ' <END>')
lines.head(10)
#lines.answers.tail(10)

Unnamed: 0,questions,answers
0,bangladesh officially the peoples republic of ...,<START> bangladesh shares land borders with in...
1,bangladesh shares land borders with india to t...,<START> chittagong the largest sea port is the...
2,chittagong the largest sea port is the second ...,<START> as a result the country is crisscrosse...
3,as a result the country is crisscrossed by num...,<START> the country also features the longest ...
4,the country also features the longest natural ...,<START> bangladesh forms the largest and easte...
5,bangladesh forms the largest and eastern part ...,<START> in the ancient and classical period of...
6,in the ancient and classical period of the ind...,<START> the principalities were notable for th...
7,the principalities were notable for their over...,<START> islam was introduced during the pala e...
8,islam was introduced during the pala empire th...,<START> following the decline of the mughal em...
9,following the decline of the mughal empire in ...,<START> the borders of modern bangladesh were ...


In [7]:
tokenizer = preprocessing.text.Tokenizer()
tokenizer.fit_on_texts( lines.questions + lines.answers )
VOCAB_SIZE = len( tokenizer.word_index )+1
print( 'VOCAB SIZE : {}'.format( VOCAB_SIZE ))

VOCAB SIZE : 562


In [8]:
word_index = tokenizer.word_index

index2word = {}
for k, v in word_index.items():
    if v < 15000:
        index2word[v] = k
    if v > 15000:
        continue

word2index = {}
for k, v in index2word.items():
    word2index[v] = k

### B) Reading the data from the files

We parse each of the `.yaml` files.

*   Concatenate two or more sentences if the answer has two or more of them.
*   Remove unwanted data types which are produced while parsing the data.
*   Append `<START>` and `<END>` to all the `answers`.
*   Create a `Tokenizer` and load the whole vocabulary ( `questions` + `answers` ) into it.






### C) Preparing data for Seq2Seq model

Our model requires three arrays namely `encoder_input_data`, `decoder_input_data` and `decoder_output_data`.

For `encoder_input_data` :
* Tokenize the `questions`. Pad them to their maximum lquestionsth.

For `decoder_input_data` :
* Tokenize the `answers`. Pad them to their maximum lquestionsth.

For `decoder_output_data` :

* Tokenize the `answers`. Remove the first element from all the `tokenized_answers`. This is the `<START>` element which we added earlier.



In [9]:
questions = lines.questions 
answers = lines.answers

# encoder_input_data
tokenized_questions = tokenizer.texts_to_sequences( questions )
maxlen_questions = max( [ len(x) for x in tokenized_questions ] )
padded_questions = preprocessing.sequence.pad_sequences( tokenized_questions , maxlen=maxlen_questions , padding='post' )
encoder_input_data = np.array( padded_questions )
print( encoder_input_data.shape , maxlen_questions )

# decoder_input_data
tokenized_answers = tokenizer.texts_to_sequences( answers )
maxlen_answers = max([ len(x) for x in tokenized_answers ])
padded_answers = preprocessing.sequence.pad_sequences( tokenized_answers , maxlen=maxlen_answers , padding='post' )
decoder_input_data = np.array( padded_answers )
print( decoder_input_data.shape , maxlen_answers )

# decoder_output_data
tokenized_answers = tokenizer.texts_to_sequences( answers )
for i in range(len(tokenized_answers)) :
    tokenized_answers[i] = tokenized_answers[i][1:]
padded_answers = preprocessing.sequence.pad_sequences( tokenized_answers , maxlen=maxlen_answers , padding='post' )
onehot_answers = utils.to_categorical( padded_answers , VOCAB_SIZE )
decoder_output_data = np.array( onehot_answers )
print( decoder_output_data.shape )

# Saving all the arrays to storage
#np.save( 'Saved Arrays/enc_in_data.npy' , encoder_input_data )
#np.save( 'Saved Arrays/dec_in_data.npy' , decoder_input_data )
#np.save( 'Saved Arrays/dec_tar_data.npy' , decoder_output_data )


(57, 44) 44
(57, 47) 47
(57, 47, 562)


## 3) Defining the Encoder-Decoder model
The model will have Embedding, LSTM and Dense layers. The basic configuration is as follows.


*   2 Input Layers : One for `encoder_input_data` and another for `decoder_input_data`.
*   Embedding layer : For converting token vectors to fix sized dense vectors. **( Note :  Don't forget the `mask_zero=True` argument here )**
*   LSTM layer : Provide access to Long-Short Term cells.

Working : 

1.   The `encoder_input_data` comes in the Embedding layer (  `encoder_embedding` ). 
2.   The output of the Embedding layer goes to the LSTM cell which produces 2 state vectors ( `h` and `c` which are `encoder_states` )
3.   These states are set in the LSTM cell of the decoder.
4.   The decoder_input_data comes in through the Embedding layer.
5.   The Embeddings goes in LSTM cell ( which had the states ) to produce seqeunces.

**Important points :**


*   `50` is the output of the GloVe embeddings.
*   `embedding_matrix` is the GloVe embedding which we downloaded earlier.


<center><img style="float: center;" src="https://cdn-images-1.medium.com/max/1600/1*bnRvZDDapHF8Gk8soACtCQ.gif"></center>


Image credits to [Hackernoon](https://hackernoon.com/tutorial-3-what-is-seq2seq-for-text-summarization-and-why-68ebaa644db0).










In [10]:
import numpy as np
from pathlib import Path

home = str(Path.home())

embeddings_index = {}
with open(home + '/GlovePretrainedVectors/glove.6B.50d.txt', encoding='utf-8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        #print(word)
        coefs = np.asarray(values[1:], dtype='float32')
        #print(coefs)
        embeddings_index[word] = coefs
    f.close()
print("Glove Loaded")    

Glove Loaded


In [11]:
embedding_dimention = 50
def embedding_matrix_creater(embedding_dimention, word_index):
    embedding_matrix = np.zeros((len(word_index) + 1, embedding_dimention))
    for word, i in word_index.items():
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
          # words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector
    return embedding_matrix

embedding_matrix = embedding_matrix_creater(50, word_index=word2index)

In [12]:
import tensorflow as tf
#from keras.layers import Embedding

embed_layer = tf.keras.layers.Embedding(input_dim=VOCAB_SIZE, output_dim=50, trainable=True,)
embed_layer.build((None,))
embed_layer.set_weights([embedding_matrix])

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


In [13]:


encoder_inputs = tf.keras.layers.Input(shape=( None , ))
#encoder_embedding = tf.keras.layers.Embedding( VOCAB_SIZE, 50 , mask_zero=True ) (encoder_inputs)

encoder_embedding = embed_layer(encoder_inputs)

encoder_outputs , state_h , state_c = tf.keras.layers.LSTM( 50 , return_state=True )( encoder_embedding )
encoder_states = [ state_h , state_c ]

decoder_inputs = tf.keras.layers.Input(shape=( None ,  ))
#decoder_embedding = tf.keras.layers.Embedding( VOCAB_SIZE, 50 , mask_zero=True) (decoder_inputs)

decoder_embedding = embed_layer(decoder_inputs)

decoder_lstm = tf.keras.layers.LSTM( 50 , return_state=True , return_sequences=True )
decoder_outputs , _ , _ = decoder_lstm ( decoder_embedding , initial_state=encoder_states )
decoder_dense = tf.keras.layers.Dense( VOCAB_SIZE , activation=tf.keras.activations.softmax ) 
output = decoder_dense ( decoder_outputs )

model = tf.keras.models.Model([encoder_inputs, decoder_inputs], output )
model.compile(optimizer=tf.keras.optimizers.RMSprop(), loss='categorical_crossentropy')

model.summary()


Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_2 (InputLayer)            [(None, None)]       0                                            
__________________________________________________________________________________________________
input_1 (InputLayer)            [(None, None)]       0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, None, 50)     28100       input_1[0][0]                    
                                                                 input_2[0][0]                    
___________________________________________________________________________

## 4) Training the model
We train the model for a number of epochs with `RMSprop` optimizer and `categorical_crossentropy` loss function.

In [14]:

model.fit([encoder_input_data , decoder_input_data], decoder_output_data, batch_size=10, epochs=100 ) 
model.save( 'model.h5' ) 


Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch

Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


## 5) Defining inference models
We create inference models which help in predicting answers.

**Encoder inference model** : Takes the question as input and outputs LSTM states ( `h` and `c` ).

**Decoder inference model** : Takes in 2 inputs, one are the LSTM states ( Output of encoder model ), second are the answer input seqeunces ( ones not having the `<start>` tag ). It will output the answers for the question which we fed to the encoder model and its state values.

In [15]:

def make_inference_models():
    
    encoder_model = tf.keras.models.Model(encoder_inputs, encoder_states)
    
    decoder_state_input_h = tf.keras.layers.Input(shape=( 50 ,))
    decoder_state_input_c = tf.keras.layers.Input(shape=( 50 ,))
    
    decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
    
    decoder_outputs, state_h, state_c = decoder_lstm(
        decoder_embedding , initial_state=decoder_states_inputs)
    decoder_states = [state_h, state_c]
    decoder_outputs = decoder_dense(decoder_outputs)
    decoder_model = tf.keras.models.Model(
        [decoder_inputs] + decoder_states_inputs,
        [decoder_outputs] + decoder_states)
    
    return encoder_model , decoder_model


## 6) Talking with our Chatbot

First, we define a method `str_to_tokens` which converts `str` questions to Integer tokens with padding.


In [16]:

def str_to_tokens( sentence : str ):
    words = sentence.lower().split()
    tokens_list = list()
    #return tokens_list
    for word in words:
        tokens_list.append( tokenizer.word_index[ word ] ) 
    return preprocessing.sequence.pad_sequences( [tokens_list] , maxlen=maxlen_questions , padding='post')


In [17]:
str_to_tokens("Bangladesh")

array([[11,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0]], dtype=int32)




1.   First, we take a question as input and predict the state values using `enc_model`.
2.   We set the state values in the decoder's LSTM.
3.   Then, we generate a sequence which contains the `<start>` element.
4.   We input this sequence in the `dec_model`.
5.   We replace the `<start>` element with the element which was predicted by the `dec_model` and update the state values.
6.   We carry out the above steps iteratively till we hit the `<end>` tag or the maximum answer lquestionsth.







In [18]:
enc_model , dec_model = make_inference_models()

for _ in range(10):
    states_values = enc_model.predict( str_to_tokens( input( 'Enter question : ' ) ) )
    empty_target_seq = np.zeros( ( 1 , 1 ) )
    empty_target_seq[0, 0] = tokenizer.word_index['start']
    stop_condition = False
    decoded_translation = ''
    while not stop_condition :
        dec_outputs , h , c = dec_model.predict([ empty_target_seq ] + states_values )
        sampled_word_index = np.argmax( dec_outputs[0, -1, :] )
        sampled_word = None
        for word , index in tokenizer.word_index.items() :
            if sampled_word_index == index :
                decoded_translation += ' {}'.format( word )
                sampled_word = word
        
        if sampled_word == 'end' or len(decoded_translation.split()) > maxlen_answers:
            stop_condition = True
            
        empty_target_seq = np.zeros( ( 1 , 1 ) )  
        empty_target_seq[ 0 , 0 ] = sampled_word_index
        states_values = [ h , c ] 

    print( decoded_translation )


Enter question : Bangladesh
 the the is is of the largest sea and rivers and the world end
Enter question : Muslim
 the the is is of the largest sea and rivers and and inland end
Enter question : islam
 the the is is of the largest sea and rivers and and inland end
Enter question : cat


KeyError: 'cat'