# LSTM Chatbot with Tensorflow using MetaLWOz Dataset
A chatbot trained on 37844 crosstalk pairs. The dataset used is the [MetaLWOz](https://www.microsoft.com/en-us/research/project/metalwoz/)

### A) Import libraries

In [None]:
import numpy as np 
import os
import re
import json
import tensorflow as tf
from tensorflow.keras import layers, activations, models, preprocessing

### B) Reading the data from the files

In [None]:
!wget https://download.microsoft.com/download/E/B/8/EB84CB1A-D57D-455F-B905-3ABDE80404E5/metalwoz-v1.zip -O metalwoz-v1.zip
!unzip metalwoz-v1.zip
dir_path = 'dialogues'
files_list = os.listdir(dir_path + os.sep)

--2023-02-04 15:51:40--  https://download.microsoft.com/download/E/B/8/EB84CB1A-D57D-455F-B905-3ABDE80404E5/metalwoz-v1.zip
Resolving download.microsoft.com (download.microsoft.com)... 23.39.1.112, 2600:1406:3c:393::317f, 2600:1406:3c:3a4::317f
Connecting to download.microsoft.com (download.microsoft.com)|23.39.1.112|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5639228 (5.4M) [application/octet-stream]
Saving to: ‘metalwoz-v1.zip’


2023-02-04 15:51:40 (36.9 MB/s) - ‘metalwoz-v1.zip’ saved [5639228/5639228]

Archive:  metalwoz-v1.zip
  inflating: LICENSE.pdf             
  inflating: tasks.txt               
  inflating: dialogues/AGREEMENT_BOT.txt  
  inflating: dialogues/ALARM_SET.txt  
  inflating: dialogues/APARTMENT_FINDER.txt  
  inflating: dialogues/APPOINTMENT_REMINDER.txt  
  inflating: dialogues/AUTO_SORT.txt  
  inflating: dialogues/BANK_BOT.txt  
  inflating: dialogues/BUS_SCHEDULE_BOT.txt  
  inflating: dialogues/CATALOGUE_BOT.txt  
  inflating

The dialogues file contains 47 .txt files which are the topics. However, it will be convenient to consider these files as JSON files with missing commmas between the JSON objects. All the JSON objects will be parsed and stored into one variable, `parsed_objects`

But using all 47 topics overloads the RAM (>30 GB for the creation of `decoder_output_data` in the next session) so only 3 files will be used.

In [None]:
files_list=files_list[0:3]
print(files_list)

['PHONE_PLAN_BOT.txt', 'ALARM_SET.txt', 'PET_ADVICE.txt']


In [None]:
#https://stackoverflow.com/questions/54663739/how-to-analyze-json-objects-that-are-not-separated-by-comma-preferably-in-pytho

def parse_unformatted_json(files_list, dir_path):
  decoder = json.JSONDecoder()
  parsed_objects = []

  for topic in files_list: # for all 47 topics
    
    with open(dir_path + "/" + topic, "r") as f:
        content = f.read()
    while content:
        value, new_start = decoder.raw_decode(content)
        content = content[new_start:].strip()
        parsed_objects.append(value)

  return parsed_objects

parsed_objects = parse_unformatted_json(files_list, dir_path)

In [None]:
print("Total number of dialogues =", len(parsed_objects))

Total number of dialogues = 2601


Below is an example instance of parsed_objects, the values of `turns` are going to be extracted. The `turns` section always starts with the bot's greeting line, which is going to be removed later. 

In [None]:
parsed_objects[0]

{'id': '5a0fafb4',
 'user_id': '62edbdf3',
 'bot_id': '5b89b8eb',
 'domain': 'PHONE_PLAN_BOT',
 'task_id': '105bb6ba',
 'turns': ['Hello how may I help you?',
  'what can you do?',
  'I am a bot that can help you with your mobile plan issues. I can do things like give you details of the different plans.',
  'great, i need to upgrade my plan',
  'I can do that! Do you have a specific plan in mind?',
  'i want one with free calling',
  'Do you need free calling before 7 PM or 24/7 free calling?',
  "it must be 24/7 free calling, that's what i need",
  'Okay! We offer that. Can you please give me your number so I can get your account information.',
  '555 1212',
  'Thanks! Please provide your security question answer so I can apply changes to your account.']}

Extract values from `turns` JSON key for all objects for all topics

In [None]:
def extract_dialogue_pairs(parsed_objects):
  user=[]
  bot_raw=[]
  bot = list()

  for v in parsed_objects:
    if len(v['turns']) % 2 == 1:        # if user ends the conversation
      last_line = len(v['turns'])
    elif len(v['turns']) % 2 == 0:      # if bot ends the conversation
      last_line = len(v['turns'])-1

    for i in range(1,last_line): # start from 1 so as to not include bot's first line (the welcoming line)
      if i % 2 == 0:
        bot_raw.append(v['turns'][i])
      else:
        user.append(v['turns'][i])

    # the length of dialogues should be equal, i.e. for every user's question there should be a bot's answer
    assert (len(user)==len(bot_raw))
  
  for i in range(len(bot_raw)) :
    bot.append( '<START> ' + bot_raw[i] + ' <END>' )

  return (user, bot)

The decoder will progress by taking the tokens it emits as inputs, so before it has emitted anything it needs a token to start with, i.e \<START> <p>
The \<END> token helps decoder to emit arbitrary-length sequences. The decoder will tell us when it's done emitting tokens: without an "end" token, we would have no idea when the decoder is done talking to us and continuing to emit tokens will produce gibberish.

In [None]:
user, bot = extract_dialogue_pairs(parsed_objects)

In [None]:
# There are 13561 sentence pairs
print(len(user))
print(len(bot))

13561
13561


Example of 5 sentence pairs between user and bot

In [None]:
user[0:5]

['what can you do?',
 'great, i need to upgrade my plan',
 'i want one with free calling',
 "it must be 24/7 free calling, that's what i need",
 '555 1212']

In [None]:
bot[0:5]

['<START> I am a bot that can help you with your mobile plan issues. I can do things like give you details of the different plans. <END>',
 '<START> I can do that! Do you have a specific plan in mind? <END>',
 '<START> Do you need free calling before 7 PM or 24/7 free calling? <END>',
 '<START> Okay! We offer that. Can you please give me your number so I can get your account information. <END>',
 '<START> Thanks! Please provide your security question answer so I can apply changes to your account. <END>']

In [None]:
# Create a Tokenizer and load the whole vocabulary (user + bot) into it.
tokenizer = preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(user + bot)
VOCAB_SIZE = len(tokenizer.word_index)+1
print('VOCAB SIZE : ',VOCAB_SIZE)

VOCAB SIZE :  4713


### C) Preparing data for Seq2Seq model
Our model requires three arrays namely `encoder_input_data`, `decoder_input_data` and `decoder_output_data`.

For `encoder_input_data` : Tokenize the questions. Pad or truncate them to length `MAX_LEN`.<p>
For `decoder_input_data` : Tokenize the answers. Pad or truncate them to length `MAX_LEN`.<p>
For `decoder_output_data` : Tokenize the answers. Remove the first element from all the tokenized_answers. This is the <START> element which we added earlier.

A maximum length of 30 words per sentence is used which seems to be enough, considering that the biggest question is 43 words and the biggest answer is 68 words. 

In [None]:
MAX_LEN = 30

#encoder_input_data
tokenized_questions = tokenizer.texts_to_sequences(user)
maxlen_questions = max([len(x) for x in tokenized_questions])
encoder_input_data = preprocessing.sequence.pad_sequences(tokenized_questions, maxlen = MAX_LEN, padding = 'post', truncating='post')
print(encoder_input_data.shape, maxlen_questions)

# decoder_input_data
tokenized_answers = tokenizer.texts_to_sequences(bot)
maxlen_answers = max([len(x) for x in tokenized_answers])
decoder_input_data = preprocessing.sequence.pad_sequences(tokenized_answers , maxlen=MAX_LEN , padding='post', truncating='post')
print(decoder_input_data.shape, maxlen_answers)

# decoder_output_data
from tensorflow.keras import utils

tokenized_answers = tokenizer.texts_to_sequences(bot)

# remove <START> tag
for i in range(len(tokenized_answers)) :
    tokenized_answers[i] = tokenized_answers[i][1:]
padded_answers = preprocessing.sequence.pad_sequences(tokenized_answers, maxlen=MAX_LEN, padding='post', truncating='post')

# convert to 3d matrix (num_sentences x MAX_LEN x VOCAB_SIZE)
# for every word (out of the 30) in every sentence (out of num_sentences) convert the word to its one hot encoding representation
decoder_output_data = utils.to_categorical(padded_answers, VOCAB_SIZE)
print( decoder_output_data.shape )

(13561, 30) 43
(13561, 30) 68
(13561, 30, 4713)


### 3) Defining the Encoder-Decoder model

The model will have Embedding, LSTM and Dense layers. The basic configuration is as follows.


*   2 Input Layers : One for `encoder_input_data` and another for `decoder_input_data`.
*   Embedding layer : For converting token vectors to fix sized dense vectors.
*   LSTM layer : Provide access to Long-Short Term cells.

Working : 

1.   The `encoder_input_data` comes in the Embedding layer (  `encoder_embedding` ). 
2.   The output of the Embedding layer goes to the LSTM cells which produce 2 state vectors ( `h` and `c` which are `encoder_states` )
3.   These states are set in the LSTM cells of the decoder ( `decoder_lstm` ).
4.   The decoder_input_data comes in through the Embedding layer.
5.   The decoder embeddings go to the LSTM cells ( which have as initial states the states produced by the encoder, h & c ) in order to produce seqeunces.

In [None]:
EMB_DIM=150

encoder_inputs = tf.keras.layers.Input(shape=(MAX_LEN, ))
# use mask_zero=True to mak out the zeros from the padding
encoder_embedding = tf.keras.layers.Embedding(VOCAB_SIZE, EMB_DIM, mask_zero=True) (encoder_inputs)
# connect lstm layer with encoder embedding
encoder_outputs, state_h , state_c = tf.keras.layers.LSTM(200, return_state=True)(encoder_embedding)
encoder_states = [state_h , state_c]

decoder_inputs = tf.keras.layers.Input(shape=(MAX_LEN, ))
decoder_embedding = tf.keras.layers.Embedding(VOCAB_SIZE, EMB_DIM, mask_zero=True) (decoder_inputs)
decoder_lstm = tf.keras.layers.LSTM(200 , return_state=True , return_sequences=True)
# use the encoder states produced before as initial input for the decoder
decoder_outputs , _ , _ = decoder_lstm (decoder_embedding, initial_state=encoder_states)
# convert to probabilities
decoder_dense = tf.keras.layers.Dense(VOCAB_SIZE, activation=tf.keras.activations.softmax) 
output = decoder_dense ( decoder_outputs )

model = tf.keras.models.Model([encoder_inputs, decoder_inputs], output)
model.compile(optimizer=tf.keras.optimizers.Adam(), loss='categorical_crossentropy')

model.summary()

Model: "model_5"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_7 (InputLayer)           [(None, 30)]         0           []                               
                                                                                                  
 input_8 (InputLayer)           [(None, 30)]         0           []                               
                                                                                                  
 embedding_2 (Embedding)        (None, 30, 150)      706950      ['input_7[0][0]']                
                                                                                                  
 embedding_3 (Embedding)        (None, 30, 150)      706950      ['input_8[0][0]']                
                                                                                            

### 4) Training the model

In [None]:
model.fit([encoder_input_data , decoder_input_data], decoder_output_data, batch_size=50, epochs=35)

Epoch 1/35
Epoch 2/35
Epoch 3/35
Epoch 4/35
Epoch 5/35
Epoch 6/35
Epoch 7/35
Epoch 8/35
Epoch 9/35
Epoch 10/35
Epoch 11/35
Epoch 12/35
Epoch 13/35
Epoch 14/35
Epoch 15/35
Epoch 16/35
Epoch 17/35
Epoch 18/35
Epoch 19/35
Epoch 20/35
Epoch 21/35
Epoch 22/35
Epoch 23/35
Epoch 24/35
Epoch 25/35
Epoch 26/35
Epoch 27/35
Epoch 28/35
Epoch 29/35
Epoch 30/35
Epoch 31/35
Epoch 32/35
Epoch 33/35
Epoch 34/35
Epoch 35/35


<keras.callbacks.History at 0x7fc4ecee9430>

In [None]:
#model.save('model.h5') 
model = tf.keras.models.load_model('model.h5')

# Show the model architecture
model.summary()

### 5) Defining inference models
Create inference models for predicting answers.

*Encoder inference model* : Takes the question as input and outputs LSTM states ( h and c ).

*Decoder inference model* : Takes in 2 inputs, one are the LSTM states ( Output of encoder model ), second are the answer input seqeunces ( ones not having the <start> tag ). It will output the answers for the question which we fed to the encoder model and its state values.

In [None]:
def make_inference_models():
  encoder_model = tf.keras.models.Model(encoder_inputs, encoder_states)
  
  decoder_state_input_h = tf.keras.layers.Input(shape=(200 ,))
  decoder_state_input_c = tf.keras.layers.Input(shape=(200 ,))
  
  decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
  
  decoder_outputs, state_h, state_c = decoder_lstm(
      decoder_embedding , initial_state=decoder_states_inputs)
  decoder_states = [state_h, state_c]
  decoder_outputs = decoder_dense(decoder_outputs)
  decoder_model = tf.keras.models.Model(
      [decoder_inputs] + decoder_states_inputs,
      [decoder_outputs] + decoder_states)
  
  return encoder_model , decoder_model


### 6) Talking with Chatbot

In [None]:
def str_to_tokens(sentence : str ):
  sentence = re.sub(r'[^\w\s]','',sentence)
  words = sentence.lower().split()
  tokens_list = list()
  for word in words:
    try: 
      tokens_list.append(tokenizer.word_index[ word ]) 
    except KeyError:
      print("I don't understand the word", word)
      return 
  return preprocessing.sequence.pad_sequences([tokens_list], maxlen=MAX_LEN , padding='post', truncating='post')

In [None]:
enc_model , dec_model = make_inference_models()

for _ in range(10): 
    question = str_to_tokens( input( 'Enter question : ' ) )
    if not isinstance(question, type(None)):
      states_values = enc_model.predict(question)
      empty_target_seq = np.zeros(( 1 , 1 ))
      empty_target_seq[0, 0] = tokenizer.word_index['start']
      stop_condition = False
      decoded_translation = ''
      while not stop_condition :
          dec_outputs, h, c = dec_model.predict([ empty_target_seq ] + states_values, verbose = 0)
          sampled_word_index = np.argmax( dec_outputs[0, -1, :] )
          sampled_word = None
          for word , index in tokenizer.word_index.items() :
              if sampled_word_index == index :
                  decoded_translation += ' {}'.format(word)
                  sampled_word = word
          
          if sampled_word == 'end' or len(decoded_translation.split()) > MAX_LEN:
              stop_condition = True
              
          empty_target_seq = np.zeros(( 1 , 1 ))  
          empty_target_seq[ 0 , 0 ] = sampled_word_index
          states_values = [ h , c ] 
    else: 
      break

    print(decoded_translation)

Enter question : How many walks per day a Labrador needs?




 what is the dog end
Enter question : How much space a dog needs?
 what kind of dog do you have end
Enter question : Should I bath my cat?
 well that's a good idea need me to help you with anything else end
Enter question : What is an easy meal to make?
 the sound is that all set end
Enter question : Can you set my alarm for 6 am tomorrow?
 yes i can do that for you end
Enter question : What to see in France?
I don't understand the word france
