<a href="https://colab.research.google.com/github/Rachhh53/chatbot/blob/main/Chatbot_using_LSTM_updated.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Chatbot using Seq2Seq LSTM models**

This project is to create conversational chatbot using Sequence to sequence LSTM models. 
Sequence to sequence learning is about training models to convert from one domain to sequences another domain. 

# Step 1: Import all the packages 

In [1]:
import numpy as np 
import tensorflow as tf
import pickle
from tensorflow.keras import layers, activations, models, preprocessing

# Step 2: Download all the data from kaggle

In [2]:
# !pip install kaggle 

In [3]:
# from google.colab import files
# files.upload()

In [4]:
# !mkdir -p ~/.kaggle

In [5]:
# !cp kaggle.json ~/.kaggle/

In [6]:
# !ls ~/.kaggle

In [7]:
# !chmod 600 /root/.kaggle/kaggle.json

In [8]:
# !kaggle datasets download -d kausr25/chatterbotenglish

In [9]:
# !unzip /content/chatterbotenglish.zip

In [10]:
# !wget https://github.com/shubham0204/Dataset_Archives/blob/master/chatbot_nlp.zip?raw=true -O chatbot_nlp.zip
# !unzip chatbot_nlp.zip

# Step 3: Preprocessing the data

### a) Reading the data from the files
We parse each of the .yaml files.

1. Concatenate two or more sentences if the answer has two or more of them.
2. Remove unwanted data types which are produced while parsing the data.
3. Append <START> and <END> to all the answers.
4. Create a Tokenizer and load the whole vocabulary ( questions + answers ) into it.

In [11]:
from tensorflow.keras import preprocessing, utils
import os
import yaml

The dataset contains .yml files which have pairs of different questions and their answers on varied subjects like history, bot profile, science etc.
We can easily read them as folows:

In [12]:
dir_path = '/content/chatbot_nlp/data'
files_list = os.listdir(dir_path + os.sep)

In [13]:
# create separate lists for input sequences and target sequences
questions = list()
answers = list()

for filepath in files_list:
    stream = open( dir_path + os.sep + filepath , 'rb') # read binary
    docs = yaml.safe_load(stream)
    conversations = docs['conversations']
    for con in conversations:
      # if there are multiple answers to a question
        if len( con ) > 2 :
            # add questions to the questions list
            questions.append(con[0])
            # if there is more than one possible answer
            replies = con[ 1 : ]
            ans = ''
            # parse out multiple answers to a question before adding them to the answers list
            for rep in replies:
                ans += ' ' + rep
            answers.append( ans )
        # if there is only one answer to a question
        elif len( con )> 1:
            questions.append(con[0])
            answers.append(con[1])

# target sequences
answers_with_tags = list()
for i in range( len( answers ) ):
    if type( answers[i] ) == str:
        answers_with_tags.append( answers[i] )
    else:
        # remove only the question and not the answer too?? *** look at this code more ***
        questions.pop( i )

answers = list()
for i in range( len( answers_with_tags ) ) :
  # tells model where to start and end text generation
    answers.append( '<START> ' + answers_with_tags[i] + ' <END>' )

tokenizer = preprocessing.text.Tokenizer()
# update vocabulary
tokenizer.fit_on_texts( questions + answers )
VOCAB_SIZE = len( tokenizer.word_index )+1
print( 'VOCAB SIZE : {}'.format( VOCAB_SIZE ))

VOCAB SIZE : 1894


In [14]:
tokenizer.word_index.keys()

dict_keys(['end', 'start', 'you', 'a', 'i', 'the', 'is', 'of', 'to', 'what', 'are', 'do', 'not', 'and', 'me', 'it', 'in', 'have', 'that', 'am', 'tell', 'as', 'get', 'can', 'my', 'when', "i'm", 'your', 'how', 'joke', 'like', 'be', 'an', 'feel', 'about', 'who', 'computer', 'or', 'for', "don't", 'no', 'by', 'cross', 'with', 'software', 'on', 'all', 'think', 'much', 'but', 'very', 'which', 'at', 'he', 'why', 'know', 'any', 'could', 'was', 'so', 'one', 'should', 'from', 'make', 'more', 'we', 'if', 'robots', 'will', 'did', 'die', 'favorite', 'stock', 'been', 'say', 'emotion', 'human', 'mad', 'robot', 'read', 'hal', 'does', 'feeling', "that's", 'right', 'really', 'bad', 'said', 'just', 'yet', 'up', 'eat', 'would', 'computers', 'chat', 'market', 'time', 'hard', 'try', 'work', 'some', 'sense', 'emotions', 'gossip', 'well', 'yes', 'too', 'than', 'capable', 'programmed', 'this', 'ever', 'sad', 'makes', 'myself', 'has', "it's", 'other', 'people', 'only', 'them', 'good', 'money', 'never', 'experien

### b) Preparing data for Seq2Seq model

This model requires 3 arrays encoder_input_data, decoder_input_data and decoder_output_data.

For encoder_input_data:
Tokensize the Questions and Pad them to their maximum Length.

For decoder_input_data:
Tokensize the Answers and Pad them to their maximum Length.

For decoder_output_data:
Tokensize the Answers and Remove the 1st element from all the tokenized_answers. This is the <START> element which was added earlier.

In [15]:
from gensim.models import Word2Vec
import re

In [16]:
vocab = []
for word in tokenizer.word_index:
  vocab.append(word)

def tokenize(sentences):
  tokens_list = []
  vocabulary = []
  for sentence in sentences:
    sentence = sentence.lower()
    sentence = re.sub('[^a-zA-Z]', ' ', sentence)
    tokens = sentence.split()
    vocabulary += tokens
    tokens_list.append(tokens)
  return tokens_list, vocabulary

In [17]:
#encoder_input_data

# transform text to integer
tokenized_questions = tokenizer.texts_to_sequences( questions )
# find length of longest question
maxlen_questions = max( [len(x) for x in tokenized_questions ] )
# ensure all vectors are the same length by padding 0 to the end
padded_questions = preprocessing.sequence.pad_sequences( tokenized_questions, maxlen = maxlen_questions, padding = 'post')
encoder_input_data = np.array(padded_questions)
print(encoder_input_data.shape, maxlen_questions)

(564, 22) 22


^^ encoder shape, length of longest question in the corpus

In [18]:
# decoder_input_data

# transform text to integer
tokenized_answers = tokenizer.texts_to_sequences( answers )
# find length of longest answer
maxlen_answers = max( [ len(x) for x in tokenized_answers ] )
# ensure all vectors are the same length by padding 0 to the end
padded_answers = preprocessing.sequence.pad_sequences( tokenized_answers , maxlen=maxlen_answers , padding='post' )
decoder_input_data = np.array( padded_answers )
print( decoder_input_data.shape , maxlen_answers )

(564, 74) 74


In [19]:
# decoder_output_data

# transform text to integer
tokenized_answers = tokenizer.texts_to_sequences( answers )
for i in range(len(tokenized_answers)) :
    tokenized_answers[i] = tokenized_answers[i][1:]
# ensure all vectors are the same length by padding 0 to the end
padded_answers = preprocessing.sequence.pad_sequences( tokenized_answers , maxlen=maxlen_answers , padding='post' )
# convert to binary class matrix
onehot_answers = utils.to_categorical( padded_answers , VOCAB_SIZE )
decoder_output_data = np.array( onehot_answers )
print( decoder_output_data.shape )

(564, 74, 1894)


# Step 4: Defining Encoder Decoder Model





In [20]:
dimensionality = 200

# input is size of the vector (they're all padded to the same len)
encoder_inputs = tf.keras.layers.Input(shape=( maxlen_questions , ))
# bottleneck layer to create dense layer
encoder_embedding = tf.keras.layers.Embedding( VOCAB_SIZE, dimensionality , mask_zero=True ) (encoder_inputs)
encoder_outputs , state_h , state_c = tf.keras.layers.LSTM( dimensionality , return_state=True )( encoder_embedding )
encoder_states = [ state_h , state_c ]

decoder_inputs = tf.keras.layers.Input(shape=( maxlen_answers ,  ))
decoder_embedding = tf.keras.layers.Embedding( VOCAB_SIZE, dimensionality , mask_zero=True) (decoder_inputs)
decoder_lstm = tf.keras.layers.LSTM( dimensionality , return_state=True , return_sequences=True )
decoder_outputs , _ , _ = decoder_lstm ( decoder_embedding , initial_state=encoder_states )
decoder_dense = tf.keras.layers.Dense( VOCAB_SIZE , activation=tf.keras.activations.softmax ) 
output = decoder_dense ( decoder_outputs )

model = tf.keras.models.Model([encoder_inputs, decoder_inputs], output )
model.compile(optimizer=tf.keras.optimizers.RMSprop(), loss='categorical_crossentropy')

model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_1 (InputLayer)           [(None, 22)]         0           []                               
                                                                                                  
 input_2 (InputLayer)           [(None, 74)]         0           []                               
                                                                                                  
 embedding (Embedding)          (None, 22, 200)      378800      ['input_1[0][0]']                
                                                                                                  
 embedding_1 (Embedding)        (None, 74, 200)      378800      ['input_2[0][0]']                
                                                                                              

# Step 5: Training the Model

We train the model for a number of epochs with RMSprop optimizer and categorical_crossentropy loss function.

In [21]:
# model.fit([encoder_input_data , decoder_input_data], decoder_output_data, batch_size=50, epochs=300 ) 
# model.save( 'model.h6' )

# Step 6: Defining Inference Models

Encoder Inference Model: Takes questions as input and outputs LSTM states (h and c)

Decoder Inference Model: Takes in 2 inputs one are the LSTM states, second are the answer input sequences. it will o/p the answers for questions which fed to the encoder model and it's state values.

In [22]:
def make_inference_models():
    
    encoder_model = tf.keras.models.Model(encoder_inputs, encoder_states)
    
    decoder_state_input_h = tf.keras.layers.Input(shape=( dimensionality ,))
    decoder_state_input_c = tf.keras.layers.Input(shape=( dimensionality ,))
    
    decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
    
    decoder_outputs, state_h, state_c = decoder_lstm(
        decoder_embedding , initial_state=decoder_states_inputs)
    
    decoder_states = [state_h, state_c]

    decoder_outputs = decoder_dense(decoder_outputs)
    
    decoder_model = tf.keras.models.Model(
        [decoder_inputs] + decoder_states_inputs,
        [decoder_outputs] + decoder_states)
    
    return encoder_model , decoder_model

# Step 7: Talking with the Chatbot

define a method str_to_tokens which converts str questions to Integer tokens with padding.

1. First, we take a question as input and predict the state values using enc_model.
2. We set the state values in the decoder's LSTM.
3. Then, we generate a sequence which contains the <start> element.
4. We input this sequence in the dec_model.
5. We replace the <start> element with the element which was predicted by the dec_model and update the state values.
6. We carry out the above steps iteratively till we hit the <end> tag or the maximum answer length.



In [23]:
model = models.load_model('model.h5')

In [24]:
def str_to_tokens( sentence : str ):

    sentence = re.sub('[^a-zA-Z]', ' ', sentence)
    words = sentence.lower().split()
    tokens_list = list()
  
    for word in words:
        if word in tokenizer.word_index.keys():
          tokens_list.append( tokenizer.word_index[ word ] ) 

    return preprocessing.sequence.pad_sequences( [tokens_list] , maxlen=maxlen_questions , padding='post')


In [29]:
enc_model , dec_model = make_inference_models()

u_input = ''

# for _ in range(10):
print('Type goodbye to stop the conversation at any time.')
while not u_input == 'goodbye':
      u_input = input( 'Enter question : ' )
      states_values = enc_model.predict( str_to_tokens( u_input ) )
      empty_target_seq = np.zeros( ( 1 , 1 ) )
      empty_target_seq[0, 0] = tokenizer.word_index['start']
      stop_condition = False
      decoded_translation = ''
      while not stop_condition :
          dec_outputs , h , c = dec_model.predict([ empty_target_seq ] + states_values )
          sampled_word_index = np.argmax( dec_outputs[0, -1, :] )
          sampled_word = None
          for word , index in tokenizer.word_index.items() :
              if sampled_word_index == index :
                  decoded_translation += ' {}'.format( word )
                  sampled_word = word
          
          if sampled_word == 'end' or len(decoded_translation.split()) > maxlen_answers:
              stop_condition = True
              
          empty_target_seq = np.zeros( ( 1 , 1 ) )  
          empty_target_seq[ 0 , 0 ] = sampled_word_index
          states_values = [ h , c ] 

      print( decoded_translation )

Type goodbye to stop the conversation at any time.
Enter question : hello chatbot!




 resemblance invented why no glad giant multiplication thousands plato's didn't data's mumble working man physicist relative embarassed oses oses oses oses oses oses oses owned owned everything prefer 20th erased burn burn pennsylvania continent device indvidual volumes dental really link link carolina he v conversations anyone toying ai would district considered accomplish hope fever 20th sapient work currency braggadaccio context classless 20th skiddoo russia's russia's russia's russia's russia's value wish relationships wave recall vineland operates


KeyboardInterrupt: ignored

# Conversion to TFLite 

We can convert our seq2seq model to a TensorFlow Lite model so that we can use it on edge devices


In [None]:
#!pip install tf-nightly

In [None]:
# converter = tf.lite.TFLiteConverter.from_keras_model( enc_model )
# buffer = converter.convert()
# open( 'enc_model.tflite' , 'wb' ).write( buffer )

# converter = tf.lite.TFLiteConverter.from_keras_model( dec_model )
# open( 'dec_model.tflite' , 'wb' ).write( buffer )