**Introduction:**  As part of AI Project, a conversational chatbot is developed that is destined to be utilized to address international student's common queries where much of the time is spent by sending enquiries to BU Helpdesk.

**Data set**: A Javascript Object Notation(JSON) intents file is used as a dataset. The intents file is created with a lot of possible questions that the student is probably going to ask and mapping them to possible responses.

The tag on each intent item represents the group of each message. With this information, a neural system is prepared to take a sentence of words and characterize it as one of the tags in the intent file. By doing this a possible response from the groups is taken and responded to the user. The intents file has 105 tags which are nothing but classification classes and with nearly more than 300 questions and responses to training the Neural Network.

The dataset is comprised of 3 different JSON attributes.

**1.Tag**: This is unique across the dataset. These values are referred to as classes while training the model. Based on this value the questions will be classified.

**2.Patterns**: the possible questions expected from the student and used to train the model

**3.Responses**: The responses from the Chatbot.

The dataset is prepared from the https://www.bournemouth.ac.uk/ website.

**Used libraries:**
- numpy
- nltk
- tensorflow 2.0
- pandas
- json
- itertools
- random
- datetime
- Ipython.core

**Dependent files**: This notebook file is dependent on dataset file

1.BUIntents.json - Dataset



# Code Execution Environment setup
This jupyter notebook file can be run in Google Colaboratory or in a Anaconda virtual environment by setting the environment variable in the below code

**For Google Colaboratory:** set **Colab** value to the environment variable

**For Anaconda virtual environment:** set **Local** value to the environment variable

By default the value is set to "Colab". So, the Colab related code will be activated. For instance with Colab drive mounting is neccessary and nltk packages should be downloaded. These are not required in local Anaconda virtual environment.
To toggle between Colab and Anaconda **environment** variable is used

In [0]:
#set Colab for Google colaboratory or set Local for anaconda environment
environment = 'Colab'

# Import libraries



In [0]:
import random
import json
import itertools
import numpy as np
import pandas as pd
import os
from datetime import datetime
from IPython.core.display import display, HTML

#libraries need for tensor flow
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Dropout
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.models import load_model

#libraries need for nltk
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
wordnetlemmatizer = WordNetLemmatizer()

#this code execution is needed only for colab
if(environment == "Colab"):
    #download the nltk wordnet lemmatizer to access the wordnet interface from python
    nltk.download('punkt')
    nltk.download('wordnet')


#Method to load the dataset
This method loads the json file into dataframe.

In [0]:
def getJSONdata(filename):    
    with open(filename, 'r') as file:
        data=file.read()
    return pd.DataFrame.from_dict(json.loads(data)['objectives'], orient='columns')

# Method to convert to lower case
This method is used to convert the object case to lower case.We can pass list or dictionary any type to this method. The method internally usinf string.lower()

In [0]:
def toLower(data):
    return str(data).lower()

# Method to Match the patterns with tags
Each pattern is transformed into a list of words utilizing **"nltk.word_tokenizer"**, as opposed to having them as strings. Then the tokenized words are appended with the appropriate tag. So the model can be trained easily with the patterns belonging to classes. This method returns the matched patterns and tags.

In [0]:
def matchPatternsWithTags(row): 
  patterns2tags = []
  #looping through all the patterns
  for pattern in row['patterns']:      
    #tokenizing the patterns    
    tword = nltk.word_tokenize(pattern)     
    #matched the tag with tokenized words   
    patterns2tags.append((tword, row['tag']))   
  return patterns2tags

# Method to tokenize the patterns
Each pattern will be tokenized using **"nltk.word_tokenize"**. This method returns the list of tokenized words.

In [0]:
def getTokenizedWords(row):
    tokenized_words = []
    for pattern in row['patterns']:          
        tword = nltk.word_tokenize(pattern)        
        tokenized_words.extend(tword)   
    return tokenized_words

# Method to Lemmatization
Lemmatization is where the singular tokens are extracted from a sentence and attempt to lessen them to their base structure. This method apply the lemmatization on the tokenized words using **"nltk.wordnetlemmatizer"**  and returns the list of lemma words. This method also remove the special characters
such as ['!', '?', ',', '.','&']


In [0]:
def getLemmaWords(tokenized_words):
   lwords = []
   specialCharacters = ['!', '?', ',', '.','&']
   lwords.append([wordnetlemmatizer.lemmatize(toLower(w)) for w in tokenized_words if not w in specialCharacters]) 
   result = list(itertools.chain(*lwords))   
   return sorted(list(set(result)))

# Mount the Google Drive
To access the dataset file and saved weights from Google Drive, the drive needs to be mounted. This code is executed only once in the first instance and will need to be authenticated on the Google Authentication page. Follow the instructions on the Google Authentication page to proceed further.

In [0]:
#This code excutes only on Colab
if(environment == 'Colab'):
  #drive mount code is for colab for first time run
  from google.colab import drive
  drive.mount('/content/drive')

# Assigning the dataset and weights path
The below code gets the dataset and the saved weights location aumatically by looping through the root directory.

In [0]:
#getting the root directory path of dataset and  weights
modelSavedPath = ''
datasetpath = ''
for roots,dirs,files in os.walk(os.getcwd()):
  if('BUChatbot.ipynb' in files):
    modelSavedPath = roots + '/BUChatbotModel.h5'
    datasetpath = roots + '/BUIntents.json'
    break

# Method to Prepare the train data and the model

In this method, the "train data" is prepared using a bag of words method. All the patterns with tags are taken and compared with the pattern words from the prepared Lemma words list. If the pattern word exists in Lemma words list then it will be 1 otherwise it will 0.

The classes list is prepared with the length of the tags. If the pattern belongs to the tag then 1 is added otherwise that existence will be 0.

Train data is [bagofwords, classes_list]

**X traindata = bagofwords**

**Y traindata = classes_list**


**Preparing the Model:**
**Sequential Model** is used with dense and dropout layers.

The dropout is set as 0.5. That means with 0.5 probability, it sets the input units to 0 on each update at the training time. Dropout is mainly used to avoid the overfitting of the model.

The input layer is with 128 internal units with RELU activation function.

The hidden layer is with 64 internal units with RELU activation function.

The output layer internal units are equal to the number of the classes with a SoftMax activation function.

**RELU** (The rectified linear activation unit) is used, in case of positive it will output the input directly, else it will output zero. 

**Softmax:** Softmax converts a real vector to a vector of categorical probabilities. The output vector elements range between 0 and 1, that sum up to 1. Softmax is used for the classification network last layer activation   because the result could be illuminated as a probability distribution.

For model compilation **SGD optimizer**, **categorical_crossentropy** loss function with accuracy as metrics is used.

The model is prepared with 50 epochs and batch size as 5 . Decay is calculated as lr/epochs

verbose = 1, which contains a progress bar and one-line per epoch. 

The weights(BUChatbotModel.h5)are saved to the same folder as the notebook.

In [0]:
def prepareTrainDataAndModel(patterns2tags, lemmawords, tag):   
    traindata = []       
    #creating the empty row with length of the classes
    empty_row =  list([0] * len(tag)) 
    for i in patterns2tags:
       bagofwords = []
       l_words = [wordnetlemmatizer.lemmatize(word.lower()) for word in i[0]]       
       #if word exists in lemma words list then add 1 else 0
       for word in lemmawords:
          bagofwords.append(1) if word in l_words else bagofwords.append(0)
       
       classes_list = list(empty_row)
       #change the index of the current class to 1
       classes_list[tag.index(i[1])] = 1   
       traindata.append([bagofwords, classes_list])
        
    random.shuffle(traindata)
    traindata = np.array(traindata)    
    traindata_x = list(traindata[:,0])  
    traindata_y = list(traindata[:,1])  
    
    #model training
    chatbotmodel = Sequential()
    chatbotmodel.add(Dense(128, input_shape=(len(traindata_x[0]),), activation='relu'))
    chatbotmodel.add(Dropout(0.5))   
    chatbotmodel.add(Dense(64, activation='relu'))
    chatbotmodel.add(Dropout(0.5))
    chatbotmodel.add(Dense(len(traindata_y[0]), activation='softmax'))
    
    sgdoptimizer = SGD(lr=0.01, decay=0.0002, momentum=0.9)
    chatbotmodel.compile(loss='categorical_crossentropy', optimizer=sgdoptimizer, metrics=['accuracy'])
    
    history = chatbotmodel.fit(np.array(traindata_x), np.array(traindata_y), epochs=50, batch_size=5, verbose=1)    
    chatbotmodel.save(modelSavedPath, history)
    chatbotmodel.summary()
   

# Method to train the model

**The below sequence is followed to call the methods**

1.getJSONData

2.Call GetTokenizeWords method and add the results to a new column in data frame.

3.Call matchPatternsWithTags method and add the results to a new column in data frame.

4.Call the getLemmaWords method by passing the tokinized words as input parameters.

5.Call the BagOfWordsAndModelPreparation method by passing the Patterns2Tags list and Tags list as input parameters.


In [0]:
df = getJSONdata(datasetpath)
df['tokenizedwords'] = df.apply(getTokenizedWords, axis=1)
df['patterns2tags'] = df.apply(matchPatternsWithTags, axis=1)
lemmawords = getLemmaWords(list(itertools.chain(*df['tokenizedwords'])))
prepareTrainDataAndModel(list(itertools.chain(*df['patterns2tags'])), lemmawords, sorted(list(set(df['tag']))))


# Methods to process user input into a bag of words and predictions
Finally, the model is trained successfully. Now it is time to get the predictions from the saved model. As discussed earlier Neural Network only understands the numeric data format, but the user asks the question in string format.

So the following same steps are repeated for user input.

1. Get responses from JSON
2. Tokenization
3. Lemmatization
4. Bag of words
5. Predict the output


Below code loads the saved weights from the same path of the notebook file.

The tag names are defined as classes used for classification.

In [0]:
classes = sorted(list(set(df['tag'])))
chatbotmodel = load_model(modelSavedPath)

**Getting the Bot response for user question**
This method converts the user question to a bag of words and predict the class for that question.
The Error_threshold value is set as 0.25, which uses to filter out predictions below the threshold and provide an intent index

The top probability prediction is picked to respond to the user.

In [0]:
def getBotResponse(input):
   #tokenization of user input
    input_words = nltk.word_tokenize(input)   
    #lemmatizing the user input
    input_words = [wordnetlemmatizer.lemmatize(iword.lower()) for iword in input_words]  
    #converting input into bag of words
    bagofwords = [0]*len(lemmawords)  
    for w in input_words:
        for i,word in enumerate(lemmawords):
            if word == w:   
                bagofwords[i] = 1               
    p = np.array(bagofwords)
   
    #predicting the model
    result = chatbotmodel.predict(np.array([p]))[0]
    #threshold cutoff 
    ERROR_THRESHOLD = 0.25
    #loading the predictions ins array
    predictions = [[i,r] for i,r in enumerate(result) if r>ERROR_THRESHOLD] 

    #sorting the predictions
    predictions.sort(key=lambda x: x[1], reverse=True)
    prediction_list = []
    for r in predictions:
        prediction_list.append({"intent": classes[r[0]], "probability": str(r[1])})

    #print prediction list and probability for our reference
    print(prediction_list)
    #if there is no intent bot reply below message
    if(len(prediction_list) == 0):
        return "I am sorry, I didn't get you"

    tag = prediction_list[0]['intent'] 
    responses = df.query('tag == "'+ tag + '"')['responses'].tolist()
    if(toLower(tag) == 'localtime'):     
       return str(eval(responses[0]))

    if(type(responses[0]) is list):
      response = random.choice(responses[0])   
    else:
      response = responses[0]   
  
    return response

# Testing the Chatbot
The Following are the two ways to test the bot.

1. Print statement

2. IPYWidgets GUI.

Either of the ones can be used for testing.


**Displaying the output using simple print statement**

In [0]:
response = getBotResponse("bye")
display(HTML(response))

**Displaying the output using IPYWidgets**

In [0]:
import ipywidgets as widgets
from ipywidgets import GridspecLayout,Layout,VBox,HBox

#This method will be called on send button click
def on_send_click(b):
  #get userinput
    usermsg = TextEntryBox.value
    TextEntryBox.value = ''

    #display user message in chat display box
    if usermsg != '':
       if(ChatDisplayBox.value == ''):
         ChatDisplayBox.value = "you: " + usermsg + '</br>'         
       else:
         ChatDisplayBox.value = ChatDisplayBox.value  + "you: " + usermsg + '</br>'      
      #get the reposne
       response = getBotResponse(usermsg)      
     
     #display the response
       if(ChatDisplayBox.value == ''):
         ChatDisplayBox.value = "BUHelpDesk: " + response + '</br>' 
       else:
         ChatDisplayBox.value = ChatDisplayBox.value  + "BUHelpDesk: " + response + '</br>'  

# fill it in with widgets
layout1 = Layout(flex='0 1 auto', height='500px', min_height='500px', width='600px',overflow_y='auto')
layout1.border= "2px solid lightblue" 
layout2 = Layout(flex='0 1 auto', height='50px', min_height='50px', width='500px')
layout3= Layout(flex='0 1 auto', height='30px', min_height='30px', width='100px')
ChatDisplayBox = widgets.HTML(value='', layout=layout1, disabled=True)
TextEntryBox = widgets.Text(value='',layout=layout2)
sendButton = widgets.Button(description='Send',layout=layout3)
sendButton.on_click(on_send_click)
vb_left = VBox([TextEntryBox], layout=Layout( width='500'))
vb_right = VBox([sendButton], layout=Layout( width='100'))
hb = HBox([vb_left,vb_right])
display(ChatDisplayBox)
hb


[nltk]: http://en.wikipedia.org/wiki/Chile "Wikipedia Article About Chile"