For my final capstone, my goal is to build a Chatbot trained on gardening related questions and answers that could help users with their gardening issues. 

This notebook is the third step to my final capstone
- Part 1: https://colab.research.google.com/drive/12wB6STgkfORS_LsU6AEJC6Qr5PmIK6KQ?usp=sharing

- Part 2: https://colab.research.google.com/drive/1VP7-CyF1MibmaugyUTvBsWOrqgio_Ke-?usp=sharing

#Final Capstone : Part 3: Chatbot model using Deep Learning techniques

The data set is comprised of 14,868 questions and answers scraped from the gardening blog on houzz.com and treated in steps 1 & 2. In step 1, I extracted the category/intent of the questions using Topic Modeling techniques. In step 2, I summarized the answers.

 There are 3 columns:
- Cleaned_A_Sum: this is the summarized answer obtained in Step 2
- Topic: this is the topic obtained in Step 1
- Question: this is the original blog post title which we use as the question

Based on my research and my hardware limitations, I will proceed with formating this dataset and train a 3 layer neural network model using Stochastic gradient descent with Nesterov accelerated gradient.



In [21]:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

In [22]:
#authenticate and create user
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [23]:
import pandas as pd
downloaded_text1 = drive.CreateFile({'id':"1-57sRquo8BGeIW2UJIwT4ZyjaJKTBZnM"})   
downloaded_text1.GetContentFile('final_capstone_summarized_df.csv')  
df = pd.read_csv('final_capstone_summarized_df.csv')

In [24]:
import nltk
nltk.download('punkt')
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()


import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.optimizers import SGD
from tensorflow.keras.callbacks import EarlyStopping


import nltk
from nltk.stem.lancaster import LancasterStemmer
stemmer = LancasterStemmer()

# things we need for Tensorflow
import numpy as np
import tensorflow as tf
import random

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [25]:
df.head()

Unnamed: 0.1,Unnamed: 0,Cleaned_A_orig,Cleaned_A_summ,Topic,Question,word_count,sentence_count,avg_sentence_length,char_count,word_density,punc_count
0,0,i’ve no idea what the kratky method is well i ...,i’ve no idea what the kratky method is well i ...,Indoor_Gardening,Kratky lettuce spots,26,1,26.0,105,0.245283,0
1,1,"yes it s magic! when you thin of it, hydroponi...",excellent question hydroangel! i saw your comm...,Hydroponics,What keeps grow plugs together?,28,1,28.0,129,0.215385,2
2,2,i m not expert but i think adding a little at ...,"once the roots become air roots, topping off w...",Perennial_Gardening,Possible to refill container when doing Kratky...,14,1,14.0,68,0.202899,1
3,3,it s the back-pressure of the water in the man...,"if your reservoir were taller, you could have ...",Compost_Fertilization,Pump runs for about a minute then loses pressure,24,1,24.0,98,0.242424,1
4,4,update and a bump. some of the cuttings are st...,can i dome each individual clone with a plasti...,Organic_Gardening,Cloning Blueberries using an Aeroponic System ...,71,1,71.0,295,0.239865,4


In [26]:
df.drop(df.columns[:2], axis=1, inplace=True)
df.drop(df.columns[3:], axis=1, inplace=True)

df_final=df.reset_index(drop=True)

df_final.head()

Unnamed: 0,Cleaned_A_summ,Topic,Question
0,i’ve no idea what the kratky method is well i ...,Indoor_Gardening,Kratky lettuce spots
1,excellent question hydroangel! i saw your comm...,Hydroponics,What keeps grow plugs together?
2,"once the roots become air roots, topping off w...",Perennial_Gardening,Possible to refill container when doing Kratky...
3,"if your reservoir were taller, you could have ...",Compost_Fertilization,Pump runs for about a minute then loses pressure
4,can i dome each individual clone with a plasti...,Organic_Gardening,Cloning Blueberries using an Aeroponic System ...


In [27]:
df_final.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14868 entries, 0 to 14867
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Cleaned_A_summ  14868 non-null  object
 1   Topic           14868 non-null  object
 2   Question        14868 non-null  object
dtypes: object(3)
memory usage: 348.6+ KB


Here I will organize the data into 3 lists:

- documents: will be a list of pairs where the first element is the list of words in the question and the second element is the topic
- classes: will be the list of distinct topics
- words: are the list of words in the questions

In [28]:
#creating a new column with tokenized questions
from nltk.tokenize import word_tokenize

df_final['Question_Tokenized']=df_final['Question'].apply(word_tokenize) 

q_t=df_final['Question_Tokenized'].to_list() 

In [29]:
#checking
df_final.head()

Unnamed: 0,Cleaned_A_summ,Topic,Question,Question_Tokenized
0,i’ve no idea what the kratky method is well i ...,Indoor_Gardening,Kratky lettuce spots,"[Kratky, lettuce, spots]"
1,excellent question hydroangel! i saw your comm...,Hydroponics,What keeps grow plugs together?,"[What, keeps, grow, plugs, together, ?]"
2,"once the roots become air roots, topping off w...",Perennial_Gardening,Possible to refill container when doing Kratky...,"[Possible, to, refill, container, when, doing,..."
3,"if your reservoir were taller, you could have ...",Compost_Fertilization,Pump runs for about a minute then loses pressure,"[Pump, runs, for, about, a, minute, then, lose..."
4,can i dome each individual clone with a plasti...,Organic_Gardening,Cloning Blueberries using an Aeroponic System ...,"[Cloning, Blueberries, using, an, Aeroponic, S..."


In [30]:
#creating a list from the topic column
topic_list=df_final['Topic'].to_list()

In [31]:
#merging the 2 lists into the "documents" list
def merge(list1, list2): 
      
    merged_list = [(list1[i], list2[i]) for i in range(0, len(list1))] 
    return merged_list 
      
documents=merge(q_t, topic_list)

In [32]:
#Print lenght
len(documents)

14868

In [33]:
#Print first 5 examples
documents[0:5]

[(['Kratky', 'lettuce', 'spots'], 'Indoor_Gardening'),
 (['What', 'keeps', 'grow', 'plugs', 'together', '?'], 'Hydroponics'),
 (['Possible',
   'to',
   'refill',
   'container',
   'when',
   'doing',
   'Kratky',
   'method',
   '?'],
  'Perennial_Gardening'),
 (['Pump', 'runs', 'for', 'about', 'a', 'minute', 'then', 'loses', 'pressure'],
  'Compost_Fertilization'),
 (['Cloning',
   'Blueberries',
   'using',
   'an',
   'Aeroponic',
   'System',
   '-',
   'Help',
   'Needed'],
  'Organic_Gardening')]

In [34]:
#Creating the word list
words = {
    x for x in ' '.join(
        df_final['Question'].str.lower().tolist()
    ).split() if x.isalpha()
} 

In [35]:
words=list(words)

In [36]:
#creating the classes list
classes = set(topic_list)
classes=list(classes)

In [37]:
#looking at the 3rd value in the list
classes[2]

'Perennial_Gardening'

In [38]:
#Cleaning and lemmatizing

ignore_words = ['?', '!','/','-']

words = [lemmatizer.lemmatize(w.lower()) for w in words if w not in ignore_words]
words = sorted(list(set(words)))

classes = sorted(list(set(classes)))

print (len(documents), "documents")

print (len(classes), "classes", classes)

print (len(words), "unique lemmatized words", words)

14868 documents
10 classes ['Compost_Fertilization', 'Herb_Garden', 'Hydroponics', 'Indoor_Gardening', 'Issue_Identification', 'Organic_Gardening', 'Perennial_Gardening', 'Pest_Disease', 'Trees', 'Watering']


## Chat bot model

In [39]:
# initializing training data
training = []
output_empty = [0] * len(classes)
for doc in documents:
    # initializing bag of words
    bag = []
    # list of tokenized words for the pattern
    pattern_words = doc[0]
    # lemmatize each word - create base word, in attempt to represent related words
    pattern_words = [lemmatizer.lemmatize(word.lower()) for word in pattern_words]
    # create our bag of words array with 1, if word match found in current pattern
    for w in words:
        bag.append(1) if w in pattern_words else bag.append(0)

    # output is a '0' for each tag and '1' for current tag (for each pattern)
    output_row = list(output_empty)
    output_row[classes.index(doc[1])] = 1

    training.append([bag, output_row])
# shuffle our features and turn into np.array
random.shuffle(training)
training = np.array(training)
# create train and test lists. X - patterns, Y - intents
train_x = list(training[:,0])
train_y = list(training[:,1])
print("Training data created")

Training data created


In [40]:
# Create model - 3 layers. First layer 128 neurons, second layer 64 neurons and 3rd output layer contains number of neurons
# equal to number of intents to predict output intent with softmax
model = Sequential()
model.add(Dense(128, input_shape=(len(train_x[0]),), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(len(train_y[0]), activation='softmax'))

# Compile model. Stochastic gradient descent with Nesterov accelerated gradient gives good results for this model
# More info on this https://ruder.io/optimizing-gradient-descent/
sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])

#fitting and saving the model
#es = EarlyStopping(monitor='val_loss', mode='min', verbose=1,patience=2)
hist = model.fit(np.array(train_x), np.array(train_y), epochs=25, batch_size=5, verbose=1)
model.save('chatbot_model.h5', hist)

print("model created")

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25
model created


In [41]:
from keras.models import load_model
model = load_model('chatbot_model.h5')
import random


In [42]:
def clean_up_sentence(sentence):
    # tokenize the pattern
    sentence_words = nltk.word_tokenize(sentence)
    # stem each word
    sentence_words = [stemmer.stem(word.lower()) for word in sentence_words]
    return sentence_words

# return bag of words array: 0 or 1 for each word in the bag that exists in the sentence
def bow(sentence, words, show_details=False):
    # tokenize the pattern
    sentence_words = clean_up_sentence(sentence)
    # bag of words
    bag = [0]*len(words)  
    for s in sentence_words:
        for i,w in enumerate(words):
            if w == s: 
                bag[i] = 1
                if show_details:
                    print ("found in bag: %s" % w)

    return(np.array(bag))

In [43]:
p = bow("What can I do against aphids?", words)
print (p)

[0 0 0 ... 0 0 0]


In [44]:
list_of_int = set(df_final['Topic'])
list_of_int = list(list_of_int)
list_of_intents_2 = pd.DataFrame(list_of_int,columns=['Tag'])
list_of_intents_2.head()

Unnamed: 0,Tag
0,Issue_Identification
1,Herb_Garden
2,Perennial_Gardening
3,Organic_Gardening
4,Watering


In [55]:
#import random
def predict_class(sentence, model):
    # filter out predictions below a threshold
    p = bow(sentence, words,show_details=False)
    res = model.predict(np.array([p]))[0]
    ERROR_THRESHOLD = 0.25
    results = [[i,r] for i,r in enumerate(res) if r>ERROR_THRESHOLD]
    # sort by strength of probability
    results.sort(key=lambda x: x[1], reverse=True)
    return_list = []
    for r in results:
        return_list.append({"intent": classes[r[0]], "probability": str(r[1])})
    return return_list

def getResponse(ints, intents_json):
    tag = ints[0]['intent']
    #global list_of_intents
    list_of_intents = list_of_intents_2['Tag']
    for i in list_of_intents:
        if(i == tag):
            #select all rows from df where topic=i
            temp = df_final['Topic'] == i
            df_temp=df_final.iloc[temp.values]
            #df=#put that into a df
            result = df_temp['Cleaned_A_summ'].sample(n=1)
            result=result.to_list()
            #result = random.choice(i['responses'])## this is wrong, has to be a random choice of row in df where topic=i
            break
    return result

def chatbot_response(msg):
    ints = predict_class(msg, model)
    res = getResponse(ints, df_final)
    return res

In [56]:
predict_class('What can you do against aphids',model)

[{'intent': 'Indoor_Gardening', 'probability': '0.7952408'}]

In [57]:
GREETING_INPUTS = ["hello", "hi", "greetings", "what's up","hey"]
GREETING_RESPONSES = ["hello", "hi", "hey", "hi there"]
def greeting(sentence):
    for word in sentence.split():
        if word.lower() in GREETING_INPUTS:
            return random.choice(GREETING_RESPONSES)

In [65]:
print("MontyBot: I can help you with gardening questions!")

# below is the chatting
while True:
    
    user_input = input("User: ")
    user_input=user_input.lower()
    
    if(user_input!='bye'):
        if(user_input == 'thanks' or user_input == 'thank you'):
            print("MontyBot: You're welcome.")
            break
            
        else:
            if(greeting(user_input) != None):
                print("MontyBot: " + greeting(user_input))
            else:
                print("MontyBot: ", end = "")
                print(chatbot_response(user_input))
    else:
        print("MontyBot: Bye! It was a great chat.")
        break

MontyBot: I can help you with gardening questions!
User: hi
MontyBot: hey
User: how are you
MontyBot: ['i gave the plant the fertilizer that comes in small balls 2-3x couple of tablespoons over the summer']
User: cool
MontyBot: ['wayne yes, use the wood shavings and sawdust.great, that ll put all these shavings to some good use! since you re posting in the organic gardening forum, i shouldn t have to remind you that organic gardeners do not use roundup, and certainly would never flood the area with roundup']
User: what about aphids?
MontyBot: ['if you see it we will see it many pix in one post otherwise i am usually too lazy to do it myself lol and the one time i do it wont open lol welcome to gw come and visit the perennial and annual forums we need more action ken thanks for the info on how to load pix']
User: when should you water
MontyBot: ['thai basil is my favorite culinary basil']
User: what about pest
MontyBot: ['i have since purchased great wall and smith s best and they survi

Source for this step: https://towardsdatascience.com/how-to-create-a-chatbot-with-python-deep-learning-in-less-than-an-hour-56a063bdfc44

# Conclusions and future developments

This was a very challenging project in which I discovered many new techniques and NLP models. I decided throughout the project to implement simple steps first to later build on top of. Overall, the results is far from being satisfactory and the chatbot can't answer the questions accurately.

However here are some possible avenues for improvement:
- Data:
  - The data was scrapped and isn't of great quality. More time could be spent finding better data to feed to the models
  - The data could be further preprocessed and cleaned to try to better prepare for the models
- Topic Modeling:
  - I would like to explore other topic modeling models
- Text Summarization:
  - Overall the summarization technique that I chose is purely extractive and I would like to successfully (this was attempted during this project) implement BERT models. 
  - I would also like to explore other summarization techniques
- Chatbot:
  - I built the chatbot using Deep Learning techniques but the core mechanic is still too broad given that it randomly choses an answer paired to the topic. There are thousands of answers in each topic therefore it is very likely if not almost certain that it will fail to pick the right one. 
  - I want to explore other chatbot models for example: https://towardsdatascience.com/build-a-simple-chatbot-with-python-and-google-search-c000aa3f73f0
- Future functionalities and developments:
This was 100% a passion project that I will continue working on post Thinkful. Amongst the functionalities that I would like this project to have:
  - A planting recommendation engine to the bot: based on criterias such as water/space/sun, it could give the user a list of plants that would thrive
  - A personality: I would love for the bot to tell jokes and have a "style"
  - Image recognition feature: user could submit a pictre asking what plant or what disease is present on the plant and the bot would identify it
