# Retrieval-Based ChatBot
The most popular chatbot implementation in use today!

Retrieval-Based ChatBots perform **three main tasks**:

1. [**Intent Classification**](#IntentClassification)  
Classify the intent of the message from user input.
    1. [Intent Classification with Bag-of-Words](#ICBoW) 


2. [**Entity Recognition**](#EntityRecognition)  
Entities are often the proper nouns of a message.
    1. [Entity Recognition with Part-of-Speech tagging](#ERPOS)
    2. [Entity Recognition with Word Embeddings](#ERWE)


3. [**Response Selection**](#ResponseSelection)  
Retrieve the best-fit response from this collection


4. [**Bringing them all together!**](#ChatBot)  

In [44]:
import re
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk.corpus import stopwords

def pos_tagging(word):
    """Tag each word with its part of speech."""
    
    # get the already tagged synonyms of the word 
    probable_pos = wordnet.synsets(word)
    # instantiate Counter()
    pos_counts = Counter()
    
    # count the POS of the word's synonyms
    pos_counts["n"] = len([synonym for synonym in probable_pos if synonym.pos()=="n"])
    pos_counts["v"] = len([synonym for synonym in probable_pos if synonym.pos()=="v"])
    pos_counts["a"] = len([synonym for synonym in probable_pos if synonym.pos()=="a"])
    pos_counts["r"] = len([synonym for synonym in probable_pos if synonym.pos()=="r"])
    
    # find the most common POS of the word's synonyms
    most_likely_pos = pos_counts.most_common(1)[0][0]
    
    return most_likely_pos

# create a set with the english stopwords
stop_words = set(stopwords.words("english"))

def preprocess_text(text):
    """
    1. Strips the text off punctuation.
    2. Lower-case letters
    3. Splits sentences into tokens
    4. Removes stopwords
    4. Lemmatize tokens
    """
    # strip text off punctuation and lower-case letters
    cleaned = re.sub(r'\W+', ' ', text).lower()
    # tokenize text
    tokenized = word_tokenize(cleaned)
    # remove stopwords
    no_stop = [token for token in tokenized if token not in stop_words]
    # instantiate WordNetLemmatizer
    lemmatizer = WordNetLemmatizer()
    # lemmatize text with POS
    normalized = [lemmatizer.lemmatize(token, pos_tagging(token)) for token in no_stop]
    
    return normalized

<a name='ICBoW'> </a>
# Intent Classification with BoW
Utilize **word frequency** to construct a measure of the intent of the user's message (**similarity**).

Best suited where the **order of words** does not contain much information about the intent of a user's query.

## **Process**:
#### 1. Preprocess user's query and pre-defined responses 

* using custom-made function that:
    * *Removes punctuation and whitespace*  
    * *Lower-case letters*  
    * *Tokenize text*  
    * *Lemmatize tokens using Part-of-Speech tagging*  

    
#### 2. Construct a Bag-of-Words dictionary for the processed user's query and responses

* ```from collections import Counter``` returns a BoW dictionary

#### 3. Count the number of similar words occuring between user's query and each response

* using custom-made function

In [59]:
from collections import Counter

"""
1. Preprocess user's query and pre-defined responses
"""

# user's input message
user_message = "Hello! What is the fit of the 'Elosie' dress? My shoulders are broad, so I often size up" 
" for a comfortable fit. Do dress sizes run large or small? Especially in the shoulders?"

# potential responses
response_a = "All of our dresses are cut from a polyester blend for a strechy fit"
response_b = "The 'Elosie' dress runs large. I suggest you take your regular size or smaller for the best fit."
response_c = "The 'Elosie' dress comes in green, lavender, and orange."

# process both user's input & potential responses
user_message_processed = preprocess_text(user_message)
response_a_processed = preprocess_text(response_a)
response_b_processed = preprocess_text(response_b)
response_c_processed = preprocess_text(response_c)

"""
2. Construct a Bag-of-Words dictionary for the processed user's query and responses
"""
# create and print BoW dictionaries, i.e. {'word': word frequency}, for user's input and potential responses
bow_user_message = Counter(user_message_processed)
bow_response_a = Counter(response_a_processed)
bow_response_b = Counter(response_b_processed)
bow_response_c = Counter(response_c_processed)

# print BoW dictionaries
print("User's input BoW dictionary:\n", bow_user_message, '\n')
print("Response's a BoW dictionary:\n", bow_response_a, '\n')
print("Response's b BoW dictionary:\n", bow_response_b, '\n')
print("Response's c BoW dictionary:\n", bow_response_c, '\n')

"""
3. Count the number of similar words occuring between user's query and each response
"""

def compare_overlap(user_message, possible_response):
    """Count the similar words between the user_message and potential response."""
    similar_words = 0
    #iterate over tokens in user_message
    for token in user_message:
        # if token exist in response
        if token in possible_response:
            # increase similar words by 1
            similar_words += 1
    # return the number of similar words
    return similar_words

# print the number of similar words between message and responses
print("Number of similar words between user message and\n")
print("response A:", compare_overlap(bow_user_message, bow_response_a))

print("\nresponse B:", compare_overlap(bow_user_message, bow_response_b))

print("\nresponse C:", compare_overlap(bow_user_message, bow_response_c))

User's input BoW dictionary:
 Counter({'hello': 1, 'fit': 1, 'elosie': 1, 'dress': 1, 'shoulder': 1, 'broad': 1, 'often': 1, 'size': 1}) 

Response's a BoW dictionary:
 Counter({'dress': 1, 'cut': 1, 'polyester': 1, 'blend': 1, 'strechy': 1, 'fit': 1}) 

Response's b BoW dictionary:
 Counter({'elosie': 1, 'dress': 1, 'run': 1, 'large': 1, 'suggest': 1, 'take': 1, 'regular': 1, 'size': 1, 'small': 1, 'best': 1, 'fit': 1}) 

Response's c BoW dictionary:
 Counter({'elosie': 1, 'dress': 1, 'come': 1, 'green': 1, 'lavender': 1, 'orange': 1}) 

Number of similar words between user message and

response A: 2

response B: 4

response C: 2


In [98]:
import spacy

# load a word2vec model
word2vec = spacy.load("en_core_web_lg")

# a list of nouns
message_nouns = ['shirts', 'weekend', 'package']

# a board category (the blank spot)
category = word2vec("clothes")

# join words into a single string with a space for seperator
tokens = word2vec(" ".join(message_nouns))

def compute_similarity(tokens, category):
    """Calculate the similarity between a string and a "blank spot" word."""
    output_list = list()
    # for each word in a string
    for token in tokens:
        # print the word, the "blank spot" word, and their similarity score
        # similarity() defaults to the average of the token vectors
        output_list.append([token.text, category.text, token.similarity(category)])
    return output_list


# print the similarity between each word and "blank_spot"
for i in range(3):
    print(compute_similarity(tokens, category)[i])

# assign the word with the highest similarity to the blank_spot, i.e. shirts
blank_spot = message_nouns[0]

# response to the user
bot_response = f"Hey! I just checked my records, your shipment containing {blank_spot} is en route."
"Expect it within the next two days!"

#print bot_response
print('\n',bot_response)

['shirts', 'clothes', 0.678414398517753]
['weekend', 'clothes', 0.2510121169200076]
['package', 'clothes', 0.16207362417098703]

 Hey! I just checked my records, your shipment containing shirts is en route.


<a name="ResponseSelection"> </a>
## Response Selection

In [87]:
stop_words = set(stopwords.words("english"))

def preprocess(input_sentence):
    """Clean a string."""
    # lower case letters
    input_sentence = input_sentence.lower()
    # remove punctuation and whitespace
    input_sentence = re.sub(r'[^\w\s]','',input_sentence)
    # split string into individual words
    tokens = word_tokenize(input_sentence)
    # remove stopwords
    input_sentence = [i for i in tokens if not i in stop_words]
    return(input_sentence)

  
def extract_nouns(tagged_message):
    """Return a list with just the nouns from a list of words."""
    message_nouns = list()
    for token in tagged_message:
        if token[1].startswith("N"):
            message_nouns.append(token[0])
    return message_nouns

In [91]:
user_message = "Good morning... will it rain in Chicago later this week?"

blank_spot = "illinois city"

# a selection of responses to match to the blank spot
response_a = "The average temperature this weekend in {} will be 88 degrees. Bring your sunglasses!"
response_b = "Forget about your umbrella; there is no rain forecasted in {} this weekend."
response_c = "This weekend, a warm front from the southeast will keep skies near {} clear."

responses= [response_a, response_b, response_c]

# preprocess documents
bow_user_message = Counter(preprocess(user_message))
processed_responses = [Counter(preprocess(response)) for response in responses]

# build BoW model
similarity_list = [compare_overlap(doc, bow_user_message) for doc in processed_responses]

# select response with best intent fit
response_index = similarity_list.index(max(similarity_list))

# extracting entities with word2vec 
tagged_user_message = pos_tag(preprocess(user_message))
message_nouns = extract_nouns(tagged_user_message)

# executing word2vec model
tokens = word2vec(" ".join(message_nouns))
category = word2vec(blank_spot)
word2vec_result = compute_similarity(tokens, category)

# select highest scoring entity
print(word2vec_result,'\n')
entity = word2vec_result[2][0]

# select final response with titlecase
final_response = responses[response_index].format(entity.title())
print(final_response)

[['morning', 'illinois city', 0.26479153177805814], ['rain', 'illinois city', 0.2857365552501409], ['chicago', 'illinois city', 0.7571821357578838], ['week', 'illinois city', 0.2169059489038729]] 

Forget about your umbrella; there is no rain forecasted in Chicago this weekend.


<a name="ChatBot"> </a>
# Bringing Them All Together!

# Intent Classification with BoW

>**The goal is to find the best reponse from a list of pre-defined responses.**

In [117]:
def find_intent_match(self, responses, user_message):
    """Select the response that best matches the intent of the user message."""

    # clean, tokenize, and count term frequency of user_message
    bow_user_message = Counter(self.preprocess_text(user_message))
    # clean, tokenize potential responses
    processed_responses = [Counter(preprocess_text(response)) for response in responses] 
    # call .compare_overlap method that returns the number of similar 
    # words b2een bow_user_message & processed_responses
    similarity_list = [compare_overlap(doc,  bow_user_message) for doc in processed_responses]
    # select the index of the highest similarity score in similarity_list 
    response_index = similarity_list.index(max(similarity_list))
    # return the element at index response_index in responses
    return responses[response_index]

# Entity Recognition
After determining the best method for the classification of a user’s intent, there is the task of recognizing entities within a user’s message.

### 1. Entity Recognition with POS tagging
POS tagging is commonly used to identify entities within a user message, as most entities are nouns.

>**The goal is to get a list with just the nouns from the user's query.**


### 2. Entity Recognition with Word Embeddings
While POS tagging extracts key entities in a user message, it does not provide context that allows a chatbot to believably integrate an entity reference into a predefined response.

In order to produce a coherent response, the chatbot must **insert entities from a user message** into the blank spots.

>**The goal is to get the best noun to fill the blank spot in the bot's response.**

In [102]:
from nltk import pos_tag
import spacy

# load word2vec model
word2vec = spacy.load("en_core_web_lg")

# get user's query and preprocess it
user_message = input('How can I help you with? ')

def extract_nouns(tagged_message):
    """Return a list with just the nouns from a list of words."""
    message_nouns = []
    # for each word in the list of POS-tagged words
    for token in tagged_message:
        # if the word is tagged as a NOUN
        if 'NN' in token[1]:
            # add the word at the end of the list
            message_nouns.append(token[0])
    # return the list of nouns
    return message_nouns

blank_spot = "resume"


def find_entities(user_message):
    """"""
    
    # preprocess and POS-tag user's query
    tagged_user_message = pos_tag(preprocess_text(user_message))
    # extract the nouns from user's query
    message_nouns = extract_nouns(tagged_user_message)
    """
    execute word2vec model
    """
    # join words into a single string with a space for seperator
    tokens = word2vec(" ".join(message_nouns))
    # a broad category
    category = word2vec(blank_spot)
    # compute the similarity between tokens and category
    word2vec_result = compute_similarity(tokens, category)
    
    word2vec_result.sort(key=lambda x: x[2])
    return word2vec_result[-1][0]

print(find_entities(user_message))

How can I help you with? I want to take a telephone
telephone


In [None]:
import re
from collections import Counter
from nltk import pos_tag
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk.corpus import stopwords
import spacy
    

class ResumeChatBot:
    """A representation of a ChatBot intented to serve a resume-style website."""
    
    # define a list of stopwords
    stop_words = set(stopwords.words("english"))
    
    # load a word2vec model
    word2vec = spacy.load("en_core_web_lg")
    
    # create possible responses to potential questions
    response_a = "You can find my {} attached to this page as .pdf or .doc format."
    response_b = "You can find my past and current project in my GitHub page along with my {}."
    response_c = "{} is included in my LinkedIn profile!"
    blank_spot = "resume"
    
    # create a list with all the responses as elements
    responses_list = [response_a, response_b, response_c]
    
    # exit commands so the user can exit the chat
    # a response to the question "Is there anything else that I can help you with?"
    exit_commands = ('quit', 'exit', 'bye', 'goodbye', 'nothing', 'nope', 'no', 'not', )
    
    
    def make_exit(self, user_message):
        """The user's exit path."""
        for exit_command in self.exit_commands:
            if exit_command in user_message:
                print("Thanks for taking the time to visit my website. I hope to hear back from you soon."
                       " Have a great and productive day!")      
        return True

    
    def chat(self): 
        """Greet user and check if he/she wants to chat."""
        
        # start chat with a Welcoming message
        user_message = input("Hello there. Welcome to my personal website. How can I help you today? ")
        # if the response does not include an exit_command
        while not self.make_exit(user_message):
            # call .respond method passing the user_message as an argument
            user_message = self.respond(user_message)

  
    def respond(self, user_message):
        """Find the best reponse based on user's input."""
        # call .find_intent_match method passing responses and user_message as arguments
        best_response = self.find_intent_match(responses_list, user_message)
        # call .find_entiries method passsing user_message as an argument
        entity = self.find_entities(user_message)
        input_message = input("Do you need something else? ")
        return input_message


    def find_intent_match(self, responses_list, user_message):
        """Select the response that best matches the intent of the user message."""
        
        # clean, tokenize, and count term frequency of user_message
        bow_user_message = Counter(self.preprocess(user_message))
        # clean, tokenize potential responses
        processed_responses = [Counter(preprocess(response)) for response in responses] 
        # call .compare_overlap method that returns the number of similar 
        # words b2een bow_user_message & processed_responses
        similarity_list = [compare_overlap(doc,  bow_user_message) for doc in processed_responses]
        # select the index of the highest similarity score in similarity_list 
        response_index = similarity_list.index(max(similarity_list))
        # return the element at index response_index in responses
        return responses[response_index]

    
    def find_entities(self, user_message):
        """Select the best word to cover the blank_spot"""

        # preprocess and POS-tag user's query
        tagged_user_message = pos_tag(preprocess(user_message))
        # extract the nouns from user's query
        message_nouns = extract_nouns(tagged_user_message)
        """
        execute word2vec model
        """
        # join words into a single string with a space for seperator
        tokens = word2vec(" ".join(message_nouns))
        # a broad category
        category = word2vec(blank_spot)
        # compute the similarity between tokens and category
        word2vec_result = compute_similarity(tokens, category)

        word2vec_result.sort(key=lambda x: x[2])
        if len(word2vec_result) >= 1:
            return word2vec_result[-1][0]
        return blank_spot


test = ResumeChatBot()
test.chat()