# Simple Chatbot/Query Search for Animals

The goal for this notebook is to create a simple chatbot that can potentially answer various questions about animals.  To do this, we will scrape a variety of Wikipedia articles relating to animals, seperate out each sentence in each article, match the inputted query to the sentence that most closely matches the query via the TF-IDF method, and output that sentence as the chatbot's response.

In [1]:
# Webscrapping 
import requests
from bs4 import BeautifulSoup
import string

# nltk
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer

# Feature Extraction
from sklearn.feature_extraction.text import TfidfVectorizer

# Utility
import numpy as np
import re

## Gathering and Cleaning the Data

First, we need a function to scrape the data from a Wikipedia link, which is written below.

In [2]:
def scrape_wikipedia(url):
    response = requests.get(url)
    if response.status_code == 200:
        wiki_text = ""
        soup = BeautifulSoup(response.text, 'html.parser')
        paragraphs = soup.find('div', {'id': 'mw-content-text'}).find_all('p')
        for paragraph in paragraphs:
            wiki_text += " " + paragraph.text
        return wiki_text
    else:
        return None

Now that we have our function, we can create a list of links that we want to scrape, and store the texts of all the articles for processing.  Below, you can see which Wikipedia articles we chose to scrape for this project.

In [3]:
list_of_wiki_pages = ["https://en.wikipedia.org/wiki/Dog",
                      "https://en.wikipedia.org/wiki/Cat",
                      "https://en.wikipedia.org/wiki/Horse",
                      "https://en.wikipedia.org/wiki/Lion",
                      "https://en.wikipedia.org/wiki/Tiger",
                      "https://en.wikipedia.org/wiki/Cattle",
                      "https://en.wikipedia.org/wiki/Snake",
                      "https://en.wikipedia.org/wiki/Insect",
                      "https://en.wikipedia.org/wiki/Bear",
                      "https://en.wikipedia.org/wiki/Wolf",
                      "https://en.wikipedia.org/wiki/Koala",
                      "https://en.wikipedia.org/wiki/Vertebrate",
                      "https://en.wikipedia.org/wiki/Dinosaur",
                      "https://en.wikipedia.org/wiki/Red_panda",
                      "https://en.wikipedia.org/wiki/Capybara",
                      "https://en.wikipedia.org/wiki/Reptile",
                      "https://en.wikipedia.org/wiki/Giraffe",
                      "https://en.wikipedia.org/wiki/Deer",
                      "https://en.wikipedia.org/wiki/Cheetah",
                      "https://en.wikipedia.org/wiki/Turtle",
                      "https://en.wikipedia.org/wiki/Axolotl",
                      "https://en.wikipedia.org/wiki/Chordate",
                      "https://en.wikipedia.org/wiki/Shark",
                      "https://en.wikipedia.org/wiki/Sloth",
                      "https://en.wikipedia.org/wiki/Frog",
                      "https://en.wikipedia.org/wiki/Quokka",
                      "https://en.wikipedia.org/wiki/Gorilla",
                      "https://en.wikipedia.org/wiki/Raccoon",
                      "https://en.wikipedia.org/wiki/Leopard",
                      "https://en.wikipedia.org/wiki/Otter",
                      "https://en.wikipedia.org/wiki/Hippopotamus",
                      "https://en.wikipedia.org/wiki/Narwhal",
                      "https://en.wikipedia.org/wiki/Meerkat",
                      "https://en.wikipedia.org/wiki/Sheep",
                      "https://en.wikipedia.org/wiki/Goat",
                      "https://en.wikipedia.org/wiki/Donkey",
                      "https://en.wikipedia.org/wiki/Ferret",
                      "https://en.wikipedia.org/wiki/Squirrel",
                      "https://en.wikipedia.org/wiki/Amphibian",
                      "https://en.wikipedia.org/wiki/Hamster",
                      "https://en.wikipedia.org/wiki/Spider",
                      "https://en.wikipedia.org/wiki/Platypus",
                      "https://en.wikipedia.org/wiki/Lizard",
                      "https://en.wikipedia.org/wiki/Hedgehog",
                      "https://en.wikipedia.org/wiki/Glaucus_atlanticus",
                      "https://en.wikipedia.org/wiki/Hyena",
                      "https://en.wikipedia.org/wiki/Crocodile",
                      "https://en.wikipedia.org/wiki/Rhinoceros"]

output_text = ""
for link in list_of_wiki_pages:
    output_text += " " + scrape_wikipedia(link)

Note that we combined all the text from all articles into a single string.  In the next part we will break up the string into a new list where every element is a sentence from the string.  

Before doing that, we need to do some simple text cleaning.  Note that this is NOT the part where we do the standard NLP preprocessing steps such as removing stopwords or lemmatizing; that part will come later.  This part is meant to clean up the formatting and remove messy text that carried over from scraping the data to make the eventual output more readable. 

In [4]:
output_text = re.sub(r'\[.*?\]', '', output_text) # remove anything in square brackets
output_text = output_text.replace("\xa0", " ")    # fixing formatting for whatever \xa0 is
output_text = output_text.replace("\n", " ")
output_text = re.sub(r':\u200a[0-9]+\u200a ', '', output_text)
output_text = re.sub(r':\u200a[a-z]+\u200a ', '', output_text)

sentences = sent_tokenize(output_text)
sentences[:5]

['    The dog (Canis familiaris or Canis lupus familiaris) is a domesticated descendant of the wolf.',
 "Also called the domestic dog, it is derived from extinct Pleistocene wolves, and the modern wolf is the dog's nearest living relative.",
 'The dog was the first species to be domesticated by humans.',
 'Hunter-gatherers did this, over 15,000 years ago in Germany, which was before the development of agriculture.',
 'Due to their long association with humans, dogs have expanded to a large number of domestic individuals and gained the ability to thrive on a starch-rich diet that would be inadequate for other canids.']

Since the first article was about dogs, the first five sentences in the example output above have to do with dogs.  You can verify that the sentences above are the first five sentences in the article about dogs at https://en.wikipedia.org/wiki/Dog.

The sentences in `sentences` are going to be the outputs of our chatbot.  However, in order for the program to better understand which sentences match best with each query, the sentences need to be further preprocessed.  We do NOT want these preprocessed sentences as the output of the chatbot since preprocessing will include stripping away punctuation and stopwords, making the sentences difficult to understand and our chatbot will not produce realistic outputs.

## Preprocessing

We will write a function to do the preprocessing of each sentence.

In [5]:
def string_formatter(input_string):
    
    # Tokenize
    token_list = word_tokenize(input_string) 
    # Put to Lower Case
    case_folded = [word.lower() for word in token_list] 
    # Remove Stop Words
    stop_words_removed_case_folded = [word for word in case_folded if word not in stopwords.words('english')] 
    # Remove Punctuation
    stop_words_punctuation_removed_case_folded = [word.replace("-", " ") for word in stop_words_removed_case_folded] # special case for hyphens
    stop_words_punctuation_removed_case_folded = [re.sub(r'[^\w\s]', '', word) for word in stop_words_punctuation_removed_case_folded] 
    # POS Tagged
    pos_tagged_words = nltk.pos_tag(stop_words_punctuation_removed_case_folded)

    # POS tags for wnl.lematize
    new_tags = [""] * len(pos_tagged_words)
    for i in range(len(pos_tagged_words)):
        if ((pos_tagged_words[i][1] == "VB") | (pos_tagged_words[i][1] == "VBG") |
            (pos_tagged_words[i][1] == "VBD") | (pos_tagged_words[i][1] == "VBN") |
            (pos_tagged_words[i][1] == "VBP") | (pos_tagged_words[i][1] == "VBZ")):
            new_tags[i] = wordnet.VERB
        elif ((pos_tagged_words[i][1] == "JJ") | (pos_tagged_words[i][1] == "JJR") |
              (pos_tagged_words[i][1] == "JJS")):
            new_tags[i] = wordnet.ADJ
        elif ((pos_tagged_words[i][1] == "RB") | (pos_tagged_words[i][1] == "RBR") |
              (pos_tagged_words[i][1] == "RBS") | (pos_tagged_words[i][1] == "WRB")):
            new_tags[i] = wordnet.ADV
        else:
            new_tags[i] = wordnet.NOUN

    # Lemmatization
    wnl = WordNetLemmatizer()
    lemma_words = [""] * len(stop_words_punctuation_removed_case_folded)
    for i in range(len(stop_words_punctuation_removed_case_folded)):
        lemma_words[i] = wnl.lemmatize(stop_words_punctuation_removed_case_folded[i], pos = new_tags[i])

    sentence = " ".join(lemma_words) 
    sentence = re.sub(r'\s+', ' ', sentence).strip() # remove extra spaces between words
    return sentence

training_text = [""] * len(sentences) # pre-allocate memory
for i in range(len(training_text)):
    training_text[i] = string_formatter(sentences[i])
training_text[:5]

['dog canis familiaris canis lupus familiaris domesticate descendant wolf',
 'also call domestic dog derive extinct pleistocene wolf modern wolf dog s near living relative',
 'dog first specie domesticate human',
 'hunter gatherers 15000 year ago germany development agriculture',
 'due long association human dog expand large number domestic individual gain ability thrive starch rich diet would inadequate canid']

Comparing to the outputs of the original first five sentences, we can see that the above text is a lot more bare-bones.  This is because all the punctuation has been removed, stopwords have been removed, and each word has been lemmatized.

## Creating the TF-IDF Matrix

Now that we have our list of cleaned sentences, we can create our TF-IDF matrix.  A TF-IDF matrix is a matrix where each row corresponds to a sentence and each column corresponds to a word in the entire corpus.  Each element of the matrix has a measure that considers how often the word appears in the sentence and how many sentences the word appears in.  The formula to calculate the TF-IDF element value is $[1 + log_{10}(tf)]\ *\ log_{10}(\frac{N}{df})$, there tf (term frequency) is how often the word appears in the sentence, df (document frequency) is the number of sentences the term appears in, and N is the total number of documents.

In [6]:
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(training_text).toarray()
feature_names = tfidf_vectorizer.get_feature_names_out()

Since the elements of the TF-IDF matrix are mostly 0 (most sentences only contain a fraction of the words in the corpus), I will not output a part of the matrix as an example.

## Outputting Results

Now that we have our TF-IDF matrix, we can move onto the last part of matching a query to a sentence.  The way that this is done is by selecting the columns of the TF-IDF that contain the relevant words in the query and taking the rowsum of the remaining TF_IDF elements.  The element with the largest rowsum indicates the sentence with the highest relevance to the query.

We will write a function to take in a query and output the most relevant sentence.  Note that this function will be calling on global variables defined throughout this notebook.

In [7]:
def animal_query_response(query):
    tokenized_query = word_tokenize(string_formatter(query))

    # Create a filter for the tfidf matrix that will filter out every row that does not correspond to a tokenized word from search_query
    word_filter = np.zeros(len(feature_names)).astype(int)
    for i, word in enumerate(feature_names):
        if word in tokenized_query:
            word_filter[i] = 1
    word_filter = word_filter.astype(bool)

    closest_sentence_index = tfidf_matrix[:, word_filter].sum(axis = 1).argmax()
    return sentences[closest_sentence_index]

Let us see what query results we get.  Note that we already know ahead of time that a lot of the query tests will likely not produce results that make sense.  We will discuss this in the "Conclusion and Next Steps" section.

In [8]:
def query_input_output(query):
    print("Input:", query)
    print("Output:", animal_query_response(query))
    print()

queries = ["dog breeds",
           "aniMals thAT ARE excellEnt Climbers",
           "What is the lifespan of an average house cat?",
           "Which animal is the fastest land mammal?",
           "What is the purpose of a chameleon's ability to change color?",
           "How many hearts does an octopus have?",
           "Which big cat is known for its distinctive black mane?",
           "Which snake is the longest in the world?"]

for query in queries:
    query_input_output(query)

Input: dog breeds
Output: In conformation shows, also referred to as breed shows, a judge familiar with the specific dog breed evaluates individual purebred dogs for conformity with their established breed type as described in the breed standard.

Input: aniMals thAT ARE excellEnt Climbers
Output: Deer are also excellent jumpers and swimmers.

Input: What is the lifespan of an average house cat?
Output: House cats often mate with feral cats.

Input: Which animal is the fastest land mammal?
Output: The cheetah is the world's fastest land animal.

Input: What is the purpose of a chameleon's ability to change color?
Output: The chameleons in general use their ability to change their coloration for signalling rather than camouflage, but some species such as Smith's dwarf chameleon do use active colour change for camouflage purposes.

Input: How many hearts does an octopus have?
Output: Bears do not have many predators.

Input: Which big cat is known for its distinctive black mane?
Output: 

A lot of our results do not match up very well (which we will talk about), but other queries such as "Which animal is the fastest land mammal?" have a perfect response in "The cheetah is the world's fastest land animal".  For the notebook I only outputted a select few examples, but it is not much more work to make this into a chatbot by putting the function through a continuous loop that consistently asks for an input and returns an output.

## Conclusion and Next Steps

Given that the chatbot only considers the number of words in each sentence and gives zero weight to the context of a query, it makes sense that a lot of the prompts did not produce a strong result.  For example, the question "What is the lifespan of an average house cat?" produced the answer "House cats often mate with feral cats".  This is because the words in the query matched best with the words in the sentence.  When we look through the article on cats, it does have the sentence "The average lifespan of pet cats has risen in recent decades".  However, the sentence "House cats often mate with feral cats" contains the word "house", which also appeared in the query.  It is therefore likely that the TF-IDF matrix gave more weight to "house" than it did "lifespan", likely because of how often each word occurs in other documents (recall that the TF-IDF matrix considers how many documents the word appears in as well).  Furthermore, the original sentence also contains the word "cat" twice, whereas the other option only contains the word "cat" once.

Clearly, context matters.  Next steps for this chatbot would likely be to move on from the TF-IDF method and implement contextual embeddings such as BERT or GPT.  At that point we could also consider using the Wikipedia articles to train our own generative answers, instead of simply outputting the most correct sentence.