# Shelby Howard CISB63 Final Project

## My Trivia Chatbot:

For my final project, I developed an automated trivia chatbot that prompts users with random questions and tracks their scores. The data was sourced from the Washington University website and organized into questions and answers. I implemented functions to clean the data, check user responses for correctness, and run the chatbot. The chatbot was created by integrating these functions into a loop that asks questions, evaluates answers, and provides feedback to the user.

## Download data

Data: https://nlp.cs.washington.edu/triviaqa/

In [1]:
#Import tarfile so I can download the data
import tarfile

#My path
path = "/Users/shelbyhoward/Downloads/Triva Final/data.tar.gz"

extract_path = "/Users/shelbyhoward/Downloads/Triva Final/extracted_data"
#Extract tar.gz file
with tarfile.open(path, 'r:gz') as tar:
    tar.extractall(extract_path)

In [40]:
#Import libraries and load dataset
import pandas as pd
import numpy as np
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk import pos_tag, ne_chunk
from nltk.tree import Tree
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from textblob import TextBlob
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, LSTM, Embedding, Dense, Dropout
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.optimizers import Adam
import nltk
import random
import json

In [4]:
#Import the NLTK packages
nltk.download("punkt")
nltk.download("stopwords")
nltk.download("averaged_perceptron_tagger")
nltk.download("maxent_ne_chunker")
nltk.download("words")

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/shelbyhoward/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/shelbyhoward/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/shelbyhoward/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /Users/shelbyhoward/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     /Users/shelbyhoward/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

## Create the Dataframe and Display Data

In [5]:
#Using pandas download the data into a dataframe from thee xtracted path
df = pd.read_json('extracted_data/triviaqa-unfiltered/unfiltered-web-train.json')

In [6]:
#Print the unfiltered head
df.head()

Unnamed: 0,Data,Domain,Split,VerifiedEval,Version
0,{'Answer': {'Aliases': ['Presidency of Harry S...,unfiltered-web,train,False,1
1,{'Answer': {'Aliases': ['(Harry) Sinclair Lewi...,unfiltered-web,train,False,1
2,"{'Answer': {'Aliases': ['Park Grove (1895)', '...",unfiltered-web,train,False,1
3,"{'Answer': {'Aliases': ['Beer Cans'], 'Normali...",unfiltered-web,train,False,1
4,"{'Answer': {'Aliases': ['30's', '30’s', '30s',...",unfiltered-web,train,False,1


In [7]:
#Print the unfiltered tail
df.tail()

Unnamed: 0,Data,Domain,Split,VerifiedEval,Version
87617,{'Answer': {'Aliases': ['Rock Lobster by the B...,unfiltered-web,train,False,1
87618,"{'Answer': {'Aliases': ['Wascally wabbit', 'Bu...",unfiltered-web,train,False,1
87619,{'Answer': {'Aliases': ['All the kings horses ...,unfiltered-web,train,False,1
87620,"{'Answer': {'Aliases': ['Okypete', 'Snatchers'...",unfiltered-web,train,False,1
87621,{'Answer': {'Aliases': ['Butterfingers Snacker...,unfiltered-web,train,False,1


In [8]:
#Print the info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87622 entries, 0 to 87621
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Data          87622 non-null  object
 1   Domain        87622 non-null  object
 2   Split         87622 non-null  object
 3   VerifiedEval  87622 non-null  bool  
 4   Version       87622 non-null  int64 
dtypes: bool(1), int64(1), object(3)
memory usage: 2.8+ MB


## Clean up the Data

In the follwoing section I focused on seprating my data into two sections, questions and answers. This made the data easy to work with and easy to spot any mistakes incase my chatbot was having any issues.

In [9]:
#Using a function I need to extract all of the questions and answers from the data
def extract_question_answer(entry):
    try:
        #using entry.get, get the questions
        question = entry.get('Question', 'Unknown Question')
        #using entry.get, get the answers
        answer = entry.get('Answer', {}).get('NormalizedValue', 'Unknown Answer')
        return {"Question": question, "Answer": answer}
    except AttributeError:
        return None

In [10]:
#Create a new data, apply the function, and drop all empty questions and answers.
trivia_data = df['Data'].apply(extract_question_answer).dropna()

In [11]:
#Move the data into a dataframe
trivia_df = pd.DataFrame(trivia_data.tolist())
#New path
cleaned_data_path = "cleaned_trivia_data.json"

trivia_df.to_json(cleaned_data_path, orient="records", indent=4)

In [None]:
#Imported need packages
import json

#Load the cleaned JSON data
cleaned_data_path = "cleaned_trivia_data.json"

#Open the file
try:
    with open(cleaned_data_path, 'r') as file:
        cleaned_data = json.load(file)

In [13]:
#Check our work by printing out the first few questions
    for entry in cleaned_data[:5]:  
        print(f"Question: {entry['Question']}")
        print(f"Answer: {entry['Answer']}")
except FileNotFoundError:
    print("Try again")


Sample Cleaned Data:
Question: Who was President when the first Peanuts cartoon was published?
Answer: harry truman
------------------------------
Question: Which American-born Sinclair won the Nobel Prize for Literature in 1930?
Answer: sinclair lewis
------------------------------
Question: Where in England was Dame Judi Dench born?
Answer: york
------------------------------
Question: William Christensen of Madison, New Jersey, has claimed to have the world's biggest collection of what?
Answer: beer cans
------------------------------
Question: In which decade did Billboard magazine first publish and American hit chart?
Answer: 30s
------------------------------


## Using NLP to ensure correct matching

I chose to use natural language processing (NLP) to make the chatbot more user-friendly by allowing flexible answer matching, accounting for variations in user input.

In [41]:
#IMport need packages for NLP and cleaning up the data futher
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words("english"))

In [23]:
#Create a function for tokenizing, lowercasing, and removing stopwords
def preprocess_text(text):
    stop_words = set(stopwords.words("english"))
    tokens = word_tokenize(text.lower())
    #Returns a list of cleaned tokens
    filtered_tokens = [token for token in tokens if token.isalnum() and token not in stop_words]
    return filtered_tokens

In [24]:
#Create a function to compare user answers with correct answers using NLP techniques.
def is_answer_correct(user_answer, correct_answer):
    user_tokens = preprocess_text(user_answer)
    correct_tokens = preprocess_text(correct_answer)
    #Returns True if they match, it will come back as false
    return set(user_tokens) == set(correct_tokens)

## Chatbot Function

Creating dedicated functions before hand for data cleaning and response validation ensured modularity and made the chatbot easier to maintain and expand.

In [33]:
#Create a function to run my trivia chatbot
def trivia_chatbot(json_path):
    #Open the file
    with open(json_path, 'r') as file:
        questions = json.load(file)

    #To keep count, start the counters at 0
    correct_count = 0  # Track correct answers
    asked_questions_count = 0  # Track how many questions were asked

    #Start with an opening statement
    print("Welcome to your personal Trivia Chatbot! Type 'quit' when you are ready to exit.\n")

    #While loop for questions
    while True:
        #Randomly select a trivia question
        question = random.choice(questions)
        #Ask the use the question
        print("Question:", question['Question'])
        
        #Get the user's answer
        user_answer = input("Your Answer: ").strip()
        #If user inserts quit, program will exit
        if user_answer.lower() == "quit":
            print(f"Thanks for playing! You got {correct_count}/{asked_questions_count} correct!")
            break
        
        #Check if the answer is correct
        correct_answer = question['Answer']
        #If it matches, print correct
        if is_answer_correct(user_answer, correct_answer):
            print("Correct! Let do another....\n")
            #Increase correct count
            correct_count += 1 
        else:
            print(f"Wrong! The correct answer was: {correct_answer}\n")
        #Increment the number of questions asked
        asked_questions_count += 1  

# Final Triva Chatbot

In [34]:
#Run the chatbot
trivia_chatbot(cleaned_data_path)

Welcome to the Trivia Chatbot! Type 'exit' to quit.

Question: What is Frigophobia the fear of


Your Answer:  fridges


Wrong! The correct answer was: being cold

Question: Give a year in the life of prison reformer Elizabeth Fry.


Your Answer:  a


Wrong! The correct answer was: 1780 1845

Question: What is the common name for the bird Passer domesticus?


Your Answer:  a


Wrong! The correct answer was: house sparrow

Question: In what is now Iran in 1951 to 1954, what led to the Shah fleeing the country, the Prime Minister trying to dissolve parliament, crude oil being imported from Kuwait, and a breakdown of diplomatic relations with Britain and the USA?


Your Answer:  a


Wrong! The correct answer was: persian oil dispute

Question: The principal Deputy Speaker of the House of Commons has the official title Chairman of Ways and Means. Who is the current holder of the post, the MP for Chorley?


Your Answer:  a


Wrong! The correct answer was: lindsay hoyle

Question: At which racecourse did Lester Piggott ride both his first and last winners?


Your Answer:  Kuntucky Derby 


Wrong! The correct answer was: haydock park

Question: What type of volcanic stone is commonly used during pedicures?


Your Answer:  Lava Rock


Wrong! The correct answer was: pumice

Question: The film 'You've Got Mail' reunited the male and female leads from which hit 1993 movie?


Your Answer:  


Wrong! The correct answer was: sleepless in seattle tom hanks meg ryan

Question: The President of which African country resigned in February 2011 after widespread protests calling for his departure?


Your Answer:  


Wrong! The correct answer was: egypt

Question: Name the Danish maritime explorer who served with the Russian fleet and gave his name to a Strait, a Sea, an Island, a Glacier and Land Bridge?


Your Answer:  


Wrong! The correct answer was: vitus bering

Question: Die Dreigroschenoper (The Threepenny Opera) by Kurt Weill was based on which work?


Your Answer:  


Wrong! The correct answer was: beggars opera

Question: What is the name given to a class of coal that contains a high percentage of fired carbon and burns smokelessly with intense heat?


Your Answer:  High


Wrong! The correct answer was: anthracite

Question: "Which airline used to promote itself as ""The world's favourite airline""?"


Your Answer:  American Airline


Wrong! The correct answer was: british airways

Question: Which now defunct car manufacturer produced models called Vedette, Aronde and Ariane? SIMCA


Your Answer:  Toyota


Wrong! The correct answer was: 2 with which sporting event is dane tom kristensen particularly associated having won it eight times le mans

Question: Living History' is the memoir of which contemporary US politician?


Your Answer:  a


Wrong! The correct answer was: hillary clinton

Question: Sir Humphry Repton achieved fame in the 18th century in which sphere of activity?


Your Answer:  a


Wrong! The correct answer was: landscape gardening

Question: "What model/rapper had a #1 hit in 2014 with her song ""Fancy""?"


Your Answer:  Iggy


Wrong! The correct answer was: iggy azalea

Question: Who is the former Chief Constable of Merseyside, now Commissioner of the Metropolitan Police?


Your Answer:  a


Wrong! The correct answer was: bernard hogan howe

Question: According to official EU population statistics, which is the largest city in the European Union to begin with the letter 'T'?


Your Answer:  a


Wrong! The correct answer was: turin

Question: Whose wife “Wouldn’t have a Willie nor a Sam”?


Your Answer:  a


Wrong! The correct answer was: enery eighth

Question: Typically related to glass or crystal, the term dichroism refers to two or more what?


Your Answer:  a


Wrong! The correct answer was: colours colors

Question: An insect of the order Diptera is more commonly called a what?


Your Answer:  a


Wrong! The correct answer was: fly

Question: How many countries joined together to create the original European Economic Community following upon the Treaty of Rome in 1957?


Your Answer:  13


Wrong! The correct answer was: six

Question: What William S Burroughs 1961 book popularised the rock music term 'heavy metal', and provided the names for at least two rock bands of the 1970s?


Your Answer:  w


Wrong! The correct answer was: soft machine

Question: The unrecognized state of Biafra existed from 1967-1970 in the south-east of which country?


Your Answer:  Nigeria


Correct! 🎉

Question: In which city will the 2017 World Athletics Championships take place?


Your Answer:  quit


Wrong! The correct answer was: london

Question: Which type of seeds are traditionally used in a recipe for seed cake?


Your Answer:  exit


Thanks for playing! You got 1/26 correct!
