# Analyzing Jeopardy Questions
In this guided project, a dataset of Jeopardy questions was downloaded and examined in various ways with the hopes of finding some patterns or strategies that could help data analysts on the show win more easily.

Note: In this dataset, the "Question" column contains the quiz prompts. The "Answer" column does not contain any question words like "Who is...?"

In [26]:
import pandas as pd

jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.columns = ['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question', 'Answer']
jeopardy.head()


Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


We will be analyzing the words that appear in Jeopardy questions and answers and so first we must sanitize this data to avoid issues with punctuation marks and case-sensitivity. This will be done by using regular expressions.

In [27]:
import re

def normalize_text(string):
    string = string.lower()
    string = re.sub("[^A-Za-z0-9\s]", "", string)
    return string

jeopardy['clean_question'] = jeopardy['Question'].apply(normalize_text)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize_text)
jeopardy.loc[1984]['clean_question']


'this bird seen here is the provincial bird of prince edward island'

The question values also need to be formatted. We will need to strip them of the dollar signs and commas. We will want the airdate converted to datetime values as well. 

In [28]:
def normalize_values(text):
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    try:
        text = int(text)
    except Exception:
        text = 0
    return text
    
jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_values)
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200


Now that the data has been cleaned we can do some analyzing. The first question we will explore is how frequently words in the answer appear in the question. 

In [29]:
def word_count(row):
    split_question = row['clean_question'].split(" ")
    split_answer = row['clean_answer'].split(" ")
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0
    match_count = 0
    for item in split_answer:
        if item in split_question:
            match_count += 1
    return match_count / len(split_answer)

jeopardy['answer_in_question'] = jeopardy.apply(word_count, axis=1)

jeopardy.answer_in_question.mean()
    

0.060493257069335914

It appears a word in the answer will appear in the question only 6% of the time. 

Next, we will iterate through each question in the dataset. The questions will be split into words and words longer than 5 letters will be kept. These words will be checked against a set of words that have appeared in earlier questions. The words from the current question are added into the set after and the number of matched words is divided over the number of words. With the average of all of these matched percentages we can see how important it may be to study old questions.

In [30]:
question_overlap = []
terms_used = set()

for i, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(" ")
    split_question = [q for q in split_question if len(q) > 5]
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
    for word in split_question:
        terms_used.add(word)
    if len(split_question) > 0:
        match_count /= len(split_question)
    question_overlap.append(match_count)

jeopardy['question_overlap'] = question_overlap
jeopardy['question_overlap'].mean()

0.6908737315671878

While 61% of words (5-letters) and greater appear in later questions as of March 14th, 2000, the total number of words in the terms_used set is 34,814. According to TestYourVocab.com the average native-English speaking adult knows about 25,000-30,000 words which means looking for repeating words would be an extremely inefficient strategy at best.

Perhaps it would be better to know which words are more likely to appear in questions of higher dollar value? Next, we will code the questions as being high or low value (high value being greater than $800) and see which words 

In [31]:
def value_assignment(row):
    if row['clean_value'] > 800:
        return 1
    else:
        return 0
    
jeopardy['high_value'] = jeopardy.apply(value_assignment, axis=1)

def count_usage(word):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        if word in row['clean_question'].split(" "):
            if row['high_value'] == 1:
                high_count += 1
            elif row['high_value'] == 0:
                low_count += 1
    return low_count, high_count

observed_expected = []
comparison_terms = list(terms_used)[:5]

for term in comparison_terms:
    observed_expected.append(count_usage(term))
    
print(comparison_terms)
print(observed_expected)



['incredible', 'newsweek', 'portsmouth', 'fibrinogen', 'tombstone']
[(1, 1), (1, 1), (1, 1), (1, 1), (1, 1)]


Note: For the purpose of the guided project and to save computing time, the comparison terms were limited to the first 5. 

With our list of observances, we can now do a chi square test to see if there is any statistical significance of any bias toward high or low valued questions. 

In [37]:
from scipy.stats import chisquare
import numpy as np

high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]

chi_squared = []

for obs in observed_expected:
    total = sum(obs)
    total_prop = total / len(jeopardy)
    exp_high = total_prop * high_value_count
    exp_low = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([exp_low, exp_high])
    chi_squared.append(chisquare(observed,expected))
    
chi_squared

[Power_divergenceResult(statistic=0.44487748166127949, pvalue=0.50477764875459963),
 Power_divergenceResult(statistic=0.44487748166127949, pvalue=0.50477764875459963),
 Power_divergenceResult(statistic=0.44487748166127949, pvalue=0.50477764875459963),
 Power_divergenceResult(statistic=0.44487748166127949, pvalue=0.50477764875459963),
 Power_divergenceResult(statistic=0.44487748166127949, pvalue=0.50477764875459963)]