# How to win Jeopardy

This project will analyse a dataset of questions from the game show Jeopardy to try to find an effective way to train to win the game. In the game questions are asked and an amount of money is assigned to each question, if the respondent gets the answer correct, they win the money. Different questions have different amounts of money assigned.

In [1]:
import pandas as pd
import csv

jeopardy = pd.read_csv("jeopardy.csv")

First we'll explore the data.

In [2]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [3]:
print(jeopardy.columns)

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')


The column names require cleaning (whitespace needs to be removed) to make analysis easier.

In [25]:
new_columns = []
for c in jeopardy.columns:
    c = c.strip()
    new_columns.append(c)
jeopardy.columns = new_columns
print(jeopardy.columns)

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer', 'clean_question', 'clean_answer', 'clean_value',
       'answer_in_question', 'question_overlap', 'high_value'],
      dtype='object')


For later analysis we also need to clean the question and answer columns, by turning them into lowercase and removing any punctuation.

In [5]:
import re
def norm(string):
    string = str(string)
    string = string.lower()
    string = re.sub(r'[^\w\s]', '', string)
    return string

In [6]:
jeopardy['clean_question'] = jeopardy['Question'].apply(norm)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(norm)

The 'Value' column must also be cleaned (turned into int values) to allow analysis.

In [7]:
def numeric(x):
    x = re.sub(r'[^\w\w]', '', x)
    try:
        x = int(x)
    except:
        x = 0
    return x


In [8]:
jeopardy['clean_value'] = jeopardy['Value'].apply(numeric)

The values in the 'Air Date' can be turned into datetime objects for easier analysis.

In [9]:
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

The first strategy we will test for winning jeopardy is the strategy of deducing the answer from the question. To test this we will determine how often terms appear both in the answer as well as the question. We will exclude the word 'the' which is likely unhelpful for deeducing the answer from the question.

In [10]:
#Here we create a function to turn the words of the question and answer into separate lists, and determine how many
#terms that appear in the question also appear in the answer. We will not include the word 'the'.

def ans_in_q(row):
    split_question = row['clean_question'].split()
    split_answer = row['clean_answer'].split()
    match_count = 0
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0
    for word in split_answer:
        if word in split_question:
            match_count += 1
    return match_count/len(split_answer)

In [11]:
#Here we will apply the function to each row in the dataset.

jeopardy['answer_in_question'] = jeopardy.apply(func=ans_in_q, axis=1)

In [12]:
#Here we find the mean proportion of terms that appear in both the question and the answer.

jeopardy['answer_in_question'].mean()

0.05792070323661354

In only 5.7% of cases do terms in the answer appear in the wording of the question. This suggests that deducing the answer from the question is not a viable strategy to winning jeopardy.

The next strategy we will test is finding how often similar terms are repeated (i.e. have appeared in previous jeopardy questions). This will tell us if studying past Jeopardy questions may be an effective way of winning Jeopardy.

In [13]:
#First, we sort the dataframe by date of question asked.

jeopardy.sort_values(by='Air Date', inplace=True)

In [14]:
#Next we loop through the dataframe and count how often complex terms (terms of >5 characters) are repeated. We will 
#create a new column "question overlap", which counts how many complex terms in each question have also appeared in
#past questions.

terms_used = set()
question_overlap = []
for index, row in jeopardy.iterrows():
    match_count = 0
    split_question = row['clean_question'].split()
    split_question = [x for x in split_question if len(x) > 5]
    for w in split_question:
        if w in terms_used:
            match_count += 1
        else:
            terms_used.add(w)
    try:
        match_per_word = match_count/len(split_question)
    except:
        match_per_word = 0
    question_overlap.append(match_per_word)

jeopardy['question_overlap'] = pd.Series(question_overlap)

In [15]:
#Now we find the mean value of repeated terms per question.

print(jeopardy['question_overlap'].mean())

0.8726690023368798


Around 87% of words of 6 characters or more in the questions in the dataset had been used in previous Jeopardy questions, this suggests that studying past questions could be a promising route to learning how to win Jeopardy.

Next we can determine if some subjects in past questions are better to study than others. To do this we will try to find patterns in what types of terms are likely to appear in higher value questions (questions where the money on offer is higher) vs. lower value questions.

In [16]:
#First we will write a function that classifies a question as either high value (1) or low value (0).

def value_question(row):
    if row['clean_value'] > 800:
        return 1
    else:
        return 0

In [17]:
#Next we apply the value to the dataframe.

jeopardy['high_value'] = jeopardy.apply(value_question, axis=1)

In [18]:
#Next we can write a function that takes in a word and a row and determines (1) if the word appears in the question
#and (2) if the question is high value

def highword(row, word):
    high_count = 0
    words_in_q = row['clean_question'].split()
    if word in words_in_q:
        if row['high_value'] == 1:
            high_count += 1
    return high_count
    

In [19]:
#Next we write a function that does the same but for low value questions.

def lowword(row, word):
    low_count = 0
    words_in_q = row['clean_question'].split()
    if word in words_in_q:
        if row['high_value'] == 0:
            low_count += 1
    return low_count

In [20]:
#Next we take a random sample of 10 terms to try out our analysis.

import random
sample_terms = random.sample(terms_used, 10)

In [21]:
#Here we make a list of the observed frequencies of each word in high value questions and low value questions.

observed = []
for w in sample_terms:
    high_count = jeopardy.apply(lambda x: highword(x, w), axis=1).sum()
    low_count = jeopardy.apply(lambda x: lowword(x, w), axis=1).sum()
    observed.append([high_count, low_count])
    
    

In [22]:
#Here are the observed frequencies.

print(observed)

[[6, 13], [1, 3], [0, 1], [0, 1], [0, 1], [0, 3], [1, 0], [1, 0], [0, 1], [1, 4]]


We can see that the sampled terms show variable frequencies among high and low value questions. But we cannot tell if the differences in frequencies for each word is statistically significant. To do this we can perform a chi square analysis.

In [23]:
#For a chi square analysis we need to find the expected frequencies of high and low value questions, we start by 
#determining their frequency in the dataset as a whole

high_value_count = len(jeopardy[jeopardy['high_value'] == 1].index)
low_value_count = len(jeopardy[jeopardy['high_value'] == 0].index)

In [24]:
#Now we can perform the chi square test, we add the observed high value and low value frequencies of each word, then 
#find their percentage as a proportion of all the questions. Using that proportion we can find the expected 
#frequencies, by mutliplying it by the overall high value count and low value count to find the expected number
#of high value and low value questions respectively (for that word). Once we have the observed and expected 
#frequencies, we can perform the chi square test, and append the result to a list.


from scipy.stats import chisquare
chi_squared = []
for i in observed:
    total = sum(i)
    total_prop = total/len(jeopardy.index)
    expected_high = total_prop*high_value_count
    expected_low = total_prop*low_value_count
    chi_squared.append(chisquare(i, [expected_high, expected_low]))

chi_squared

[Power_divergenceResult(statistic=0.09977335504492044, pvalue=0.7521017865262274),
 Power_divergenceResult(statistic=0.021646150708492677, pvalue=0.8830323245068887),
 Power_divergenceResult(statistic=0.3949764642333513, pvalue=0.5296950912486695),
 Power_divergenceResult(statistic=0.3949764642333513, pvalue=0.5296950912486695),
 Power_divergenceResult(statistic=0.3949764642333513, pvalue=0.5296950912486695),
 Power_divergenceResult(statistic=1.184929392700054, pvalue=0.27635474913315955),
 Power_divergenceResult(statistic=2.5317964247338085, pvalue=0.11157312838169751),
 Power_divergenceResult(statistic=2.5317964247338085, pvalue=0.11157312838169751),
 Power_divergenceResult(statistic=0.3949764642333513, pvalue=0.5296950912486695),
 Power_divergenceResult(statistic=0.1702839704934861, pvalue=0.6798595573662745)]

Unfortunately, none of the terms returned a significant result, i.e. none of the words appear in significantly more higher value questions than lower value questions. This suggests that finding words which are more common in higher value questions is not a promising route to winning Jeopardy. 

However, this is only a random sample of 10 words, so a higher sampling may find some words which show a significant result. 

Overall, the most promising route to winning Jeopardy appears to be studying past questions, whether high value or low value.