# Winning Jeopardy

In this project, we will work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help us win.

Data Dictionary:-  
* Show Number -- the Jeopardy episode number of the show this question was in.
* Air Date -- the date the episode aired.
* Round -- the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
* Category -- the category of the question.
* Value -- the number of dollars answering the question correctly is worth.
* Question -- the text of the question.
Answer -- the text of the answer.

In [113]:
import pandas as pd
import re

In [114]:
jeopardy = pd.read_csv("jeopardy.csv")

In [115]:
jeopardy.head(5)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [116]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
Show Number    19999 non-null int64
 Air Date      19999 non-null object
 Round         19999 non-null object
 Category      19999 non-null object
 Value         19999 non-null object
 Question      19999 non-null object
 Answer        19999 non-null object
dtypes: int64(1), object(6)
memory usage: 1.1+ MB


In [117]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [118]:
jeopardy.columns = ['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question', 'Answer']

jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

In [119]:
def normalizer(string):
    string = string.lower()
    string = re.sub("\W", " ", string)
    return string

In [120]:
jeopardy["clean_question"] = jeopardy["Question"].apply(normalizer)

jeopardy["clean_answer"] = jeopardy["Answer"].apply(normalizer)

In [121]:
jeopardy['Value'].unique()

array(['$200', '$400', '$600', '$800', '$2,000', '$1000', '$1200',
       '$1600', '$2000', '$3,200', 'None', '$5,000', '$100', '$300',
       '$500', '$1,000', '$1,500', '$1,200', '$4,800', '$1,800', '$1,100',
       '$2,200', '$3,400', '$3,000', '$4,000', '$1,600', '$6,800',
       '$1,900', '$3,100', '$700', '$1,400', '$2,800', '$8,000', '$6,000',
       '$2,400', '$12,000', '$3,800', '$2,500', '$6,200', '$10,000',
       '$7,000', '$1,492', '$7,400', '$1,300', '$7,200', '$2,600',
       '$3,300', '$5,400', '$4,500', '$2,100', '$900', '$3,600', '$2,127',
       '$367', '$4,400', '$3,500', '$2,900', '$3,900', '$4,100', '$4,600',
       '$10,800', '$2,300', '$5,600', '$1,111', '$8,200', '$5,800',
       '$750', '$7,500', '$1,700', '$9,000', '$6,100', '$1,020', '$4,700',
       '$2,021', '$5,200', '$3,389'], dtype=object)

In [122]:
def dollar_to_num(string):
    string = re.sub("\W", "", string)
    try:
        string = int(string)
    except:
        string = 0
    return string

In [123]:
jeopardy["Value"] = jeopardy["Value"].apply(dollar_to_num)

jeopardy["Air Date"] = pd.to_datetime(jeopardy["Air Date"])

Normalized the question and answer columns for analysis and cleaned the air date column

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

* How often the answer is deducible from the question.
* How often new questions are repeats of older questions.  

We can answer the second question by seeing how often complex words (> 6 characters) reoccur. We can answer the first question by seeing how many times words in the answer also occur in the question. We'll work on the first question now, and come back to the second.

In [124]:
def match_counter(row):
    split_answer = row["clean_answer"].split()
    split_question = row["clean_question"].split()
    match_count = 0
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0

    for i in split_answer:
        if i in split_question:
            match_count += 1
    return match_count / len(split_answer)

In [125]:
jeopardy["answer_in_question"] = jeopardy.apply(match_counter, axis=1)

In [126]:
jeopardy["answer_in_question"].mean()

0.06294645581984949

From the above output we find that the mean is really too low, thus our first point to deduce an answer from the question seems negligible for our winning strategy.

Now let us investigate our second point mentioned above - how often new questions are repeats of older ones.

We can't completely answer this, because we only have about 10% of the full Jeopardy questions dataset, but we can investigate it at least.

In [127]:
question_overlap = []
terms_used = set()

jeopardy = jeopardy.sort_values('Air Date')
jeopardy.reset_index(drop=True, inplace=True)

In [128]:
for index, row in jeopardy.iterrows():
    split_question = row["clean_question"].split()
    split_question = [i for i in split_question if len(i) >= 6]
    match_count = 0
    
    for word in split_question:
        if word in terms_used:
            match_count += 1
    terms_used.add(word)
    if len(split_question) > 0:
        match_count = match_count / len(split_question)
    question_overlap.append(match_count)
    
jeopardy["question_overlap"] = question_overlap

In [129]:
jeopardy["question_overlap"].mean()

0.5254786645114725

From the results above it may seem that there is quite a bit of overlap in questions (around 50%), suggesting that old questions are often recycled, but we need to consider the fact that we are only checking for the words not the phrases and also we are only looking at a small subset of data. Therefore no concreate conclusion can be drawn here without further/deeper investigation.

Now, we will try to find question that have high value to maximize the reward.

We, will define high and low value questions like this:-  
* Low value -- Any row where Value is less than 800.
* High value -- Any row where Value is greater than 800.

In [130]:
def val_category(row):
    if row['Value'] > 800:
        value = 1
    else:
        value = 0
    return value

In [131]:
jeopardy["high_value"] = jeopardy.apply(val_category, axis=1)

In [132]:
def usage_counter(word):
    low_count = 0
    high_count = 0
    for idx, row in jeopardy.iterrows():
        if word in row["clean_question"].split():
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

In [133]:
from random import choice

terms_used_list = list(terms_used)
comparison_terms = [choice(terms_used_list) for _ in range(10)]
print(comparison_terms)

observed_expected = []

['carrots', 'scottsboro', 'reference', 'imagine', 'chardonnay', 'airplay', 'cutthroat', 'propelled', 'saturn', 'scores']


In [134]:
for term in comparison_terms:
    observed_expected.append(usage_counter(term))

In [135]:
observed_expected  # (high_count, low_count)

[(0, 6),
 (1, 0),
 (0, 9),
 (1, 1),
 (1, 0),
 (1, 0),
 (0, 3),
 (0, 2),
 (3, 6),
 (3, 3)]

Now that we've found the observed counts for a few terms, we can compute the expected counts and the chi-squared value.

In [148]:
high_value_count = (jeopardy["high_value"] == 1).sum()
low_value_count = (jeopardy["high_value"] == 0).sum()

chi_squared = []

In [156]:
from scipy.stats import chisquare
import numpy as np

for i, j in observed_expected:
    total = i + j
    total_prop = total / len(jeopardy)
    exp_high_val = total_prop * high_value_count
    exp_low_val = total_prop * low_value_count
    
    observed = np.array([i, j])
    expected = np.array([exp_high_val, exp_low_val])
    chi_squared.append(chisquare(observed, expected))

In [157]:
chi_squared

[Power_divergenceResult(statistic=2.411777076761304, pvalue=0.120425590069509),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=3.6176656151419557, pvalue=0.05716903708519498),
 Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=1.205888538380652, pvalue=0.27214791766902047),
 Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571),
 Power_divergenceResult(statistic=0.09564350170321084, pvalue=0.75712159875701),
 Power_divergenceResult(statistic=1.3346324449838385, pvalue=0.24798277007881886)]

## Chi-squared results
None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.