# Winning Jeopardy
<p>Jeopardy is a TV show in the US where participants answer questinos to win money. We will look at a data set of Jeopardy questions to find out whether there is a pattern that could help to win the show.</p>
<p>The data set contains 2000 rows and can be found <a href="https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file">here</a></p>

In [9]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re

In [2]:
jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [3]:
jeopardy.isnull().sum()

Show Number    0
 Air Date      0
 Round         0
 Category      0
 Value         0
 Question      0
 Answer        0
dtype: int64

In [4]:
jeopardy.columns = jeopardy.columns.str.strip()

In [5]:
jeopardy.dtypes

Show Number     int64
Air Date       object
Round          object
Category       object
Value          object
Question       object
Answer         object
dtype: object

We want to normalise the `Question` and `Answer` columns byconvert the strings to lowercase and remove all punctuation. We define a function to do that and subsequently apply it to the columns:

In [13]:
def normalise_col(col):
    col = col.str.lower()
    col = col.str.replace('[^\w\s]','')
    col = col.str.replace('\s+',' ')
    return col

In [14]:
jeopardy['clean_answer'] = normalise_col(jeopardy['Answer'])
jeopardy['clean_question'] = normalise_col(jeopardy['Question'])

In [15]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_answer,clean_question
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,copernicus,for the last 8 years of his life galileo was u...
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,jim thorpe,no 2 1912 olympian football star at carlisle i...
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,arizona,the city of yuma in this state has a record av...
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,mcdonalds,in 1963 live on the art linkletter show this c...
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,john adams,signer of the dec of indep framer of the const...


Furthermore, we will need to normalise the `Value` column by removing the dollar sign and converting the string to `int`. The column `Air Date` should be a datetime format and not a string:

In [16]:
def normalise_values(string):
    string = re.sub('[^A-Za-z0-9\s]','',string)
    try:
        string = int(string)
    except Exception:
        string = 0
    return string

In [19]:
jeopardy['clean_value'] = jeopardy['Value'].apply(normalise_values)
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

In [20]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_answer,clean_question,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,copernicus,for the last 8 years of his life galileo was u...,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,jim thorpe,no 2 1912 olympian football star at carlisle i...,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,arizona,the city of yuma in this state has a record av...,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,mcdonalds,in 1963 live on the art linkletter show this c...,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,john adams,signer of the dec of indep framer of the const...,200


In [21]:
jeopardy.dtypes

Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_answer              object
clean_question            object
clean_value                int64
dtype: object

Next, we want to know how often questions are repeated or very similar. We will test
<ul>
<li> How often the answer can be used for a question</li>
<li> How often questions are repeated</li>
</ul>
<p>This function counts how many times a word in `Answer` occurs in `Question`:</p>

In [32]:
def match_counting(row):
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()
    match_count = 0
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0
    for word in split_answer:
        if word in split_question:
            match_count += 1
    return match_count/len(split_answer)

In [33]:
jeopardy['answer_in_question'] = jeopardy.apply(match_counting, axis = 1)
jeopardy['answer_in_question'].mean()

0.05900196524977763

In only 6% of the cases, the answers can be found in the question. It's definitely better to study than to hope to derive the answer from the words used in the question.

## Repeated questions:
How many times was a question repeated?

In [35]:
question_overlap = []
terms_used = set()
jeopardy = jeopardy.sort_values(by = ['Air Date'])
for ii, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(' ')
    split_question = [word for word in split_question if len(word)>5]
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
        terms_used.add(word)
    if len(split_question) > 0:
        match_count /=len(split_question)
    question_overlap.append(match_count)

jeopardy['question_overlap'] = question_overlap

jeopardy['question_overlap'].mean()        

0.6894947317226771

We only compared the single (longer) words in each question but it seems that many words are used quite regularly. However, this is not a test whether the exact same questions have been used. On top of that, we are only investigating 2000 examples and not every question that has ever been asked on the show.

## High vs low value questions
<p>We want to figure out which terms corresponds to high value questions using a chi=squared test:</p>

In [39]:
jeopardy.loc[jeopardy['clean_value']>800,'high_value'] = 1
jeopardy.loc[jeopardy['clean_value']<=800, 'high_value'] = 0

In [47]:
def counts(xx):
    low_counts = 0
    high_counts = 0
    for ii, row in jeopardy.iterrows():
        split_row = row['clean_question'].split(' ')
        if xx in split_row:
            if row['high_value'] == 1:
                high_counts +=1
            else:
                low_counts +=1
    return high_counts, low_counts

In [48]:
observed_expected = []

from random import choice
terms_used_list = list(terms_used)
comparison_terms = []
for xx in range(10):
    comparison_terms.append(choice(terms_used_list))

for item in comparison_terms:
    observed_expected.append(counts(item))
    
observed_expected

[(1, 0),
 (1, 0),
 (0, 1),
 (1, 0),
 (1, 0),
 (1, 3),
 (0, 1),
 (0, 3),
 (1, 0),
 (0, 1)]

In [55]:
from scipy.stats import chisquare

high_value_count = jeopardy[jeopardy['high_value'] == 1].shape[0]
low_value_count = jeopardy[jeopardy['high_value'] == 0].shape[0]

chi_squared = []

for item in observed_expected:
    total = sum(item)
    total_prop = total/jeopardy.shape[0]
    high_value_prop = total_prop*high_value_count
    low_value_prop = total_prop*low_value_count
    
    observed = np.array([item[0],item[1]])
    expected = np.array([high_value_prop, low_value_prop])
    chi_squared.append(chisquare(observed,expected))
    
chi_squared

[Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.02636443308440769, pvalue=0.871013484688921),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=1.205888538380652, pvalue=0.27214791766902047),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469)]