# Jeopardy
Let's say we want to compete on Jeopardy, and we're looking for any edge you can get to win. In this project, we'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you win.

In [1]:
import pandas as pd

In [2]:
jeopardy = pd.read_csv('jeopardy.csv')

In [3]:
jeopardy.head(5)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [4]:
#remove leading whitespace of columns
jeopardy.columns = jeopardy.columns.str.lstrip()

In [5]:
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

In [6]:
#some more cleaning
import re
def normalise(text):
    text = text.lower()
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    text = re.sub("\s+", " ", text)
    return text  

In [7]:
jeopardy['clean_question'] = jeopardy['Question'].apply(normalise)

In [8]:
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalise)

In [9]:
jeopardy.head(5)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams


In [10]:
#remove the dollar signs from values and turn int numbers
def normalise_numeric(text):
    text = re.sub('[^A-Za-z0-9\s]', '',text)
    try:
        text = int(text)
    except Exception:
        text = 0
    return text

In [11]:
jeopardy['clean_value'] = jeopardy['Value'].apply(normalise_numeric)

In [12]:
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])


In [13]:
jeopardy.dtypes

Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object

# Deduction

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

- How often the answer is deducible from the question.
- How often new questions are repeats of older questions.

We can answer the second question by seeing how often complex words (> 6 characters) reoccur. We can answer the first question by seeing how many times words in the answer also occur in the question. We'll work on the first question now, and come back to the second.

In [14]:
# functions that takes in a row from our DF, splits the question and 
# answer columns into lists of words counts cross over words between them
def common_words(row):
    try:
        row['split_answer'] = row['clean_answer'].split(' ')
        row['split_question'] = row['clean_question'].split(' ') 
    except Exception:
        print('nah son')
    #keep track of how many words in common
    match_count = 0
    #'the' has no meaning in question or answers
    if ('the' in row['split_answer']):
        row['split_answer'].remove('the')
    #avoid division by zero at the end of this function  
    if(len(row['split_answer']) == 0):
        return 0
    
    for word in row['split_answer']:
        if word in row['split_question']:
            match_count +=1
    #what proportion of words in answer are in question
    return match_count/len(row['split_answer'])
            
    
jeopardy['answer_in_question'] = jeopardy.apply(common_words,axis=1)
        
    

In [15]:
jeopardy['answer_in_question'].value_counts()

0.000000    17475
0.500000     1447
0.333333      494
0.250000      156
1.000000      124
0.666667      104
0.200000       68
0.166667       27
0.400000       26
0.142857       21
0.750000       17
0.600000        9
0.125000        9
0.285714        7
0.800000        2
0.428571        2
0.181818        2
0.571429        2
0.300000        2
0.111111        2
0.350000        1
0.444444        1
0.875000        1
Name: answer_in_question, dtype: int64

In [16]:
jeopardy['answer_in_question'].mean()

0.05898946462474639

The vast majority of answers cannot be found within the question. However there is 1447 answers within which half of the words in the answer can be found in the question. An example of how the statistic is calculated is as follows. If the question was 'name a section of the United States Military' and the answer was 'United States Air Force'. In this case 'united' and 'states' can be found in the question but 'air' and 'force' cannot be. So the calculation would be match_count/len(answer)......2/4 = 0.5

looking at the mean, it is very unlikely to find the answer within the question and so we'll probably have to find another way to get the answer. 

# reappearing questions
Let's say we want to investigate how often new questions are repeats of older ones. we can't completely answer this, because we only have about 10% of the full Jeopardy question dataset, but we can investigate it at least.

To do this, we can:

- Sort jeopardy in order of ascending air date.
- Maintain a set called terms_used that will be empty initially.
- Iterate through each row of jeopardy.
- Split clean_question into words, remove any word shorter than 6 characters, and check if each word occurs in terms_used.
  - If it does, increment a counter.
  - Add each word to terms_used.
  
This will enable us to check if the terms in questions have been used previously or not. Only looking at words with six or more characters enables us to filter out words like the and than, which are commonly used, but don't tell you a lot about a question.

In [17]:
#sort the DataFrame by air date
# list of words appeard before
# set of all terms seen
question_overlap = []
terms_used = set()

jeopardy.sort_values("Air Date", axis=0, inplace=True)

In [18]:
for index, data in jeopardy.iterrows():
    split_question = data['clean_question'].split(' ')
    #only keep words which are not less than 6 chars long
    split_question = [x for x in split_question if len(x) >= 6]
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
        else:
            terms_used.add(word)
    if(len(split_question) > 0):
        match_count = match_count/len(split_question)
    question_overlap.append(match_count)

jeopardy['question_overlap'] = question_overlap
question_overlap_mean = jeopardy['question_overlap'].mean()
print(question_overlap_mean)

    

0.6894031359073245


on average, almost 70% of words appearing in questions have appeared in questions in previous shows. Whilst it is recurring words, not actual phrases, that is still a significant amount and is worthwhile looking in to.

# Low/High value questions

Let's say we only want to study questions that pertain to high value questions instead of low value questions. This will help you earn more money when you're on Jeopardy.

We can actually figure out which terms correspond to high-value questions using a chi-squared test. we'll first need to narrow down the questions into two categories:

Low value -- Any row where Value is less than 800.
High value -- Any row where Value is greater than 800.

We can then find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values.

In [19]:
#is this question valuable by our definition
def value_func(row):
    if row['clean_value'] > 800:
        return 1
    else:
        return 0
jeopardy['high_value'] = jeopardy.apply(value_func,axis=1)

In [20]:
jeopardy['high_value'].value_counts()

0    14265
1     5734
Name: high_value, dtype: int64

In [21]:
import random
def high_low_count(word):
    low_count = 0
    high_count = 0
    for index, data in jeopardy.iterrows():
        split_question = data['clean_question'].split(' ')
        if word in split_question:
            if data['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count
    

In [22]:
#try our function on 10 random words in our vocabulary
comparison_terms = random.sample(terms_used, k=10)

In [23]:
comparison_terms

['hauptmann',
 'nigerian',
 'regulates',
 'australian',
 'ribosomal',
 'hrefhttpwwwjarchivecommedia20110505dj13jpg',
 'creating',
 'krakow',
 'sportsmanship',
 'seahawks']

In [24]:
#our observed counts of our words to use in chi squares multi  
observed_expected = []
for term in comparison_terms:
    counts = high_low_count(term)
    observed_expected.append(counts)
    

Now that we've found the observed counts for a few terms,we can compute the expected counts and the chi-squared value.

In [25]:
#totals of high and low values to work out expectations
high_value_count = jeopardy['high_value'].sum()
low_value_count = len(jeopardy) - high_value_count

In [26]:
table = pd.DataFrame(data=observed_expected,index=comparison_terms, columns=['High_value', 'Low_value'])
table['Total'] = table.sum(axis=1)
table.append(table.sum().rename('Total'))

Unnamed: 0,High_value,Low_value,Total
hauptmann,0,1,1
nigerian,0,1,1
regulates,1,1,2
australian,7,8,15
ribosomal,1,0,1
hrefhttpwwwjarchivecommedia20110505dj13jpg,1,0,1
creating,4,3,7
krakow,1,2,3
sportsmanship,1,0,1
seahawks,0,1,1


In [27]:
from scipy.stats import chisquare
import numpy as np

chi_squared = []
for index, data in table.iterrows():
    if(index != 'Total'):
        total_proportion = data['Total']/len(jeopardy)
        expected_high_value_count = high_value_count * total_proportion
        expected_low_value_count = low_value_count *total_proportion
        chi_squared.append(chisquare(data[['High_value', 'Low_value']], [expected_high_value_count, expected_low_value_count]))


In [28]:
pvalues = [x[1] for x in chi_squared]

In [32]:
p_vals = pd.DataFrame(data=pvalues,index=comparison_terms, columns=['p'])

In [33]:
p_vals

Unnamed: 0,p
hauptmann,0.526077
nigerian,0.526077
regulates,0.504778
australian,0.123279
ribosomal,0.114733
hrefhttpwwwjarchivecommedia20110505dj13jpg,0.114733
creating,0.095769
krakow,0.858289
sportsmanship,0.114733
seahawks,0.526077


In [34]:
chi_squared

[Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996),
 Power_divergenceResult(statistic=2.375162392980576, pvalue=0.12327877776969601),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.774619927181822, pvalue=0.09576938744167536),
 Power_divergenceResult(statistic=0.03188116723440362, pvalue=0.8582887163235293),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469)]

# Pvalues
It seems like all of our words are independant of the value of the question in which they appear. The only word which is even slightly close to the p=0.05 threshold is 'creating'. This suggests there could be some dependency between this word and the value of the question it appears in. 

At the same time, it should be noted that the frequencies of our words (how many times they appear in questions in the entire dataset) are relatively low. It would be interesting to investigate if dependencies arise when we focus our attention on words with a higher total frequency.