# Winning Jeopardy: a statistical analysis

Jeopardy is a popular US TV show in which participant answer questions for a chance to win money. It has run for many decades and it has become a major force in popular culture.

In this notebook we perform a statistical analysis of a dataset containing 20000 Jeopardy questions to determine if it is possible to gain an advantage. The data we will use is a sample from the full dataset containing 200000 questions. That can be found [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/). 

In [1]:
%matplotlib inline
import pandas as pd

In [2]:
jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [3]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [4]:
jeopardy.columns = jeopardy.columns.str.strip()
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

## Normalizing the columns

Before performing our analysis, we will normalize the question and answer columns by removing uppercases and punctuations.

In [5]:
def norm_string(string):
    return (string
            .lower()
            .replace(',', '')
            .replace('.', '')
            .replace(':', '')
            .replace(';', '')
            .replace('"', '')
            .replace("'", '')
            .replace('!', '')
            .replace('?', '')
           )    

In [6]:
jeopardy['clean_question'] = jeopardy['Question'].apply(norm_string)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(norm_string)

Next we normalize the _Value_ column and the _Air Date_ column. We convert the _Value_ column to numeric values and the _Air Date_ column to datetime.

In [7]:
def norm_dollars(string):
    norm_string = string.replace('$', '')
    try:
        norm_string = int(norm_string)
    except:
        norm_string = 0
    return norm_string

In [8]:
jeopardy['clean_value'] = jeopardy['Value'].apply(norm_dollars)

In [9]:
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

# Analysis of Jeopardy questions

We are now interested in finding:
* How often is the answer deducible from the question
* How often new questions are repeats of older questions

This will give a guideline on how much time should be spent on studying previous questions.

## Is the answer contained in the question?

We begin by answering the first question. Our approach will be to study how many times words in the answer also occur in the question.

In [10]:
def question_analysis(row):
    split_answer = row['clean_answer'].split(' ')
    split_question = row['clean_question'].split(' ')
    match_count = 0
    split_answer = [y for y in split_answer if y != 'the']
    try:
        for item in split_answer:
            match_count += int(item in split_question)
        return match_count/len(split_answer)
    
    except:
        return 0

In [11]:
jeopardy['answer_in_question'] = jeopardy.apply(question_analysis, axis=1)

In [12]:
jeopardy['answer_in_question'].mean()

0.05548872307085218

In [13]:
jeopardy['answer_in_question'].describe()

count    19999.000000
mean         0.055489
std          0.161427
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max          1.000000
Name: answer_in_question, dtype: float64

## Conclusions

We have found that on average ~0.05 words in the answer are contained in the questions. We have also seen that the maximum is 1, so we can safely assume that if one word from the answer is contained in the question

## How often are questions repeats of older questions?

We now investigate how often the questions proposed are repeats of older Jeopardy questions. To answer this, we will check for each question how many times the words in the question that longer than 6 characters have appeared previously. This is a good proxy for answering our original question.

In [14]:
question_overlap = []
terms_used = set()

for index, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(' ')
    split_question = [y for y in split_question if len(y)>5]
    
    match_count = 0
    if len(split_question)>0:    
        for word in split_question:
            match_count += int(word in terms_used)
        for word in split_question:
            terms_used.add(word)
        question_overlap.append(match_count / len(split_question))
    else:
        question_overlap.append(match_count)

jeopardy['question_overlap'] = question_overlap

In [15]:
jeopardy['question_overlap'].describe()

count    19999.000000
mean         0.685725
std          0.299306
min          0.000000
25%          0.500000
50%          0.750000
75%          1.000000
max          1.000000
Name: question_overlap, dtype: float64

In [16]:
# Compute percentages of value counts
jeopardy['question_overlap'].value_counts(normalize=True).apply(lambda x: x*100).head(10)

1.000000    30.546527
0.500000    10.885544
0.666667    10.070504
0.000000     7.960398
0.750000     7.695385
0.800000     6.140307
0.600000     4.380219
0.833333     4.110206
0.333333     3.815191
0.400000     2.195110
Name: question_overlap, dtype: float64

## Conclusions

We have show that a large number of questions are recycled. More precisely:
* ~30% of questions are completely recycled
* ~8% of questions are completely new
* 75% of questions have at least a 50% overlap with an old question.

From this analysis we conclude that studying the past questions is most definitely a winning strategy.

# Typical words in high-value questions

Next we study which words are more likely to appear in high-value questions. To do this, we first split questions in two categories, depending on their value:

* Low value: Value < 800
* High value: Value >= 800

This will give us the proportion of low-value and high-value questions. For each word, we multiply the total number of occurences by the two proportions above to obtain the expected number of occurences in low-value and high-value questions respectively. We can compare these expected numbers with the actual occurences and test for statistical significance of the difference with a Chi-squared test.

In [17]:
def classify_value(row):
    if row['clean_value'] >= 800:
        return 1
    elif row['clean_value'] < 800:
        return 0

In [18]:
jeopardy['high_value'] = jeopardy.apply(classify_value, axis=1)

In [19]:
def low_high_value_count(word):
    low_count = 0
    high_count = 0
    for index, row in jeopardy.iterrows():
        question_split = row['clean_question'].split(' ')
        if word in question_split:
            high_count += row['high_value']
            low_count += 1 - row['high_value']
    return low_count, high_count

In [20]:
observed_expected = []
comparison_terms = list(terms_used)[:20]
comparison_terms

['abbreviating',
 'haunts',
 'cinchers',
 'boosters',
 'abusive',
 'blinked',
 'bullets',
 'grovers',
 'joseph',
 'auburns',
 'simulator',
 'unable',
 'reformed',
 'jack-jack',
 'mozarts',
 'infant',
 'olenska',
 'prescriptions',
 'corcoran</a>)',
 'cleavers']

In [21]:
for term in comparison_terms:
    observed_expected.append(low_high_value_count(term))
observed_expected

[(1, 0),
 (2, 1),
 (1, 0),
 (2, 0),
 (1, 0),
 (0, 1),
 (0, 2),
 (0, 1),
 (24, 10),
 (1, 0),
 (2, 0),
 (1, 1),
 (0, 1),
 (0, 1),
 (0, 1),
 (3, 0),
 (1, 0),
 (1, 0),
 (1, 0),
 (0, 1)]

In [22]:
high_value_count = len(jeopardy[jeopardy['high_value'] == 1])
low_value_count = len(jeopardy[jeopardy['high_value'] == 0])
chi_squared = []

In [23]:
from scipy.stats import chisquare

for item in observed_expected:
    total = sum(item)
    total_prop = total/jeopardy.shape[0]
    
    expected_high_count = total_prop * high_value_count
    expected_low_count = total_prop * low_value_count
    expected = [expected_low_count, expected_high_count]
    
    chi_squared.append(chisquare(f_obs=item, f_exp=expected))

In [24]:
chi_squared

[Power_divergenceResult(statistic=0.6600813480534574, pvalue=0.41653122582698476),
 Power_divergenceResult(statistic=0.05176339364874122, pvalue=0.8200227626475018),
 Power_divergenceResult(statistic=0.6600813480534574, pvalue=0.41653122582698476),
 Power_divergenceResult(statistic=1.3201626961069148, pvalue=0.2505628539124093),
 Power_divergenceResult(statistic=0.6600813480534574, pvalue=0.41653122582698476),
 Power_divergenceResult(statistic=1.5149647887323945, pvalue=0.2183830639074722),
 Power_divergenceResult(statistic=3.029929577464789, pvalue=0.0817415642907401),
 Power_divergenceResult(statistic=1.5149647887323945, pvalue=0.2183830639074722),
 Power_divergenceResult(statistic=1.5206863338832604, pvalue=0.21751566001383427),
 Power_divergenceResult(statistic=0.6600813480534574, pvalue=0.41653122582698476),
 Power_divergenceResult(statistic=1.3201626961069148, pvalue=0.2505628539124093),
 Power_divergenceResult(statistic=0.08752306839292578, pvalue=0.7673499986093351),
 Power_div

## Conclusions

We cannot conclude that any of the words in our small sample is significantly more likely to appear either in low- or high-value questions. A full analysis including all words should be performerd, but such analysis will necessarily incur in many Type I errors