# Winning Jeopardy

In this project, we'll be doing some brief analysis on historical data from the popular game show Jeopardy. In particular, we'll be looking to answer some questions as to how one might go about maximizing their chances of winning, and if any insight can be gleaned from past episodes.

To begin, we'll do some preliminary investigation and cleaning of the data. This includes formatting the column names, converting the date strings to datetime objects and converting the value column from a series of objects to a series of integers.

In [1]:
import pandas as pd
import re

jeopardy = pd.read_csv('jeopardy.csv')
print(jeopardy.head())

   Show Number    Air Date      Round                         Category  Value  \
0         4680  2004-12-31  Jeopardy!                          HISTORY   $200   
1         4680  2004-12-31  Jeopardy!  ESPN's TOP 10 ALL-TIME ATHLETES   $200   
2         4680  2004-12-31  Jeopardy!      EVERYBODY TALKS ABOUT IT...   $200   
3         4680  2004-12-31  Jeopardy!                 THE COMPANY LINE   $200   
4         4680  2004-12-31  Jeopardy!              EPITAPHS & TRIBUTES   $200   

                                            Question      Answer  
0  For the last 8 years of his life, Galileo was ...  Copernicus  
1  No. 2: 1912 Olympian; football star at Carlisl...  Jim Thorpe  
2  The city of Yuma in this state has a record av...     Arizona  
3  In 1963, live on "The Art Linkletter Show", th...  McDonald's  
4  Signer of the Dec. of Indep., framer of the Co...  John Adams  


In [2]:
print(jeopardy.columns)

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')


In [3]:
# Standardize column names
new_columns = []
for col in jeopardy.columns:
    if col[0] == ' ':
        col = col.replace(' ', '', 1)
    new_columns.append(col)

jeopardy.columns = new_columns

jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

In [4]:
def normalize_string(string):
    string = string.lower()
    string = re.sub('[^A-Za-z0-9\s]', '', string)
    return string

clean_question = jeopardy['Question'].apply(normalize_string)
clean_answer = jeopardy['Answer'].apply(normalize_string)

In [5]:
print(jeopardy['Value'].head())

0    $200
1    $200
2    $200
3    $200
4    $200
Name: Value, dtype: object


In [6]:
def normalize_dollars(string):
    string = re.sub('[^A-Za-z0-9\s]', '', string)
    try: 
        clean = int(string)
    except Exception:
        clean = 0
        
    return clean

clean_value = jeopardy['Value'].apply(normalize_dollars)
print(clean_value.head())

0    200
1    200
2    200
3    200
4    200
Name: Value, dtype: int64


In [7]:
jeopardy['clean_question'] = clean_question
jeopardy['clean_answer'] = clean_answer
jeopardy['clean_value'] = clean_value

In [8]:
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

When looking at this historical data, we want to consider a few questions to see if we can learn anything from previous contestants. The questions we'll be looking to answer here are:

  1. How often does the answer appear in the question?
  2. How much overlap is there between terms in new questions and terms in old questions? 

In [9]:
def funcs(row):
    split_answer = row['clean_answer'].split(' ')
    split_question = row['clean_question'].split(' ')
    match_count = 0
    
    if 'the' in split_answer:
        split_answer.remove('the')
    
    if len(split_answer) == 0:
        return 0
    
    for item in split_answer:
        if item in split_question:
            match_count += 1
            
    return match_count/len(split_answer)


In [10]:
answer_in_question = jeopardy.apply(funcs, axis = 1)
means = answer_in_question.mean()
print(means)

0.060493257069335914


Our analysis reveals that the answer only appears in the question about 6% of the time, meaning that we can't rely on the question giving us a strong indication of what the answer entails.

In [11]:
question_overlap = []
terms_used = set()

jeopardy = jeopardy.sort_values('Air Date')

for index, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(' ')
    split_question = [r for r in split_question if len(r) >= 6]
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
        
    for word in split_question:
        terms_used.add(word)
    
    if len(split_question) > 0:
        match_count = match_count/len(split_question)
    
    question_overlap.append(match_count)
    
jeopardy['question_overlap'] = question_overlap
mean_overlap = jeopardy['question_overlap'].mean()

print(mean_overlap)

0.6876260592169776


There seems to be some significant overlap, about 68.7%, between terms in old and new questions. This does not speak of the question itself or the ordering, but the fact that there is a high overlap may indicate that there are some topics which are more likely to reappear in future episodes.

In [12]:
def check_value(row):
    value = 0
    if row['clean_value'] > 800:
        value = 1
    
    return value

jeopardy['high_value'] = jeopardy.apply(check_value, axis = 1)

def word_occurence(word):
    high_count = 0
    low_count = 0
    for index, row in jeopardy.iterrows():
        if word in row['clean_question'].split(' '):
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
            
    return high_count, low_count

observed_expected = []
comparison_terms = list(terms_used)[0:5]

for word in comparison_terms:
    observed_expected.append(word_occurence(word))
    
print(observed_expected)

[(0, 3), (0, 1), (2, 1), (0, 1), (3, 14)]


In [14]:
from scipy.stats import chisquare
import numpy as np

high_value_count = jeopardy[jeopardy['high_value'] == 1].shape[0]
low_value_count = jeopardy[jeopardy['high_value'] == 0].shape[0]

chi_squared = []

for item in observed_expected:
    total = item[0] + item[1]
    total_prop = total/jeopardy.shape[0]
    expected_high = total_prop * high_value_count
    expected_low = total_prop * low_value_count
    
    expected = np.array([expected_high, expected_low])
    observed = np.array([item[0], item[1]])
    
    chi_squared.append(chisquare(observed, expected))

print(chi_squared)

[Power_divergenceResult(statistic=1.205888538380652, pvalue=0.27214791766901714), Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469), Power_divergenceResult(statistic=2.1177104383031944, pvalue=0.14560406868263753), Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469), Power_divergenceResult(statistic=1.0102851115076668, pvalue=0.314834544813388)]


### Chi-Squared Results

There doesn't seem to be a significant difference in usage between high value and low value rows. 