# Winning Jeopardy

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture.

Let's say you want to compete on Jeopardy, and you're looking for any edge you can get to win. In this project, you'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you win.

## Reading the data

In [1]:
import pandas as pd

jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [2]:
# remove spaces in column names
jeopardy.columns = jeopardy.columns.str.replace(" ","")

In [3]:
# normalize Question and Answer

import re

def normalize_str(string):
    
    string = string.lower()
    string = re.sub('[^A-Za-z0-9\s]', '', string)
    
    return string

jeopardy['clean_question'] = jeopardy['Question'].apply(normalize_str)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize_str)

In [4]:
# normalize Value and Air Date

def normalize_val(string):
    
    string = re.sub('[^A-Za-z0-9\s]', '', string)
    try:
        integer = int(string)
    except Exception :
        integer = 0
    return integer

jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_val)

In [6]:
jeopardy['AirDate'] = pd.to_datetime(jeopardy['AirDate'])

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

* How often the answer is deducible from the question.
* How often new questions are repeats of older questions.

You can answer the second question by seeing how often complex words (> 6 characters) reoccur. You can answer the first question by seeing how many times words in the answer also occur in the question. We'll work on the first question now, and come back to the second.

In [9]:
def count_matches(row):
    
    split_answer = row['clean_answer'].split(' ')
    split_question = row['clean_question'].split(' ')
    
    match_count = 0
    
    if 'the' in split_answer:
        split_answer.remove('the') # 'the' is common
    if len(split_answer) == 0:
        return 0
    
    for item in split_answer:
        if item in split_question:
            match_count+=1
    
    return match_count/len(split_answer)

In [10]:
jeopardy['answer_in_question'] = jeopardy.apply(count_matches, axis=1)

In [11]:
jeopardy['answer_in_question'].mean()

0.06049325706933587

The answer occurs in the question only 6% of the time.

## Terms often used

In [13]:
question_overlap = []
terms_used = set()

jeopardy = jeopardy.sort_values('AirDate')

for index, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(' ')
    split_question = [w for w in split_question if len(w) >= 6]
    
    match_count = 0
    
    for word in split_question:
        if word in terms_used:
            match_count+=1
    for word in split_question:
        terms_used.add(word)
    
    if len(split_question) > 0:
        match_count /= len(split_question)
    
    question_overlap.append(match_count)
    
jeopardy['question_overlap'] = question_overlap
jeopardy['question_overlap'].mean()

0.6876260592169802

There is almost a 70% overlap between terms used in older questions and terms in new questions.

## Question value

Let's say you only want to study questions that pertain to high value questions instead of low value questions. This will help you earn more money when you're on Jeopardy.

You can actually figure out which terms correspond to high-value questions using a chi-squared test. You'll first need to narrow down the questions into two categories:

* Low value -- Any row where Value is less than 800.
* High value -- Any row where Value is greater than 800.

In [16]:
def get_value(row):
    value = 0
    if row['clean_value'] > 800:
        value = 1
        
    return value

jeopardy['high_value'] = jeopardy.apply(get_value, axis=1)

In [17]:
def count_usage(word):
    low_count = 0
    high_count = 0
    
    for index, row in jeopardy.iterrows():
        split_q = row['clean_question'].split(' ')
        
        if word in split_q:
            if row['high_value'] == 1 :
                high_count += 1
            else:
                low_count += 1
    
    return high_count, low_count

observed_expected = []

comparison_terms = list(terms_used)[:5]

for term in comparison_terms:
    v = count_usage(term)
    observed_expected.append(v)

In [18]:
observed_expected

[(6, 11), (0, 1), (0, 1), (8, 15), (0, 1)]

In [20]:
high_value_count = jeopardy[jeopardy['high_value']==1].shape[0]
low_value_count = jeopardy[jeopardy['high_value']==0].shape[0]

chi_squared = []

from scipy.stats import chisquare
import numpy as np

for item in observed_expected :
    total = sum(item)
    total_prop = total/jeopardy.shape[0]
    
    exp_high = total_prop * high_value_count
    exp_low = total_prop * low_value_count
    
    observed = np.array([item[0], item[1]])
    expected = np.array([exp_high, exp_low])
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=0.3645894470821917, pvalue=0.5459683564099789),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.4200146034379173, pvalue=0.5169297575235856),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469)]

## Results

None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.