# Guided Project: Winning Jeopardy (Hypothesis Testing)

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture. If you need help at any point, you can consult our solution notebook here.

Let's say you want to compete on Jeopardy, and you're looking for any edge you can get to win. In this project, I will work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you win.

In [1]:
import pandas as pd
import numpy as np
import re
from scipy.stats import chisquare

In [2]:
jeopardy = pd.read_json('/Users/miesner.jacob/Desktop/DataQuest/JEOPARDY_QUESTIONS.json')

## Data Exploration

In [3]:
jeopardy.head(5)

Unnamed: 0,air_date,answer,category,question,round,show_number,value
0,2004-12-31,Copernicus,HISTORY,"'For the last 8 years of his life, Galileo was...",Jeopardy!,4680,$200
1,2004-12-31,Jim Thorpe,ESPN's TOP 10 ALL-TIME ATHLETES,'No. 2: 1912 Olympian; football star at Carlis...,Jeopardy!,4680,$200
2,2004-12-31,Arizona,EVERYBODY TALKS ABOUT IT...,'The city of Yuma in this state has a record a...,Jeopardy!,4680,$200
3,2004-12-31,McDonald\'s,THE COMPANY LINE,"'In 1963, live on ""The Art Linkletter Show"", t...",Jeopardy!,4680,$200
4,2004-12-31,John Adams,EPITAPHS & TRIBUTES,"'Signer of the Dec. of Indep., framer of the C...",Jeopardy!,4680,$200


In [4]:
jeopardy.shape

(216930, 7)

In [5]:
jeopardy.columns

Index(['air_date', 'answer', 'category', 'question', 'round', 'show_number',
       'value'],
      dtype='object')

In [6]:
jeopardy.columns = jeopardy.columns.str.strip()

In [7]:
jeopardy

Unnamed: 0,air_date,answer,category,question,round,show_number,value
0,2004-12-31,Copernicus,HISTORY,"'For the last 8 years of his life, Galileo was...",Jeopardy!,4680,$200
1,2004-12-31,Jim Thorpe,ESPN's TOP 10 ALL-TIME ATHLETES,'No. 2: 1912 Olympian; football star at Carlis...,Jeopardy!,4680,$200
2,2004-12-31,Arizona,EVERYBODY TALKS ABOUT IT...,'The city of Yuma in this state has a record a...,Jeopardy!,4680,$200
3,2004-12-31,McDonald\'s,THE COMPANY LINE,"'In 1963, live on ""The Art Linkletter Show"", t...",Jeopardy!,4680,$200
4,2004-12-31,John Adams,EPITAPHS & TRIBUTES,"'Signer of the Dec. of Indep., framer of the C...",Jeopardy!,4680,$200
5,2004-12-31,the ant,3-LETTER WORDS,"'In the title of an Aesop fable, this insect s...",Jeopardy!,4680,$200
6,2004-12-31,the Appian Way,HISTORY,'Built in 312 B.C. to link Rome & the South of...,Jeopardy!,4680,$400
7,2004-12-31,Michael Jordan,ESPN's TOP 10 ALL-TIME ATHLETES,'No. 8: 30 steals for the Birmingham Barons; 2...,Jeopardy!,4680,$400
8,2004-12-31,Washington,EVERYBODY TALKS ABOUT IT...,"'In the winter of 1971-72, a record 1,122 inch...",Jeopardy!,4680,$400
9,2004-12-31,Crate & Barrel,THE COMPANY LINE,'This housewares store was named for the packa...,Jeopardy!,4680,$400


## Data Cleaning

In [8]:
def normalize_string(string):
    string = string.lower()
    string = re.sub("[^A-Za-z0-9\s]","",string)
    return string

In [9]:
jeopardy['clean_question'] = jeopardy['question'].apply(normalize_string)
jeopardy['clean_answer'] = jeopardy['answer'].apply(normalize_string)

In [10]:
def normalize_dollars(string):
    string = re.sub("[^A-Za-z0-9\s]","",str(string))
    try:
        string = int(string)
    except:
        string = 0
    return string

In [11]:
jeopardy['clean_value'] = jeopardy['value'].apply(normalize_dollars)

In [12]:
jeopardy['air_date'] = pd.to_datetime(jeopardy['air_date'])

## Data Analysis

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

* How often the answer is deducible from the question.
* How often new questions are repeats of older questions.

You can answer the second question by seeing how often complex words (> 6 characters) reoccur. You can answer the first question by seeing how many times words in the answer also occur in the question. We'll work on the first question now, and come back to the second.

In [13]:
def answer_question(row):
    split_answer = row['clean_answer'].split(" ")
    split_question = row['clean_question'].split(" ")
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer) == 0:
        return 0
    match_count = 0
    for item in split_answer:
        if item in split_question:
            match_count += 1
    return match_count / len(split_answer)    

In [14]:
jeopardy["answer_in_question"] = jeopardy.apply(answer_question, axis=1)

In [15]:
jeopardy['answer_in_question'].mean()

0.05776013093881471

Answer terms in the question

The answer only appears in the question 5.7% of the time. This means that you should probably study and not rely only on deducing the answer from the question.

In [19]:
question_overlap = []
terms_used = set()

jeopardy = jeopardy.sort_values('air_date').reset_index(drop=True)

for i, row in jeopardy.iterrows():
        split_question = row["clean_question"].split(" ")
        split_question = [q for q in split_question if len(q) > 5]
        match_count = 0
        for word in split_question:
            if word in terms_used:
                match_count += 1
        for word in split_question:
            terms_used.add(word)
        if len(split_question) > 0:
            match_count /= len(split_question)
        question_overlap.append(match_count)
jeopardy["question_overlap"] = question_overlap

jeopardy["question_overlap"].mean()

0.8654721654522508

Question overlap

There is about 87% overlap between terms in new questions and terms in old questions. This only looks at a small set of questions and it only looks at single terms longer than five characters long. This means that it is not as significantly as the percentage would seem, but it is worth studying old questions.

Let's say you only want to study questions that pertain to high value questions instead of low value questions. This will help you earn more money when you're on Jeopardy.

You can actually figure out which terms correspond to high-value questions using a chi-squared test. First I will need to narrow down the questions into two categories:

* Low value -- Any row where Value is less than 800.
* High value -- Any row where Value is greater than 800.

In [21]:
def valuefier(row):
    if row['clean_value'] > 800:
        value = 1
    else:
        value = 0
    return value

In [23]:
jeopardy['high_value'] = jeopardy.apply(valuefier, axis = 1)

In [24]:
jeopardy

Unnamed: 0,air_date,answer,category,question,round,show_number,value,clean_question,clean_answer,clean_value,answer_in_question,question_overlap,high_value
0,1984-09-10,the Jordan,LAKES & RIVERS,'River mentioned most often in the Bible',Jeopardy!,1,$100,river mentioned most often in the bible,the jordan,100,0.000000,0.000000,0
1,1984-09-10,Independence Hall,NATIONAL LANDMARKS,"'Site where John Hancock signed his ""John Hanc...",Double Jeopardy!,1,$800,site where john hancock signed his john hancock,independence hall,800,0.000000,0.000000,0
2,1984-09-10,the love of money,THE BIBLE,"'According to 1st Timothy, it is the ""root of ...",Double Jeopardy!,1,$1000,according to 1st timothy it is the root of all...,the love of money,1000,0.333333,0.000000,1
3,1984-09-10,Mr. Wizard,'50'S TV,'Name under which experimenter Don Herbert tau...,Double Jeopardy!,1,$1000,name under which experimenter don herbert taug...,mr wizard,1000,0.000000,0.000000,1
4,1984-09-10,the Capitol,NATIONAL LANDMARKS,'D.C. building shaken by November '83 bomb blast',Double Jeopardy!,1,$1000,dc building shaken by november 83 bomb blast,the capitol,1000,0.000000,0.000000,1
5,1984-09-10,John Wilkes Booth,NOTORIOUS,"'After the deed, he leaped to the stage shouti...",Double Jeopardy!,1,$1000,after the deed he leaped to the stage shouting...,john wilkes booth,1000,0.000000,0.000000,1
6,1984-09-10,oath,4-LETTER WORDS,'The president takes one before stepping into ...,Double Jeopardy!,1,$1000,the president takes one before stepping into o...,oath,1000,0.000000,0.000000,1
7,1984-09-10,Martin Luther King Day,HOLIDAYS,'The third Monday of January starting in 1986',Final Jeopardy!,1,,the third monday of january starting in 1986,martin luther king day,0,0.000000,0.000000,0
8,1984-09-10,Sean Connery,ACTORS & ROLES,"'He may ""Never Say Never Again"" when asked to ...",Jeopardy!,1,$300,he may never say never again when asked to be ...,sean connery,300,0.000000,0.000000,0
9,1984-09-10,a blintz,FOREIGN CUISINE,'Jewish crepe filled with cheese',Jeopardy!,1,$300,jewish crepe filled with cheese,a blintz,300,0.000000,0.000000,0


In [27]:
def word_count(word):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        split_question = row['clean_question'].split(" ")
        if word in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count


observed_expected = []
comparison_terms = list(terms_used)[:5]
for term in comparison_terms:
    result = word_count(term)
    observed_expected.append(result)

In [28]:
observed_expected

[(0, 5), (0, 2), (0, 1), (0, 1), (0, 1)]

In [29]:
high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]

chi_squared = []
for obs in observed_expected:
    total = sum(obs)
    total_prop = total / jeopardy.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=1.974882321166757, pvalue=0.15993058334750943),
 Power_divergenceResult(statistic=0.7899529284667026, pvalue=0.3741143592744989),
 Power_divergenceResult(statistic=0.3949764642333513, pvalue=0.5296950912486695),
 Power_divergenceResult(statistic=0.3949764642333513, pvalue=0.5296950912486695),
 Power_divergenceResult(statistic=0.3949764642333513, pvalue=0.5296950912486695)]

Statistical Test Result (Chi-Square)

None of the terms had a significant difference in usage between high value and low value rows. 