# Winning Jeopardy

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for many years, and is a major force in popular culture.

Imagine that you want to compete on Jeopardy, and you're looking for any way to win. In this project, you'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you win.

## The dataset

This dataset has 216.930 rows with questions and the right answer.

In [190]:
import pandas as pd

dataset = pd.read_csv("dataset/JEOPARDY_CSV.csv")
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 216930 entries, 0 to 216929
Data columns (total 7 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   Show Number  216930 non-null  int64 
 1    Air Date    216930 non-null  object
 2    Round       216930 non-null  object
 3    Category    216930 non-null  object
 4    Value       213296 non-null  object
 5    Question    216930 non-null  object
 6    Answer      216927 non-null  object
dtypes: int64(1), object(6)
memory usage: 11.6+ MB


In [191]:
#formatting the column names for improved readability
new_columns = []
for c in dataset.columns:
    c = c.strip()
    c = c.replace(" ", "_")
    c = c.lower()
    new_columns.append(c)

dataset.columns = new_columns
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 216930 entries, 0 to 216929
Data columns (total 7 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   show_number  216930 non-null  int64 
 1   air_date     216930 non-null  object
 2   round        216930 non-null  object
 3   category     216930 non-null  object
 4   value        213296 non-null  object
 5   question     216930 non-null  object
 6   answer       216927 non-null  object
dtypes: int64(1), object(6)
memory usage: 11.6+ MB


In [192]:
dataset.head()

Unnamed: 0,show_number,air_date,round,category,value,question,answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [193]:
import string

# normalizing the values
dataset['normalized_question'] = dataset['question'].str.lower().str.translate(str.maketrans('', '', string.punctuation)).copy()
dataset['normalized_answer'] = dataset['answer'].str.lower().str.translate(str.maketrans('', '', string.punctuation)).copy()
dataset.head()

Unnamed: 0,show_number,air_date,round,category,value,question,answer,normalized_question,normalized_answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams


In [194]:
# Let's drop these rows that have no answer
dataset['normalized_answer'].isna().sum()

3

In [195]:
dataset = dataset.dropna(subset=['normalized_answer']).copy()
dataset['normalized_answer'].isna().sum()

0

In [196]:
# converting the 'value' column to int
dataset['value'] = dataset['value'].str.replace("$", "").str.replace(",", "").fillna("0").astype(int).copy()
dataset.head()

Unnamed: 0,show_number,air_date,round,category,value,question,answer,normalized_question,normalized_answer
0,4680,2004-12-31,Jeopardy!,HISTORY,200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams


In [197]:
# converting the 'air_time' column to datetime object
dataset['air_date'] = pd.to_datetime(dataset['air_date']).copy()
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Index: 216927 entries, 0 to 216929
Data columns (total 9 columns):
 #   Column               Non-Null Count   Dtype         
---  ------               --------------   -----         
 0   show_number          216927 non-null  int64         
 1   air_date             216927 non-null  datetime64[ns]
 2   round                216927 non-null  object        
 3   category             216927 non-null  object        
 4   value                216927 non-null  int64         
 5   question             216927 non-null  object        
 6   answer               216927 non-null  object        
 7   normalized_question  216927 non-null  object        
 8   normalized_answer    216927 non-null  object        
dtypes: datetime64[ns](1), int64(2), object(6)
memory usage: 16.6+ MB


In [198]:
def count_matches(row: pd.Series) -> float:
    
    split_question: list = row['normalized_question'].split()
    split_anwers: list = row['normalized_answer'].split()
    match_count = 0
    
    # remove the word 'the' since it's commonly found in answers and questions
    if 'the' in split_anwers:
        split_anwers.remove('the')

    # This prevents a division by zero error later
    if len(split_anwers) == 0:
        return 0
    
    for word in split_anwers:
        if word in split_question:
            match_count += 1

    return match_count / len(split_anwers)

dataset['answer_in_question'] = dataset.apply(count_matches, axis=1)
print(dataset['answer_in_question'].mean())

0.05789203416806156


Less than 6% of the anwers are in the question itself. We can't rely on seeing the answer in the question so it looks like we actually have to study for it.

## Recycled Questions

Let's say we want to investigate how often new questions are repeats of older ones. We can't completely answer this, because we only have about 10% of the full Jeopardy question dataset, but we can investigate it at least.

In [199]:
question_overlap = []
terms_used = set([])

for index, row in dataset.sort_values(by=['air_date']).iterrows():
    split_question = [word for word in split_question if len(word) > 5]
    match_count = 0

    for word in split_question:
        if word in terms_used:
            match_count += 1

    for word in split_question:
        terms_used.add(word)
        
    if len(split_question) > 0:
        question_overlap.append(match_count / len(split_question))

dataset["question_overlap"] = question_overlap
dataset.head()


Unnamed: 0,show_number,air_date,round,category,value,question,answer,normalized_question,normalized_answer,answer_in_question,question_overlap
0,4680,2004-12-31,Jeopardy!,HISTORY,200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,0.0,0.0
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,0.0,1.0
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,0.0,1.0
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,0.0,1.0
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,0.0,1.0


In [200]:
dataset["question_overlap"].mean()

0.9999953901542915

## Low Value vs High Value Questions

Let's say we only want to study questions that pertain to high value questions instead of low value questions. This will help us earn more money when we're on Jeopardy.

In [201]:
def determine_value(row: pd.Series):
    value = 0
    if row['value'] > 800:
        value = 1
    return value

dataset["high_value"] = dataset.apply(determine_value, axis=1)
dataset.head()

Unnamed: 0,show_number,air_date,round,category,value,question,answer,normalized_question,normalized_answer,answer_in_question,question_overlap,high_value
0,4680,2004-12-31,Jeopardy!,HISTORY,200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,0.0,0.0,0
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,0.0,1.0,0
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,0.0,1.0,0
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,0.0,1.0,0
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,0.0,1.0,0


In [202]:
def count_usage(term):
    low_count = 0
    high_count = 0

    for i, row in dataset.iterrows():
        if term in row["normalized_question"].split():
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    
    return high_count, low_count

In [203]:
from random import choice

# Randomly pick ten elements of terms_used and append them to a list called comparison_terms
terms_used_list = list(terms_used)
comparison_terms = [choice(terms_used_list) for _ in range(10)]

comparison_terms

['originally', 'capital', 'william', 'originally', 'originally', 'william', 'designed', 'building', 'originally', 'william']


In [204]:
observed_expected = []

for term in comparison_terms:
    observed_expected.append(count_usage(term))

observed_expected

[(192, 448),
 (772, 1909),
 (314, 695),
 (192, 448),
 (192, 448),
 (314, 695),
 (185, 351),
 (181, 488),
 (192, 448),
 (314, 695)]

In [214]:
from scipy.stats import chisquare

high_value_count = dataset[dataset["high_value"] == 1].shape[0]
low_value_count = dataset[dataset["high_value"] == 0].shape[0]

chi_squared = []
for observed in observed_expected:
    # Add up both items in the list (high and low counts) to get the total count,
    total = sum(observed)

    # Divide total by the number of rows in the dataset to get the proportion across the dataset
    total_prop = total / dataset.shape[0]

    # Multiply total_prop by high_value_count to get the expected term count for high value rows
    high_value_exp = total_prop * high_value_count

    # Multiply total_prop by low_value_count to get the expected term count for low value rows
    low_value_exp = total_prop * low_value_count
    
    observed = [observed[0], observed[1]]
    expected = [high_value_exp, low_value_exp]

    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=0.8956695183041069, pvalue=0.34394553248310644),
 Power_divergenceResult(statistic=0.30512649189854474, pvalue=0.5806862771751663),
 Power_divergenceResult(statistic=3.9121623915838972, pvalue=0.0479378787828858),
 Power_divergenceResult(statistic=0.8956695183041069, pvalue=0.34394553248310644),
 Power_divergenceResult(statistic=0.8956695183041069, pvalue=0.34394553248310644),
 Power_divergenceResult(statistic=3.9121623915838972, pvalue=0.0479378787828858),
 Power_divergenceResult(statistic=10.152043192187406, pvalue=0.0014414139789220711),
 Power_divergenceResult(statistic=0.5226790294953705, pvalue=0.46970111626326394),
 Power_divergenceResult(statistic=0.8956695183041069, pvalue=0.34394553248310644),
 Power_divergenceResult(statistic=3.9121623915838972, pvalue=0.0479378787828858)]

## Chi-Squared Results

None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.