# Winning Jeopardy

## Jeopardy Questions

[Jeopardy](https://www.jeopardy.com/) is a popular TV show in the US where participants answer questions to win money. In this project, we will work with a dataset of Jeopardy questions to look for patterns to improve the chances of winning.

For those familiar with the show, the contestant usually has to give the answer in the form of a question while the actual question is phrased as an answer. An example question could be, "This famous white beagle is from the Peanuts series." The contestant would be required to answer with a question such as, "Who is Snoopy?" While the format of Jeopardy's questions and answers are different than most quiz shows, this dataset lists questions and answers in the typical format where a contestant is asked a question and provides an answer.

In [1]:
import pandas as pd
import re
import random
from scipy import stats
import numpy as np

jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [2]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
Show Number    19999 non-null int64
 Air Date      19999 non-null object
 Round         19999 non-null object
 Category      19999 non-null object
 Value         19999 non-null object
 Question      19999 non-null object
 Answer        19999 non-null object
dtypes: int64(1), object(6)
memory usage: 1.1+ MB


In [3]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

We'll remove any spaces from the columns for ease of use.

In [4]:
jeopardy.columns = jeopardy.columns.str.replace(' ','')
jeopardy.columns

Index(['ShowNumber', 'AirDate', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

## Normalizing Text

Before beginning analysis of the questions, we need to normalize the `Question` and `Answer` columns by removing punctuation and capitalization. This will prevent the same words being classified as different which will be important during our analysis.

In [5]:
def normalize_string(string):
    return re.sub(r'[^\w\s]','',string).lower()

jeopardy['CleanQuestion'] = jeopardy['Question'].apply(normalize_string)
jeopardy['CleanAnswer'] = jeopardy['Answer'].apply(normalize_string)

jeopardy.head()

Unnamed: 0,ShowNumber,AirDate,Round,Category,Value,Question,Answer,CleanQuestion,CleanAnswer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams


## Normalizing Columns

In order to look at questions and answers by their value, we need to normalize the `Value` column. We'll remove the dollar sign and convert the string to an integer. In addition, we will convert the `AirDate` column to a datetime object for easier searching and sorting.

In [6]:
def normalize_value(value):
    value = re.sub(r'[^\w\s]','',value)
    try:
        value = int(value)
    except (TypeError, ValueError):
        value = 0
    return value

jeopardy['CleanValue'] = jeopardy['Value'].apply(normalize_value)
jeopardy.head()

Unnamed: 0,ShowNumber,AirDate,Round,Category,Value,Question,Answer,CleanQuestion,CleanAnswer,CleanValue
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200


In [7]:
jeopardy['AirDate'] = pd.to_datetime(jeopardy['AirDate'])
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 10 columns):
ShowNumber       19999 non-null int64
AirDate          19999 non-null datetime64[ns]
Round            19999 non-null object
Category         19999 non-null object
Value            19999 non-null object
Question         19999 non-null object
Answer           19999 non-null object
CleanQuestion    19999 non-null object
CleanAnswer      19999 non-null object
CleanValue       19999 non-null int64
dtypes: datetime64[ns](1), int64(2), object(7)
memory usage: 1.5+ MB


## Answers in Questions

The following code determines how often a word in the answer appears in the question. These steps seem arbitrary and useless as an answer should generally be substantially different than all but the simplest of questions.

In [8]:
def answer_in_question(row):
    match_count = 0
    # Split answer and question into lists of words
    split_answer = row['CleanAnswer'].split()
    split_question = row['CleanQuestion'].split()
    # Remove 'the' from answers and questions
    try:
        split_answer.remove('the')
    except ValueError:
        pass
    try:
        split_question.remove('the')
    except ValueError:
        pass
    # Avoid divide by zero in case there is no answer
    if len(split_answer) == 0:
        return 0
    # Find how often part of the answer is in the question
    for word in split_answer:
        if word in split_question:
            match_count += 1
    return match_count / len(split_answer)
    
    
jeopardy['AnswerInQuestion'] = jeopardy.apply(answer_in_question, axis=1)
jeopardy['AnswerInQuestion'].mean()

0.05854908546293115

The answer is almost never part of the question.

# Recycled Questions

If questions are frequently reused in Jeopardy, it would be helpful to study past questions over general knowledge. The dataset we are using only contains about 10% of the total dataset, but perhaps a pattern can be found in this subset.

In [9]:
question_overlap = []
terms_used = set()

jeopardy = jeopardy.sort_values(by=['AirDate'], ascending=True)

for row in jeopardy.itertuples():
    # Split the question into a list of words at least six characters long
    split_question = row[8].split()
    split_question = [word for word in split_question if len(word) >= 6]
    match_count = 0
    # Check if the word has been used before
    for word in split_question:
        if word in terms_used:
            match_count += 1
    # Add the words to terms_used after checking the entire question
    for word in split_question:
        terms_used.add(word)
    if len(split_question) > 0:
        match_count /= len(split_question)
    question_overlap.append(match_count)

jeopardy['QuestionOverlap'] = question_overlap
print(jeopardy['QuestionOverlap'].mean())

0.6876235590919739


There is a pretty high frequency of terms being used again, but this doesn't say much about how often questions are reused. For example, the word `automobile` might be used in a huge number of completely different questions.

## Low-Value vs High-Value Questions

In Jeopardy, questions have different values. If certain terms appear more frequently in high-value questions, those subjects could be useful to study. We'll separate questions into two categories:

* `High-Value` when `Value` is greater than \$800.
* `Low-Value` in all other cases.

Once done, we will randomly sample to see if we can draw any conclusions.

In [10]:
jeopardy['HighValue'] = jeopardy['CleanValue'].apply(
    lambda x: 1 if x > 800 else 0)

def word_value(word):
    low_count = 0
    high_count = 0
    for row in jeopardy.itertuples():
        split_question = row[8].split()
        if word in split_question:
            if row[13] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

# Convert terms_used to a list for more efficient randomization
terms_list = list(terms_used)
#comparison_terms = [random.choice(terms_list) for i in range(10)]
comparison_terms = random.sample(terms_list, 10)

observed_expected = []

for term in comparison_terms:
    observed_expected.append(word_value(term))

observed_expected

[(1, 0),
 (0, 3),
 (1, 2),
 (0, 1),
 (1, 1),
 (0, 1),
 (1, 2),
 (1, 0),
 (0, 1),
 (0, 1)]

## Applying the Chi-Squared Test

Now that we have the observed counts for a few different words, we can calculate the expected values along with the chi-squared values and p-values.

In [11]:
from scipy import stats
import numpy as np

high_value_count = sum(jeopardy['HighValue'])
low_value_count = len(jeopardy) - high_value_count

chi_squared = []
for item in observed_expected:
    total = sum(item)
    total_prop = total / len(jeopardy)
    high_exp = total_prop * high_value_count
    low_exp = total_prop * low_value_count
    observed = np.array([item[0], item[1]])
    expected = np.array([high_exp, low_exp])
    chi_squared.append(stats.chisquare(observed, expected))
    
chi_squared

[Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=1.205888538380652, pvalue=0.27214791766902047),
 Power_divergenceResult(statistic=0.03188116723440362, pvalue=0.8582887163235293),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.03188116723440362, pvalue=0.8582887163235293),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469)]

There is no statistically significant difference in term usage between high-value and low-value questions. The word usage frequency is also very low, meaning that even if there was a difference, the number of subjects which would need to be studied would still be very large, rendering this analysis useless.