# Winning Jeopardy: Finding the Leading Strategies
[Jeopardy](https://en.wikipedia.org/wiki/Jeopardy!) is an iconic TV quiz show in the USA that reverses the typical questiong/answer format of most other quiz shows. The contestants are instead presented with general knowledge clues in the form of answers and must phrase their responses in the form of questions to earn money.

The goal of this project is to find patterns in the questions so we can find a winning strategy. We'll work with a dataset containing over 200,000 rows from the beginning of a full dataset of Jeopardy questions, available for downloading [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file). Each row represents a single question on a single episode of Jeopardy.

In [1]:
import pandas as pd
import numpy as np
import csv
import re

jeopardy = pd.read_csv("jeopardy.csv")

print(f'Number of rows: {jeopardy.shape[0]}')
print(f'Number of columns: {jeopardy.shape[1]}')
print(f'Number of missing values: {jeopardy.isnull().sum().sum()}')

jeopardy.head()

Number of rows: 216930
Number of columns: 7
Number of missing values: 3637


Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [2]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

- `Show Number` - the Jeopardy episode number
- `Air Date` - the date the episode aired
- `Round` - the round of Jeopardy
- `Category` - the category of the question
- `Value` - the number of dollars the correct answer is worth
- `Question` - the text of the question
- `Answer` - the text of the answer

## Cleaning the Data
Our dataset has 3637 missing values and most of our column names have unwanted spaces in front of them. We'll need to clean our data before we begin our analysis.

In [3]:
# Removing leading spaces from column names
cleaned_column_names = []
for column in jeopardy.columns:
    cleaned_column_names.append(column.lstrip())

jeopardy.columns = cleaned_column_names
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

In [4]:
# Removing rows with missing values
jeopardy = jeopardy.dropna()
print(f'Number of missing values: {jeopardy.isnull().sum().sum()}')

Number of missing values: 0


## Normalizing Columns
To make our data easier to work with, we'll normalize our the data in the following columns:
- `Question` and `Answer` – making text lowercase and removing punctuation.
- `Value` – removing the dollar signs and converting each value to numeric.
- `Air Date` – convert to datetime.

In [5]:
def normalize_text(text):
    text = text.lower()
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    text = re.sub("\s+", " ", text)
    return text

def normalize_value(text):
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    try:
        text = int(text)
    except Exception:
        text = 0
    return text

jeopardy['clean_question'] = jeopardy['Question'].apply(normalize_text)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize_text)
jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_value)
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200


## Answers in Questions
In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:
- How often the answer can be used for a question.
- How often questions are repeated.

To answer the first question, we can check how many times words in the answer also occur in the question.

In [6]:
def count_matches(row):
    split_answer = row["clean_answer"].split()
    split_question = row["clean_question"].split()
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer) == 0:
        return 0
    match_count = 0
    for item in split_answer:
        if item in split_question:
            match_count += 1
    return match_count / len(split_answer)

jeopardy["answer_in_question"] = jeopardy.apply(count_matches, axis=1)

jeopardy["answer_in_question"].mean()

0.058228398435179705

The answer is mentioned in the question roughly 6% of the time. This is not often enough to be a winning strategy, so let's figure out the answer to our second question.

## Recycled Questions
Now we're going to find out how often questions are repeated.

The idea is to check if the words in the questions have been used before. To ensure the words are meaningful, we'll exclude common or filler words such as *the* and *it*. To filter out insignificant words, we'll use a stop word list.

In [7]:
from stop_words import get_stop_words
stop_words = get_stop_words('en')
from gensim.parsing.preprocessing import STOPWORDS
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

# Combining stop word lists from different libraries 
stopword_list = list(set(stop_words+list(STOPWORDS)+list(ENGLISH_STOP_WORDS)+['1','2','3','4','5','6','7','8','9','0']))

print(stopword_list)

['each', 'others', 'whereafter', 'several', 'being', 'and', "she'll", 'was', 'they', 'see', 'nobody', 'show', 'before', 'whose', "we've", 'became', 'thereafter', 'nor', 'who', 'thereby', 'the', 'down', "they'd", 'how', 'were', 'ltd', 'otherwise', 'your', 'when', 'do', 'behind', 'anyone', 'their', 'detail', "it's", "they'll", 'already', 'few', 'further', 'some', 'both', 'beforehand', 'less', 'give', 'twelve', 'them', 'made', 'since', 'empty', 'eleven', '1', 'while', 'rather', 'becoming', 'myself', 'it', "i'd", 'regarding', 'except', 're', 'describe', 'above', 'are', "wasn't", 'couldnt', 'against', '3', 'sincere', "we'll", 'whereupon', 'wherein', 'never', 'someone', 'nevertheless', 'keep', 'out', 'without', 'doesn', 'below', 'us', 'third', 'sixty', 'fill', 'ought', 'five', 'beyond', 'full', 'take', 'say', 'sometime', 'this', 'get', 'because', 'un', 'upon', 'indeed', "why's", 'thus', '5', 'thick', "he'll", 'sometimes', "that's", '4', 'is', 'latterly', 'enough', 'namely', 'eight', 'hereupo

In [8]:
question_overlap = []
terms_used = set()
terms_used_list = []

jeopardy.sort_values(by=['Air Date'])

for i, row in jeopardy.iterrows():
    split_question=row['clean_question'].split(' ')
    match_count = 0
    
    # Removing stop words
    split_question = [word for word in split_question if word not in stopword_list]
    
    for word in split_question:
        if word in terms_used:
            match_count += 1     # the number of repeated words
            
    for word in split_question:
        terms_used.add(word)   # a set of unique words in all the questions
        terms_used_list.append(word)
        
    if len(split_question) > 0:
        match_count /= len(split_question)
        
    question_overlap.append(match_count)
    
jeopardy['question_overlap'] = question_overlap
question_overlap_pct = round(jeopardy['question_overlap'].mean()*100)
print(f'{question_overlap_pct}% of meaningful word overlap in questions.')

92% of meaningful word overlap in questions.


There is about a 92% overlap between terms in new questions and terms in old questions. This is significant enough to be worthy of looking into the recycling of questions more in the future.

## Low Value vs High Value Questions
Since we want to earn as much money as we can on Jeopardy, let's figure out which words correspond to high-value questions using a chi-squared test, so we can know which topics are most worth studying. First, we'll need to split the questions into two categories:
- low value – less than or equal to \\$800.
- high value – greater than \\$800.

We're going to perform the chi-squared test on a hand full of the most frequent words in our dataset to see which ones correlate the most with high value questions. We're limiting the number of words because using the chi-squared test across all the words in our dataset would take a very long time to compute.

In [9]:
def determine_value(row):
    if row['clean_value'] > 800:
        value = 1
    else:
        value = 0
    return value

jeopardy['high_value'] = jeopardy.apply(determine_value, axis=1)

In [10]:
# Creating a list of the 25 most frequent words in all the questions
comparison_terms = list(pd.Series(terms_used_list).value_counts()[:25].index)

print(f'The 25 most frequent words in the whole dataset:'
      f'\n{comparison_terms}')

The 25 most frequent words in the whole dataset:
['city', 'named', 'called', 'like', 'seen', 'new', 'country', 'man', 'type', 'film', 'clue', 'state', 'title', 'crew', 'known', 'word', 'said', 'years', 'played', 'novel', 'term', 'wrote', 'president', 'american', 'capital']


Many of these words are not useful topics of study, such as *like* or *seen*, so we will be removing them so we're only left with topics people can prepare for.

In [11]:
words_to_remove = ['named', 'called', 'like', 'seen', 'new', 'man', 'type', 'clue', 'crew', 'known', 'word', 'said', 'played', 'term']
comparison_terms = [x for x in comparison_terms if x not in words_to_remove]

print(f'Useful comparison terms for studying:'
      f'\n{comparison_terms}')

Useful comparison terms for studying:
['city', 'country', 'film', 'state', 'title', 'years', 'novel', 'wrote', 'president', 'american', 'capital']


In [12]:
def count_usage(word):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        split_question=row['clean_question'].split(' ')
        if word in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

# Extracting high and low values for each of the comparison terms
observed_high_low = []
for word in comparison_terms:
    observed_high_low.append(count_usage(word))
    
print(f'\nNumber of times each word occurred in high and low value questions:'
      f'\n{observed_high_low}')


Number of times each word occurred in high and low value questions:
[(1830, 4430), (1375, 3318), (1361, 3197), (1152, 3195), (1345, 2603), (873, 2110), (1022, 1849), (964, 1847), (814, 1955), (913, 1817), (772, 1853)]


## Applying the Chi-Squared Test
Now that we've found the observed counts for our comparison terms, we can compute the expected counts, chi-squared values, and p-values:

In [13]:
from scipy.stats import chisquare

# Counting high and low value questions
high_value_count = len(jeopardy[jeopardy['high_value'] == 1])
low_value_count = len(jeopardy[jeopardy['high_value'] == 0])

# Counting chi-squared and p-values for each word
chi_squared = []
for item in observed_high_low:
    total = item[0] + item[1] # the number of questions a word occurs in
    total_prop = total / len(jeopardy)
    expected_high = total_prop * high_value_count
    expected_low = total_prop * low_value_count
    
    observed = np.array([item[0], item[1]])
    expected = np.array([expected_high, expected_low])
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=0.5809512438929748, pvalue=0.4459397050776779),
 Power_divergenceResult(statistic=0.5766656287591106, pvalue=0.4476222639375934),
 Power_divergenceResult(statistic=2.5098772876099753, pvalue=0.11313474098463278),
 Power_divergenceResult(statistic=11.1757612387094, pvalue=0.0008287288635621066),
 Power_divergenceResult(statistic=53.493013668195644, pvalue=2.5951408355367423e-13),
 Power_divergenceResult(statistic=0.3197743857612405, pvalue=0.5717432623900001),
 Power_divergenceResult(statistic=64.75146512443725, pvalue=8.496641676371857e-16),
 Power_divergenceResult(statistic=41.42296927532989, pvalue=1.2260824877894515e-10),
 Power_divergenceResult(statistic=0.4859780120408699, pvalue=0.4857269278885131),
 Power_divergenceResult(statistic=28.741930404210784, pvalue=8.269376867432918e-08),
 Power_divergenceResult(statistic=0.48030800377642013, pvalue=0.4882828366922436)]

To make these results more readable, let's create a dataframe and only include the words with p-value lower than 0.05, meaning that the results are significant and cannot be explained just by a random chance. We'll also include how often each word appears in high and low value questions.

In [14]:
chi_squared_list = []
for i in range(len(comparison_terms)):
    chi_squared_row = []
    
    # Adding the word associated with each pair of chi-squared and p-value 
    chi_squared_row.append(comparison_terms[i])  
    
    chi_squared_row.append(list(chi_squared[i])[0]) # chi squared value
    chi_squared_row.append(list(chi_squared[i])[1]) # p value
    chi_squared_row.append(observed_high_low[i][0]) # number of appearances in high value questions
    chi_squared_row.append(observed_high_low[i][1]) # number of appearances in low value questions
    chi_squared_list.append(chi_squared_row)
    
readable_chi_squared = pd.DataFrame(chi_squared_list, columns=['Word', 'Chi_squared', 'p_val', 'High', 'Low'])
readable_chi_squared = readable_chi_squared.sort_values(by=['Chi_squared'], ascending=False).reset_index(drop=True)
readable_chi_squared = readable_chi_squared[readable_chi_squared['p_val']<0.05]

readable_chi_squared

Unnamed: 0,Word,Chi_squared,p_val,High,Low
0,novel,64.751465,8.496642e-16,1022,1849
1,title,53.493014,2.595141e-13,1345,2603
2,wrote,41.422969,1.226082e-10,964,1847
3,american,28.74193,8.269377e-08,913,1817
4,state,11.175761,0.0008287289,1152,3195


While these 5 words have p-values that are statistically significant, meaning they are more likely to appear in either high or low value questions, the p-value does not tell us which one it is. To determine this, we included the how often each word appears in both types of questions in our dataframe. Unfortuneatly for us, all 5 of these words appear more often in low value questions, meaning this is what our p-value is indicating.

Let's see if we can find an explanation for this.

In [15]:
print(f'Number of high value questions: {high_value_count:,}'
      f'\nNumber of low value questions:  {low_value_count:,}')

Number of high value questions: 61,422
Number of low value questions:  151,871


It makes sense that given most questions are of low value, it makes sense that our Chi Squared test would find words that are more likely to appear in low value questions, rather than high value questions.

## Conclusion
We succesfully found that the following five words appear commonly in jeopardy questions and are therefore topics worth studying:
- Novel
- Title
- Wrote
- American
- State

Unfortunately, we did not find any commonly used words that appear more often in high value questions (worth more than \\$800)  than low value questions (worth \\$800 or less), which would have further improved our odds of winning.