### Introduction

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture.

Let's say we want to compete on Jeopardy, and we're looking for any edge we can get to win. In this project, we'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help us win.

The dataset is named jeopardy.csv, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions, which you can download [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/). Here's the beginning of the file:

In [127]:
import pandas as pd
pd.options.display.max_columns = 50

jeopardy = pd.read_csv("jeopardy.csv")

### Exploration and clean up

In [128]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [129]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
Show Number    19999 non-null int64
 Air Date      19999 non-null object
 Round         19999 non-null object
 Category      19999 non-null object
 Value         19999 non-null object
 Question      19999 non-null object
 Answer        19999 non-null object
dtypes: int64(1), object(6)
memory usage: 1.1+ MB


As we can see, each row in the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:

- Show Number -- the Jeopardy episode number of the show this question was in.
- Air Date -- the date the episode aired.
- Round -- the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
- Category -- the category of the question.
- Value -- the number of dollars answering the question correctly is worth.
- Question -- the text of the question.
- Answer -- the text of the answer.

### Clean up 

It seems that many of the column names have spaces in from of the text. Let's clean up the column names to make them easier to work with.

In [130]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [131]:
jeopardy.columns = ['show_number', 'air_date', 'round', 'category', 'value',
       'question', 'answer']

jeopardy.columns

Index(['show_number', 'air_date', 'round', 'category', 'value', 'question',
       'answer'],
      dtype='object')

Before we can start analysing the questions we first need to normalize the text and value columns, i.e. make sure words aren't regarded as different due to factors such as capatalisation and punctuation and values are shown as a numeric dtype.

In [132]:
import re

def normalize_text(text):
    text = text.lower()
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    return text

def normalize_values(text):
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    try:
        text = int(text)
    except Exception:
        text = 0
    return text

In [133]:
jeopardy["clean_question"] = jeopardy["question"].apply(normalize_text)
jeopardy["clean_answer"] = jeopardy["answer"].apply(normalize_text)
jeopardy["clean_value"] = jeopardy["value"].apply(normalize_values)
jeopardy["air_date"] = pd.to_datetime(jeopardy["air_date"])

jeopardy.head()

Unnamed: 0,show_number,air_date,round,category,value,question,answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200


In [134]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 10 columns):
show_number       19999 non-null int64
air_date          19999 non-null datetime64[ns]
round             19999 non-null object
category          19999 non-null object
value             19999 non-null object
question          19999 non-null object
answer            19999 non-null object
clean_question    19999 non-null object
clean_answer      19999 non-null object
clean_value       19999 non-null int64
dtypes: datetime64[ns](1), int64(2), object(7)
memory usage: 1.5+ MB


### Defining the strategy

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

- How often the answer is deducible from the question.
- How often new questions are repeats of older questions.

We can answer the second question by seeing how often complex words (> 6 characters) reoccur. We can answer the first question by seeing how many times words in the answer also occur in the question. We'll work on the first question now, and come back to the second.

#### Answer in question

In [135]:
def find_answer_in_question(row):
    """Calculates the portion of the words in the answer present in the question"""
    split_question = set(row["clean_question"].split(" "))
    split_answer = set(row["clean_answer"].split(" "))
    match_count = 0
    try:
        split_answer.remove("the")
    except:
        pass
    if len(split_answer) == 0:
        return 0
    for i in split_answer:
        if i in split_question:
            match_count += 1
    return match_count / len(split_answer)

jeopardy["answer_in_question"] = jeopardy.apply(find_answer_in_question, axis=1)
    
jeopardy["answer_in_question"].mean() * 100
    

5.980069899265868

Thus we can see that on average 6% of the words in the answer are in the question. This is a small amount and so should not influence our stategy. For interest's sake let's also have a look at how many times all the words in the answer were also in the question.

In [136]:
def find_whole_answer_in_question(row):
    """Returns 1 if all words in answer were also in question"""
    split_question = set(row["clean_question"].split(" "))
    split_answer = set(row["clean_answer"].split(" "))
    match_count = 0
    try:
        split_answer.remove("the")
    except:
        pass
    if len(split_answer) == 0:
        return 0
    for i in split_answer:
        if i in split_question:
            match_count += 1
    if match_count >= len(split_answer):
        return 1

number_whole_answers_in_question = jeopardy.apply(find_whole_answer_in_question, axis=1)

sum(number_whole_answers_in_question.dropna())

123.0

123 times represents just 0.615 percent of all questions and so we can see that we can in no way relay on this strategy.

#### Repeated questions

Let's now investigate how many times terms from previous questions have been used to determine if studying old questions would help us. Note that as our dataset does not contain all jeopardy questions ever asked we will not be able to answer this question perfectly but we should be able to get an idea of the answer.

In [137]:
question_overlap = []

terms_used = set()

for i, row in jeopardy.iterrows():
    split_question = row["clean_question"].split(" ")
    for word in split_question:
        if len(word) < 6:
            split_question.remove(word)
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
        terms_used.add(word)
    if len(split_question) > 0:
        question_overlap.append(match_count / len(split_question))
    else:
        question_overlap.append(0)
        
jeopardy["question_overlap"] = question_overlap

jeopardy["question_overlap"].mean()

0.804079368539106

There is about 80% overlap between terms in new questions and terms in old questions. This only looks at a small set of questions, and it doesn't look at phrases, it looks at single terms. This makes it relatively insignificant, but it does mean that it's worth looking more into the recycling of questions.

#### High value questions

Let's say we only want to study questions that pertain to high value questions instead of low value questions. This will help us earn more money when we're on Jeopardy.

We can actually figure out which terms correspond to high-value questions using a chi-squared test. We'll first need to narrow down the questions into two categories:

- Low value -- Any row where Value is less than 800.
- High value -- Any row where Value is greater than 800.
- We'll then be able to loop through each of the terms from the terms_used, and:


- Find the number of low value questions the word occurs in.
- Find the number of high value questions the word occurs in.
- Find the percentage of questions the word occurs in.
- Based on the percentage of questions the word occurs in, find expected counts.
- Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.

We can then find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values. Doing this for all of the words would take a very long time, so we'll just do it for a small sample now.

In [143]:
def high_low_value(row):
    if row["clean_value"] > 800:
        return 1
    else:
        return 0
    
jeopardy["high_value"] = jeopardy.apply(high_low_value, axis = 1)

In [145]:
def high_low_count(term):
    high_count = 0
    low_count = 0
    for i, row in jeopardy.iterrows():
        split_question = row["clean_question"].split(" ")
        if term in split_question:
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

observed_expected = []

terms_used_list = list(terms_used)
comparison_terms = terms_used_list[:5]

for term in comparison_terms:
    observed_expected.append(high_low_count(term))
    
observed_expected

[(1417, 3301), (0, 1), (1, 0), (1, 0), (2, 8)]

Now that we've found the observed counts for a few terms, we can compute the expected counts and the chi-squared value.

In [147]:
from scipy.stats import chisquare

high_value_count = jeopardy["high_value"].sum()
low_value_count = jeopardy.shape[0] - high_value_count

chi_squared = []
for x in observed_expected:
    total = sum(x)
    total_prop = total / jeopardy.shape[0]
    expected_high = total_prop * high_value_count
    expected_low = total_prop * low_value_count
    chi_squared.append(chisquare(x, [expected_high, expected_low]))
    
chi_squared

[Power_divergenceResult(statistic=4.28257258858768, pvalue=0.038505027441259804),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.36767906209032747, pvalue=0.5442721040962595)]

The only statistically significant value is the first one in the list above with a p_value of 3.85% (Note that we are assuming a significant level of 5%). Let's have alook at which term this is and determine the direction of the difference.

In [150]:
comparison_terms[0]

''

Thus the only significant term is "" which is not effectivly a term at all and should be removed from terms_used.

### Conclusion

Thus we can see that the only startgy explored that could be of some assistance in winning Jeopardy is the strategy of studying past questions but this still needs to be investigated further using a more comprehensive data set.