# Winning Jeopardy

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for many years, and is a major force in popular culture. 

Imagine that we want to compete on Jeopardy, and we're looking for any way to win. In this project, we'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help us win.

## Jeopardy Questions

The dataset is named `jeopardy.csv`, and contains `20000` rows from the beginning of a full dataset of Jeopardy questions, which we can download [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file). Here's the beginning of the file:

In [1]:
import pandas as pd

jeopardy = pd.read_csv('jeopardy.csv')
print(jeopardy.shape)
jeopardy.head()

(216930, 7)


Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


As we can see, each row in the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:

- `Show Number` - the Jeopardy episode number


- `Air Date` - the date the episode aired


- `Round` - the round of Jeopardy


- `Category` - the category of the question


- `Value` - the number of dollars the correct answer is worth


- `Question` - the text of the question


- `Answer` - the text of the answer

In [3]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [4]:
### Removing spaces from column names
jeopardy.columns = [name.strip() for name in jeopardy.columns]
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

In [5]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 216930 entries, 0 to 216929
Data columns (total 7 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   Show Number  216930 non-null  int64 
 1   Air Date     216930 non-null  object
 2   Round        216930 non-null  object
 3   Category     216930 non-null  object
 4   Value        216930 non-null  object
 5   Question     216930 non-null  object
 6   Answer       216928 non-null  object
dtypes: int64(1), object(6)
memory usage: 11.6+ MB


We can change type of some columns to appropriate ones - e.g. date and int.

## Normalizing Text

Before we can start doing analysis on the Jeopardy questions, we need to normalize all of the text columns (the `Question` and `Answer` columns). The idea of normalization is to ensure that we put words in lowercase and remove punctuation so `Don't` and `don't` aren't considered to be different words when we compare them.

In [8]:
import re

### Convert string to lowercase and remove all punctuation
def normalize_text(text):
    text = str(text).lower()
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    text = re.sub("\s+", " ", text)
    return text

In [9]:
### Normalize Question and Answer columns
jeopardy["clean_question"] = jeopardy["Question"].apply(normalize_text)
jeopardy["clean_answer"] = jeopardy["Answer"].apply(normalize_text)

## Normalizing Columns

Now that we've normalized the text columns, there are also some other columns to normalize.

The `Value` column should be numeric, to allow us to manipulate it easier. We'll need to remove the dollar sign from the beginning of each value and convert the column from text to numeric.

The `Air Date` column should also be a datetime instead of a string, to enable us to work it easier.

In [16]:
### Remove punctuation and convert to int
def normalize_values(text):
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    try:
        text = int(text)
    except Exception:
        text = 0
    return text

In [17]:
### Normalize values
jeopardy["clean_value"] = jeopardy["Value"].apply(normalize_values)

In [19]:
### Format datetime
jeopardy["Air Date"] = pd.to_datetime(jeopardy["Air Date"])

## Answers in Questions

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

- How often the answer can be used for a question
- How often questions are repeated

We can answer the first question by seeing how many times words in the answer also occur in the question. We can answer the second question by seeing how often complex words (> 6 characters) reoccur. We'll work on the first question and come back to the second.

In [21]:
### Count relative matches of answers in questions
def count_matches(row):
    split_answer = row["clean_answer"].split()
    split_question = row["clean_question"].split()
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer) == 0:
        return 0
    match_count = 0
    for item in split_answer:
        if item in split_question:
            match_count += 1
    return match_count / len(split_answer)

In [22]:
jeopardy["answer_in_question"] = jeopardy.apply(count_matches, axis=1)
jeopardy["answer_in_question"].mean()

0.05792070323661354

On average, the answer only makes up for about `6%` of the question.  This isn't a huge number, and means that we probably can't just hope that hearing a question will enable us to figure out the answer.  We'll probably have to study.

## Recycled Questions

We want to investigate how often new questions are repeats of older ones.

To do this, we can:

- Sort `jeopardy` in order of ascending air date
- Maintain a set called `terms_used` that will be empty initially
- Iterate through each row of `jeopardy`
- Split `clean_question` into words, remove any word shorter than `6` characters, and check if each word occurs in `terms_used`

This allows us to check if the terms in questions have been used previously or not. Only looking at words with six or more characters enables us to filter out words like `the` and `than`, which are commonly used, but don't tell you a lot about a question.

In [24]:
question_overlap = []
terms_used = set()

jeopardy = jeopardy.sort_values("Air Date")

for i, row in jeopardy.iterrows():
        split_question = row["clean_question"].split()
        split_question = [q for q in split_question if len(q) > 5]
        match_count = 0
        for word in split_question:
            if word in terms_used:
                match_count += 1
        for word in split_question:
            terms_used.add(word)
        if len(split_question) > 0:
            match_count /= len(split_question)
        question_overlap.append(match_count)
jeopardy["question_overlap"] = question_overlap

jeopardy["question_overlap"].mean()

0.8722104187855287

There is about `87%` overlap between terms in new questions and terms in old questions. However, this doesn't look at phrases, it looks at single terms. This makes it relatively insignificant, but it does mean that it's worth looking more into the recycling of questions.

## Low Value vs High Value Questions

Let's say we only want to study questions that pertain to high value questions instead of low value questions. This will help us earn more money when we're on Jeopardy. 

We can actually figure out which terms correspond to high-value questions using a chi-squared test. We'll first need to narrow down the questions into two categories:

- Low value - Any row where `Value` is less than `800`
- High value - Any row where `Value` is greater than `800`

We'll then be able to loop through each of the terms from the last screen, `terms_used`, and:

- Find the number of low value questions the word occurs in
- Find the number of high value questions the word occurs in
- Find the percentage of questions the word occurs in
- Based on the percentage of questions the word occurs in, find expected counts
- Compute the chi squared value based on the expected counts and the observed counts for high and low value questions

We can then find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values. 

In [27]:
### Classify values
def classify_value(row):
    if row["clean_value"] > 800:
        return 1
    return 0

In [28]:
jeopardy["high_value"] = jeopardy.apply(classify_value, axis=1)

In [29]:
### Counting terms in low and high values
def count_usage(term):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        if term in row["clean_question"].split():
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

Doing this for all of the words would take a very long time, so we'll just do it for a small sample now.

In [30]:
from random import choice

terms_used_list = list(terms_used)
comparison_terms = [choice(terms_used_list) for _ in range(10)]

observed_expected = []

for term in comparison_terms:
    observed_expected.append(count_usage(term))

observed_expected

[(0, 1),
 (0, 1),
 (0, 1),
 (10, 10),
 (0, 1),
 (0, 1),
 (9, 25),
 (1, 1),
 (1, 0),
 (1, 0)]

## Applying the Chi-Squared Test

Now that we've found the observed counts for a few terms, we can compute the expected counts and the chi-squared value. The significance level will be a p-value less than 5%.

In [35]:
from scipy.stats import chisquare
import numpy as np

high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]

chi_squared = []
for obs in observed_expected:
    total = sum(obs)
    total_prop = total / jeopardy.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=0.3949764642333513, pvalue=0.5296950912486695),
 Power_divergenceResult(statistic=0.3949764642333513, pvalue=0.5296950912486695),
 Power_divergenceResult(statistic=0.3949764642333513, pvalue=0.5296950912486695),
 Power_divergenceResult(statistic=4.6338644448358, pvalue=0.03134688230199346),
 Power_divergenceResult(statistic=0.3949764642333513, pvalue=0.5296950912486695),
 Power_divergenceResult(statistic=0.3949764642333513, pvalue=0.5296950912486695),
 Power_divergenceResult(statistic=0.05693531027303224, pvalue=0.8114070706676093),
 Power_divergenceResult(statistic=0.46338644448358013, pvalue=0.49604555208958945),
 Power_divergenceResult(statistic=2.5317964247338085, pvalue=0.11157312838169751),
 Power_divergenceResult(statistic=2.5317964247338085, pvalue=0.11157312838169751)]

In [32]:
comparison_terms[3]

'bologna'

In [33]:
high_value_count

61422

In [34]:
low_value_count

155508

We notice that one term `bologna` has a significant difference (p-value is lower 0.05) in usage between high value and low value rows. It appears in high values more often.

## Conclusion 

We've explored the Jeopardy questions data set and found out that:

- The question contains in average about `6%` of the answer words. It is low value.


- There is about `87%` overlap between terms in new questions and terms in old questions. However, this doesn't look at phrases, it looks at single terms.


- There Are terms that occur more often in a certain category - for example, the term `'bologna'` is more characteristic of the `High value` category. We can prepare better if we pay more attention to some specific words, themes.