# Jeopardy

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture.

Let's say we want to compete on Jeopardy, and we're looking for any edge we can get to win. In this project, we'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help us win.

The dataset is named jeopardy.csv, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions, which you can download [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file) 
.

As we can see, each row in the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:

- Show Number -- the Jeopardy episode number of the show this question was in.
- Air Date -- the date the episode aired.
- Round -- the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
- Category -- the category of the question.
- Value -- the number of dollars answering the question correctly is worth.
- Question -- the text of the question.
- Answer -- the text of the answer.

Let's take a look at the dataset.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import re 


In [2]:
jeopardy = pd.read_csv("jeopardy.csv")
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [3]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

### Data Cleaning

We can see that a few columns have a space before their names, let's correct that.

In [4]:
jeopardy.columns = jeopardy.columns.str.strip()

In [5]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
Show Number    19999 non-null int64
Air Date       19999 non-null object
Round          19999 non-null object
Category       19999 non-null object
Value          19999 non-null object
Question       19999 non-null object
Answer         19999 non-null object
dtypes: int64(1), object(6)
memory usage: 1.1+ MB


We can see that the 'Air Date' column is not recognized as a Datetime object, plus we can modify the 'Value' column to have an *int* instead of an object.

In [6]:
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'],errors='coerce') 

In [7]:
jeopardy['Value'].value_counts()

$400       3892
$800       2980
$200       2784
$600       1890
$1000      1796
$2000      1074
$1200      1069
$1600      1027
$100        804
$500        798
$300        764
None        336
$1,000      184
$2,000      149
$3,000       70
$1,500       50
$1,200       42
$4,000       32
$5,000       23
$1,800       22
$1,400       20
$1,600       19
$2,500       18
$700         15
$2,200       11
$3,600        8
$2,400        8
$6,000        7
$7,000        7
$900          6
           ... 
$4,600        2
$12,000       2
$5,600        2
$4,700        1
$2,300        1
$1,020        1
$2,127        1
$5,200        1
$1,700        1
$5,400        1
$6,100        1
$2,021        1
$10,800       1
$9,000        1
$1,111        1
$4,100        1
$5,800        1
$7,500        1
$6,800        1
$750          1
$3,900        1
$367          1
$4,500        1
$3,300        1
$6,200        1
$8,200        1
$1,492        1
$7,400        1
$3,389        1
$2,900        1
Name: Value, Length: 76,

In [8]:
#We deal with the 'None' value by replacing them by 0
jeopardy['Value'].replace('None', 0 , inplace = True)
jeopardy['Value'].replace(0, '0', inplace=True)
jeopardy.loc[55,'Value']

'0'

In [9]:
jeopardy['Value'] = jeopardy['Value'].str.replace("$","").str.replace(",","").astype(int)

In [10]:
#Let's not forget to rename our columns to remember the currency!
jeopardy.rename(columns = {'Value' : 'Value in $'}, inplace = True)

We also need to normalize our 'Question' and 'Answer' columns to ensure that *Don't* and *don't* aren't considered to be different words when you compare them for example.


In [11]:
def normalization(text):
    text = text.lower()
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    return text

In [12]:
jeopardy["clean_question"] = jeopardy["Question"].apply(normalization)
jeopardy["clean_answer"] = jeopardy["Answer"].apply(normalization)

### How often the answer is deducible from the question

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

- How often the answer is deducible from the question.
- How often new questions are repeats of older questions.

We can answer the second question by seeing how often complex words (> 6 characters) reoccur. We can answer the first question by seeing how many times words in the answer also occur in the question. We'll work on the first question now, and come back to the second!

So : How often the answer is deducible from the question?


In [13]:
def deducible(row):
    split_answer = row['clean_answer'].split(" ")
    split_question = row['clean_question'].split(" ")
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0
    match_count = 0
    for word in split_answer:
        if word in split_question:
            match_count += 1
    return match_count/len(split_answer)

In [14]:
jeopardy['answer_in_question'] = jeopardy.apply(deducible,axis = 1)

In [15]:
jeopardy['answer_in_question'].mean()

0.06049325706933587

The answer only appears in the question about 6% of the time. This isn't a huge number, and means that we probably can't just hope that hearing a question will enable us to figure out the answer. We'll probably have to study.

### How often new questions are repeats of older questions.

Now let's focus on the second question

In [16]:
jeopardy.sort_values(by = 'Air Date',inplace = True, ascending = True)

In [24]:
terms_used = set()
question_overlap = []


for i,row in jeopardy.iterrows():
    clean_question = row["clean_question"].split(" ")
    clean_question = [word for word in clean_question if len(word) > 5]
    counter = 0
    for word in clean_question:
        if word in terms_used:
            counter += 1
    for word in clean_question:
        terms_used.add(word)
    if len(clean_question) > 0:
        match_count = counter/len(clean_question)
    question_overlap.append(match_count) 

In [28]:
jeopardy = jeopardy.assign(question_overlap = question_overlap)
jeopardy["question_overlap"].mean()


0.7019788296638052

### Question overlap

There is about 70% overlap between terms in new questions and terms in old questions. This only looks at a small set of questions, and it doesn't look at phrases, it looks at single terms. This makes it relatively insignificant, but it does mean that it's worth looking more into the recycling of questions.

In [35]:
def value(row):
    if row["Value in $"] <= 800:
        value = 0
    else:
        value = 1
    return value

jeopardy["high value"] = jeopardy.apply(value,axis = 1)

Let's loop through each of the terms from the last screen, terms_used, and:

- Find the number of low value questions the word occurs in.
- Find the number of high value questions the word occurs in.
- Find the percentage of questions the word occurs in.
- Based on the percentage of questions the word occurs in, find expected counts.
- Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.

In [37]:
def low_high_count(word):
    low_count = 0
    high_count = 0
    for i,row in jeopardy.iterrows():
        clean_question = row['clean_question'].split(" ")
        if word in clean_question:
            if row["high value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count,low_count
            

We can then find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values. Doing this for all of the words would take a very long time, so we'll just do it for a small sample now.

In [46]:
comparison_terms = []
import random

comparison_terms = random.sample(terms_used,10)

Now let's use those 10 words with our function : 

In [48]:
observed_expected = []
for word in comparison_terms:
    observed_expected.append(low_high_count(word))
    
observed_expected    

[(2, 4),
 (1, 1),
 (0, 1),
 (2, 8),
 (0, 1),
 (1, 0),
 (1, 8),
 (1, 1),
 (0, 3),
 (0, 1)]

Now that we've found the observed counts for a few terms, we can compute the expected counts and the chi-squared value.

In [59]:
high_value_count = len(jeopardy[jeopardy['value'] == 1])
low_value_count = len(jeopardy[jeopardy['value'] == 0])

from scipy.stats import chisquare

chi_squared = []

for val1,val2 in observed_expected:
    total = val1 + val2
    total_prop = total/jeopardy.shape[0]
    
    high_value_expected = total_prop*high_value_count
    low_value_expected = total_prop*low_value_count
    
    observed = np.array([val1, val2])
    expected = np.array([high_value_expected,low_value_expected])
    chi_squared.append(chisquare(observed, expected))    

In [61]:
chi_squared

[Power_divergenceResult(statistic=0.06376233446880725, pvalue=0.8006453026878781),
 Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.36767906209032747, pvalue=0.5442721040962595),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=1.3570460299240277, pvalue=0.24405008712856013),
 Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996),
 Power_divergenceResult(statistic=1.205888538380652, pvalue=0.27214791766902047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469)]

### Chi-squared results
None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.