# Winning Jeopardy

In this project, you'll work with a dataset of Jeopardy questions to figure out some patterns in the questions.

The dataset is named jeopardy.csv, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions, which you can download <a href="https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/">here</a>.

In [1]:
import pandas as pd 
import re

jeopardy = pd.read_csv('jeopardy.csv')

jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [2]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams




Each row in the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:

- <span style="color:navy">Show Number</span> - the Jeopardy episode number
- <span style="color:navy">Air Date</span>- the date the episode aired
- <span style="color:navy">Round</span> - the round of Jeopardy
- <span style="color:navy">Category</span> - the category of the question
- <span style="color:navy">Value</span> - the number of dollars the correct answer is worth
- <span style="color:navy">Question</span> - the text of the question
- <span style="color:navy">Answer</span> - the text of the answer


## Data Cleaning


As we cam see, some of the column names have spaces in front. Let's make the column names more usable.

In [3]:
new_columns = []
for el in jeopardy.columns:
    new_columns.append(el.strip(' ').replace(' ', '_').lower())
    
jeopardy.columns = new_columns
jeopardy.head()

Unnamed: 0,show_number,air_date,round,category,value,question,answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [4]:
jeopardy.dtypes

show_number     int64
air_date       object
round          object
category       object
value          object
question       object
answer         object
dtype: object

In [25]:
def normalize_text(str_):
    str_ = str_.lower()
    lst_ = re.findall(r'\w+', str_)
    return ' '.join(el for el in lst_)

def normalize_value(str_):
    str_ = re.sub("[^0-9\s]", "", str_)
    if not str_:
        str_ = 0
    return int(str_)

jeopardy['clean_question'] = jeopardy['question'].apply(normalize_text)
jeopardy['clean_answer'] = jeopardy['answer'].apply(normalize_text)
jeopardy['clean_value_$'] = jeopardy['value'].apply(normalize_value)
jeopardy['clean_air_date'] = pd.to_datetime(jeopardy['air_date'])


In [26]:
jeopardy.head()

Unnamed: 0,show_number,air_date,round,category,value,question,answer,clean_question,clean_answer,clean_value_$,clean_air_date
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200,2004-12-31
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200,2004-12-31
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200,2004-12-31
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonald s,200,2004-12-31
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200,2004-12-31


## Answer in question

Let's find out how often the answer can be used for a question. We can answer this by seeing how many times words in the answer occur in the question.

In [30]:
def answer_in_question(row):
    match_count = 0
    answer = row['clean_answer'].split(' ')
    if 'the' in answer:
        answer.remove('the')
    question = row['clean_question'].split(' ')
    if not answer:
        return 0
    for el in answer:
        if el in question:
            match_count += 1
    return match_count / len(answer)

jeopardy['answer_in_question'] = jeopardy.apply(answer_in_question, axis=1)
mean = jeopardy['answer_in_question'].mean()
mean

0.06294645581984949

On average, the answer only makes up for about 6% of the question. This isn't a huge number, and means that we probably can't just hope that hearing a question will enable us to figure out the answer.

## Repetitive questions

Let's explore how often questions are repeated by seeing how often complex words (> 6 characters) reoccur.

In [35]:
jeopardy = jeopardy.sort_values("clean_air_date")
question_overlap = []
terms_used = set()
for i, row in jeopardy.iterrows():
    match_count = 0
    split_question = re.findall(r'\w{6,}',row['clean_question'])
    for word in split_question:
        if word in terms_used:
            match_count += 1
        else:
            terms_used.add(word)
    if match_count:
        match_count /= len(split_question)
    question_overlap.append(match_count)
        
jeopardy['question_overlap'] = question_overlap
jeopardy['question_overlap'].mean()

0.7217938859138121

There is about 70% overlap between terms in new questions and terms in old questions. It doesn't look at phrases, it looks at single terms. This makes it relatively insignificant, but it does mean that it's worth looking more into the recycling of questions.

## Low value vs high value questions

In [42]:
from random import choice

def determine_value(row):
    value = 0
    if row["clean_value_$"] > 800:
        value = 1
    return value

jeopardy["high_value"] = jeopardy.apply(determine_value, axis=1)

def word_estimator(word):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        split_question = row['clean_question'].split(' ')
        if word in split_question:
            if row['high_value']:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

terms_used_list = list(terms_used)
comparison_terms = [choice(terms_used_list) for _ in range(10)]
observed_expected = []

for word in comparison_terms:
    observed_expected.append(word_estimator(word))


In [47]:
from scipy.stats import chisquare
import numpy as np

high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]
chi_squared = []

for obs in observed_expected:
    total = sum(obs)
    total_prop = total / jeopardy.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.889754963322559, pvalue=0.3455437191483468),
 Power_divergenceResult(statistic=4.235420876606389, pvalue=0.03958880694352712),
 Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469)]