# Winning Jeopardy

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture. Let's say a friend of mine want to compete on Jeopardy, and my job is to look for any edge I can get to help him win. In this project, I will work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help my friend win. The dataset is available on [reddit](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/).

In [40]:
import pandas as pd
import numpy as np
import random as rd
from scipy.stats import chisquare
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS



In [3]:
# Read the dataset
jeopardy=pd.read_json('jEOPARDY_QUESTIONS1.json')

In [6]:
# Explore the data
print(jeopardy.head())
print('\n')
print(jeopardy.columns)
print('\n')
print(jeopardy.shape)
print('\n')
print(jeopardy.info())


                          category    air_date  \
0                          HISTORY  2004-12-31   
1  ESPN's TOP 10 ALL-TIME ATHLETES  2004-12-31   
2      EVERYBODY TALKS ABOUT IT...  2004-12-31   
3                 THE COMPANY LINE  2004-12-31   
4              EPITAPHS & TRIBUTES  2004-12-31   

                                            question value       answer  \
0  'For the last 8 years of his life, Galileo was...  $200   Copernicus   
1  'No. 2: 1912 Olympian; football star at Carlis...  $200   Jim Thorpe   
2  'The city of Yuma in this state has a record a...  $200      Arizona   
3  'In 1963, live on "The Art Linkletter Show", t...  $200  McDonald\'s   
4  'Signer of the Dec. of Indep., framer of the C...  $200   John Adams   

       round  show_number  
0  Jeopardy!         4680  
1  Jeopardy!         4680  
2  Jeopardy!         4680  
3  Jeopardy!         4680  
4  Jeopardy!         4680  


Index(['category', 'air_date', 'question', 'value', 'answer', 'round',
       

In [12]:
# Clean the question and answer columns
jeopardy['clean_question']=jeopardy['question'].str.replace(r'[^\w\s]','').str.lower()
jeopardy['clean_answer']=jeopardy['answer'].str.replace(r'[^\w\s]','').str.lower()

0                             copernicus
1                             jim thorpe
2                                arizona
3                              mcdonalds
4                             john adams
                       ...              
216925                          turandot
216926                        a titmouse
216927                      clive barker
216928                          geronimo
216929    grigori alexandrovich potemkin
Name: clean_answer, Length: 216930, dtype: object

In [25]:
# Clean the value column
jeopardy['clean_value']=jeopardy['value'].str.replace('\W+','').fillna(0).astype(int)

In [26]:
# Set air date in datetime 
jeopardy['air_date']=pd.to_datetime(jeopardy['air_date'])

## How often the answer is deducible from the question

In [28]:
# Function
def answer_match_question(row):
    split_answer=row['clean_answer'].split()
    split_question=row['clean_question'].split()
    match_count=0
    
    # 'The' is commonly found in answers and questions, but doesn't have any meaningfull use in finding the answer
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer)==0:
        return 0
    else:
        for word in split_answer:
            if word in split_question:
                match_count += 1
    return match_count/len(split_answer)
jeopardy['answer_in_question']=jeopardy.apply(answer_match_question,axis=1)        

In [29]:
# The answer only appears in the question about 6% of the time
jeopardy['answer_in_question'].mean()

# This means that my friend probably can't just hope that hearing a question will enable him 
# to figure out the answer. He'll probably have to study.

0.05637826071470733

## How often new questions are repeates of older questions

In [42]:
# Sort the data in order of ascending air date
jeopardy.sort_values(by='air_date',inplace=True)

# Initiate a list and a set
question_overlap=[]
terms_used=set()

# Loop through each row of the data
for index,row in jeopardy.iterrows():
    split_question=row['clean_question'].split()
    # Remove stopwords 
    split_question=[word for word in split_question if word not in ENGLISH_STOP_WORDS]
    match_count=0
    for word in split_question:
        if word in terms_used:
            match_count += 1
    for word in split_question:
        terms_used.add(word)
    if len(split_question)>0:
        match_count /= len(split_question)
    question_overlap.append(match_count)
jeopardy['question_overlap']=question_overlap
          

0.9171296488547745

In [43]:
# There is about 90% overlap between terms in new questions and terms in old questions
jeopardy['question_overlap'].mean() 
# This result doesn't account for phrases, it looks at single terms. But it's worth looking more into the recyling of questions

0.9171296488547745

## Figure out which terms correspond to high-value questions using a chi-squared test

In [46]:
# Using 800 as the threshold
jeopardy['high_value']=0
jeopardy.high_value[jeopardy['clean_value'] > 800]=1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [47]:
# Find the number of low value and high value questions the word occurs in
def word_count(word):
    low_count=0
    high_count=0
    for index,row in jeopardy.iterrows():
        split_question=row['clean_question'].split()
        if word in split_question:
            if row['high_value']==1:
                high_count += 1
            else:
                low_count += 1
    return high_count,low_count

In [49]:
# Randomly pick 10 elements of termed_used
comparison_terms=rd.sample(terms_used,10)
observed_expected=[]
# For each element calculate the observed counts
for word in comparison_terms:
    observed_expected.append(word_count(word))

In [50]:
# Count the number of high value and low value questions
high_value_count=jeopardy['high_value'].value_counts()[1]
low_value_count=jeopardy['high_value'].value_counts()[0]
chi_squared=[]

# Calculate chi-squared 
for obs in observed_expected:
    total=sum(obs)
    total_prop=total/jeopardy.shape[0]
    high_count_exp=total_prop*high_value_count
    low_count_exp=total_prop*low_value_count
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_count_exp, low_count_exp])
    chi_squared.append(chisquare(observed, expected))

In [63]:
# None of the terms had a significant difference in usage between high value and low value rows
chi_squared


[Power_divergenceResult(statistic=0.3949764642333513, pvalue=0.5296950912486695),
 Power_divergenceResult(statistic=0.9267728889671603, pvalue=0.3357028942299553),
 Power_divergenceResult(statistic=0.3949764642333513, pvalue=0.5296950912486695),
 Power_divergenceResult(statistic=0.46338644448358013, pvalue=0.49604555208958945),
 Power_divergenceResult(statistic=2.5317964247338085, pvalue=0.11157312838169751),
 Power_divergenceResult(statistic=0.3949764642333513, pvalue=0.5296950912486695),
 Power_divergenceResult(statistic=0.021646150708492677, pvalue=0.8830323245068887),
 Power_divergenceResult(statistic=0.3949764642333513, pvalue=0.5296950912486695),
 Power_divergenceResult(statistic=0.7899529284667026, pvalue=0.3741143592744989),
 Power_divergenceResult(statistic=0.07446818777814278, pvalue=0.7849388502668134)]

In [62]:
# The frequencies were all lower than 5, so the chi-squared test isn't as valid. 
# It would be better to run this test with only terms that have higher frequencies
observed_expected

[(0, 1),
 (2, 2),
 (0, 1),
 (1, 1),
 (1, 0),
 (0, 1),
 (1, 3),
 (0, 1),
 (0, 2),
 (2, 4)]