# Winning Jeopardy

A [data set](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/) has been compiled, containing nearly 200,000 questions from the famous trivia show, Jeopardy. A future participant might be able to gain a winning advantage by examining the previously asked questions and looking for patterns. Let's see if there is any significant information to be extracted.

### Exploring the Data

In [1]:
import pandas as pd
import numpy as np

In [2]:
jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.head(3)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona


In [5]:
#clean column labels
jeopardy.columns = jeopardy.columns.str.replace(' ', '')
jeopardy.columns

Index(['ShowNumber', 'AirDate', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

In [7]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
ShowNumber    19999 non-null int64
AirDate       19999 non-null object
Round         19999 non-null object
Category      19999 non-null object
Value         19999 non-null object
Question      19999 non-null object
Answer        19999 non-null object
dtypes: int64(1), object(6)
memory usage: 1.1+ MB


### Normalize Data and Datatypes

The majority of this data set is in a string like format (pandas object). It is wise to clean and normalize these strings before moving forward. In particular, the strings will be coverted to lowercase and stripped of any punctuation.

The Value and Air Date columns should have normalized data as well.

In [21]:
import re

def normalize_pd_object(s):
    #make lowercase and remove punctuation
    s = s.lower()
    s = re.sub('[^A-Za-z0-9\s]', '', s)
    s = re.sub('\s+', ' ', s)
    return s


def normalize_pd_value(s):
    #lowercase and remove punctuation
    s = s.lower()
    s = re.sub('[^A-Za-z0-9\s]', '', s)
    
    #convert to integer
    try:
        s = int(s)
    except Exception:
        s = 0
    
    return s



In [22]:
jeopardy['clean_question'] = jeopardy['Question'].apply(normalize_pd_object)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize_pd_object)
jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_pd_value)

In [28]:
jeopardy['AirDate'] = pd.to_datetime(jeopardy['AirDate'])

In [29]:
jeopardy[ ['clean_question', 'clean_answer', 'clean_value', 'AirDate'] ].head(5)

Unnamed: 0,clean_question,clean_answer,clean_value,AirDate
0,for the last 8 years of his life galileo was u...,copernicus,200,2004-12-31
1,no 2 1912 olympian football star at carlisle i...,jim thorpe,200,2004-12-31
2,the city of yuma in this state has a record av...,arizona,200,2004-12-31
3,in 1963 live on the art linkletter show this c...,mcdonalds,200,2004-12-31
4,signer of the dec of indep framer of the const...,john adams,200,2004-12-31


In [31]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 10 columns):
ShowNumber        19999 non-null int64
AirDate           19999 non-null datetime64[ns]
Round             19999 non-null object
Category          19999 non-null object
Value             19999 non-null object
Question          19999 non-null object
Answer            19999 non-null object
clean_question    19999 non-null object
clean_answer      19999 non-null object
clean_value       19999 non-null int64
dtypes: datetime64[ns](1), int64(2), object(7)
memory usage: 1.5+ MB


The data is now normailzed for further analysis.

## Answering a Few Qustions

To find the most productive ways for studying Jeopardy, it is worth asking a few questions about the data available.

- How often is the answer deducible from the question?
- How often are new questions repeats of older questions?

Let's look at the first question in detail:
Another way to phrase this question is, how often do the words in the question contain the answer to that question. This can be found by simply comparing the words in the question and answer, tallying along the way.

In [44]:

def count_ans_in_question(row):
    #split text into words and compare answer with question
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()
    match_count = 0
    
    #remove common words
    split_answer = [w for w in split_answer if w != 'the']
        
    #avoid div by 0 error
    if len(split_answer) == 0:
        return 0
    
    #check for each word in question
    for item in split_answer:
        if item in split_question:
            match_count += 1
    
    #ratio of matched words to all words
    return match_count / len(split_answer)



In [36]:
jeopardy['answer_in_question'] = jeopardy.apply(count_ans_in_question, axis=1)
print(jeopardy['answer_in_question'].mean())

0.05834744478926688


As shown above, the answer often does not lie within the question. Let's see about the second question now. How many old questions are recycled into new questions (at least within this sample)?

In [43]:
#RECYCLED QUESTIONS

questions_overlap = []
terms_used = set()

jeopardy = jeopardy.sort_values('AirDate')

for i, row in jeopardy.iterrows():
    split_question = row['clean_question'].split()
    split_question = [word for word in split_question if len(word) > 5]
    match_count = 0
    
    for item in split_question:
        if item in terms_used:
            match_count += 1
        
        terms_used.add(item)
        
    if len(split_question) > 0:
        match_count /= len(split_question)
        
    questions_overlap.append(match_count)



In [42]:
jeopardy['question_overlap'] = questions_overlap
jeopardy['question_overlap'].mean()

0.6895495514032576

This result shows about 70% of significant (greater than 5 character) words are reused from old questions in newer questions. This could suggest that studying old questions is the most efficient method, but this might not be the case. This procedure only examined single values that were repeated, not particular phrases of answers. While potentially insignificant, it is wise to do a more detailed analysis of recycled questions if time allows.

## Low and High Value Questions

Before diving into a more detailed analysis of recycled questions, it is beneficial to consider an alternative strategy for studying. Jeopardy's scoring system awards more points for more difficult or niche questions. These high-value questions could be considered worth 800 points or more, while low-value questions could be considered anything less than 800 points.

It is possible to analyze what words are used the most for these high-value questions, which could be used as a proxy for which subjects to study. This is done by examining chi-squared values based on word occurance.

The first step is to separate the questions as high or low value, then calculate a chi-squared statistic.

In [45]:

def high_low(row):
    #determine high(1) or low(0)-value questions
    value = 0   
    if row['clean_value'] > 800:
        value = 1
        
    return value


jeopardy['high_value'] = jeopardy.apply(high_low, axis=1)

In [46]:

def count_use(word):
    #compare all words and find largest difference of use
    low_count = 0
    high_count = 0
    
    for index, row in jeopardy.iterrows():
        if word in row['clean_question'].split():
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
                
    return high_count, low_count



In [51]:
from random import choice
#finding the observed counts for a sample of words as calculation takes a bit.

comparison_terms = [choice(list(terms_used)) for _ in range(10)]
observed_expected = []

for word in comparison_terms:
    observed_expected.append(count_use(word))
    
observed_expected

[(1, 1),
 (1, 2),
 (2, 1),
 (0, 3),
 (1, 2),
 (2, 0),
 (0, 1),
 (0, 1),
 (1, 1),
 (1, 0)]

Time to compute the chi-squared value

In [52]:
from scipy.stats import chisquare

high_value_count = jeopardy[jeopardy['high_value'] == 1].shape[0]
low_value_count = jeopardy[jeopardy['high_value'] == 0].shape[0]

chi_squared = []

for obs in observed_expected:
    total = sum(obs)
    total_prop = total / jeopardy.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996),
 Power_divergenceResult(statistic=0.03188116723440362, pvalue=0.8582887163235293),
 Power_divergenceResult(statistic=2.1177104383031944, pvalue=0.14560406868264344),
 Power_divergenceResult(statistic=1.205888538380652, pvalue=0.27214791766902047),
 Power_divergenceResult(statistic=0.03188116723440362, pvalue=0.8582887163235293),
 Power_divergenceResult(statistic=4.97558423439135, pvalue=0.025707519787911092),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047)]

#### Results

Examining the chi-squared values for this sample does not show any significant difference in usage between high and low-value questions. All frequencies were below 5, resulting in a less-valid chi-squared test. This test would be more illuminating if the time were spent to run calculations on high frequency terms.

## Potential Next Steps

- Create a better filter for 'significant' words
- Apply chi-squared testing with larger freqency terms
- Explore and examine the question categories
- Analyze full dataset
- Check question overlap/recycling with phrases instead of words.