# Winning Jeopardy

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture, we'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you win.

The dataset is named jeopardy.csv, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions, which you can download [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file).

In [1]:
import pandas as pd

In [2]:
jeopardy = pd.read_json('JEOPARDY_QUESTIONS1.json')

In [3]:
jeopardy.head()

Unnamed: 0,category,air_date,question,value,answer,round,show_number
0,HISTORY,2004-12-31,"'For the last 8 years of his life, Galileo was...",$200,Copernicus,Jeopardy!,4680
1,ESPN's TOP 10 ALL-TIME ATHLETES,2004-12-31,'No. 2: 1912 Olympian; football star at Carlis...,$200,Jim Thorpe,Jeopardy!,4680
2,EVERYBODY TALKS ABOUT IT...,2004-12-31,'The city of Yuma in this state has a record a...,$200,Arizona,Jeopardy!,4680
3,THE COMPANY LINE,2004-12-31,"'In 1963, live on ""The Art Linkletter Show"", t...",$200,McDonald\'s,Jeopardy!,4680
4,EPITAPHS & TRIBUTES,2004-12-31,"'Signer of the Dec. of Indep., framer of the C...",$200,John Adams,Jeopardy!,4680


In [4]:
clean_names = []
for c in jeopardy.columns:
    name = c.replace('_', '')
    name = name.capitalize()
    clean_names.append(name)
    
jeopardy.columns = clean_names

In [5]:
for c in jeopardy.columns:
    print(c)

Category
Airdate
Question
Value
Answer
Round
Shownumber


Before we can start doing analysis on the Jeopardy questions, we need to normalize all of the text columns (the Question and Answer columns).

In [6]:
import re

def normalize(string):
    string = string.lower()
    string = re.sub("[^A-Za-z0-9\s]", "", string)
    string = re.sub("\s+", " ", string)
    
    return string

In [7]:
jeopardy['clean_question'] = jeopardy['Question'].apply(normalize)

In [8]:
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize)

Now that we've normalized the text columns, there are also some other columns to normalize.

The *Value* column should also be numeric, to allow you to manipulate it more easily. We'll need to remove the dollar sign from the beginning of each value and convert the column from text to numeric.

The *Air Date* column should also be a datetime, not a string, to enable us to work with it more easily.

In [9]:
def integer(number):
    number = str(number)
    number = re.sub("[^A-Za-z0-9\s]", "", number)
    try:
        number = int(number)
    except Exception:
        number = 0
        
    return number

In [10]:
jeopardy['clean_value'] = jeopardy['Value'].apply(integer)

In [11]:
jeopardy['Airdate'] = pd.to_datetime(jeopardy['Airdate'])

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

- How often the answer is deducible from the question.
- How often new questions are repeats of older questions.

We can answer the second question by seeing how often complex words (> 6 characters) reoccur. We can answer the first question by seeing how many times words in the answer also occur in the question. We'll work on the first question now, and come back to the second.

In [12]:
def deducible(row):
    row = pd.Series(row)
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()
    
    match_count = 0
    
    if 'the' in split_answer:
        split_answer.remove('the')
        
    if len(split_answer) == 0:
        return 0
    
    for word in split_answer:
        if word in split_question:
            match_count += 1
            
    return match_count / len(split_answer)

In [13]:
jeopardy['answer_in_question'] = jeopardy.apply(deducible, axis=1)

In [14]:
mean_answer_in_question = jeopardy['answer_in_question'].mean()
mean_answer_in_question

0.05637826071470733

This means 5 % of the words in answers occur in questions. This shows is worth to listen carefully to the question.

Let's say we want to investigate how often new questions are repeats of older ones. We can't completely answer this, because we only have about 10% of the full Jeopardy question dataset, but we can investigate it at least.

In [15]:
question_repeated = []
words_used = set()

In [16]:
jeopardy = jeopardy.sort_values(by=['Airdate'])

In [17]:
for index, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(' ')
    
    split_question = [word for word in split_question if len(word) > 5]
     
    match_count = 0
    
    for word in split_question:
        if word in words_used:
            match_count += 1
            
    for word in split_question:
        words_used.add(word)
        
    if len(split_question) > 0:
        match_count /= len(split_question)
    question_repeated.append(match_count)
    
jeopardy['question_repeated'] = question_repeated

mean_question_repeated = jeopardy['question_repeated'].mean()
mean_question_repeated

0.8654625652481788

This tell us they do recycle questions! Is worth to study questionsin the past.

Let's say we only want to study questions that pertain to high value questions instead of low value questions. This will help us earn more money when you're on Jeopardy.

We can actually figure out which terms correspond to high-value questions using a chi-squared test. We'll first need to narrow down the questions into two categories:

- Low value -- Any row where Value is less than 800.
- High value -- Any row where Value is greater than 800.

We can then find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values. Doing this for all of the words would take a very long time, so we'll just do it for a small sample now.

In [18]:
def low_high(row):
    if row['clean_value'] > 800:
        value = 1
    else:
        value = 0
        
    return value

In [19]:
jeopardy['high_value'] = jeopardy.apply(low_high, axis=1)

In [20]:
def low_high_count(word):
    low_count = 0
    high_count = 0
    
    for index, row in jeopardy.iterrows():
        split_question = row['clean_question'].split(' ')
        
        if word in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
                
    return high_count, low_count

In [21]:
len(words_used)

106074

In [22]:
from random import choice

words_used_list = list(words_used)
comparison_words = [choice(words_used_list) for _ in range(10)]

In [23]:
observed_expected = []

for word in comparison_words:
    observed_expected.append(low_high_count(word))

observed_expected

[(1, 0),
 (1, 2),
 (0, 1),
 (1, 0),
 (1, 2),
 (0, 1),
 (0, 1),
 (1, 0),
 (0, 2),
 (1, 0)]

In [24]:
from scipy.stats import chisquare
import numpy as np

high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]

chi_squared = []
for obs in observed_expected:
    total = sum(obs)
    total_prop = total / jeopardy.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=2.5317964247338085, pvalue=0.11157312838169751),
 Power_divergenceResult(statistic=0.03723409388907139, pvalue=0.846989214486915),
 Power_divergenceResult(statistic=0.3949764642333513, pvalue=0.5296950912486695),
 Power_divergenceResult(statistic=2.5317964247338085, pvalue=0.11157312838169751),
 Power_divergenceResult(statistic=0.03723409388907139, pvalue=0.846989214486915),
 Power_divergenceResult(statistic=0.3949764642333513, pvalue=0.5296950912486695),
 Power_divergenceResult(statistic=0.3949764642333513, pvalue=0.5296950912486695),
 Power_divergenceResult(statistic=2.5317964247338085, pvalue=0.11157312838169751),
 Power_divergenceResult(statistic=0.7899529284667026, pvalue=0.3741143592744989),
 Power_divergenceResult(statistic=2.5317964247338085, pvalue=0.11157312838169751)]

None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.