In [1]:
%cd C:\\Users\\debie\\Documents\\anaconda_space

C:\Users\debie\Documents\anaconda_space


# Winning Jeopardy

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for many years, and is a major force in popular culture.

In this project, we'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help us win this game.

In [2]:
import pandas as pd

In [3]:
jeopardy = pd.read_csv('jeopardy.csv')

jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [4]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [5]:
col_name = {'Show Number' : 'show_number', ' Air Date' : 'air_date', ' Round' : 'round', ' Category' : 'category', ' Value' : 'value', ' Question' : 'question', ' Answer' : 'answer'}
jeopardy.rename(mapper = col_name, axis = 1, inplace = True)


Before starting to work on the data, we first need to normalize the 'question' and 'answer' columns, as well as the 'value' column.

In [13]:
import re

def normal(string):
    string = re.sub('[^A-Za-z0-9\s]', '', string.lower())
    string = re.sub('\s+', ' ', string)
    return string

def normalize_values(string):
    string = re.sub('[^A-Za-z0-9\s]', '', string)
    try:
        text = int(string)
    except Exception:
        string = 0
    return string

In [17]:
jeopardy['clean_question'] = jeopardy['question'].apply(normal)
jeopardy['clean_answer'] = jeopardy['answer'].apply(normal)
jeopardy['clean_value'] = jeopardy['value'].apply(normalize_values)
jeopardy['air_date'] = pd.to_datetime(jeopardy['air_date'])

In [18]:
jeopardy.head()

Unnamed: 0,show_number,air_date,round,category,value,question,answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200


In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

    How often the answer can be used for a question.
    How often questions are repeated.
    
We can answer the second question by seeing how often complex words (> 6 characters) reoccur. We can answer the first question by seeing how many times words in the answer also occur in the question. We'll work on the first question and come back to the second.


In [30]:
def q_a(row):
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()
    match_count = 0
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0
    for item in split_answer:
        if item in split_question:
            match_count += 1
    return match_count / len(split_answer)
    
jeopardy['answer_in_question'] = jeopardy.apply(q_a, axis=1)

In [33]:
jeopardy['answer_in_question'].mean()

0.059001965249777744

On average, there is only 6 % of the question that allow us to guess what the answer is by just knowing its content. It is not enough.

We now want to investigate how often new questions are repeats of older ones. We can't completely answer this, because we only have about 10% of the full Jeopardy question dataset, but we can investigate it at least.

In [37]:
question_overlap = []
terms_used = set()
jeopardy.sort_values('air_date')

for i, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(' ')
    split_question = [word for word in split_question if len(word) > 5]
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
    for word in split_question:
        terms_used.add(word)
    if len(split_question) > 0:
        match_count = match_count / len(split_question)
    question_overlap.append(match_count)
jeopardy['question_overlap'] = question_overlap

jeopardy['question_overlap'].mean()

0.6908737315671878

There is about 70% overlap between terms in new questions and terms in old questions. This only looks at a small set of questions, and it doesn't look at phrases, it looks at single terms. This makes it relatively insignificant, but it does mean that it's worth looking more into the recycling of questions.

Let's say you only want to study questions that pertain to high value questions instead of low value questions. This will help you earn more money when you're on Jeopardy.

You can actually figure out which terms correspond to high-value questions using a chi-squared test. You'll first need to narrow down the questions into two categories:

    Low value -- Any row where Value is less than 800.
    High value -- Any row where Value is greater than 800.

You'll then be able to loop through each of the terms from the last screen, terms_used, and:

    Find the number of low value questions the word occurs in.
    Find the number of high value questions the word occurs in.
    Find the percentage of questions the word occurs in.
    Based on the percentage of questions the word occurs in, find expected counts.
    Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.

You can then find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values. Doing this for all of the words would take a very long time, so we'll just do it for a small sample now.

In [40]:
def val(row):
    if int(row['clean_value']) > 800:
        value = 1
    else:
        value = 0
    return value

jeopardy['high_value'] = jeopardy.apply(val, axis=1)

In [46]:
def val_count(word):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        split_question = row['clean_question'].split(' ')
        if word in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

In [47]:
from random import choice

terms_used_list = list(terms_used)
comparison_terms = [choice(terms_used_list) for _ in range(10)]

observed_expected = []

for term in comparison_terms:
    observed_expected.append(val_count(term))

observed_expected

[(1, 0),
 (0, 2),
 (0, 1),
 (2, 2),
 (0, 1),
 (0, 1),
 (0, 1),
 (0, 1),
 (0, 1),
 (0, 1)]

In [48]:
jeopardy['high_value'].value_counts()

0    14265
1     5734
Name: high_value, dtype: int64

In [51]:
high_value_count = 5734
low_value_count = 14265
chi_squared = []

from scipy.stats import chisquare
import numpy as np

for list in observed_expected:
    total = sum(list)
    total_prop = total / jeopardy.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count  
    observed = np.array([list[0], list[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.889754963322559, pvalue=0.3455437191483469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469)]