In [1]:
import pandas as pd
jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [2]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [3]:
col = jeopardy.columns.tolist()
col = [str.replace(c, ' ', '') for c in col]
jeopardy.columns = col
jeopardy.columns

Index(['ShowNumber', 'AirDate', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

In [4]:
import re
def norm_text(text):
    text = str.lower(text)
    text = re.sub("[^A-Za-z0-9\s]", "", text)    
    return text

jeopardy['clean_question'] = jeopardy['Question'].apply(norm_text)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(norm_text)
jeopardy.head(10)

Unnamed: 0,ShowNumber,AirDate,Round,Category,Value,Question,Answer,clean_question,clean_answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams
5,4680,2004-12-31,Jeopardy!,3-LETTER WORDS,$200,"In the title of an Aesop fable, this insect sh...",the ant,in the title of an aesop fable this insect sha...,the ant
6,4680,2004-12-31,Jeopardy!,HISTORY,$400,Built in 312 B.C. to link Rome & the South of ...,the Appian Way,built in 312 bc to link rome the south of ita...,the appian way
7,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$400,"No. 8: 30 steals for the Birmingham Barons; 2,...",Michael Jordan,no 8 30 steals for the birmingham barons 2306 ...,michael jordan
8,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$400,"In the winter of 1971-72, a record 1,122 inche...",Washington,in the winter of 197172 a record 1122 inches o...,washington
9,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$400,This housewares store was named for the packag...,Crate & Barrel,this housewares store was named for the packag...,crate barrel


In [5]:
def norm_values(text):
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    try:
        val = int(text)
    except Exception:
        val = 0
    return val
jeopardy['clean_value'] = jeopardy['Value'].apply(norm_values)
jeopardy['AirDate'] = pd.to_datetime(jeopardy['AirDate'])

In [27]:
def compare_question_answer(row):
    split_answer = row['clean_answer'].split(" ")
    split_question = row['clean_question'].split(" ")
    match_count = 0
    if "the" in split_answer:
        split_answer.remove('the')
    if len(split_answer)==0:
        return 0
    
    for ans in split_answer:
        if ans in split_question:
            match_count += 1
    return match_count/len(split_answer)

jeopardy['answer_in_question'] = jeopardy.apply(compare_question_answer, axis=1)
jeopardy['answer_in_question'].mean()

0.060493257069335872

**Questions vs. Answers**

Based on the above analysis, focusing only on the questions for Jeopardy is likely an uneffective preparation strategy, as the answer rarely occurs in the question (~6% of the time).

In [24]:
question_overlap = []
terms_used = set()
for i, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(" ")
    split_question = [q for q in split_question if len(q)>=6]
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
    for word in split_question:
        terms_used.add(word)
    if len(split_question) > 0:
        match_count /= len(split_question)
    question_overlap.append(match_count)
jeopardy['question_overlap'] = question_overlap
jeopardy['question_overlap'].mean()

0.69087373156719623

**Repeat Questions Check**

In this small sample of Jeopardy questions, terms from old questions reappear in new questions about 70% of the time. This only considers words and not phrases, so it isn't directly indicative of recycled questions. Although, checking further into recycling questions is warranted.

In [33]:
def value_check(row):
    if row['clean_value'] > 800:
        value = 1
    else:
        value = 0
    return value
jeopardy['high_value'] = jeopardy.apply(value_check, axis=1)

def word_value(word):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        clean_question = row['clean_question'].split(" ")
        if word in clean_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count
observed_expected = []
comparison_terms = list(terms_used)[:5]
for item in comparison_terms:
    val_counts = word_value(item)
    observed_expected.append(val_counts)
observed_expected

[(0, 2), (5, 6), (1, 5), (0, 1), (1, 0)]

In [31]:
from scipy.stats import chisquare
import numpy as np
high_value_count = jeopardy[jeopardy['high_value']==1].shape[0]
low_value_count = jeopardy[jeopardy['high_value']==0].shape[0]
chi_squared = []
for items in observed_expected:
    total = sum(items)
    total_prop = total / jeopardy.shape[0]
    high_expected = high_value_count * total_prop
    low_expected = low_value_count * total_prop
    
    observed = np.asarray([items[0], items[1]])
    expected = np.asarray([high_expected, low_expected])
    chi_squared.append(chisquare(observed, expected))
chi_squared
    

[Power_divergenceResult(statistic=0.80392569225376798, pvalue=0.36992223780795708),
 Power_divergenceResult(statistic=1.5150423082236086, pvalue=0.21837128417807639),
 Power_divergenceResult(statistic=0.42281054506129573, pvalue=0.51553795812945302),
 Power_divergenceResult(statistic=0.40196284612688399, pvalue=0.52607729857054686),
 Power_divergenceResult(statistic=2.4877921171956752, pvalue=0.11473257634454047)]

**Chi-squared Results**

In this small comparison sample, no chi-squared values are significant; thus, no single word disproportionately appears in high-value vs. low-value questions more than expected. Additionally, the item frequencies were all below 7, suggesting the test should be run primarily on high frequency terms to get more stable or valid results.