In [1]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import string
from scipy.stats import chisquare

data = pd.read_csv('jeopardy.csv')
data.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [2]:
#replacing space between column names
data.columns = data.columns.str.replace(' ', '')

In [3]:
data.head()

Unnamed: 0,ShowNumber,AirDate,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


We want to normalize the text data in question and answer columns, so we will lowercase them and remove any punctuation.

In [4]:
def normalize_string(row):
    for punc in string.punctuation:
        if punc in row:
            row = row.replace(punc, '')
    return(row.lower())

data['clean_question'] = data['Question'].apply(normalize_string)
data['clean_answer'] = data['Answer'].apply(normalize_string)

In [5]:
data[['clean_question', 'clean_answer']].head()

Unnamed: 0,clean_question,clean_answer
0,for the last 8 years of his life galileo was u...,copernicus
1,no 2 1912 olympian football star at carlisle i...,jim thorpe
2,the city of yuma in this state has a record av...,arizona
3,in 1963 live on the art linkletter show this c...,mcdonalds
4,signer of the dec of indep framer of the const...,john adams


We will convert the value column's datatype into numeric and AirDate into datetime.

In [6]:
def normalize_value(row):
    for punc in string.punctuation:
        if punc in row:
            row = row.replace(punc, '')
    try:
        return(int(row))
    except ValueError:
        return(0)

data['clean_value'] = data['Value'].apply(normalize_value)
data['AirDate'] = pd.to_datetime(data['AirDate'])

In [7]:
data[['clean_value', 'AirDate']].head()

Unnamed: 0,clean_value,AirDate
0,200,2004-12-31
1,200,2004-12-31
2,200,2004-12-31
3,200,2004-12-31
4,200,2004-12-31


Main ways of studying for Jeopardy would be to study questions from previous episodes, study general knowledge or just winging it! So we want to look at two things here:

1) How often the answer is deducible from the question - we will see how many times words in the answer also occur in the question.

2) How often new questions are repeats of older questions - we will assess how often 'complex' words (more than 6 characters) reoccur.

In [8]:
#writing a function to match words in question and answer
def match_answer(row):
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()
    match_count = 0
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0
    for word in split_answer:
        if word in split_question:
            match_count += 1
    return(match_count / len(split_answer))

data['answer_in_question'] = data.apply(match_answer, axis=1)
print('The mean percentage of words that appear in both the answers and questions is ' + 
      str(round(data['answer_in_question'].mean() * 100, 2)) + '%')

The mean percentage of words that appear in both the answers and questions is 5.89%


5.89% of words appear in both questions and answers - that is not a lot! It might not be the best stretagey to attempt deducing answers from questions. Well then, should we bother studying questions from previous episodes? To find out, we'll keep track of all the complex words (more than 5 words) and see if questions contain any of these words. We will get a percentage of complex words within questions that match with our word bank and come up with a percentage of the matches per question.

In [9]:
question_overlap = []
terms_used = {} #master list of 'complex' (more than 5 characters) words used
for index, row in data.iterrows():
    split_question = row['clean_question'].split(' ') #splitting question into a list of words
    split_question = [word for word in split_question if len(word) > 5] #only keeping words with 6 or more characters
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1 #if word in question matches with any of the words used in the past, add point
            terms_used[word] += 1
        else:
            terms_used[word] = 0
    if len(split_question) > 0:
        match_count /= len(split_question) #getting the overall percentage of words matched with words from previous questions
    question_overlap.append(match_count)

data['question_overlap'] = question_overlap
data['question_overlap'].mean()

0.6919577992203563

70% of mean percentage of match may seem high, but we need to note that we are only looking at words with 6 or more characters, not the questions themselves. Nonetheless, let's stick with this analysis and find the most efficient way of answering high valued questions. That is, which of the words within our word bank are in questions with high value (more than 800 USD prize money)?

In [10]:
def assign_value(row):
    if row['clean_value'] > 800:
        return 1
    else:
        return 0

data['high_value'] = data.apply(assign_value, axis=1)

We'll take a look at the 20 most commonly found words from our dataset and run chi-square tests. These are the main steps we will be taking:
- Find the number of low value questions the word occurs in.
- Find the number of high value questions the word occurs in.
- Find the percentage of questions the word occurs in.
- Based on the percentage of questions the word occurs in, find expected counts.
- Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.

In [19]:
import operator
sorted_terms_used = sorted(terms_used.items(), key=operator.itemgetter(1), reverse=True)
example_terms_used = []
for pair in sorted_terms_used[:50]:
    example_terms_used.append(pair[0])

In [20]:
#creating a function that takes in a word and counts how many times these words are in high and low valued questions
def value_count(word):
    low_count = 0
    high_count = 0
    for index, row in data.iterrows():
        if word in row['clean_question'].split(' '):
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return(high_count, low_count)

observed_expected = []

for word in example_terms_used:
    observed_expected.append(value_count(word))

observed_expected

[(168, 346),
 (141, 332),
 (77, 212),
 (79, 203),
 (71, 191),
 (68, 181),
 (61, 186),
 (77, 173),
 (78, 168),
 (97, 146),
 (108, 133),
 (73, 134),
 (57, 124),
 (55, 124),
 (42, 134),
 (55, 108),
 (50, 119),
 (54, 110),
 (52, 108),
 (58, 102),
 (45, 115),
 (54, 102),
 (46, 107),
 (42, 110),
 (43, 108),
 (34, 112),
 (51, 92),
 (36, 103),
 (52, 89),
 (40, 100),
 (62, 79),
 (65, 73),
 (43, 95),
 (41, 79),
 (30, 85),
 (36, 79),
 (22, 97),
 (26, 84),
 (26, 84),
 (30, 74),
 (31, 72),
 (45, 54),
 (44, 54),
 (22, 76),
 (23, 74),
 (35, 62),
 (25, 68),
 (34, 61),
 (25, 69),
 (28, 67)]

In [21]:
high_value_count = data[data['high_value'] == 1].shape[0]
low_value_count = data[data['high_value'] == 0].shape[0]

chi_squared = []
for values in observed_expected:
    total = sum(values)
    total_prop = total / data.shape[0] #the percentage of questions the word occurs in
    high_prop = total_prop * high_value_count #finding expected counts based on percentage within sample
    low_prop = total_prop * low_value_count
    
    observed = np.array([values[0], values[1]])
    expected = np.array([high_prop, low_prop])
    chisq = chisquare(observed, expected)
    chi_squared.append(chisq)

chi_squared

[Power_divergenceResult(statistic=4.048305063534577, pvalue=0.044215717944225866),
 Power_divergenceResult(statistic=0.29967829483482744, pvalue=0.5840841713114313),
 Power_divergenceResult(statistic=0.5810990283039111, pvalue=0.4458818590919339),
 Power_divergenceResult(statistic=0.05956570730840162, pvalue=0.8071836789959332),
 Power_divergenceResult(statistic=0.3166666119780599, pvalue=0.5736178092344641),
 Power_divergenceResult(statistic=0.22592591114717697, pvalue=0.6345612982626103),
 Power_divergenceResult(statistic=1.9084254764809114, pvalue=0.16713826420470967),
 Power_divergenceResult(statistic=0.55386193833867, pvalue=0.45674398774097136),
 Power_divergenceResult(statistic=1.108644756518943, pvalue=0.2923767382010634),
 Power_divergenceResult(statistic=15.028296538003147, pvalue=0.00010591119029347305),
 Power_divergenceResult(statistic=30.70509560211122, pvalue=3.003751982853918e-08),
 Power_divergenceResult(statistic=4.401396413478652, pvalue=0.03590951591318824),
 Power_

In [22]:
#printing out words with significant chi-square scores
for i, stats in enumerate(chi_squared):
    if stats[1] < 0.05: #p=0.05
        print(example_terms_used[i])

called
targetblankherea
french
island
meaning
founded
reports
targetblankthisa
popular
italian
german


Here are the main findings that might help contestants to narrow down which topics they should study if they are short on time and want to target highly valued questions:
- French, Italian and German were significant. Focus on European countries!
- Study a bit on islands as well.
- Take a look at some major companies that people 'founded' throughout history.

Other words are either too vague or too common. If we want a more robust analysis, we may want increase our sample size in the future.