In [1]:
import pandas as pd
jeopardy = pd.read_csv("jeopardy.csv")
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [2]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [3]:
# remove the preceding space in some of the column name
jeopardy.columns = ['Show Number', 'Air Date', 'Round', 'Category', 'Value',
       'Question', 'Answer']

In [4]:
import re

#define a function that 
#takes in a string, converts it to lowercase,
#removes all punctuation in the string and returns the normalized string

def normalize(string):
    string = string.lower()
    string = re.sub(r'[^\w\s]', '', string)
    return string

In [5]:
# normalize the Question column
jeopardy["clean_question"] = jeopardy["Question"].apply(normalize)
jeopardy["clean_question"].head()

0    for the last 8 years of his life galileo was u...
1    no 2 1912 olympian football star at carlisle i...
2    the city of yuma in this state has a record av...
3    in 1963 live on the art linkletter show this c...
4    signer of the dec of indep framer of the const...
Name: clean_question, dtype: object

In [6]:
# Question column looks good. Now, let's normalize Answer column.
jeopardy["clean_answer"] = jeopardy["Answer"].apply(normalize)
jeopardy["clean_answer"].head()

0    copernicus
1    jim thorpe
2       arizona
3     mcdonalds
4    john adams
Name: clean_answer, dtype: object

In [7]:
#a function to normalize dollar values - takes in string, 
#remove punctuations (note that dollar values are not fractional)
# convert string to integer
# if there's an error, assign zero
#return integer value.
def normalize_value(string):
    string = re.sub(r'[^\w\s]', '', string)
    try:
        num = int(string)
    except Exception:
        num = 0
    return num

jeopardy["clean_value"] = jeopardy["Value"].apply(normalize_value)
jeopardy["clean_value"].head()

0    200
1    200
2    200
3    200
4    200
Name: clean_value, dtype: int64

In [8]:
# convert "Air Date" column to a datetime column
jeopardy["Air Date"] = pd.to_datetime(jeopardy["Air Date"])

To prepare for our Jeopardy! game, we'll like to know if we should study past questions, general knowledge or not study at all. To do tat, it would be helpful to figure out two things:
    * How often the answer is deducible from the question.
    * How often new questions are repeats of older questions
    
To help us answer these questions, we'll do a frequency count of words in the Question and Answer columns of our dataset.

In [9]:

def qa_match(row):#takes in a row
    """
    Finds the proportion of word matches between questions and their respective answers
    """
    split_answer = row["clean_answer"].split()
    split_question = row["clean_question"].split()
    match_count = 0
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer) == 0:
        return 0
    else:
        for word in split_answer:
            if word in split_question:
                match_count += 1
        result = match_count/len(split_answer)
        return result

In [10]:
# count the number of times terms in clean_answer occur in clean_question
jeopardy["answer_in_question"] = jeopardy.apply(qa_match, axis=1)


In [11]:
#mean of the answer_in_question column; to get a general sense of how often it happens
aiq_mean = jeopardy["answer_in_question"].mean()
aiq_mean

0.05900196524977763

This calculation shows that there's about 6% of the words in the answers come directly from the question. This is a little misleading, in that we may straightaway conclude that answers do not come from the wording of questions. However, there are a few cases where all the words in the answer come from the question. We may need to segment the dataset based on those that have nonzero answer_in_question score to know if there's something we can learn about these questions that can help us in our preparation.

In [12]:
# investigate the idea from the markdown above

Let's investigate if the show repeats questions. 

Because this dataset contains only about 10 percent of the full Jeopardy question dataset, we cannot do a direct comparison. However, we can track the terms used in the dataset we have and see if there are any patterns there.

In [19]:
question_overlap = []
terms_used = set()

jeopardy = jeopardy.sort_values("Air Date")

for i, row in jeopardy.iterrows():
    split_question = row["clean_question"].split()
    split_question = [w for w in split_question if len(w) >=6] #a way of removing insignificant words
    match_count = 0
    for w in split_question:
        if w in terms_used:
            match_count += 1
        terms_used.add(w)
    if len(split_question) > 0:
        match_count /=len(split_question) #turns match_count into a fraction
    question_overlap.append(match_count)
jeopardy["question_overlap"] = question_overlap
mean_overlap = jeopardy["question_overlap"].mean()

In [20]:
mean_overlap

0.6895470512782512

There's a 69% probability that a term (presumably, nontrivial words with more than five characters) used in a question has been used in a previous question. This is a very valid and, potentially, important way of preparing for the game.
So, we could focus on this and supplement our study with patterns from cases where the answers have most of their words emanating from the corresponding questions.

# High - Value Terms
Let's say you only want to study questions that pertain to high value questions instead of low value questions. This will help you earn more money when you're on Jeopardy.

You can actually figure out which terms correspond to high-value questions using a chi-squared test. You'll first need to narrow down the questions into two categories:

    * Low value -- Any row where `Value` is less than `800`.
    * High value -- Any row where `Value` is greater than `800`.
You'll then be able to loop through each of the terms from the last cell, `terms_used`, and:

    * Find the number of low value questions the word occurs in.
    * Find the number of high value questions the word occurs in.
    * Find the percentage of questions the word occurs in.
    * Based on the percentage of questions the word occurs in, find expected counts.
    * Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.

You can then find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values. Doing this for all of the words would take a very long time, so we'll just do it for a small sample now.

In [21]:
def value_score(row):
    """
    A function that tells whether us whether a column is high-valued or not
    """
    if row["clean_value"] > 800:
        value = 1
    else:
        value = 0
    return value

In [22]:
jeopardy["high_value"] = jeopardy.apply(value_score, axis = 1)

In [23]:
def value_counts(word):
    
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        split_question = row["clean_question"].split()
        if word in split_question:
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

In [26]:
from random import choice

terms_used_list = list(terms_used)
comparison_terms = [choice(terms_used_list) for i in range(10)]

observed_expected = []

for term in comparison_terms:
    v = value_counts(term)
    observed_expected.append(v)

observed_expected

[(1, 0),
 (1, 1),
 (0, 1),
 (0, 1),
 (0, 2),
 (0, 3),
 (0, 5),
 (1, 0),
 (0, 1),
 (15, 44)]

In [28]:
high_value_rows = jeopardy[jeopardy["high_value"] == 1]
low_value_rows = jeopardy[jeopardy["high_value"] == 0]

high_value_count = len(high_value_rows)
low_value_count = len(low_value_rows)

print(high_value_count)
print(low_value_count)

5734
14265


In [31]:
from scipy.stats import chisquare
import numpy as np
chi_squared = []

for i in observed_expected:
    total = i[0] + i[1]   #each element of observed_expected is a list
    total_prop = total/len(jeopardy)
    expected_high_value_count = total_prop * high_value_count
    expected_low_value_count = total_prop * low_value_count
    observed = np.array([i[0], i[1]])
    expected = np.array([expected_high_value_count, expected_low_value_count])
    chi_squared.append(chisquare(observed, expected))

In [32]:
chi_squared

[Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571),
 Power_divergenceResult(statistic=1.205888538380652, pvalue=0.27214791766902047),
 Power_divergenceResult(statistic=2.00981423063442, pvalue=0.1562844540498966),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.3042931605199027, pvalue=0.5812034286869123)]

# Chi- Squared Results
None of the terms had a significant difference between high and low-value rows. All p - values were far greater than 0.05 and all the frequencies were lower than 5. It would far better to run this tests only with terms that have higher frequencies.

We may try this out with the whole dataset and see if we are able to pinpoint specific terms with significant difference between high and low value rows.