## Winning Jeopardy

In this project, we'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help us win.

The dataset is named jeopardy.csv, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions, which you can download [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/).

Let's start by reading in the dataset and familiarising ourselves with it.

In [1]:
# Import libraries
import pandas as pd
import re
from scipy.stats import chisquare
from random import choice

# Read dataset & view first five rows
jeopardy = pd.read_csv("jeopardy.csv")
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


### Clean columns

Some of the column names have spaces in front of them. Let's remove the spaces first.

In [2]:
# Remove spaces in front of column names
jeopardy.columns = jeopardy.columns.str.strip()
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

### Normalize Text Columns

Before doing any sort of analysis, we'll first normalize the 'Question' and 'Answer' columns.

We'll write a function to remove all punctuation and covert all text to lowercase.

In [3]:
def normalize_string(string):
    string = re.sub("[^A-Za-z0-9\s]", "", string)
    string = string.lower()
    return string

jeopardy["clean_question"] = jeopardy["Question"].apply(normalize_string)
jeopardy["clean_answer"] = jeopardy["Answer"].apply(normalize_string)

print(jeopardy["clean_question"][:3])
print(jeopardy["clean_answer"][:3])

0    for the last 8 years of his life galileo was u...
1    no 2 1912 olympian football star at carlisle i...
2    the city of yuma in this state has a record av...
Name: clean_question, dtype: object
0    copernicus
1    jim thorpe
2       arizona
Name: clean_answer, dtype: object


### Normalize Value and Air Date Columns

Right now the Value column is a string with a dollar sign in front, we'll remove the dollar sign and covert it to a numeric type.

We'll also convert the the Air Date column to datetime type to make it easier to work with.

In [4]:
# Func to normalize Value column.
def normalize_value(string):
    value = re.sub("[^A-Za-z0-9\s]", "", string)
    try:
        value = int(value)
    except Exception:
        value = 0
    return value

# Normalize Value column.
jeopardy["Value"] = jeopardy["Value"].apply(normalize_value)

# Convert Air Date to datetime data type.
jeopardy["Air Date"] = pd.to_datetime(jeopardy["Air Date"])

jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer
0,4680,2004-12-31,Jeopardy!,HISTORY,200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams


In [5]:
# Check Air Date and Value dtypes.
jeopardy[['Air Date', 'Value']].dtypes

Air Date    datetime64[ns]
Value                int64
dtype: object

### Answers in Questions?

To know if we should be studing past questions, general knowledge questions etc. It would help if we knew:

- How often the answer can be used for a question.


- How often questions are repeated.

To answer the first question, we can at look how many times words in the answer also appeared in the question.

To answer the second question, we can look at how often complex words (words > 6 characters) are repeated.

Let's attempt to answer the first question first.

In [6]:
# Func to find proportion of words in Question that match words in Answer.
def count_matches(row):
    split_answer = row['clean_answer'].split(" ")
    split_question = row['clean_question'].split(" ")
    match_count = 0
    #remove 'the' as it doesn't have any meaningful use in the answer. 
    if "the" in split_answer:
        split_answer.remove("the")
    #return 0 to prevent division by zero later on.    
    if len(split_answer) == 0:
        return 0
    for w in split_answer:
        if w in split_question:
            match_count +=1
    return match_count/len(split_answer)

# Apply func to every row in the dataset and add proportion to new column "answer_in_question"
jeopardy["answer_in_question"] = jeopardy.apply(count_matches, axis=1)

# Find the mean of the "answer_in_question" column
jeopardy["answer_in_question"].mean()

0.060493257069335914

On average only 6% of words found in the answer are found in the question. This tells us we shouldn't rely on hearing the answer in the question and should probably study.

Let's now try and answer the second question "How often are questions repeated". As mentioned, to answer this question we'll look at how often complex words (words > 6 characters) are repeated.

In [7]:
# Empty list to store match count
question_overlap = []
# Empty set to store terms used
terms_used = set()
# Sort dataset by "Air Date"
jeopardy = jeopardy.sort_values("Air Date")

# Find repeated terms in questions.
for i, row in jeopardy.iterrows():
    split_question = row["clean_question"].split(" ")
    split_question = [w for w in split_question if len(w) > 5]
    match_count = 0
    for w in split_question:
        if w in terms_used:
            match_count += 1
        terms_used.add(w)
    if len(split_question) > 0:
        match_count /= len(split_question)
    question_overlap.append(match_count)
    
jeopardy["question_overlap"] = question_overlap

jeopardy["question_overlap"].mean()

0.6894031359073217

Our findings show there is about 70% of terms that's been repeated in questions. Though this might seems high, we've only looked at single terms and only 10% of the full Jeopardy question dataset. However it's worth taking a closer look at questions being repeated.

### Low Value vs High Value Questions

Let's say we only want to study questions that pertain to high value questions instead of low value questions. This will help you earn more money when we're on Jeopardy.

We can figure out which terms correspond to high-value questions using a chi-squared test. First we need to put our questions into two categories:

- Low value -- Any row where Value is less than 800.
- High value -- Any row where Value is greater than 800.



In [8]:
#Func to determine if value of the question is high or low.
def determine_value(row):
    value = 0
    if row["Value"] > 800:
        value = 1
    return value

jeopardy["high_value"] = jeopardy.apply(determine_value, axis=1)

Next we need to find the number of low value questions the term occurs in, and the number of high value questions the term occurs in.

To save time, we'll only use a random sample of ten terms.

In [9]:
# Count how many times the term was used in low & high value questions
def count_usage(term):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        if term in row["clean_question"].split(" "):
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return term, low_count, high_count

comparison_terms = [choice(list(terms_used)) for _ in range(10)]

high_low_value_counts = []

for term in comparison_terms:
    high_low_value_counts.append(count_usage(term))

high_low_value_counts

[('snippy', 1, 0),
 ('universe', 7, 5),
 ('dryden', 1, 0),
 ('vanderbilts', 0, 1),
 ('paranoia', 1, 0),
 ('character', 100, 40),
 ('evicted', 1, 0),
 ('rested', 1, 0),
 ('targetblankjacoba', 0, 1),
 ('philips', 2, 1)]

### Applying the Chi-squared Test

Now that we've found the observed counts for a few terms, we can compute the expected counts and the chi-squared value.

In [10]:
high_value_count = jeopardy[jeopardy['high_value'] == 1].shape[0]
low_value_count = jeopardy[jeopardy['high_value'] == 0].shape[0]

chi_squared = []

for obs in high_low_value_counts:
    total = obs[1] + obs[2]
    total_prop = total/jeopardy.shape[0]
    expected_hi_val = total_prop * high_value_count
    expected_low_val = total_prop * low_value_count
    observed = [obs[1], obs[2]]
    expected = [expected_low_val, expected_hi_val]
    chi_square, p_value = chisquare(observed, expected)
    chi_squared.append([obs[0], chi_square, p_value])
    

pd.DataFrame(chi_squared, columns = ['term', 'chi_squared', 'p_value'])

Unnamed: 0,term,chi_squared,p_value
0,snippy,0.401963,0.526077
1,universe,0.990915,0.319519
2,dryden,0.401963,0.526077
3,vanderbilts,2.487792,0.114733
4,paranoia,0.401963,0.526077
5,character,0.000685,0.979125
6,evicted,0.401963,0.526077
7,rested,0.401963,0.526077
8,targetblankjacoba,2.487792,0.114733
9,philips,0.031881,0.858289


None of the terms showed any significant usage between high and low value questions. However the chi-squared test is more appropriate for large samples with expected frequencies of at least 5, other than the term 'Character' the rest have frequencies less than 5. It might be worth running the test again on only terms with high usage count.