# Winning Jeopardy
## Introduction
Jeopardy is a popular TV show in the US where participants answer questions to win money. Let's say you want to compete on Jeopardy, and you're looking for any edge you can get to win. In this project, you'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you win.
## Dataset
The dataset contains 20000 rows from the beginning of a full dataset of Jeopardy questions, which you can find here:
https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/
Each row in the dataset represents a single question on a single episode of Jeopardy:
- Show Number: the Jeopardy episode number of the show this question was in.
- Air Date: the date the episode aired.
- Round: the round of Jeopardy that the question was asked in.
- Category: the category of the question.
- Value: the number of dollars answering the question correctly is worth.
- Question: the text of the question.
- Answer: the text of the answer.

## Clean data

In [25]:
import pandas as pd
import numpy as np
import matplotlib as plt
%matplotlib inline
jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.shape
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

Let's drop the space in each column name.

In [26]:
jeopardy.columns = ['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question', 'Answer']
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
Show Number    19999 non-null int64
Air Date       19999 non-null object
Round          19999 non-null object
Category       19999 non-null object
Value          19999 non-null object
Question       19999 non-null object
Answer         19999 non-null object
dtypes: int64(1), object(6)
memory usage: 1.1+ MB


### Findings:
- Column 'Air Date', 'Value' are incorrectly stored.
- Column 'Question' and 'Answer' are mixed with lowercase and uppercase.

In [27]:
# Verify our findings
jeopardy[:3]

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona


In [28]:
# Normalize 'Question' and 'Answer' column
import re
def norm_str(str):
    str = str.lower()
    str = re.sub("[^A-Za-z0-9\s]", "", str)
    return str

jeopardy['clean_question'] = jeopardy['Question'].apply(norm_str)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(norm_str)

In [29]:
# Normalize 'Value' column and clean 'Air Date' column
def norm_val(val):
    val = re.sub("[^A-Za-z0-9\s]", "", val)
    try:
        val = int(val)
    except Exception:
        val = 0    
    return val
jeopardy['clean_value'] = jeopardy['Value'].apply(norm_val)
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 10 columns):
Show Number       19999 non-null int64
Air Date          19999 non-null datetime64[ns]
Round             19999 non-null object
Category          19999 non-null object
Value             19999 non-null object
Question          19999 non-null object
Answer            19999 non-null object
clean_question    19999 non-null object
clean_answer      19999 non-null object
clean_value       19999 non-null int64
dtypes: datetime64[ns](1), int64(2), object(7)
memory usage: 1.5+ MB


## Should I study past questions, study general knowledge or not study at all?
### How often the answer is deducible from the question?
We can caculate the chance that words in the answer also occur in the question.

In [30]:
def first_question(row):
    split_answer = row['clean_answer'].split(' ')
    split_question = row['clean_question'].split(' ')
    com_words = ['the', 'a', 'an', 'and', 'to', 'in', 'at', 'this', 'or']
    for i in com_words:
        if i in split_answer:
            split_answer.remove(i)
    if len(split_answer) == 0:
        return 0 #Prevent devision by zero
    match_count = 0
    for i in split_answer:
        if i in split_question:
            match_count += 1
    return match_count / len(split_answer)

jeopardy['answer_in_question'] = jeopardy.apply(first_question, axis=1)
jeopardy['answer_in_question'].mean() * 100

4.383096069555393

4,4%. So about one every 25 anwers is deducible from the questions. Not too bad.

### How often new questions are repeats of older questions?
We can answer this by seeing how often complex words (> 6 characters) reoccur.

In [31]:
jeopardy = jeopardy.sort_values('Air Date')
question_overlap = []
terms_used = set()
for i, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(' ')
    split_question = [q for q in split_question if len(q) > 5]
    match_count = 0
    for w in split_question:
        if w in terms_used:
            match_count += 1
    for w in split_question:
        terms_used.add(w)
    if len(split_question) > 0:
        match_count /= len(split_question)
    question_overlap.append(match_count)
jeopardy['question_overlap'] = question_overlap
jeopardy['question_overlap'].mean() * 100

68.76260592169801

There is a 69% chances that questions are recycled. It is a good thing to look back at old questions instaed of studying general.

### Aim for that coin!
Let's find out words in questions of higher value using chi-square

In [32]:
def determine_value(row):
    value = 0
    if row["clean_value"] > 800:
        value = 1
    return value

jeopardy["high_value"] = jeopardy.apply(determine_value, axis=1)

In [35]:
def count_usage(term):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        if term in row["clean_question"].split(" "):
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

comparison_terms = list(terms_used)[:5]
observed_expected = []
for term in comparison_terms:
    observed_expected.append(count_usage(term))
print(observed_expected)

[(1, 0), (0, 1), (0, 1), (2, 1), (0, 3)]


In [36]:
import scipy
from scipy.stats import chisquare 
high_value_count = len(jeopardy[jeopardy.high_value == 1])
low_value_count = len(jeopardy[jeopardy.high_value == 0])
chi_squared = []
for l in observed_expected:
    total = sum(l)
    total_prob = total / jeopardy.shape[0]
    expected_high = total_prob * high_value_count
    expected_low = total_prob * low_value_count
    chi_squared.append(chisquare([l[0], l[1]], [expected_high, expected_low]))
print(chi_squared)

[Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047), Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469), Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469), Power_divergenceResult(statistic=2.1177104383031944, pvalue=0.14560406868264344), Power_divergenceResult(statistic=1.205888538380652, pvalue=0.27214791766902047)]


None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.