# Winning Jeopardy
In this project, we will explore a dataset named `jeopardy.csv` that contains 20000 questions from the beginning of a full dataset of jeopardy questions.

The goal is to figure out if there is any patterns in the questions that could help you win. 

<img alt="Imgur" src="https://dq-content.s3.amazonaws.com/Nlfu13A.png">

As you can see, each row in the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:

- Show Number -- the Jeopardy episode number of the show this question was in.
- Air Date -- the date the episode aired.
- Round -- the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
- Category -- the category of the question.
- Value -- the number of dollars answering the question correctly is worth.
- Question -- the text of the question.
- Answer -- the text of the answer.

The full data set can be downloaded [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/)

In [1]:
import pandas as pd
import numpy as np

jeopardy = pd.read_csv('jeopardy.csv')

jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


# Data Cleaning

In [2]:
# remove spaces from column names
print(jeopardy.columns)

jeopardy.columns = jeopardy.columns.str.replace(' ', '')

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')


In [3]:
jeopardy.columns

Index(['ShowNumber', 'AirDate', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

# Normalizing Columns

The `normalize()` function will: 
- Take in a string.
- Remove all punctuation in the string.
- Convert the string to lowercase.
- Return the string.

In [4]:
import re

def normalize_text(string):
    string = re.sub(r'[^\w\s]','', string)
    string = string.lower()
    return string

jeopardy['clean_question'] = jeopardy['Question'].apply(normalize_text)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize_text)

The `normalizing_dollars()` function will:
- Take in a string.
- Remove any punctuation in the string.
- Convert the string to an integer.
- If the conversion has an error, assign 0 instead.

In [5]:
def normalize_value(string):
    try:
        string = re.sub(r'[^\w\s]', '', string)
        integer = int(string)
        return integer
    except:
        return 0
    
jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_value)

In [6]:
jeopardy.head()

Unnamed: 0,ShowNumber,AirDate,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200


In [7]:
# Convert AirDate column to datetime
jeopardy['AirDate'] = pd.to_datetime(jeopardy['AirDate'])

In [8]:
jeopardy.dtypes

ShowNumber                 int64
AirDate           datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object

# Answers in Questions
In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:
- How often the answer is deducible from the question.
- How often new questions are repeats of older questions.

Below, the `count_matches()` function that counts how many times words in the answer also occur in the question.

In [9]:
def count_matches(row):
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()   
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0
    
    match_count = 0
    for item in split_answer:
        if item in split_question:
            match_count += 1
            
    return match_count / len(split_answer)


jeopardy["answer_in_question"] = jeopardy.apply(count_matches, axis=1)

In [10]:
# mean of answer_in_question column
jeopardy['answer_in_question'].mean()

0.059001965249777744

The answer only appears in the question about 6% of the time. It's not a very big number, and means that you can't rely on the answer being in the question.

# Recycled Questions
Steps taken to check for recycled questions:
- Sort `jeopardy` in order of ascending air date.
- Maintain a set called `terms_used` that will be empty initially.
- Iterate through each row of `jeopardy`.
- Split `clean_question` into words, remove any word shorter than 6 characters, and check if each word occurs in terms_used.
    - If it does, increment a counter.
    - Add each word to `terms_used`.
    
This will enable you to check if the terms in questions have been used previously or not. Only looking at words with six or more characters enables you to filter out words like the and than, which are commonly used, but don't tell you a lot about a question.

In [11]:
question_overlap = []
terms_used = set()

jeopardy = jeopardy.sort_values('AirDate', ascending=True)

for i,row in jeopardy.iterrows():
    split_question = row['clean_question'].split(' ')
    split_question = [ch for ch in split_question if len(ch) > 5]
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
        terms_used.add(word)
    if len(split_question) > 0:
        match_count /= len(split_question)
    question_overlap.append(match_count)
    
print(np.mean(question_overlap))

0.6894006357823182


There is about a 69% overlap between old and new questions. Keep in mind this is a small set of questions and it doesn't look at phrases. This makes it relatively meaningless, but it's worth looking more into recycling of questions.

# Low Value vs High Value Questions
Let's say you only want to study questions that pertain to high value questions instead of low value questions. This will help you earn more money when you're on Jeopardy.

You can actually figure out which terms correspond to high-value questions using a chi-squared test. You'll first need to narrow down the questions into two categories:
- Low value -- Any row where Value is less than 800.
- High value -- Any row where Value is greater than 800.

In [12]:
def determine_value(row):
    value = 0
    if row['clean_value'] > 800:
        value = 1
    return value

jeopardy['high_value'] = jeopardy.apply(determine_value, axis=1)

In [13]:
def count_usage(word):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        split_question = row['clean_question'].split(' ')
        if word in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

In [14]:
from random import choice

terms_used_list = list(terms_used)
comparison_terms = [choice(terms_used_list) for _ in range(10)]

observed_expected = []

for term in comparison_terms:
    observed_expected.append(count_usage(term))

observed_expected

[(0, 1),
 (1, 3),
 (2, 0),
 (1, 0),
 (0, 1),
 (7, 6),
 (0, 1),
 (1, 0),
 (0, 2),
 (1, 0)]

# Applying the chi-squared test
The chi-squared test was taken with only 10 random samples, with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values. Attempting to do this for all the words would take a very long time.

In [15]:
from scipy.stats import chisquare

high_value_count = jeopardy[jeopardy['high_value'] == 1].shape[0]
low_value_count = jeopardy[jeopardy['high_value'] == 0].shape[0]

chi_squared = []
for oe in observed_expected:
    total = sum(oe)
    total_prop = total / jeopardy.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    observed = [oe[0], oe[1]]
    expected = [high_value_exp, low_value_exp]
    chi_squared.append(chisquare(observed, expected))
    
chi_squared

[Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.02636443308440769, pvalue=0.871013484688921),
 Power_divergenceResult(statistic=4.97558423439135, pvalue=0.025707519787911092),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=4.028652015627377, pvalue=0.04473366998729752),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047)]

# Conclusion
**Anwer in questions** : The anwser only appeared in the question only 6% of the time. It's not recommended to rely on the answer being in the question to win.

**Recycled questions** : There was about 69% overlap between old and new questions. The `jeopardy.csv` only contained a small sample from the full dataset, and it doesn't look at phrases. Despite those issues, it's worth looking more into recycling questions.

**Chi-squared test results** : None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.