# Guided Project: Winning Jeopardy

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for many years, and is a major force in popular culture.

In this project, we will work with a dataset of jeopardy questions to figure out some patterns in the questions that could help one win.
The dataset contains 20000 rows from the beginning of a full dataset of jeopardy questions and can be downloaded [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/).

Each row in the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:

    Show Number - the Jeopardy episode number
    Air Date - the date the episode aired
    Round - the round of Jeopardy
    Category - the category of the question
    Value - the number of dollars the correct answer is worth
    Question - the text of the question
    Answer - the text of the answer



In [18]:
# Import packages
import pandas as pd
import numpy as np
from scipy.stats import chisquare
from scipy.stats import chi2_contingency
import re

In [19]:
# Read and view data
jeopardy =pd.read_csv('jeopardy.csv')
jeopardy.head(3)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona


### Preparing the data

In [20]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [21]:
# Fix spaces in column names
jeopardy.columns = ['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question', 'Answer']

In [22]:
jeopardy.dtypes

Show Number     int64
Air Date       object
Round          object
Category       object
Value          object
Question       object
Answer         object
dtype: object

The 'Air Date' and 'Value' columns have the wrong data types. Date should be corrected to datetime and Value show be assigned a numeric data type. The 'Questions' and 'Answer' columns have texts and punctions in them, even capital and lower case letters, these could be normalized for easy usage.

In [23]:
# Function to clean and normalize texts 
def norm_strings(words):
    words = words.lower()
    words = re.sub('[^A-Za-z0-9\s]', '', words)
    return words

In [24]:
# Applying the function above to clean the Questions and Answer columns
jeopardy['clean_question'] =jeopardy['Question'].apply(norm_strings)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(norm_strings)

In [25]:
# Function to clean numeric column Value
def norm_num(text):
    text = re.sub('[^A-Za-z0-9\s]', '', text)
    try:
        text = int(text)
    except:
        text = 0
    return text

In [26]:
# Applying funtion to numeric column
jeopardy['clean_value'] =jeopardy['Value'].apply(norm_num)

In [27]:
# Convert date column to datetime format
jeopardy['Air Date'] =pd.to_datetime(jeopardy['Air Date'])

In [28]:
jeopardy.dtypes

Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object

### Is it worth it to study past questions

In order to figure out whether to study past questions, study general knowledge, or not study at all, it would be helpful to figure out two things:

    How often the answer can be used for a question.
    How often questions are repeated.

We can answer the second question by seeing how often complex words (> 6 characters) reoccur and the first question by seeing how many times words in the answer also occur in the question. 

In [29]:
# Function to check the proportion of answers that can be found in past questions

def count_matches(row):
    split_answer = row["clean_answer"].split()
    split_question = row["clean_question"].split()
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer) == 0:
        return 0
    match_count = 0
    for item in split_answer:
        if item in split_question:
            match_count += 1
    return match_count / len(split_answer)

jeopardy["answer_in_question"] = jeopardy.apply(count_matches, axis=1)

In [30]:
jeopardy['answer_in_question'].mean()


0.05900196524977763

From the answer above, we could conclude that 6% of time, answers can be found within the questions itself. In that case just hearing the question is less likely to tell the answer, one should probably study.

Also we cannot fully answer how often old questions are repeated, since we don't have the complete data. However, it is worth investigating from available data.

In [31]:
question_overlap = []
terms_used = set()

jeopardy = jeopardy.sort_values("Air Date")

for i, row in jeopardy.iterrows():
        split_question = row["clean_question"].split(" ")
        split_question = [q for q in split_question if len(q) > 5]
        match_count = 0
        for word in split_question:
            if word in terms_used:
                match_count += 1
        for word in split_question:
            terms_used.add(word)
        if len(split_question) > 0:
            match_count /= len(split_question)
        question_overlap.append(match_count)
jeopardy["question_overlap"] = question_overlap

jeopardy["question_overlap"].mean()

0.6876260592169802

The mean value indicates that on average about 69% of past questions are repeated. This is an important hint on how to prepare for jeopardy.

### How to win big

To win big, a player can decide to focus on high value questions, one can earn more by answering few of these as compared to many high value questions. We will categorize these as questions with value over 800, and low value questions will be those with values below this threshold.
This can be achieved using a chi-squared test.



In [32]:

def check_value(row):
    value = 0
    if row["clean_value"] > 800:
        value = 1
    return value

jeopardy["high_value"] = jeopardy.apply(check_value, axis=1)

In [44]:
def count_usage(term):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        if term in row["clean_question"].split(" "):
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count
    

In [45]:
from random import choice

terms_used_list = list(terms_used)
comparison_terms = [choice(terms_used_list) for _ in range(10)]

observed_expected = []

for term in comparison_terms:
    observed_expected.append(count_usage(term))

observed_expected

[(0, 1),
 (1, 0),
 (1, 2),
 (0, 3),
 (0, 1),
 (1, 2),
 (3, 2),
 (1, 0),
 (0, 1),
 (0, 1)]

In [46]:

high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]

chi_squared = []
for obs in observed_expected:
    total = sum(obs)
    total_prop = total / jeopardy.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))

chi_squared



[Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.03188116723440362, pvalue=0.8582887163235293),
 Power_divergenceResult(statistic=1.205888538380652, pvalue=0.27214791766902047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.03188116723440362, pvalue=0.8582887163235293),
 Power_divergenceResult(statistic=2.3995960878537224, pvalue=0.12136658322360773),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469)]

### Results

The 10 words selected at random to test high value questions wording did not show any significant usage pattern for either low or high value questions. Also, they all have a frequency lower than 5 hence the result of the chi-squared test isnt valid.