# Winning Jeopardy

![](https://s.abcnews.com/images/Entertainment/GTY_ken_jennings_jeopardy_jt_141126_16x9_608.jpg)

Image Source: Ken Jennings poses in this handout photo.
Jeopardy Productions via Getty Images.

## Table of Contents
---
- [Introduction](#Introduction)
- [Data Preparation](#Data-Preparation)
- [Setting the environment](#Setting-the-environment)
- [Reading and inspecting the data](#Reading-and-inspecting-the-data)
- [Data Cleaning](#Data-Cleaning)
- [Removing white spaces from column names](#Removing-white-spaces-from-column-names)
- [Normalizing Question and Answer Columns](#Normalizing-Question-and-Answer-Columns)
- [Normalizing the Value and Air Date Columns](#Normalizing-the-Value-and-Air-Date-Columns)
- [Data Analysis](#Data-Analysis)
- [How often the answer can be used for a question](#How-often-the-answer-can-be-used-for-a-question)
- [How often questions are repeated](#How-often-questions-are-repeated)
- [Low Value vs High Value Questions](#Low-Value-vs-High-Value-Questions)
- [Statistical Analysis](#Statistical-Analysis)
- [Conclusions](#Conclusions)

## Introduction

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for many years, and is a major force in popular culture. In this project, we'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help us win.

The dataset contains 20,000 rows from the beginning of a full dataset of Jeopardy questions, which can be downloaded [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/). Each row in the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:

- Show Number - the Jeopardy episode number
- Air Date - the date the episode aired
- Round - the round of Jeopardy
- Category - the category of the question
- Value - the number of dollars the correct answer is worth
- Question - the text of the question
- Answer - the text of the answer

## Data Preparation 

### Setting the environment

In [1]:
import pandas as pd
import re
import random
import numpy as np
from scipy.stats import chisquare

###  Reading and inspecting the data

In [2]:
# Read the csv file, parsing dates from the "Air Date" column
jeopardy = pd.read_csv("jeopardy.csv", parse_dates = [" Air Date"])

# Display dataframe information
print("Shape of the dataframe:", jeopardy.shape)
print("Columns of the dataframe:", jeopardy.columns)
print(jeopardy.info())
print(jeopardy.head())

Shape of the dataframe: (19999, 7)
Columns of the dataframe: Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
Show Number    19999 non-null int64
 Air Date      19999 non-null datetime64[ns]
 Round         19999 non-null object
 Category      19999 non-null object
 Value         19999 non-null object
 Question      19999 non-null object
 Answer        19999 non-null object
dtypes: datetime64[ns](1), int64(1), object(5)
memory usage: 1.1+ MB
None
   Show Number   Air Date      Round                         Category  Value  \
0         4680 2004-12-31  Jeopardy!                          HISTORY   $200   
1         4680 2004-12-31  Jeopardy!  ESPN's TOP 10 ALL-TIME ATHLETES   $200   
2         4680 2004-12-31  Jeopardy!      EVERYBODY TALKS ABOUT IT...   $200   
3         4680 2004-12-31  Jeopardy!             

## Data Cleaning

### Removing white spaces from column names

In [3]:
# Original columns with white space
print("Columns of the dataframe:", jeopardy.columns)

Columns of the dataframe: Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')


In [4]:
# Remove white spaces from column names
jeopardy.columns = ['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question', 'Answer']

print("Columns of the dataframe:", jeopardy.columns)

Columns of the dataframe: Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')


### Normalizing Question and Answer Columns

We'll continue by converting `Question` and `Answer` columns to lowercase. We will also remove all punctuation.

In [5]:
def normalize(string):
    """
    Takes a string as an input. Transforms the string by removing the punctuation and 
    converting to lowercase.
    
    Args:
        string: String to be normalized
    
    Returns:
        normalized_string: Lowcase string without punctuation
    """
    
    lower_string = str(string).lower()
    normalized_string = re.sub(r"[^\w\s]", "", lower_string) # ^ means negation. So [^\w\s] will match a character which
                                                             # does not belong to either the word or the whitespace group
    return normalized_string

# Apply the normalization function to the "Question" and "Answer" columns
jeopardy["clean_question"] = jeopardy["Question"].apply(normalize)
jeopardy["clean_answer"] = jeopardy["Answer"].apply(normalize)

# Display results
print(jeopardy.head())

   Show Number   Air Date      Round                         Category Value  \
0         4680 2004-12-31  Jeopardy!                          HISTORY  $200   
1         4680 2004-12-31  Jeopardy!  ESPN's TOP 10 ALL-TIME ATHLETES  $200   
2         4680 2004-12-31  Jeopardy!      EVERYBODY TALKS ABOUT IT...  $200   
3         4680 2004-12-31  Jeopardy!                 THE COMPANY LINE  $200   
4         4680 2004-12-31  Jeopardy!              EPITAPHS & TRIBUTES  $200   

                                            Question      Answer  \
0  For the last 8 years of his life, Galileo was ...  Copernicus   
1  No. 2: 1912 Olympian; football star at Carlisl...  Jim Thorpe   
2  The city of Yuma in this state has a record av...     Arizona   
3  In 1963, live on "The Art Linkletter Show", th...  McDonald's   
4  Signer of the Dec. of Indep., framer of the Co...  John Adams   

                                      clean_question clean_answer  
0  for the last 8 years of his life galileo was 

### Normalizing the Value and Air Date Columns

We will now convert the `Value` column to a numeric column by removing the dollar sign from the beginning of each value. We'll also convert the `Air Date` column to a datetime column.

In [7]:
def norm_value(value):
    """
    Takes a string as input and removes the "$" character and then transforms it to an
    integer. If errors occur (e.g., string is empty), assigns a value of zero.
    
    Args:
        value: Input string to be normalized.
    
    Returns:
        int_value: Value as integer and without "$" character.
    """
    
    clean_value = re.sub(r"[\D]", "", value)  # the \D will match strings that don't contain digits
    try:
        int_value = int(clean_value)
    except:
        int_value = 0
    return int_value

# Apply the normalization function to the "Value" column
jeopardy["clean_value"] = jeopardy["Value"].apply(norm_value)

# Apply the pd.to_datetime function to the "Air Date" column
jeopardy["Air Date"] = pd.to_datetime(jeopardy["Air Date"])

# Display results
print(jeopardy.head())

   Show Number   Air Date      Round                         Category Value  \
0         4680 2004-12-31  Jeopardy!                          HISTORY  $200   
1         4680 2004-12-31  Jeopardy!  ESPN's TOP 10 ALL-TIME ATHLETES  $200   
2         4680 2004-12-31  Jeopardy!      EVERYBODY TALKS ABOUT IT...  $200   
3         4680 2004-12-31  Jeopardy!                 THE COMPANY LINE  $200   
4         4680 2004-12-31  Jeopardy!              EPITAPHS & TRIBUTES  $200   

                                            Question      Answer  \
0  For the last 8 years of his life, Galileo was ...  Copernicus   
1  No. 2: 1912 Olympian; football star at Carlisl...  Jim Thorpe   
2  The city of Yuma in this state has a record av...     Arizona   
3  In 1963, live on "The Art Linkletter Show", th...  McDonald's   
4  Signer of the Dec. of Indep., framer of the Co...  John Adams   

                                      clean_question clean_answer  clean_value  
0  for the last 8 years of his life

## Data Analysis

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

* How often the answer can be used for a question.
* How often questions are repeated.

We can answer the first question by seeing how many times words in the answer also occur in the question. SImilarly, we can answer the second question by seeing how often complex words (> 6 characters) reoccur.

Let's work on the first question.

### How often the answer can be used for a question

In [11]:
def match_words(row):
    """
    Takes a row as input, and returns the proportion of words in the "clean_answer" column
    that are also in the "clean_question" column.
    
    Args:
        row: row of the dataframe to work with.
    
    Returns:
        p_answer_in_question : proportion of words in the "clean_answer" column that 
        are also in the "clean_question" column.
    """
    
    split_answer = row["clean_answer"].split()
    split_question = row["clean_question"].split()
    match_count = 0
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer) == 0:
        return 0
    for word in split_answer:
        if word in split_question:
            match_count += 1
    p_answer_in_question = match_count / len(split_answer)
    return p_answer_in_question

# Apply the function to the dataframe
jeopardy["answer_in_question"] = jeopardy.apply(match_words, axis=1)

# Calculate the mean value of the new column
mean_value = jeopardy["answer_in_question"].mean()

# Display the results
print("The mean percentage of words in the answer that also occur in the question is {}%.".format(round(mean_value * 100, 1)))

The mean percentage of words in the answer that also occur in the question is 5.9%.


As the mean percentage of words in the answer which also occur in the question is under 6%, we do not recommend trying to answer a question with words already contained in it.


### How often questions are repeated

Let's say we want to investigate how often new questions are repeats of older ones. We can't completely answer this, because we only have about `10%` of the full Jeopardy question dataset, but we can investigate it at least. We'll only analyze words with six or more characters to filter out words like "the" and 'than", which are commonly used, but don't tell us a lot about a question.

In [16]:
# Initiate empty list and set that are to be used later
question_overlap = []
terms_used = set()

# Sort the dataframe by the "Air Date" column in ascending order
jeopardy = jeopardy.sort_values(by = "Air Date")

# For each row of the dataframe, split the "clean_question" column, keep words with >5 characters,
# and check if each word has been previously used - "terms_used" set. Calculate the proportion of
# used words per row.
for index, row in jeopardy.iterrows():
    split_question = row["clean_question"].split()
    split_question = [q for q in split_question if len(q) > 5]
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
        terms_used.add(word)
    if len(split_question) > 0:
        match_count /= len(split_question)
    question_overlap.append(match_count)

# Assign the list of proportions calculated above to the `question_overlap`" column.
jeopardy["question_overlap"] = question_overlap

# Convert the column to float type
jeopardy["question_overlap"] = jeopardy["question_overlap"].astype(float)

# Calculate mean proportion of the new column
mean_val = jeopardy["question_overlap"].mean()

# Display results from the bottom (more chances to get proportions different from zero, as the words have been previously used)
print(jeopardy.tail())

# Display mean proportion of used words
print()
print("The mean proportion of questions that have been already used is {}%.".format(round(mean_val * 100, 1)))

      Show Number   Air Date             Round                  Category  \
1953         6294 2012-01-19  Double Jeopardy!   WEAPONS OF WORLD WAR II   
1954         6294 2012-01-19  Double Jeopardy!   ACTING PRESIDENTS ON TV   
1942         6294 2012-01-19         Jeopardy!                    INLETS   
1943         6294 2012-01-19         Jeopardy!  THE EVOLUTION OF "M"USIC   
1922         6294 2012-01-19         Jeopardy!           THAT'S BUSINESS   

      Value                                           Question  \
1953   $800  Ships in the U.S. Navy's Casablanca class of "...   
1954   $800  Dennis Haysbert & D.B. Woodside as David & Way...   
1942  $1000  North Carolina's Albemarle Sound is no deeper ...   
1943  $1000  In the 2000s "Makes Me Wonder" got this group ...   
1922   $400  In 1997 Tyco International moved to this U.K. ...   

                 Answer                                     clean_question  \
1953  aircraft carriers  ships in the us navys casablanca class of e

### Low Value vs High Value Questions

There is about a `70%` overlap between terms in new questions and terms in old questions. This only looks at a small set of questions, and it doesn't look at phrases — it looks at single terms. This makes it relatively insignificant, but it does mean that it's worth looking more into the recycling of questions.

Let's say you only want to study questions that pertain to high value questions instead of low value questions. This will help you earn more money when you're on Jeopardy.

You can actually figure out which terms correspond to high-value questions using a chi-squared test. You'll first need to narrow down the questions into two categories:

* Low value -- Any row where Value is less than `800`.
* High value -- Any row where Value is greater than `800`.

You'll then be able to loop through each of the terms in `terms_used` and find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values.

In [25]:
# Create a column with value 1 for "clean_value" values greater than 800, and 0 otherwise
jeopardy["high_value"] = jeopardy["clean_value"].apply(lambda x: 1 if x >= 800 else 0)

# Create the function to determine high and low counts for words
def high_low_counts(word):
    """
    Takes a word as input, and returns the number of times it appeared in 
    high value and low value questions.
    
    Args:
        word: Word to be searched in previous questions.
    
    Returns:
        high_count: Number of times the word appeared in high value questions.
        low_count: Number of times the word appeared in low value questions.
    """
    
    low_count = 0
    high_count= 0
    for index, row in jeopardy.iterrows():
        split_question = row["clean_question"].split()
        if word in split_question:
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

# Take a random sample from the terms_used list to test the function
random.seed(1)
comparison_terms = random.sample(list(terms_used), 10)
print("The random sample of terms is:")
print()
print('\n'.join(comparison_terms))

# Apply the function in the sample, and display the results
observed_expected = []
for word in comparison_terms:
    observed_expected.append(high_low_counts(word))
print()
print("The observed high and low values counts are:")
print()
observed_expected

The random sample of terms is:

dreamily
hardings
carolyn
safinas
liquefy
trifle
officially
genome
goulet
competing

The observed high and low values counts are:



[(1, 0),
 (1, 0),
 (0, 2),
 (0, 1),
 (1, 1),
 (1, 0),
 (9, 12),
 (0, 1),
 (0, 1),
 (0, 1)]

### Statistical Analysis

Now that we've found the observed counts for a few terms, we can compute the expected counts and the chi-squared value.

In [27]:
high_value_count = jeopardy["high_value"].value_counts()[1]
low_value_count = jeopardy["high_value"].value_counts()[0]
chi_squared = []

for list in observed_expected:
    total = sum(list)
    total_prop = total / jeopardy.shape[0]
    expected_high = high_value_count * total_prop
    expected_low = low_value_count * total_prop
    chisq_value, pvalue_gender_income = chisquare(np.array(list), np.array([expected_high, expected_low]))
    chi_squared.append([chisq_value, pvalue_gender_income])

chi_squared

[[1.295042460408538, 0.25512076479610835],
 [1.295042460408538, 0.25512076479610835],
 [1.5443509082853344, 0.21397134128528295],
 [0.7721754541426672, 0.3795448984353682],
 [0.03360895727560264, 0.8545410902144307],
 [1.295042460408538, 0.25512076479610835],
 [0.004366889982650542, 0.9473121844751238],
 [0.7721754541426672, 0.3795448984353682],
 [0.7721754541426672, 0.3795448984353682],
 [0.7721754541426672, 0.3795448984353682]]

In [28]:
# Transform the observed_expected values to dataframe
results = pd.DataFrame(observed_expected, index = comparison_terms, columns = ["Low value count", "High value count"])

# Add the chi square and p values as columns
results["Chi"] = chi_squared
results[["Chi Square", "p value"]] = pd.DataFrame(results.Chi.tolist(), index= results.index)
results.drop("Chi", axis=1, inplace = True)

# Display the results
print(results)

            Low value count  High value count  Chi Square   p value
dreamily                  1                 0    1.295042  0.255121
hardings                  1                 0    1.295042  0.255121
carolyn                   0                 2    1.544351  0.213971
safinas                   0                 1    0.772175  0.379545
liquefy                   1                 1    0.033609  0.854541
trifle                    1                 0    1.295042  0.255121
officially                9                12    0.004367  0.947312
genome                    0                 1    0.772175  0.379545
goulet                    0                 1    0.772175  0.379545
competing                 0                 1    0.772175  0.379545


The results above show that, from the sample words that we've used our function with, there's no statistically significant difference regarding whether they appear more on high value or low value questions.

## Conclusions 

In this project, a dataset of Jeopardy questions has been used to figure out some patterns in the questions that could help to win. After exploring we figured out that:

* On average about 6% of the words of answers are found in the questions. So the chance of deducing the answer from the question is quite low.
* About 69% of the complex words in questions are repeated so studying the past questions can be really helpful to win.
* There's no difference on the sample of words that we analyzed to be on high value or low value questions.

The next step can be finding the questions with the high value containing these words. These questions can be recommended to study and win.