# Statistical Insights into Jeopardy Questions

## 1. Introducing and Loading the Data

*Jeopardy* is a popular TV show in the US where participants answer questions to win money. It's been running for many years and has become a major force in popular culture. Imagine we want to compete on *Jeopardy* and are searching for any advantage to help us win. In this project, we'll work with a dataset of *Jeopardy* questions to identify patterns that could give us an edge.

The dataset, named `jeopardy.csv`, contains `19,999` rows from the start of a full *Jeopardy* question dataset, which can be downloaded [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/). Each row represents a single question from a specific episode of *Jeopardy*. Below are explanations of each column:

- `Show Number` – the episode number of *Jeopardy*. 
- `Air Date` – the date the episode aired.
- `Round` – the round of *Jeopardy*.   
- `Category` – the category of the question.   
- `Value` – the dollar amount the correct answer is worth. 
- `Question` – the text of the question.
- `Answer` – the text of the answer.

To begin, let's import the necessary libraries and load the dataset.

In [1]:
# Import the relevant libraries
from scipy.stats import chisquare
from random import choice
import pandas as pd
import numpy as np
import re

# Load the Jeopardy dataset, and display the first few rows and information of the dataset
jeopardy = pd.read_csv("Datasets/jeopardy.csv")
display(jeopardy.head())
jeopardy.info()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Show Number  19999 non-null  int64 
 1    Air Date    19999 non-null  object
 2    Round       19999 non-null  object
 3    Category    19999 non-null  object
 4    Value       19663 non-null  object
 5    Question    19999 non-null  object
 6    Answer      19999 non-null  object
dtypes: int64(1), object(6)
memory usage: 1.1+ MB


The dataset contains `7` columns; the `Show Number` is an integer, while the rest of the columns are stored as objects. Also, there are some missing values in the `Value` column (`336` entries). Except for the first column, there are spaces at the beginning of the remaining column names, which need to be removed.

## 2. Renaming and Normalizing Columns

Before we can start analyzing the *Jeopardy* questions, we need to normalize the `Question` and `Answer` columns. This involves converting words to lowercase and removing punctuation, so words like `Don't` and `don't` will not be treated as different when compared.

For the `Value` column, we need to remove the dollar sign from the beginning of each value and convert the column from text to numeric. In addition, the `Air Date` column should be converted to a datetime format instead of a string.

To get started, we'll first rename our columns to remove the leading spaces.

In [2]:
# Display the column names before the change
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [3]:
# Rename columns to remove leading spaces and display the updated column names
jeopardy.columns = ['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question', 'Answer']
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

Next, we'll create two functions named `normalize_text` and `normalize_value` to help us transform the data. Before that, we'll drop all rows with missing values in the `Value` column, leaving us with `19,663` observations.

In [4]:
# Remove rows with missing values
jeopardy.dropna(inplace=True)

# Normalize text by converting to lowercase, removing punctuation, and extra spaces
def normalize_text(text):
    text = text.lower()
    text = re.sub("[^A-Za-z0-9\s]", "", text)  # Remove non-alphanumeric characters
    text = re.sub("\s+", " ", text)            # Replace multiple spaces with a single space
    return text

# Remove non-numeric characters from values and convert to integer
def normalize_value(text):
    text = re.sub("[^0-9]", "", text)
    text = int(text)
    return text

# Apply normalization functions to the "Question", "Answer", and "Value" columns
jeopardy["Question"] = jeopardy["Question"].apply(normalize_text)
jeopardy["Answer"] = jeopardy["Answer"].apply(normalize_text)
jeopardy["Value"] = jeopardy["Value"].astype(str).apply(normalize_value)

# Convert 'Air Date' column to datetime format
jeopardy["Air Date"] = pd.to_datetime(jeopardy["Air Date"])
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,200,for the last 8 years of his life galileo was u...,copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,200,no 2 1912 olympian football star at carlisle i...,jim thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,200,the city of yuma in this state has a record av...,arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,200,in 1963 live on the art linkletter show this c...,mcdonalds
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,200,signer of the dec of indep framer of the const...,john adams


As we see, the questions and answers have been normalized, with text in lowercase and punctuation removed, ensuring consistency for easier analysis and comparison. Also, the dollar sign has been removed from the beginning of each value.

## 3. Determining Answer Presence in Questions

To determine whether to study past questions or focus on general knowledge, it would be helpful to assess how often answers appear across questions, and how frequently questions are repeated.

We can address the first inquiry by examining how many times words in answers also appear in questions, while the second inquiry can be explored by analyzing how often complex words in questions (with at least `6` characters) recur.

Note that words such as `the` and `than` are commonly found in answers and questions, but they don't provide meaningful assistance in identifying the answer. Therefore, we will remove these words from the answers. For now, let's focus on the first inquiry.

In [5]:
# Count the number of matches between the words in the answer and the question for each row
def count_matches(row):
    
    # Split the answer and question into individual words
    split_answer = row["Answer"].split()
    split_question = row["Question"].split()
    
    # Remove common unhelpful words from the answer
    for word in ['the', 'than']:
        if word in split_answer:
            split_answer.remove(word)
    
    # Count the number of words in the answer that match words in the question
    match_count = 0
    for word in split_answer:
        if word in split_question:
            match_count += 1
            
    # Return 0 if there are no valid words in the answer
    if len(split_answer) == 0:
        return 0
    
    # Return the proportion of matched words to total valid words in the answer
    return match_count / len(split_answer)

# Apply the function to each row, create a new column for the results,
# and calculate and round the mean of the column to 4 decimal places
jeopardy["Answer in Question"] = jeopardy.apply(count_matches, axis=1)
jeopardy["Answer in Question"].mean().round(4)

0.0594

On average, only about `5.94%` of an answer directly overlaps with its corresponding question. This isn't a significant number, indicating that we can't rely solely on hearing a question to determine the answer. Therefore, we'll need to find a more effective method for identifying the correct answer.

## 4. Evaluating Overlap Between New and Old Questions

Let’s say we want to investigate how often new questions are repeats of older ones. While we can’t fully answer this question since we only have about `10%` of the complete *Jeopardy* questions dataset, we can at least explore it.

Focusing only on words with `6` or more characters will help us filter out common words like `the` and `than`, which do not provide much insight into the questions.

In [6]:
# Initialize a list for overlap proportions and a set to track terms used
overlap_proportions = []
terms_used = set()

# Sort the jeopardy DataFrame by "Show Number" and "Air Date"
jeopardy = jeopardy.sort_values(by=["Show Number", "Air Date"])

# Iterate over each row in the dataset
for i, row in jeopardy.iterrows():
    
    # Split the question into individual words and filter for words longer than 5 characters
    split_question = row["Question"].split(" ")
    split_question = [word for word in split_question if len(word) > 5]
    
    # Count how many of the words from the current question have been used in previous questions
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
    
    # Add the words from the current question to the set of terms used
    for word in split_question:
        terms_used.add(word)
    
    # Calculate the proportion of matched words to total words in the question
    if len(split_question) > 0:
        match_count /= len(split_question)
    overlap_proportions.append(match_count)

# Add the question overlap values to the DataFrame and calculate the mean
jeopardy["Question Overlap"] = overlap_proportions
jeopardy["Question Overlap"].mean().round(4)

0.6843

On average, there is approximately `68.43%` overlap between terms in new questions and those in old questions. This analysis only considers a small set of questions and focuses solely on individual terms, rather than phrases. While this makes the finding relatively insignificant, it suggests that further investigation into the recycling of questions may be worthwhile.

## 5. Assessing Term Frequencies by Question Value

Let's say we want to mainly focus on studying high-value questions, which can help us earn more money when we're on *Jeopardy*. We can determine which terms correspond to high-value questions using a **chi-squared** test. However, we first need to categorize the questions into two groups:
- `High value` – Any row where the `Value` is greater than `800` (a bit above the mean of `748.34`).
- `Low value` – Any row where the `Value` is less than `800`.  

To identify the words with the most significant differences in usage between high- and low-value questions, we can select the words with the highest associated chi-squared values. Analyzing all the words would be time-consuming, so we'll focus on a small sample for now.

In [7]:
# Classify whether the question value is high (1) or low (0)
def classify_value(row):
    value = 0
    if row["Value"] > 800:
        value = 1
    return value

# Apply the classification to each row and create a new "High Value" column
jeopardy["High Value"] = jeopardy.apply(classify_value, axis=1)

Next, we’ll determine the frequency of specific terms in high and low value questions by selecting a random sample and comparing the results.

In [8]:
# Count the number of times a term appears in high and low value questions
def count_usage(term):
    
    # Initialize counts for low and high value questions
    low_count = 0
    high_count = 0
    
    # Iterate over each row to check if the term is present in the question
    for i, row in jeopardy.iterrows():
        if term in row["Question"].split(" "):
            
            # Increment the appropriate count based on whether the question is high or low value
            if row["High Value"] == 1:
                high_count += 1
            else:
                low_count += 1
                
    # Return the counts of the term occurrence in high and low value questions
    return high_count, low_count


# Select 10 random terms from the list of terms used
terms_used_list = list(terms_used)
comparison_terms = [choice(terms_used_list) for i in range(10)]

# Count the occurrences of each selected term in high and low value questions
term_frequencies = []
for term in comparison_terms:
    term_frequencies.append(count_usage(term))

# Convert term frequencies into a DataFrame with specified column names
term_freq_df = pd.DataFrame(term_frequencies, columns=["High Value Count", "Low Value Count"])
term_freq_df

Unnamed: 0,High Value Count,Low Value Count
0,0,1
1,0,1
2,0,3
3,1,0
4,1,0
5,1,0
6,1,0
7,0,1
8,1,0
9,0,1


Out of the `10` randomly selected terms, `50%` appear exclusively in low-value questions, while the other half are found in high-value questions. However, we cannot fully conclude that certain terms are more closely associated with either low- or high-value questions.

## 6. Running the Chi-Squared Test

Now that we've found the frequencies of `10` different terms in high and low-value questions, we can compute the expected counts for these terms and their respective chi-squared statistics.

In [9]:
# Count the number of high and low value questions
high_val_questions = jeopardy[jeopardy["High Value"] == 1].shape[0]
low_val_questions = jeopardy[jeopardy["High Value"] == 0].shape[0]
chi_squared = []

# Iterate through the term frequencies
for observed in term_frequencies:
    
    # Calculate the total frequency and proportion of the term relative to all questions
    total = sum(observed)
    total_prop = total / jeopardy.shape[0]
    
    # Compute expected counts based on the total proportion
    high_value_exp = total_prop * high_val_questions
    low_value_exp = total_prop * low_val_questions
    
    # Create arrays for observed and expected frequencies, 
    # and append the chi-squared result for the current term
    observed = np.array([observed[0], observed[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))

# Display the chi-squared values
chi_squared

[Power_divergenceResult(statistic=0.41165912843707375, pvalue=0.5211285963246591),
 Power_divergenceResult(statistic=0.41165912843707375, pvalue=0.5211285963246591),
 Power_divergenceResult(statistic=1.234977385311221, pvalue=0.26644123439963835),
 Power_divergenceResult(statistic=2.429194279734914, pvalue=0.11909409782120144),
 Power_divergenceResult(statistic=2.429194279734914, pvalue=0.11909409782120144),
 Power_divergenceResult(statistic=2.429194279734914, pvalue=0.11909409782120144),
 Power_divergenceResult(statistic=2.429194279734914, pvalue=0.11909409782120144),
 Power_divergenceResult(statistic=0.41165912843707375, pvalue=0.5211285963246591),
 Power_divergenceResult(statistic=2.429194279734914, pvalue=0.11909409782120144),
 Power_divergenceResult(statistic=0.41165912843707375, pvalue=0.5211285963246591)]

None of the terms show a significant difference in usage between high-value and low-value questions, with the lowest p-value being `0.119`. Additionally, the frequencies of the randomly selected terms were no more than `3`, which diminishes the validity of the chi-squared test. It would be more meaningful to conduct the chi-squared test using only terms with higher frequencies.

## 7. Conclusion

In this project, we analyzed a dataset of *Jeopardy* questions to explore any potential strategies for winning. The dataset originally contained `19,999` rows, where each row represented a single question from a specific episode of *Jeopardy*. It included `7` columns, with the `Show Number` as an integer and the rest stored as objects.

We began by renaming the columns to remove leading spaces, then dropped rows with missing values, reducing the dataset to `19,663` observations. To ensure consistency, we normalized the text in the questions and answers by converting everything to lowercase and removing punctuation. Additionally, we stripped the dollar sign from the values for easier analysis.

To determine whether studying past questions or focusing on general knowledge would be beneficial, we analyzed how often answers appear across questions, and how frequently questions are reused. We first examined the overlap between answers and questions by identifying how often words from answers also appear in the corresponding questions. Then, we explored the recurrence of complex words (`6+` characters) in questions to assess repetition.

On average, only `5.94%` of an answer directly overlaps with its corresponding question, suggesting it's not possible to predict answers just from hearing the question. In addition, around `68.43%` of terms in new questions overlap with older ones, though this analysis was limited to a small sample and focused on single words rather than entire phrases. Despite its limitations, this suggests that further exploration into question recycling could be useful.

Before confirming which terms correspond to high-value questions using a chi-squared test, we first categorized the questions into high-value and low-value groups. We analyzed the frequency of specific terms in both categories by selecting a random sample and comparing the results. Of the `10` randomly chosen terms, `50%` appeared only in low-value questions, while the other half were found in high-value questions.

We then calculated the expected counts for these `10` terms and their chi-squared statistics. None of the terms showed a significant difference in usage between high- and low-value questions, with the lowest p-value being `0.119`. Additionally, because the frequencies of the selected terms did not exceed `3`, the chi-squared test's validity was diminished. A more meaningful analysis would involve applying the chi-squared test to terms with higher occurrences.