# Jeopardy Questions

<img src="https://images.unsplash.com/photo-1604815887789-c076c46a1110?ixlib=rb-4.0.3&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=871&q=80" width="800" height="100">

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for many years, and is a major force in popular culture.

Imagine that you want to compete on Jeopardy, and you're looking for any way to win. In this project, you'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you win.

The dataset is named `jeopardy.csv`, and contains `20000` rows from the beginning of a full dataset of Jeopardy questions, which you can download [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file). Here's the beginning of the file:

<img src="https://dq-content.s3.amazonaws.com/Nlfu13A.png" width="800" height="100">

As you can see, each row in the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:

- `Show Number` - the Jeopardy episode number
- `Air Date` - the date the episode aired
- `Round` - the round of Jeopardy
- `Category` - the category of the question
- `Value` - the number of dollars the correct answer is worth
- `Question` - the text of the question
- `Answer` - the text of the answer

## Load the dataset

We should read the dataset into a DataFrame and examine its structure to become familiar with it.

In [1]:
# Import required libraries
import re
import random
import numpy as np
import pandas as pd
from scipy.stats import chisquare

In [2]:
# Read the dataset
jeopardy = pd.read_csv('jeopardy.csv')

# View first five rows
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


We will use `str.startswith()` method with to check if there are any leading or trailing whitespaces in the column names.

In [3]:
# Check for leading whitespaces in column names
has_leading_spaces = jeopardy.columns.str.startswith(' ').any()

# Check for trailing whitespaces in column names
has_trailing_spaces = jeopardy.columns.str.endswith(' ').any()

# Display the output
if has_leading_spaces or has_trailing_spaces:
    print("The DataFrame has column names with leading or trailing whitespaces.")
else:
    print("The DataFrame does not have column names with leading or trailing whitespaces.")

The DataFrame has column names with leading or trailing whitespaces.


Based on the output above, it appears that some of the column names contain whitespaces. To address this issue, we can remove the whitespaces from the column names.

In [4]:
# Remove the leading and trailing whitespaces
jeopardy.columns = jeopardy.columns.str.strip()

has_leading_spaces = jeopardy.columns.str.startswith(' ').any()
has_trailing_spaces = jeopardy.columns.str.endswith(' ').any()
if has_leading_spaces or has_trailing_spaces:
    print("The DataFrame has column names with leading or trailing whitespaces.")
else:
    print("The DataFrame does not have column names with leading or trailing whitespaces.")

The DataFrame does not have column names with leading or trailing whitespaces.


## Normalizing Text

Before we can begin analyzing the Jeopardy questions, we need to normalize all of the text columns, such as the `Question` and `Answer` columns. Normalization involves converting all words to lowercase and removing punctuation, to ensure that similar words are not treated differently when comparing them. For example, both `Don't` and `don't` would be converted to `dont`.

In [5]:
# Function to convert string into lowercase and remove all punctuation
def normalize_text(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    return text
    
# Normalize 'Question' column
jeopardy['clean_question'] = jeopardy['Question'].apply(normalize_text)

# Normalize 'Answer' column
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize_text)

# View results
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams


## Normalizing Columns

After normalizing the text columns, there are still some other columns that need to be normalized.

First, we need to convert the `Value` column to a numeric format, to make it easier to work with. This involves removing the dollar sign from the beginning of each value and converting the column from text to numeric.

Additionally, we need to convert the `Air Date` column to a datetime format, instead of a string. This will allow us to work with it more easily during analysis.

In [6]:
# Function to normalize dollar values and also convert string to integer
def normalize_value(value):
    value = re.sub(r'[^\w\s]', '', value)
    try:
        # convert the string to an integer
        value = int(value)
    except:
        # if conversion fails, assign 0 instead
        value = 0
    return value
    

# Normalize 'Value' column
jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_value)

# Convert 'Air Date' from string to datetime data type
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

# View results
jeopardy.dtypes

Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object

## Answers in Questions

To determine whether we should study past questions, study general knowledge, or not study at all, we need to consider two factors:

- The frequency with which the answer can be used for a question.
- The frequency with which questions are repeated.

We can answer the second question by examining the frequency of complex words (i.e., those with more than 6 characters) in the dataset. To answer the first question, we need to determine how often words in the answer also appear in the question.

For now, we will focus on answering the first question, and address the second question later.

In [7]:
def count_matches(row):
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()
    
    match_count = 0
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0
    for word in split_answer:
        if word in split_question:
            match_count += 1            
    
    return match_count/len(split_answer)


# Create column to count match
jeopardy['answer_in_question'] = jeopardy.apply(count_matches, axis=1)
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value,answer_in_question
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200,0.0
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200,0.0
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200,0.0
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200,0.0
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200,0.0


Now we find the mean of the `answer_in_question column` using the [mean](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.mean.html) method on Series.

In [8]:
round(jeopardy['answer_in_question'].mean(), 3)

0.059

After calculating the mean of the `answer_in_question` column, which measures the frequency with which words in the answer also appear in the question, we can use this information to inform our studying strategy for Jeopardy.

The resulting mean of 0.059 suggests that, on average, only about 6% of the words in the answer also appear in the question. This indicates that relying solely on knowledge of the answer may not be sufficient for success on the show. Instead, it may be more effective to focus on general knowledge and broaden one's range of expertise.

## Recycled Questions

We want to investigate how often new questions are repeated of older ones. Unfortunately, we only have about `10%` of the full Jeopardy question dataset, but we can still investigate it at least.

To do this, we can:

- Sort `jeopardy` in order of ascending air date.
- Maintain a set called `terms_used` that will be empty initially.
- Iterate through each row of `jeopardy`.
- Split `clean_question` into words, remove any word shorter than 6 characters, and check if each word occurs in `terms_used`.
    - If it does, increment a counter.
    - Add each word to terms_used.

This allows us to check if the terms in questions have been used previously or not. Only looking at words with six or more characters enables us to filter out words like `the` and `than`, which are commonly used, but don't tell us a lot about a question.

In [9]:
# Create an empty list and set
question_overlap = []
terms_used = set()

# Sort 'jeopardy' by ascending air date
sorted_jeopardy = jeopardy.sort_values(by='Air Date')
for i, row in sorted_jeopardy.iterrows():
    # Split row on the space character
    split_question = row['clean_question'].split(' ')
    # Remove words less than 6 characters long
    split_question = [word for word in split_question if len(word) >= 6]
    
    # Match counter
    match_count = 0
    for word in split_question:
        # Add 1 to 'match_count' if word occurs
        if word in terms_used:
            match_count += 1
            
    # Add each word to 'terms_used'
    for word in split_question:
        terms_used.add(word)
    # Words greater than 0 in 'split_question'
    if len(split_question) > 0:
        match_count /= len(split_question)
    # Append results to 'question_overlap'
    question_overlap.append(match_count)

# Create the column 'question_overlap'
sorted_jeopardy['question_overlap'] = question_overlap
# Calculate mean of column 'question_overlap'
round(sorted_jeopardy['question_overlap'].mean(), 3)

0.688

A `question_overlap` value of `0.688` means that, on average, about `69%` of the words in a question have been used in previous Jeopardy questions. This suggests that questions are being recycled to some extent, and that studying previous questions could be a useful strategy for succeeding in Jeopardy. However, it is important to note that this analysis is based on a limited dataset and may not be representative of the entire Jeopardy question pool. Additionally, the analysis only looks at word overlap and does not take into account other factors such as category and difficulty level, which could also affect the likelihood of question reuse.

## Low Value vs High Value Questions

To increase our chances of earning more money on Jeopardy, we can focus on studying high-value questions rather than low-value questions. We can use a chi-squared test to identify which terms are associated with high-value questions.

To do this, we need to split the questions into two categories based on their value:

- Low value -- Any row where `Value` is less than `800`.
- High value -- Any row where `Value` is greater than `800`.

Next, we can iterate through each term in the `terms_used` set and:

- Count the number of times the term appears in low value questions.
- Count the number of times the term appears in high value questions.
- Calculate the percentage of questions the term appears in.
- Based on the percentage of questions the term appears in, calculate the expected counts.
- Compute the chi-squared value based on the expected counts and the observed counts for high and low value questions.

Finally, we can identify the terms that have the largest differences in usage between high and low value questions by selecting the terms with the highest chi-squared values. However, since computing this for all terms can be time-consuming, we'll just do it for a small sample.

In [10]:
def clean_value(row):
    if row['clean_value'] > 800:
        row['clean_value'] = 1
    else:
        row['clean_value'] = 0   
    return row['clean_value']

# Assign result to the new column 'high_value'
sorted_jeopardy['high_value'] = sorted_jeopardy.apply(clean_value, axis=1)

In [11]:
def count_usage(word):
    low_count = 0
    high_count = 0
    
    for _, row in sorted_jeopardy.iterrows():
        if word in row['clean_question'].split(' '):
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    
    return high_count, low_count

In [12]:
# Set the seed
random.seed(42)

# Convert 'terms_used' from set into list
terms_used_list = list(terms_used)

# Pick random ten elements of 'terms_used'
comparison_terms = random.sample(terms_used_list, 10)

# Create an empty list
observed_expected = []

# Loop over 'comparison_terms'
for word in comparison_terms:
    observed_expected.append((count_usage(word)))
    
# View results
observed_expected

[(0, 1),
 (1, 0),
 (0, 1),
 (0, 3),
 (4, 13),
 (0, 1),
 (0, 1),
 (1, 2),
 (0, 1),
 (1, 2)]

## Applying the Chi-squared Test

Now that we have found the observed counts for a few terms, we can compute the expected counts and the chi-squared value.

In [13]:
# Find number of rows where 'high_value' is 1 and 0
high_value_count = (sorted_jeopardy['high_value'] == 1).sum()
low_value_count = (sorted_jeopardy['high_value'] == 0).sum()

# Create an empty list
chi_squared = []

# Loop through 'observed_expected'
for obs in observed_expected:
    # Add both 'high and low counts'
    total = sum(obs)
    # Calculate total proportion across dataset
    total_prop = total/len(sorted_jeopardy)
    # Calculate expected word count for high and low value rows
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    # Compute chi-squared value and p-value given the expected and observed counts
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))
    
# View results
chi_squared

[Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=1.205888538380652, pvalue=0.27214791766901714),
 Power_divergenceResult(statistic=0.21978793356318777, pvalue=0.6392015498682628),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.03188116723440362, pvalue=0.8582887163235293),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.03188116723440362, pvalue=0.8582887163235293)]

## Chi-Squared Results

Looking at the chi-squared values and their associated p-values, it appears that only one result is statistically significant at a significance level of 0.05. The result with the smallest p-value is the 10th result, with a chi-squared value of 4.98 and a p-value of 0.0267.

All of the other p-values are greater than 0.05, indicating that we cannot reject the null hypothesis that the observed and expected counts are not significantly different for those terms.

However, it is important to note that we have only looked at a small sample of terms, so it is possible that there are other terms with significant differences in usage between high and low value questions that we have not analyzed. Additionally, it is possible that there are confounding variables that we have not taken into account that could affect the results.