# Jeopardy

### Intro

In this project, we will conduct a light analysis of a dataset that contains information about the popular game show Jeopardy. Our objective is to identify key aspects of the data that can assist individuals in preparing to participate in the game.

We will start by examining the dataset to see if it reveals any useful correlations or patterns. Based on our findings, we will provide recommendations on how to effectively use the dataset for studying for Jeopardy.

The dataset we will be using, jeopardy.csv, includes the first 20,000 rows from a larger Jeopardy dataset. You can find the complete dataset [here.](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/?rdt=44838) Let's begin by importing the necessary libraries and data.

In [1]:
# importing necessary libraries
import pandas as pd
import numpy as np
import csv

# Read in the data
jeopardy = pd.read_csv('jeopardy.csv')

jeopardy

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams
...,...,...,...,...,...,...,...
19994,3582,2000-03-14,Jeopardy!,U.S. GEOGRAPHY,$200,"Of 8, 12 or 18, the number of U.S. states that...",18
19995,3582,2000-03-14,Jeopardy!,POP MUSIC PAIRINGS,$200,...& the New Power Generation,Prince
19996,3582,2000-03-14,Jeopardy!,HISTORIC PEOPLE,$200,In 1589 he was appointed professor of mathemat...,Galileo
19997,3582,2000-03-14,Jeopardy!,1998 QUOTATIONS,$200,"Before the grand jury she said, ""I'm really so...",Monica Lewinsky


As we can see in the results above, our dataset includes several columns that cover various aspects of 19,999 questions (rows) from Jeopardy. Before we can analyze the dataset for insights, it is essential to ensure that the data is clean. In the code below, we will begin by standardizing and cleaning our column names.

In [2]:
# observing original column names
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [3]:
# cleaning column names
jeopardy.columns = ['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question', 'Answer']

Now that we have cleaned our column names, we will proceed to clean a few of the columns themselves. In the code below, we will also note the data types of each column.

In [4]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Show Number  19999 non-null  int64 
 1   Air Date     19999 non-null  object
 2   Round        19999 non-null  object
 3   Category     19999 non-null  object
 4   Value        19663 non-null  object
 5   Question     19999 non-null  object
 6   Answer       19999 non-null  object
dtypes: int64(1), object(6)
memory usage: 1.1+ MB


### Cleaning

Our first task is to determine whether this dataset contains information that can be used for studying or preparing for Jeopardy. To achieve this, it would be helpful to know two things:

- How many times does the answer appear in the question?
- Are questions frequently repeated?

To address these questions, we need to clean and format the `Question` and `Answer` columns appropriately for testing. In the code below, we will create a function to clean these columns and store the cleaned data in new columns within the dataset for future analysis. After we clean the `Question` and `Answer` columns, we will also clean the `Value` column and convert the `Air Date` column to a more suitable data type.

In [5]:
import re

def normalize_text(text):
    text = text.lower()
    text = re.sub(r"[^A-Za-z0-9\s]", "", text)
    text = re.sub(r"\s+", " ", text)
    return text

In [6]:
jeopardy['clean_question'] = jeopardy['Question'].apply(normalize_text)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize_text)

In [7]:
def normalize_values(text):
    text = str(text)
    text = re.sub(r"[^A-Za-z0-9\s]", "", text)
    try:
        return int(text)
    except ValueError:
        return 0

In [8]:
jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_values)

In [9]:
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

In [10]:
jeopardy

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200
...,...,...,...,...,...,...,...,...,...,...
19994,3582,2000-03-14,Jeopardy!,U.S. GEOGRAPHY,$200,"Of 8, 12 or 18, the number of U.S. states that...",18,of 8 12 or 18 the number of us states that tou...,18,200
19995,3582,2000-03-14,Jeopardy!,POP MUSIC PAIRINGS,$200,...& the New Power Generation,Prince,the new power generation,prince,200
19996,3582,2000-03-14,Jeopardy!,HISTORIC PEOPLE,$200,In 1589 he was appointed professor of mathemat...,Galileo,in 1589 he was appointed professor of mathemat...,galileo,200
19997,3582,2000-03-14,Jeopardy!,1998 QUOTATIONS,$200,"Before the grand jury she said, ""I'm really so...",Monica Lewinsky,before the grand jury she said im really sorry...,monica lewinsky,200


### Answers in Questions

Now that we have cleaned more of our columns, we will proceed to the next step: writing a function that counts how many times the answers appear within the questions.

In [11]:
def count_matches(row):
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()
    match_count = 0
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer) == 0:
        return 0
    for item in split_answer:
        if item in split_question:
            match_count += 1
    return match_count / len(split_answer)

jeopardy["answer_in_question"] = jeopardy.apply(count_matches, axis=1)

In [12]:
print(jeopardy["answer_in_question"].mean())

0.05900196524977763


The results indicate that, on average, only about 6% of the answers are present in the questions. This finding is important for our analysis because it suggests that most questions lack words that appear in the answers. Given that this result is not particularly significant, we will now examine the next option: the percentage of questions that may be reused.

### Recycled Questions

In [13]:
question_overlap  = []
terms_used = set()

jeopardy = jeopardy.sort_values("Air Date")

for index, row in jeopardy.iterrows():
        split_question = row["clean_question"].split(" ")
        split_question = [q for q in split_question if len(q) > 5]
        match_count = 0
        for word in split_question:
            if word in terms_used:
                match_count += 1
        for word in split_question:
            terms_used.add(word)
        if len(split_question) > 0:
            match_count /= len(split_question)
        question_overlap.append(match_count)
jeopardy["question_overlap"] = question_overlap

print(jeopardy["question_overlap"].mean())

0.6876260592169802


The calculations presented above indicate the percentage of words that are common to both new and old questions. It is important to note that we are dealing with a relatively small sample size of questions, and our code focuses on comparing individual words rather than phrases. Because of this, our value of around 70% is virtually insignificant, but high enough for us to flag this as an area of research for someone looking to enter Jeopardy. With more data and time, it could prove profitable to continue looking into the possibility of questions being repeated.

### Low Value vs. High Value Questions

Next, we will examine the terms that correlate with high-value questions. In Jeopardy, high-value questions are those that carry a larger monetary amount. The more high-value questions a contestant answers correctly, the greater their potential earnings in the game. Our initial step will be to identify and categorize questions as either high value or not. In the code below, a question will be classified as high value if it has a value exceeding 800.

In [14]:
def determine_value(row):
    value = 0
    if row['clean_value'] > 800:
        value = 1
    return value

jeopardy['high_value'] = jeopardy.apply(determine_value, axis=1)

Now that we have determined the value of each question, we will write a function that will count the number of times a term is used in high-value questions and the number of times it is used in low-value questions.

In [15]:
def count_usage(term):
    low_count = 0
    high_count = 0
    for index, row in jeopardy.iterrows():
        split_question = row["clean_question"].split(" ")
        if term in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

Using our new function, we will determine how many times a term appears in high-value questions. Then, we will calculate how many times we expect each term to appear. Using that information, we will compute the p-value for the term using the chi-squared formula, which will help us assess whether there is a significant relationship between that word and high-value questions. However, analyzing every term in our `terms_used` vocabulary would be time-consuming, so we will focus on a small sample of ten words for now.

In [16]:
from random import choice

# selecting our 10 random terms
terms_used_list = list(terms_used)
comparison_terms = [choice(terms_used_list) for _ in range(10)]

observed_high_low = []

# calculating how many times each term was observed in high and low value questions
for term in comparison_terms:
    observed_high_low.append(count_usage(term))

observed_high_low

[(0, 3),
 (0, 1),
 (0, 7),
 (0, 1),
 (1, 0),
 (0, 1),
 (1, 3),
 (1, 0),
 (0, 1),
 (1, 0)]

In [17]:
from scipy.stats import chisquare

high_value_count = jeopardy[jeopardy['high_value'] == 1].shape[0]
low_value_count = jeopardy[jeopardy['high_value'] == 0].shape[0]

chi_squared = []

for observed in observed_high_low:
    total = sum(observed)
    total_prop = total/jeopardy.shape[0]

    # calculating the number of times a term was expected 
    # to appear in high and low value questions
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count

    # calculating the chi-squared statistics and p-values for the correlation 
    # of terms to high and low value questions
    observed = np.array([observed[0], observed[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=np.float64(1.205888538380652), pvalue=np.float64(0.27214791766901714)),
 Power_divergenceResult(statistic=np.float64(0.401962846126884), pvalue=np.float64(0.5260772985705469)),
 Power_divergenceResult(statistic=np.float64(2.813739922888188), pvalue=np.float64(0.09346026076900307)),
 Power_divergenceResult(statistic=np.float64(0.401962846126884), pvalue=np.float64(0.5260772985705469)),
 Power_divergenceResult(statistic=np.float64(2.487792117195675), pvalue=np.float64(0.11473257634454047)),
 Power_divergenceResult(statistic=np.float64(0.401962846126884), pvalue=np.float64(0.5260772985705469)),
 Power_divergenceResult(statistic=np.float64(0.02636443308440769), pvalue=np.float64(0.871013484688921)),
 Power_divergenceResult(statistic=np.float64(2.487792117195675), pvalue=np.float64(0.11473257634454047)),
 Power_divergenceResult(statistic=np.float64(0.401962846126884), pvalue=np.float64(0.5260772985705469)),
 Power_divergenceResult(statistic=np.float64(2.487

The analysis looked at how the selected sample terms relate to the value of questions in Jeopardy and found p-values between 0.09 and 0.52. Because these values are all higher than 0.05, we cannot conclude that there is a meaningful link between the sample terms and whether questions are high-value or low-value. The chi-squared statistics also ranged from 0.04 to 2.81, showing that there is not much evidence of a significant connection between the terms and the question values. Overall, these results suggest that the terms analyzed do not strongly relate to the monetary value of the questions.

### Conclusion

In this project, we performed a focused analysis on a portion of data from the Jeopardy game show. Our aim was to identify useful insights for individuals studying or preparing to participate in Jeopardy. Based on our findings, we recommend further exploration in two key areas:

1. The potential for questions to be repeated (which requires more data and time).
2. The factors that determine a question's "high-value."

Given the limited size of our dataset, we could not completely eliminate the possibility of question repetition, which remains a significant factor to consider for anyone studying the game. The small sample size may not adequately represent the full range of questions encountered in Jeopardy, making it difficult to draw definitive conclusions about patterns of repetition. Additionally, our analysis suggested that certain terms are unlikely to be strongly linked to high-value or low-value questions, indicating that other factors may play a more critical role in determining question value. This raises important questions about what truly defines a question's worth in the context of the game. Is it the complexity of the question, the subject matter, or perhaps the phrasing that influences its value? Further investigation in this area could provide valuable insights and better prepare participants to answer high-value questions in Jeopardy, ultimately enhancing their chances of success in the game. By exploring these dimensions, future research could contribute to a deeper understanding of the strategies that lead to high performance in Jeopardy.