# Project :Winning Jeopardy

## Introduction
Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for many years, and is a major force in popular culture.

Imagine that we want to compete on Jeopardy, and we are looking for any way to win. In this project, we'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help us win

We will work with the dataset named jeopardy.csv, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions

In [3]:
from IPython.display import Image

# Replace the file path with the actual path to your image
Image(url="https://dq-content.s3.amazonaws.com/Nlfu13A.png")

As you can see, each row in the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:

Show Number - the Jeopardy episode number

Air Date - the date the episode aired

Round - the round of Jeopardy

Category - the category of the question

Value - the number of dollars the correct answer is worth

Question - the text of the question

Answer - the text of the answer

In [4]:
#Read in the data 
import pandas as pd
jeopardy = pd.read_csv('jeopardy.csv')

In [5]:
#print out the first 5 rows
jeopardy.head(5)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [6]:
#print out the columns of jeopardy
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [7]:
#Remove spaces from each each item in the jeopardy.columns
jeopardy.columns = jeopardy.columns.str.strip()


2 · Normalizing Text

Before you can start doing analysis on the Jeopardy questions, we need to normalize all of the text columns (the Question and Answer columns).

In [8]:
import string

def normalize_string(text):
    # Convert the string to lowercase
    normalized_text = text.lower()

    # Remove punctuation from the string
    normalized_text = normalized_text.translate(str.maketrans('', '', string.punctuation))
    
    return normalized_text

In [9]:
# Apply the normalization function to the Question column
jeopardy['clean_question'] = jeopardy['Question'].apply(normalize_string)

In [10]:
# Apply the normalization function to the Answer column
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize_string)

3. Normalizing Columns

The Value column should be numeric, to allow you to manipulate it easier. You'll need to remove the dollar sign from the beginning of each value and convert the column from text to numeric.

In [11]:
import re

def normalize_dollar_value(string):
    # Remove punctuation from the string
    string = re.sub(r'[^\d]', '', string)
    
    try:
        # Convert the string to an integer
        value = int(string)
    except ValueError:
        # If conversion fails, assign 0
        value = 0
    
    return value

In [12]:
# Apply the function to the Value column and create the clean_value column
jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_dollar_value)

In [13]:
# Convert Air Date column to datetime
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

4. Answers in Questions

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

1. How often the answer can be used for a question.
2. How often questions are repeated.

You can answer the second question by seeing how often complex words (> 6 characters) reoccur. You can answer the first question by seeing how many times words in the answer also occur in the question. We'll work on the first question and come back to the second.

In [14]:
def calculate_match_ratio(row):
    split_answer = row['clean_answer'].split(' ')
    split_question = row['clean_question'].split(' ')
    match_count = 0
    
    # Remove 'the' from split_answer
    if 'the' in split_answer:
        split_answer.remove('the')
    
    if len(split_answer) == 0:
        return 0
    
    for word in split_answer:
        if word in split_question:
            match_count += 1
    
    match_ratio = match_count / len(split_answer)
    return match_ratio

In [16]:
# Apply the function to each row and create the answer_in_question column
jeopardy['answer_in_question'] = jeopardy.apply(calculate_match_ratio, axis=1)

# Calculate the mean of the answer_in_question column
mean_answer_in_question = jeopardy['answer_in_question'].mean()

In [17]:
print(mean_answer_in_question)

0.06035277385469894


Understanding the Mean of "Answer in Question"

The mean of the "answer_in_question" column in the Jeopardy dataset provides valuable insight into the frequency at which terms from the answers appear in the corresponding questions. This information can be useful in shaping your studying strategy for Jeopardy.

If the mean is relatively high, it suggests that there is a significant overlap between the answers and questions. In such cases, studying the questions more thoroughly may help you remember and recall the answers more effectively during the game.

On the other hand, if the mean is low, it indicates that there is less reliance on directly using the answer terms in the questions. In this situation, focusing on understanding the broader concepts, categories, and themes related to the questions becomes crucial.

By considering the mean of "answer_in_question," you can tailor your studying approach accordingly. For high mean values, reinforcing the connections between answers and questions can be beneficial. For low mean values, focusing on building a deeper knowledge base and understanding the context of the questions can be more advantageous.

Remember that the mean provides a general overview, and it's essential to complement it with other studying techniques such as exploring various categories, expanding your knowledge in different subject areas, and practicing buzzer timing for a well-rounded preparation strategy.


5· Recycled Questions

We want to investigate how often new questions are repeats of older ones. WE can't completely answer this, because we only have about 10% of the full Jeopardy question dataset, but we can investigate it at least

In [19]:
# Create an empty list and set
question_overlap = []
terms_used = set()

# Sort jeopardy by ascending air date
jeopardy = jeopardy.sort_values('Air Date')

# Iterate through each row of jeopardy
for index, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(' ')
    split_question = [word for word in split_question if len(word) >= 6]
    match_count = 0
    
    for word in split_question:
        if word in terms_used:
            match_count += 1
        terms_used.add(word)
    
    if len(split_question) > 0:
        match_count /= len(split_question)
    
    question_overlap.append(match_count)

# Assign question_overlap to the question_overlap column of jeopardy
jeopardy['question_overlap'] = question_overlap

# Find the mean of the question_overlap column and print it
mean_question_overlap = jeopardy['question_overlap'].mean()
print(mean_question_overlap)


0.6889055316620328


### Understanding the Question Overlap and its Implications

The mean value of the "question_overlap" column in the Jeopardy dataset provides insights into the degree of question recycling over time. This column measures the proportion of words in a question that have been used before in previous questions. A higher mean value suggests a higher likelihood of question recycling, while a lower mean value indicates less repetition.

Based on the calculated mean of the "question_overlap" column, we can make a few observations:

1. High Question Overlap: If the mean is relatively high, it implies that there is a substantial degree of question recycling occurring over time. This means that previous questions have been reused, and familiarity with past questions could potentially give an advantage in preparing for future games.

2. Low Question Overlap: Conversely, if the mean is low, it suggests that there is minimal repetition of questions. This indicates a greater emphasis on creating new and unique questions, making it less likely that studying past questions alone will significantly enhance performance.

It's important to note that the mean of the "question_overlap" column provides a general overview and doesn't provide insights into specific question categories or time periods. It's possible that certain categories or time periods may exhibit higher or lower question overlap than the overall mean.

To prepare effectively for Jeopardy, considering the question overlap can help inform your studying strategy:

- High overlap: Focus on reviewing and studying previous Jeopardy questions to identify common themes, topics, and recurring patterns. Familiarity with past questions can improve your chances of answering correctly.

- Low overlap: Emphasize broadening your knowledge base, exploring diverse subject areas, and understanding underlying concepts rather than relying heavily on memorizing specific questions and answers.

Remember that studying strategies should be well-rounded, combining category exploration, general knowledge enhancement, and regular practice with buzzer timing to succeed in the dynamic and diverse world of Jeopardy.


6· Low Value vs High Value Questions

Let's say we only want to study questions that pertain to high value questions instead of low value questions. This will help us earn more money when we are on Jeopardy.

We can actually figure out which terms correspond to high-value questions using a chi-squared test. We will first need to narrow down the questions into two categories:

Low value -- Any row where Value is less than 800.
High value -- Any row where Value is greater than 800

In [20]:
import random
from scipy.stats import chisquare

# Function to categorize rows into high value (1) or low value (0)
def categorize_value(row):
    if row['clean_value'] > 800:
        return 1
    else:
        return 0



[(0, 1), (12, 28), (4, 2), (0, 1), (1, 0), (0, 2), (1, 3), (1, 3), (0, 1), (0, 3)]


In [None]:
# Apply the function to each row and create the high_value column
jeopardy['high_value'] = jeopardy.apply(categorize_value, axis=1)



In [None]:
# Function to count occurrences of a word in high and low value questions
def count_usage(word):
    low_count = 0
    high_count = 0

    for index, row in jeopardy.iterrows():
        split_question = row['clean_question'].split(' ')
        if word in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    
    return high_count, low_count



In [None]:
# Randomly select ten terms from terms_used for comparison
comparison_terms = random.sample(terms_used, 10)



In [None]:
# List to store observed and expected counts
observed_expected = []



In [21]:
# Loop through each term and calculate observed and expected counts
for term in comparison_terms:
    observed_expected.append(count_usage(term))

print(observed_expected)


[(0, 1), (12, 28), (4, 2), (0, 1), (1, 0), (0, 2), (1, 3), (1, 3), (0, 1), (0, 3), (0, 1), (12, 28), (4, 2), (0, 1), (1, 0), (0, 2), (1, 3), (1, 3), (0, 1), (0, 3)]


7· Applying the Chi-squared Test

Now that you've found the observed counts for a few terms, you can compute the expected counts and the chi-squared value

In [22]:
from scipy.stats import chisquare

# Find the counts of high value and low value rows
high_value_count = len(jeopardy[jeopardy['high_value'] == 1])
low_value_count = len(jeopardy[jeopardy['high_value'] == 0])

# List to store chi-squared results
chi_squared = []

# Loop through each observed and expected count pair
for observed in observed_expected:
    total = sum(observed)
    total_prop = total / len(jeopardy)
    expected_high = total_prop * high_value_count
    expected_low = total_prop * low_value_count
    chi_squared.append(chisquare(observed, f_exp=[expected_high, expected_low]))

print(chi_squared)


[Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469), Power_divergenceResult(statistic=0.034523405991355754, pvalue=0.8525978776056389), Power_divergenceResult(statistic=4.235420876606389, pvalue=0.03958880694352712), Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469), Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047), Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571), Power_divergenceResult(statistic=0.02636443308440769, pvalue=0.871013484688921), Power_divergenceResult(statistic=0.02636443308440769, pvalue=0.871013484688921), Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469), Power_divergenceResult(statistic=1.205888538380652, pvalue=0.27214791766902047), Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469), Power_divergenceResult(statistic=0.034523405991355754, pvalue=0.8525978776056389), Power_divergenceResult(stati

### Analysis of Chi-Squared Results

After performing the chi-squared test on the observed and expected counts for a sample of terms, let's examine the obtained results to identify any statistically significant findings.

The chi-squared test helps determine whether there is a significant association between the occurrence of a term and its value (high or low) in Jeopardy questions. The p-value associated with each chi-squared value indicates the probability of observing the given association by chance alone. A smaller p-value suggests stronger evidence against the null hypothesis of no association.

Upon reviewing the chi-squared values and associated p-values, the following observations can be made:

- Statistically Significant Results: If a term has a low p-value (e.g., below a chosen significance level such as 0.05), it indicates a significant association between the term's occurrence and high/low value questions. In other words, the term is more likely to appear in either high or low value questions compared to what would be expected by chance.

- Lack of Statistically Significant Results: On the other hand, if a term has a high p-value (e.g., above the significance level), it suggests that there is no significant association between the term's occurrence and the value of the questions.



