<img src="Jeopardy_S37_OnSetLogo-smaller.webp" style="display:block; margin:auto" width=350>

<h1><center>Project: Winning Jeopardy</center></h1>

`Jeopardy!` is a long-running American television game show created by Merv Griffin. It first premiered in 1964 and has since become one of the most iconic and enduring game shows in television history. The show features a unique format where contestants are presented with general knowledge clues in the form of answers, and they must phrase their responses in the form of questions.

The game is divided into three rounds: the Jeopardy round, the Double Jeopardy round, and Final Jeopardy. In each round, contestants select clues from a game board that is divided into categories and point values. Contestants can accumulate points by providing correct responses, and they can also wager points in the Final Jeopardy round based on their confidence in the category.

The show's format, challenging questions, and iconic theme music have made it a cultural phenomenon, with fans spanning generations. It has won numerous awards and accolades throughout its history and remains a beloved staple of American television.

#### Goal

Imagine that you want to compete on Jeopardy, and you're looking for any way to win. In this project, we'll work with a dataset of `Jeopardy!` questions to figure out some patterns in the questions that could help you win.

#### Data

The dataset contains 20000 rows from the beginning of a full dataset of Jeopardy questions, which you can download [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/).

The data contains the following columns:

- `Show Number` - the Jeopardy episode number
- `Air Date` - the date the episode aired
- `Round` - the round of Jeopardy
- `Category` - the category of the question
- `Value` - the number of dollars the correct answer is worth
- `Question` - the text of the question
- `Answer` - the text of the answer


*This project was completed as part of the Data Science Career Path offered by dataquest.io.*

#### The Data

In [1]:
# Import relevant packages
import pandas as pd
import numpy as np
from scipy.stats import chisquare
import re
import random

In [2]:
# Read in the json data
jeopardy = pd.read_json('jeo.json')

In [3]:
# Check information of the dataset
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 216930 entries, 0 to 216929
Data columns (total 7 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   category     216930 non-null  object
 1   air_date     216930 non-null  object
 2   question     216930 non-null  object
 3   value        213296 non-null  object
 4   answer       216930 non-null  object
 5   round        216930 non-null  object
 6   show_number  216930 non-null  int64 
dtypes: int64(1), object(6)
memory usage: 11.6+ MB


In [4]:
# Show the first rows of the dataset
jeopardy.head()

Unnamed: 0,category,air_date,question,value,answer,round,show_number
0,HISTORY,2004-12-31,"'For the last 8 years of his life, Galileo was...",$200,Copernicus,Jeopardy!,4680
1,ESPN's TOP 10 ALL-TIME ATHLETES,2004-12-31,'No. 2: 1912 Olympian; football star at Carlis...,$200,Jim Thorpe,Jeopardy!,4680
2,EVERYBODY TALKS ABOUT IT...,2004-12-31,'The city of Yuma in this state has a record a...,$200,Arizona,Jeopardy!,4680
3,THE COMPANY LINE,2004-12-31,"'In 1963, live on ""The Art Linkletter Show"", t...",$200,McDonald\'s,Jeopardy!,4680
4,EPITAPHS & TRIBUTES,2004-12-31,"'Signer of the Dec. of Indep., framer of the C...",$200,John Adams,Jeopardy!,4680


In [5]:
# Remove spacing in column section
jeopardy.columns = jeopardy.columns.str.replace(' ', '')
jeopardy.columns

Index(['category', 'air_date', 'question', 'value', 'answer', 'round',
       'show_number'],
      dtype='object')

Before we can start doing analysis on the `Jeopardy` questions, we need to normalize all of the text columns (the Question and Answer columns). The idea is to ensure that you put words in lowercase and remove punctuation so `Don't` and `don't` aren't considered to be different words when you compare them.

In [6]:
# Create a function to clean the question and answer columns
def normalize(string):
    string = string.lower()
    string = re.sub("[^A-Za-z0-9\s]", "", string)
    string = re.sub("\s+", " ", string)
    return string

# Use apply with normalize function for the specific columns
jeopardy['clean_question'] = jeopardy['question'].apply(normalize)
jeopardy['clean_answer'] = jeopardy['answer'].apply(normalize)

In [7]:
# print the first rows of jeopardy dataset
jeopardy.head()

Unnamed: 0,category,air_date,question,value,answer,round,show_number,clean_question,clean_answer
0,HISTORY,2004-12-31,"'For the last 8 years of his life, Galileo was...",$200,Copernicus,Jeopardy!,4680,for the last 8 years of his life galileo was u...,copernicus
1,ESPN's TOP 10 ALL-TIME ATHLETES,2004-12-31,'No. 2: 1912 Olympian; football star at Carlis...,$200,Jim Thorpe,Jeopardy!,4680,no 2 1912 olympian football star at carlisle i...,jim thorpe
2,EVERYBODY TALKS ABOUT IT...,2004-12-31,'The city of Yuma in this state has a record a...,$200,Arizona,Jeopardy!,4680,the city of yuma in this state has a record av...,arizona
3,THE COMPANY LINE,2004-12-31,"'In 1963, live on ""The Art Linkletter Show"", t...",$200,McDonald\'s,Jeopardy!,4680,in 1963 live on the art linkletter show this c...,mcdonalds
4,EPITAPHS & TRIBUTES,2004-12-31,"'Signer of the Dec. of Indep., framer of the C...",$200,John Adams,Jeopardy!,4680,signer of the dec of indep framer of the const...,john adams


Now that we've normalized the text columns, there are also some other columns to normalize. The `Value` column should be numeric, to allow us to manipulate it easier. We'll need to remove the dollar sign from the beginning of each value and convert the column from text to numeric. The `Air Date` column should also be a datetime, not a string, to enable us to work it easier.

In [8]:
# Create a function to clean the value column
def normalize_value(string):
    string = re.sub("[^A-Za-z0-9\s]", "", str(string))
    try:
        string = int(string)
    except ValueError:
        string = 0
    return string

# Use function for the value column
jeopardy['clean_value'] = jeopardy['value'].apply(normalize_value)

In [9]:
# Transform air_date to datetime format
jeopardy['air_date'] = pd.to_datetime(jeopardy['air_date'])

In [10]:
# Check first rows of jeopardy
jeopardy.head()

Unnamed: 0,category,air_date,question,value,answer,round,show_number,clean_question,clean_answer,clean_value
0,HISTORY,2004-12-31,"'For the last 8 years of his life, Galileo was...",$200,Copernicus,Jeopardy!,4680,for the last 8 years of his life galileo was u...,copernicus,200
1,ESPN's TOP 10 ALL-TIME ATHLETES,2004-12-31,'No. 2: 1912 Olympian; football star at Carlis...,$200,Jim Thorpe,Jeopardy!,4680,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,EVERYBODY TALKS ABOUT IT...,2004-12-31,'The city of Yuma in this state has a record a...,$200,Arizona,Jeopardy!,4680,the city of yuma in this state has a record av...,arizona,200
3,THE COMPANY LINE,2004-12-31,"'In 1963, live on ""The Art Linkletter Show"", t...",$200,McDonald\'s,Jeopardy!,4680,in 1963 live on the art linkletter show this c...,mcdonalds,200
4,EPITAPHS & TRIBUTES,2004-12-31,"'Signer of the Dec. of Indep., framer of the C...",$200,John Adams,Jeopardy!,4680,signer of the dec of indep framer of the const...,john adams,200


In [11]:
# Show information about the adjusted jeopardy dataset
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 216930 entries, 0 to 216929
Data columns (total 10 columns):
 #   Column          Non-Null Count   Dtype         
---  ------          --------------   -----         
 0   category        216930 non-null  object        
 1   air_date        216930 non-null  datetime64[ns]
 2   question        216930 non-null  object        
 3   value           213296 non-null  object        
 4   answer          216930 non-null  object        
 5   round           216930 non-null  object        
 6   show_number     216930 non-null  int64         
 7   clean_question  216930 non-null  object        
 8   clean_answer    216930 non-null  object        
 9   clean_value     216930 non-null  int64         
dtypes: datetime64[ns](1), int64(2), object(7)
memory usage: 16.6+ MB


In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

- How often the answer can be used for a question.
- How often questions are repeated.

We'll answer the first question by seeing how many times words in the answer also occur in the question. In order to archieve this goal, we'll write a function that takes in a row in jeopardy, as a Series.

In [12]:
# Create a function which checks if words occur in questions and answers
def repeat_question(series):
    split_answer = series['clean_answer'].split()
    split_question = series['clean_question'].split()
    
    match_count = 0

    for value in split_answer:
        if value == 'the':
            split_answer.remove(value)
    if len(split_answer) == 0:
        return 0
        
    for item in split_answer:
        if item in split_question:
            match_count += 1

    result = match_count / len(split_answer)
    return result

# Use the function on the dataset, set axis to 1 and create a new column called 'answer_in_question'
jeopardy['answer_in_question'] = jeopardy.apply(repeat_question, axis=1)

In [13]:
# Check the mean value of new column
jeopardy['answer_in_question'].mean()

0.05578233100688502

Above we created the function  `repeat_question()`. First, this function creates two new variables, splitted versions of `clean_answer` and `clean_question` columns. It also initiates the `match_count` variable, with value 0. After that, the function looks for `the` values and length 0 in split_answer. This values will be removed or set to 0. In the second last step, the function iterates through `split_answer` and compares the values with values in `split_question`. If these values match, then `match_count` increases by one. The result of `match_count` divided by the length of `split_answer` is returned.

On average, the answer only makes up for about 6% of the question. This isn't a huge number, and it means that we probably can't just hope that hearing a question will enable us to determine the answer. We'll probably have to study.t.

#### Recycled Questions

Now let's say we want to investigate how often new questions are repeats of older ones. We can't completely answer this, because we only have about 10% of the full Jeopardy question dataset, but we can investigate it at least. Let's do this next.

In [14]:
# Initiate a set and a list
terms_used = set()
question_overlap = []

# Loop through jeopardy_sorted
jeopardy = jeopardy.sort_values("air_date")

for i, row in jeopardy.iterrows():
        split_question = row["clean_question"].split(" ")
        split_question = [q for q in split_question if len(q) > 5]
        match_count = 0
        for word in split_question:
            if word in terms_used:
                match_count += 1
        for word in split_question:
            terms_used.add(word)
        if len(split_question) > 0:
            match_count /= len(split_question)
        question_overlap.append(match_count)

jeopardy['question_overlap'] = question_overlap
print(jeopardy['question_overlap'].mean())

0.8654625652482993


#### Code Explanation

1. First we sort the dataset jeopardy based on the column air_date in ascending order. Then we loop  through each row of the sorted jeopardy dataset. For each iteration, i represents the index of the row, and row represents the data (a Series) of the current row. We then split the value of the clean_question column of the current row into a list of words. The list comprehension creates a new list split_question containing only those words from the original split_question list where the length of the word (q) is greater than 5 characters. This filters out words that are 5 characters or shorter.

2. We initiate a variable called match_count and set it to 0. This variable will be used to count the number of words in the current question that have been used previously (i.e., are already in the terms_used set).

3. We loop through each word (word) in the filtered split_question list and check if the current word (word) has been encountered before, i.e., if it exists in the terms_used set. If the current word (word) has been used before (exists in terms_used), then match_count is incremented by 1.

4. Then we loop  through each word (word) in the filtered split_question list again and for each word (word) in the split_question list, regardless of whether it was already in terms_used or not, the word is added to the terms_used set. This set keeps track of all unique words encountered across all questions.

5. We check again if there are any words left in the split_question list after filtering out words shorter than 6 characters. If there are words left in the split_question list, this line calculates the ratio of match_count (number of matched words) to the total number of words in the split_question list (len(split_question)). This gives the percentage of words in the current question that have been used before.

6. Finally, the calculated overlap ratio (match_count) is appended to a list called question_overlap. This list will contain overlap ratios for each question in the jeopardy data, indicating how much of each question's content has been reused from previous questions.

#### Low Value vs. High Value Questions

There is about a 87% overlap between terms in new questions and terms in old questions. This only looks at a small set of questions, and it doesn't look at phrases — it looks at single terms. This makes it relatively insignificant, but it does mean that it's worth looking more into the recycling of questions.

In [15]:
# Create a function that returns value of 1 or 0
def high_low_value(row):
    if row['clean_value'] > 800:
        value = 1
    else:
        value = 0
    
    return value

# Use the function to create a new column high_value
jeopardy['high_value'] = jeopardy.apply(high_low_value, axis=1)

In [16]:
# Check the distribution of the high_value column
round(jeopardy['high_value'].value_counts(normalize=True) * 100, 2)

high_value
0    71.69
1    28.31
Name: proportion, dtype: float64

In [17]:
# Check the info of dataset jeopardy
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
Index: 216930 entries, 84523 to 105930
Data columns (total 13 columns):
 #   Column              Non-Null Count   Dtype         
---  ------              --------------   -----         
 0   category            216930 non-null  object        
 1   air_date            216930 non-null  datetime64[ns]
 2   question            216930 non-null  object        
 3   value               213296 non-null  object        
 4   answer              216930 non-null  object        
 5   round               216930 non-null  object        
 6   show_number         216930 non-null  int64         
 7   clean_question      216930 non-null  object        
 8   clean_answer        216930 non-null  object        
 9   clean_value         216930 non-null  int64         
 10  answer_in_question  216930 non-null  float64       
 11  question_overlap    216930 non-null  float64       
 12  high_value          216930 non-null  int64         
dtypes: datetime64[ns](1), float64(

In [18]:
# Create a function to categorize questions in high or low value, regarding the categorization before
def high_low_count(word):
    low_count = 0
    high_count = 0

    for index, row in jeopardy.iterrows():
        split_question = row['clean_question'].split()
        if word in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1

    return high_count, low_count

The code above defines a function called high_low_count which takes a word as input and counts how many times that word appears in questions categorized as high-value questions (high_count) and low-value questions (low_count) in the Jeopardy dataset. It iterates through each row of the Jeopardy dataset, splits the question into words, and checks if the input word is present in the question. Depending on whether the question is classified as high-value or low-value (based on the value of the 'high_value' column), it increments the corresponding count. Finally, it returns a tuple containing the counts of high-value and low-value occurrences of the input word.

In [19]:
# Create a list
comparison_terms = []

# Loop in a range of 10 and append a random word to the list
for _ in range(10):
    comparison_terms.append(random.choice(list(terms_used)))

# Check the 10 words appended
comparison_terms

['dentona',
 'hrefhttpwwwjarchivecommedia20100615dj08jpg',
 'cryophobics',
 'protograph',
 'kenosha',
 'veggiesbr',
 'sacrificing',
 'hrefhttpwwwjarchivecommedia20110628j06jpg',
 'footballs',
 'towels']

In [20]:
# Create a empty list
observed_expected = []

# Loop through the list with 10 words and use the function high_low_count 
for value in comparison_terms:
    result = high_low_count(value)
    observed_expected.append(result)

# Check the values for these 10 specific words
observed_expected

[(0, 2),
 (1, 0),
 (0, 1),
 (1, 0),
 (0, 4),
 (0, 1),
 (2, 1),
 (0, 1),
 (3, 18),
 (4, 5)]

Now that we've found the observed counts for a few terms, we can compute the expected counts and the chi-squared value.

We'll write code, that calculates the expected counts and performs a chi-squared test for each observed-expected value pair. It iterates through each pair in observed_expected, calculates the total count, calculates the expected counts for high-value and low-value questions, creates arrays for observed and expected counts, performs a chi-squared test using chisquare from the NumPy library, and appends the chi-squared statistic and p-value to a list called chi_squared.


In [21]:
# Create an empty list
chi_squared = []

# Save high and low value to variables
high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]

for value in observed_expected:
    total = sum(value)
    total_prop = total / len(jeopardy)
    exp_high_value = total_prop * high_value_count
    exp_low_value = total_prop * low_value_count

    expected = np.array([exp_high_value, exp_low_value])
    observed = np.array([value[0], value[1]])
    chisquared, p_value = chisquare(observed, expected)
    chi_squared.append([chisquared, p_value])

# Show the result, firs: chisquared value, second: p_value
chi_squared

[[0.7899529284667026, 0.3741143592744989],
 [2.5317964247338085, 0.11157312838169751],
 [0.3949764642333513, 0.5296950912486695],
 [2.5317964247338085, 0.11157312838169751],
 [1.5799058569334052, 0.2087742545638461],
 [0.3949764642333513, 0.5296950912486695],
 [2.1740540543895293, 0.14035579428041794],
 [0.3949764642333513, 0.5296950912486695],
 [2.0361210587719096, 0.15360089742564473],
 [1.1536838223971917, 0.2827793222886609]]

#### Conclusion

None of the terms had a significant difference in usage between high value and low value rows. Additionally, most of the frequencies were lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.