# Project: Winning Jeopardy
---

## 1. Overview
---

In this project a dataset of Jeopardy (USA TV game show) questions will be used to figure out some patterns in the questions that could aid in winning the game show.

The dataset used is `jeopardy.csv`, which contains `20,000` rows. The columns/variables within the dataset are as follows:

- `Show Number` - the Jeopardy episode number
- `Air Date` - the date the episode aired
- `Round` - the round of Jeopardy
- `Category` - the category of the question
- `Value` - the number of dollars the correct answer is worth
- `Question` - the text of the question
- `Answer` - the text of the answer
---

In [14]:
# Importing libraries used throughout this lession
import pandas as pd
import numpy as np
import string
import collections  # for creating dictionary freqeuncy tables
from random import choice
from scipy.stats import chisquare

## 2. Exploring and Preparing the Data
---

In [15]:
# Reading jeopardy.csv as pandas datafranme (df)
jeop_df = pd.read_csv('data/jeopardy.csv', low_memory=False)

# display first three rows
jeop_df.head(3)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona


### 2.1. Fixing Column Names
---

This section simply cleans up the column names by removing white space before and after the column label.

In [16]:
# checking column names
jeop_df.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [17]:
# remove spaces from column names
jeop_df.columns = jeop_df.columns.str.strip()

# checking column names again
jeop_df.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

### 2.2. Normalising Text
---

This section is about normalising the text in both the questions and answers column. They will be converted to lowercase with punctuation removed to make comparing words like for like much easier.

In [18]:
import string  # this is here as a reminder that this library is being used

def normalise_text(text: str) -> str:
    
    """
        This function normalizes text by converting it to lowercase and removing all punctuation.

        Args:
            text (str): A string to be normalized.

        Returns:
            str: The normalized string.
    """
    
    # Convert the text to lowercase
    text = text.lower()
    
    # Remove all punctuation from the text
    text = text.translate(str.maketrans("", "", string.punctuation))
    
    return text

In [19]:
# applying the text normalising function to the quesitons and answers columns
jeop_df['clean_question'] = jeop_df['Question'].apply(normalise_text)
jeop_df['clean_answer'] = jeop_df['Answer'].apply(normalise_text)

# display first three rows to check result
jeop_df.head(3)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona


### 2.3. Normalising Columns
---

In this section the `Value` column is converted in to numerical type by removing the `$` sign and casting the string as an int. Additional the `Air Date` column is converted into dateTime type.

In [20]:
# checking column value types
jeop_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Show Number     19999 non-null  int64 
 1   Air Date        19999 non-null  object
 2   Round           19999 non-null  object
 3   Category        19999 non-null  object
 4   Value           19663 non-null  object
 5   Question        19999 non-null  object
 6   Answer          19999 non-null  object
 7   clean_question  19999 non-null  object
 8   clean_answer    19999 non-null  object
dtypes: int64(1), object(8)
memory usage: 1.4+ MB


In [21]:
import string  # this is here as a reminder that this library is being used

def normalize_dollar_value(value: str) -> int:
    
    """
        This function normalizes dollar values by removing punctuation, converting to an integer,
        and returning 0 if the conversion fails.

        Args:
            text (str): A string representing a dollar value.

        Returns:
            int: The normalized dollar value as an integer.
    """
    
    if isinstance(value, str):
        # Remove punctuation from string
        text_without_punctuation = value.translate(str.maketrans("", "", string.punctuation))
        # Convert the text to an integer
        try:
            return int(text_without_punctuation)
        except ValueError:
            return None
    elif isinstance(value, float) and np.isnan(value):
        # Handle NaN values
        return None
    elif isinstance(value, float):
        # Handle float values
        return int(value)
    else:
        return None

In [22]:
# applying the text normalize_dollar_value function to Value column
jeop_df['clean_value'] = jeop_df['Value'].apply(normalize_dollar_value)

# Displaying first three rows to check changes
jeop_df.head(3)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200.0
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200.0
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200.0


In [23]:
# Convert the Air Date column to datetime
jeop_df['Air Date'] = pd.to_datetime(jeop_df['Air Date'])

# checking column value types
jeop_df['Air Date'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 19999 entries, 0 to 19998
Series name: Air Date
Non-Null Count  Dtype         
--------------  -----         
19999 non-null  datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 156.4 KB


## 3. Answers in Questions and Repeating Questions
---

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

1. How often the answer can be used for a question.
2. How often questions are repeated.

The first question can be answered by checking how many times words in the answer appear in the quesiton. As for the second this can be answered by checking how often complex words (> 6 char) reoccur.

### 3.1. Answer Appearing in the Question
---

In [24]:
def match_count(series):
    """
        This function takes in a row in Jeopardy as a Pandas Series, and 
        returns the ratio of how many words in the answer also appear in the question.

        Args:
            series: A Pandas Series containing a row of Jeopardy data.

        Returns:
            The ratio of how many words in the answer also appear in the question, 
            as a float between 0 and 1.
    """
    # Split the clean_answer column around spaces and assign to the variable split_answer.
    split_answer = series['clean_answer'].split()
    # Split the clean_question column around spaces and assign to the variable split_question.
    split_question = series['clean_question'].split()
    
    # Create a variable called match_count, and set it to 0.
    match_count = 0
    
    # If the word "the" is in split_answer, remove it using the remove method on lists.
    # The word "the" doesn't have any meaningful use in finding the answer.
    if "the" in split_answer:
        split_answer.remove("the")
        
    # If the length of split_answer is 0, return 0. This prevents a division by zero error later.
    if len(split_answer) == 0:
        return 0
    
    # Loop through each item in split_answer, and see if it occurs in split_question. 
    # If it does, add 1 to match_count.
    for word in split_answer:
        if word in split_question:
            match_count += 1
            
    # Divide match_count by the length of split_answer, and return the result.
    return match_count / len(split_answer)
    

In [25]:
# Apply the match_count function to each row in the DataFrame, 
# and assign the result to a new column called 'answer_in_question'
jeop_df['answer_in_question'] = jeop_df.apply(match_count, axis=1)

jeop_df['answer_in_question'].mean()

0.058861482035140716

---
**Observatino from this section (3.1.)**

On average the answer only makes up about `6%` of the question. This is not large and so a contestant can not rely on hearing the question and inferring the answer from it.

### 3.2. Repeating Questions
---

The dataset being used only makes up about `10%` all Jeopardy questions so it's not possible to completely answer the question how often questions are repeat. However, it can still be investigated, so in this section that's what we will do.

In [26]:
import collections  # this is here as a reminder that this library is being used

# Sorting the df by dateTime column in ascending order
jeop_df.sort_values(by='Air Date', inplace=True)

question_overlap = []
terms_used = set()

for index, row in jeop_df.iterrows():
    
    # splitting string into a list of words
    split_question = row['clean_question'].split(' ')
    
    # Filtering split_question list by words >=6 char long
    split_question = [s for s in split_question if len(s) >= 6]
    
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
            
    for word in split_question:
        terms_used.add(word)
    
    if len(split_question) > 0:
        match_count /= len(split_question)
        
    question_overlap.append(match_count)
    
    
jeop_df["question_overlap"] = question_overlap
jeop_df["question_overlap"].mean()

0.687124288096678

---
**Observatino from this section (3.2.)**

The is a `69%` overlap between terms (words) in new questions when compared to older questions. This likely insignificant however, as this analysis is only considering words, excluding only `the`, and does not consider phrases. This analysis does however, give more reason to consider the possibility of recycled questions.

## 4. Considering Low Value vs. High Value Questions
---

This section will consider whether studying for high value questions instead of low value questions is worth it. This can be achieved by figuring out which terms correspond to high-value questions using a Chi-squared test. First the value of the questions will need to be split into two categories:

- Low value: Any row where `Value` is less than `800`.
- High value: Any row where `Value` is greater than `800`.

Then looping through each terms in in `terms_used` set and:

- Find the number of low value questions the word occurs in.
- Find the number of high value questions the word occurs in.
- Find the percentage of questions the word occurs in.
- Based on the percentage of questions the word occurs in, find expected counts.
- Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.

In [27]:
# creating a column in jeop_df of boolean 0,1 for high-value questions

def question_value(row: pd.Series) -> int:
    
    # Set the initial value to 0
    value = 0
    # Check if the clean_value is greater than 800
    if row['clean_value'] > 800:
        # If the clean_value is greater than 800, set the value to 1
        value = 1
        
    # Return the value
    return value

# applying the above function to row in the df
jeop_df["high_value"] = jeop_df.apply(question_value, axis=1)

# display three random rows
jeop_df.sample(3)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value,answer_in_question,question_overlap,high_value
17755,3779,2001-01-25,Jeopardy!,STYLISH CELEBRITIES,$500,"This brunette beauty won an Emmy for ""Once and...",Sela Ward,this brunette beauty won an emmy for once and ...,sela ward,500.0,0.0,0.6,0
11828,3053,1997-12-03,Double Jeopardy!,WORLD CITIES,$600,This Cambodian capital lies at the confluence ...,Phnom Penh,this cambodian capital lies at the confluence ...,phnom penh,600.0,0.0,0.5,0
19755,4925,2006-01-27,Jeopardy!,MAGAZINES,$400,Although published by this organization for 62...,the Girl Scouts of America,although published by this organization for 62...,the girl scouts of america,400.0,0.25,0.833333,0


In [28]:
def count_usage(term):
    
    # Initialize low and high count to 0
    low_count = 0
    high_count = 0
    
    # Loop through each row in jeop_df DataFrame
    for index, row in jeop_df.iterrows():
        
        # Split the clean_question column of the row by space 
        # and check if term is present in it
        if term in row["clean_question"].split(" "):
            
            # If the term is present, check if the question is high 
            # value or not and increment the respective counter
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
                
                
    # Return high and low count as a tuple
    return high_count, low_count



from random import choice # this is here as a reminder that this library is being used

# converting terms_used set into a list
terms_used_list = list(terms_used)

# randomly pick 10 words from terms_used_list 
comparison_terms = [choice(terms_used_list) for _ in range(10)]

observed_expected = []

for term in comparison_terms:
    observed_expected.append(count_usage(term))

observed_expected

[(0, 1),
 (0, 1),
 (0, 2),
 (1, 0),
 (0, 1),
 (0, 4),
 (0, 1),
 (0, 1),
 (0, 1),
 (0, 1)]

In [29]:
from scipy.stats import chisquare # this is here as a reminder that this library is being used

# Counting the number of high and low value rows
high_value_count = jeop_df[jeop_df["high_value"] == 1].shape[0]
low_value_count = jeop_df[jeop_df["high_value"] == 0].shape[0]

# Creating an empty list to store the results of the chi-squared test
chi_squared = []

# Iterating through each observed-expected pair
for obs in observed_expected:
    
    # Calculating the total number of occurrences for the pair
    total = sum(obs)
    
    # Calculating the proportion of the pair in the entire dataset
    total_prop = total / jeop_df.shape[0]
    
    # Calculating the expected counts for high and low value rows
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    # Creating arrays for the observed and expected counts
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    
    # Performing the chi-squared test and appending the result to the list
    chi_squared.append(chisquare(observed, expected))

# Returning the list of chi-squared test results
chi_squared

[Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454022),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=1.607851384507536, pvalue=0.2047940943922556),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469)]

## 5. Chi Results
---

The usage of all terms did not show a significant difference between high and low value rows. Moreover, the frequencies of all terms were below `5`, making the chi-squared test less reliable. It is recommended to perform the test using only terms with higher frequencies.