# Introduction: 

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for many years, and is a major force in popular culture.

## Context:

Imagine that you want to compete on Jeopardy, and you're looking for any way to win. In this project, you'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you win.

The dataset is named jeopardy.csv, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions, which you can download here:

Here are explanations of each column:

- `Show Number` - the Jeopardy episode number
- `Air Date` - the date the episode aired
- `Round` - the round of Jeopardy
- `Category` - the category of the question
- `Value` - the number of dollars the correct answer is worth
- `Question` - the text of the question
- `Answer` - the text of the answer

# Part 1: Reading Data & Initial look

In [1]:
import pandas as pd
jeopardy = pd.read_csv('jeopardy.csv')

In [2]:
# print first 5 rows of dataframe
jeopardy.head(5)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [3]:
# print 
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

Some of the column names have spaces in front.
Remove the spaces from each item in jeopardy.columns.
Assign the result back to jeopardy.columns to fix the column names in jeopardy.

In [4]:
jeopardy.columns = jeopardy.columns.str.strip()
print(jeopardy.columns)

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')


# Part 1.5: Normalizing & Cleaning Data

Before we can start doing analysis on the Jeopardy questions, we need to normalize all of the text columns (the Question and Answer columns). 

We want to ensure that you put words in lowercase and remove punctuation.

In [5]:
def str_normalize(string):
    """
    Function to normalize strings by making string lower case and remove punctuation:
    Take in a string.
    Convert the string to lowercase.
    Remove all punctuation in the string.
    Return the string.
    """
    import string as stt
    lower = string.lower()
    return lower.translate(str.maketrans('', '', stt.punctuation))

print(str_normalize('No. 2: 1912 Olympian; football star at Carlis'))

no 2 1912 olympian football star at carlis


In [6]:
# Normalize the Question and Answer column
jeopardy['clean_question'] = jeopardy['Question'].apply(str_normalize)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(str_normalize)

There are also some other columns to normalize.

The Value column should be numeric, to allow you to manipulate it easier. You'll need to remove the dollar sign from the beginning of each value and convert the column from text to numeric.

The Air Date column should also be a datetime, not a string, to enable you to work it easier.

In [7]:
def number_normalize(string):
    """
    Function to normalize numeric by remove punctuation.
    Take in a string.
    Remove any punctuation in the string.
    Convert the string to an integer.
    Assign 0 instead if the conversion has an error.
    Return the integer.   
    """
    import string as stt
    clean = string.translate(str.maketrans('','',stt.punctuation))
    try:
        return int(clean)
    except Exception:
        return 0

In [8]:
jeopardy['clean_value'] = jeopardy['Value'].apply(number_normalize)

In [9]:
# convert Air Date column to datetime column
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

In [10]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 10 columns):
Show Number       19999 non-null int64
Air Date          19999 non-null datetime64[ns]
Round             19999 non-null object
Category          19999 non-null object
Value             19999 non-null object
Question          19999 non-null object
Answer            19999 non-null object
clean_question    19999 non-null object
clean_answer      19999 non-null object
clean_value       19999 non-null int64
dtypes: datetime64[ns](1), int64(2), object(7)
memory usage: 1.5+ MB


# Part 2: Count matches

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

- How often the answer can be used for a question.
- How often questions are repeated.

You can answer the second question by seeing how often complex words (> 6 characters) reoccur. You can answer the first question by seeing how many times words in the answer also occur in the question. We'll work on the first question and come back to the second.

In [11]:
def frequency_ans_question(row):
    """
    Function that counts the number of times a str appears in a question. 
    """
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()
    match_count = 0
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer) == 0:
        return 0
    for ans in split_answer:
        if ans in split_question:
            match_count +=1
    return match_count / len(split_answer)

In [12]:
jeopardy['answer_in_question'] = jeopardy.apply(frequency_ans_question,axis=1)

In [13]:
print('Percentage of answers in questions: {}%'.format(jeopardy['answer_in_question'].mean()*100))

Percentage of answers in questions: 5.886148203514072%


Given that on average, only 6 percent of the answers are in the questions asked, it would be better to study general knowledge. Studying past questions won't help us win.

# Part 3: Count Question Overlap

Let's say you want to investigate how often new questions are repeats of older ones. You can't completely answer this, because you only have about 10% of the full Jeopardy question dataset, but you can investigate it at least.

To do this, you can:

- Sort jeopardy in order of ascending air date.
- Maintain a set called terms_used that will be empty initially.
- Iterate through each row of jeopardy.
- Split clean_question into words, remove any word shorter than 6 characters, and check if each word occurs in terms_used.
    - If it does, increment a counter.
    - Add each word to terms_used

In [14]:
#create empty list for count of question overlap
question_overlap = []

#set used to add distinct words to it
terms_used = set()

#sort jeopardy by air date from not recent 
jeopardy = jeopardy.sort_values(by=['Air Date'],ascending=True)

#iterate through the dataframe using iterrows()
for index, value in jeopardy.iterrows():
    # column we want is at index 7
    clean_question = value[7]
    #split the array of strings by space
    split_question = clean_question.split(' ')
    
    #remove any strings less than 6 characters long
    split_question = [i for i in split_question if len(i) >= 6]
    
    #use match count variable to track counts of words
    match_count=0
    
    #loop through split_question and if term occurs in set add 1 , and append words to set
    for q in split_question:
        if q in terms_used:
            match_count += 1
        terms_used.add(q)

    if len(split_question) > 0:
        match_count /= len(split_question)
            
    question_overlap.append(match_count)

jeopardy['question_overlap'] = question_overlap

In [15]:
print('Mean of occurence of overlapped questions {}'.format(jeopardy['question_overlap'].mean()))

Mean of occurence of overlapped questions 0.6889055316620328


Questions do not overlap

# Part 3.5: Maximizing High Value Questions

Let's say you only want to study questions that pertain to high value questions instead of low value questions. This will help you earn more money when you're on Jeopardy.

You can actually figure out which terms correspond to high-value questions using a chi-squared test. You'll first need to narrow down the questions into two categories:

- Low value -- Any row where Value is less than 800.
- High value -- Any row where Value is greater than 800.

You'll then be able to loop through each of the terms from the last screen, terms_used, and:

- Find the number of low value questions the word occurs in.
- Find the number of high value questions the word occurs in.
- Find the percentage of questions the word occurs in.
- Based on the percentage of questions the word occurs in, find expected counts.
- Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.

You can then find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values. Doing this for all of the words would take a very long time, so we'll just do it for a small sample now.

In [16]:
def high_value(row):
    """
    Create a function that takes in a row from a Dataframe, and:
    If the clean_value column is greater than 800, assign 1 to value.
    Otherwise, assign 0 to value.
    Return value.
    """
    if row['clean_value'] > 800:
        value = 1
    else:
        value = 0
    return value

In [17]:
jeopardy['high_value'] = jeopardy.apply(high_value,axis=1)

In [19]:
def count_value(string):
    low_count = 0
    high_count = 0
    for index,value in jeopardy.iterrows():
        clean_question = value[7]
        high_value = value[12]
        clean_question.split(' ')
        if string in clean_question:
            if high_value == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count,low_count

In [23]:
import random
comparison_terms = random.sample(terms_used, k=10)
observed_expected = []

for term in comparison_terms:
    observed_expected.append(count_value(term))

In [24]:
observed_expected

[(0, 1),
 (0, 1),
 (1, 0),
 (0, 1),
 (0, 1),
 (0, 5),
 (1, 0),
 (1, 3),
 (0, 1),
 (0, 4)]

# Part 5: Computer Chi Squared values for expected counts

In [59]:
# Find the number of rows in jeopardy where high_value is 1
high_value_count = sum(jeopardy['high_value']==1)

#Find the number of rows in jeopardy where high_value is 0
low_value_count = sum(jeopardy['high_value']==0)

In [64]:
from scipy.stats import chisquare
import numpy as np

chi_squared = []

for value in observed_expected:
    total = sum(value)
    total_prop = total / len(jeopardy)
    exp_highvalue = total_prop*high_value_count
    exp_lowvalue = total_prop*low_value_count
    
    observed = np.array([value[0], value[1]])
    expected = np.array([exp_highvalue, exp_lowvalue])
    chi_squared.append(chisquare(observed, expected))
    
chi_squared

[Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.00981423063442, pvalue=0.1562844540498966),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.02636443308440769, pvalue=0.871013484688921),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=1.607851384507536, pvalue=0.20479409439225948)]

Looking at our sample, and considering it is a tiny sample so we take these values with a grain of salt, overall we see no value p < 0.05 for our chi-squared tests. Given that there is no statistical significance between high value and low value rows.

# Conclusion:

- Given that on average, only 6 percent of the answers are in the questions asked, it would be better to study general knowledge. Studying past answers to these past questions won't help us win.

- Mean of occurence of overlapped questions 0.6889055316620328. Previous questions don't over nor do they repeat often, it would not be meaningful to study previous questions as its less likely they'll occur again.

- Looking at our sample, (it is a tiny sample), overall we see no value p < 0.05 for our chi-squared tests, hence there is no statistical significance between high value and low value rows.


Overall, the best plan of action to win in jeopardy is prepare with general knowledge as best as you can, hopefully the game masters provide a study guideline.