### Introduction

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture. If you need help at any point, you can consult our solution notebook here.

Let's say you want to compete on Jeopardy, and you're looking for any edge you can get to win. In this project, you'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you win.

The dataset is named jeopardy.csv, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions, which you can download here. Here's the beginning of the file:

In [2]:
import pandas as pd

In [5]:
jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.head(5)


Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


As you can see, each row in the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:

- **Show Number** -- the Jeopardy episode number of the show this question was in.
- **Air Date** -- the date the episode aired.
- **Round** -- the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
- **Category** -- the category of the question.
- **Value** -- the number of dollars answering the question correctly is worth.
- **Question** -- the text of the question.
- **Answer** -- the text of the answer.

In [20]:
jeopardy.columns.tolist()

['Show Number',
 ' Air Date',
 ' Round',
 ' Category',
 ' Value',
 ' Question',
 ' Answer']

The columns have leading whitespace in most of them, we will remove them

In [28]:
columns = []
for i in jeopardy.columns:
    columns.append(i.strip())

In [29]:
jeopardy.columns = columns

In [32]:
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

### Normalization

Before you can start doing analysis on the Jeopardy questions, you need to normalize all of the text columns (the Question and Answer columns)

In [47]:
import re
    
def normalize(text):
    string_lower = text.lower()
    out = re.sub("[^A-Za-z0-9\s]", "", text)
    return  out

In [49]:
jeopardy['clean_question'] = jeopardy['Question'].apply(normalize)

In [50]:
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize)

Now that we've normalized the text columns, there are also some other columns to normalize.

The Value column should also be numeric, to allow us to manipulate it more easily. We'll need to remove the dollar sign from the beginning of each value and convert the column from text to numeric.

The Air Date column should also be a datetime, not a string, to enable us to work with it more easily.

In [51]:
def normalize_value(text):
    try:
        res = int(re.sub("[^A-Za-z0-9\s]", "", text))
    except:
        res = 0

    return res 

In [54]:
jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_value)

In [56]:
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

In [57]:
jeopardy.dtypes

Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object

### Studying strategy

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

- How often the answer is deducible from the question.
- How often new questions are repeats of older questions.

You can answer the second question by seeing how often complex words (> 6 characters) reoccur. You can answer the first question by seeing how many times words in the answer also occur in the question. We'll work on the first question now, and come back to the second.

#### Answers in questions

In [63]:
def count_matches(s):
    split_answer = s['clean_answer'].split(" ")
    split_question = s['clean_question'].split(" ")
    
    match_count = 0
    if "the" in split_answer:
        split_answer.remove("the")
        
    if len(split_answer) == 0:
        return 0
    
    for item in split_answer:
        if item in split_question:
            match_count += 1
            
    return match_count / len(split_answer)

In [64]:
jeopardy['answer_in_question'] = jeopardy.apply(count_matches, axis=1)

In [65]:
jeopardy['answer_in_question'].mean()

0.044881387009743423

**Findings:** 0.044 is the mean of how often the answer is deducible from the question

#### Recycled questions

Let's say we want to investigate how often new questions are repeats of older ones. We can't completely answer this, because you only have about 10% of the full Jeopardy question dataset, but we can investigate it at least.

To do this, we can:

- Sort jeopardy in order of ascending air date.
- Maintain a set called terms_used that will be empty initially.
- Iterate through each row of jeopardy.
- Split clean_question into words, remove any word shorter than 6 characters, and check if each word occurs in terms_used.
    - If it does, increment a counter.
    - Add each word to terms_used.

This will enable us to check if the terms in questions have been used previously or not. Only looking at words greater than 6 characters enables us to filter out words like the and than, which are commonly used, but don't tell us a lot about a question.

In [67]:
question_overlap = []
terms_used = set([])

In [69]:
for i, row in jeopardy.iterrows():
    
    split_question = row['clean_question'].split(" ")
    for i in split_question:
        if len(i) < 6:
            split_question.remove(i)

    match_count = 0
    for w in split_question:
        if w in terms_used:
            match_count += 1
        terms_used.add(w)
        
    if len(split_question) > 0:
        match_count / len(split_question)
        
    question_overlap.append(match_count)
    
jeopardy['question_overlap'] = question_overlap
jeopardy['question_overlap'].mean()


6.6992349617480871

**Findings:** not much of the old question being recycled, so studying old question is not giving us any edge

#### Low value vs High value questions

In [74]:
def determine_value(row):
    if row['clean_value'] > 800:
        value = 1
    else:
        value = 0
    
    return value

In [75]:
jeopardy['high_value'] = jeopardy.apply(determine_value, axis=1)

In [80]:
def count_usage(word):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        split_question = row['clean_question'].split(" ")
        if word in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
            
    return high_count, low_count
        

In [81]:
observed_expected = []
comparison_terms = list(terms_used)[:5]

In [82]:
for term in comparison_terms:
    observed_expected.append(count_usage(term))

In [83]:
observed_expected

[(1417, 3301), (21, 78), (0, 1), (2, 7), (2, 0)]

In [99]:
from scipy.stats import chisquare
import numpy as np

high_value_count = len(jeopardy[jeopardy['high_value'] == 1])
low_value_count = len(jeopardy[jeopardy['high_value'] == 1])

chi_squared = []
for i in observed_expected:
    total = i[0] + i[1] 
    #total = sum(i)

    total_prop = total/jeopardy.shape[0]
    high_val_exp = total_prop * high_value_count
    low_val_exp = total_prop * low_value_count
    
    observed = np.array([i[0], i[1]])
    expected = np.array([high_val_exp, low_val_exp])

    chi_squared.append(chisquare(observed, expected))

In [100]:
chi_squared

[Power_divergenceResult(statistic=2809.1097440640356, pvalue=0.0),
 Power_divergenceResult(statistic=88.64664619618398, pvalue=4.7202623898421272e-21),
 Power_divergenceResult(statistic=2.0612207886292468, pvalue=0.15108908362537679),
 Power_divergenceResult(statistic=7.7000782886100074, pvalue=0.005521843038654611),
 Power_divergenceResult(statistic=4.1224415772584937, pvalue=0.042317962210945609)]

**Findings:** The only significant chi-square result between high values and low values rows is the first