# Winning Jeopardy

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture. 

Let's say we want to compete on Jeopardy, and we're looking for any edge we can get to win. In this project, we'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help us win.

**Project Goal**
In this project, we'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help us win.

This project is part of the guided project series on [dataquest](http://dataquest.io/).

## Jeopardy Questions

Let's explore the jeopardy questions. We will see that the dataset contains the following columns:

* `Show Number` -- the Jeopardy episode number of the show this question was in.
* `Air Date` -- the date the episode aired.
* `Round` -- the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
* `Category` -- the category of the question.
* `Value` -- the number of dollars answering the question correctly is worth.
* `Question` -- the text of the question.
* `Answer` -- the text of the answer.

Let's start our exploration with loading data in and reviewing the dataset.

In [54]:
import pandas as pd
from string import punctuation
import re
from scipy.stats import chisquare

# read in the dataset
jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

We can see that the names of each column contain an additional space. Let's remove those and keep the name only.

In [55]:
# clean column names from additional spaces
jeopardy.columns = jeopardy.columns.str.replace(' ', '')
jeopardy.columns

Index(['ShowNumber', 'AirDate', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

In [56]:
jeopardy.shape

(19999, 7)

In [57]:
jeopardy.head()

Unnamed: 0,ShowNumber,AirDate,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


Our dataset contains 216930 questions.

## Normalizing Text

Before we start with our analysis, we need to normalize the text, that is ensure lowercase words and remove punctuation. There are two ways we can go about this. Let's see which one is more efficient.

In [58]:
# function to normalize the text using python string library
def normalize_text(s):
    s = str(s).lower()
    s = s.translate(str.maketrans('', '', punctuation))
    return s

# function to normalize the text using regex
def normalize_text_reg(s):
    s = str(s).lower()
    s = re.sub("[^A-Za-z0-9\s]", "", s)
    return s

s = 'Hello.?! World.*'
%timeit normalize_text(s)
%timeit normalize_text_reg(s)

4.93 µs ± 266 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
4.63 µs ± 460 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


We can observe that using the function with regular expression seems to be much faster. We will use it to treat and normalize our `Question` and `Answer` columns.

In [59]:
# normalize our columns
jeopardy['clean_question'] = jeopardy['Question'].apply(normalize_text_reg)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize_text_reg)

In [77]:
jeopardy.head()

Unnamed: 0,ShowNumber,AirDate,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value,answer_in_question,question_overlap,high_value
19325,10,1984-09-21,Final Jeopardy!,U.S. PRESIDENTS,,"Adventurous 26th president, he was 1st to ride...",Theodore Roosevelt,adventurous 26th president he was 1st to ride ...,theodore roosevelt,0,0.0,0.0,0
19301,10,1984-09-21,Double Jeopardy!,LABOR UNIONS,$200,Notorious labor leader missing since '75,Jimmy Hoffa,notorious labor leader missing since 75,jimmy hoffa,200,0.0,0.0,0
19302,10,1984-09-21,Double Jeopardy!,1789,$200,"Washington proclaimed Nov. 26, 1789 this first...",Thanksgiving,washington proclaimed nov 26 1789 this first n...,thanksgiving,200,0.0,0.0,0
19303,10,1984-09-21,Double Jeopardy!,TOURIST TRAPS,$200,Both Ferde Grofe' & the Colorado River dug thi...,the Grand Canyon,both ferde grofe the colorado river dug this ...,the grand canyon,200,0.0,0.5,0
19304,10,1984-09-21,Double Jeopardy!,LITERATURE,$200,"Depending on the book, he could be a ""Jones"", ...",Tom,depending on the book he could be a jones a sa...,tom,200,0.0,0.0,0


## Normalizing Columns

Now that we have normalized the string columns, let's also normalize `Value` and `Air Date` columns. We will remove the $ sign from the `Value` column and convert the column to numeric; we will also convert `Air Date` to datetime.

In [60]:
# function to normalize value column 
def normalize_value(s):
    s = re.sub("[^A-Za-z0-9\s]", "", s)
    try:
        s = int(s)
    except:
        s = 0
    return s

# clean value column
jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_value)
jeopardy['AirDate'] = pd.to_datetime(jeopardy['AirDate'])

## Answers in Questions

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

* How often the answer is deducible from the question.
* How often new questions are repeats of older questions.

We can answer the first question by seeing how many times words in the answer also occur in the question. We can answer the 2nd question by seeing how often complex words (> 6 characters) reoccur. Let's tackle the first question first.

In [61]:
# function to count words from answer in the question
def count_terms(row):
    # split the column values by space
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()    
    
    # initialize count of words
    match_count = 0
    
    # remove 'the' from answers
    if 'the' in split_answer:
        split_answer.remove('the')
    
    # check if answer is 0
    if len(split_answer) == 0:
        return 0
    
    # count ocurrences of answer values in question
    for i in split_answer:
        if i in split_question:
            match_count += 1
    
    # return the calculated count
    return match_count / len(split_answer)
    
# count the words
jeopardy['answer_in_question'] = jeopardy.apply(count_terms, axis=1)

# calculate the mean
jeopardy['answer_in_question'].mean()

0.05900196524977763

We can see that on average the answer shows up in the question around 6% of the time. This tells us that its quite difficult to deduce the answer from the question.

## Recycled Questions

Let's move to tackling our second question, where we need to find out how often new questions are repeats of older ones. Even though we can't completely answer this, because we only have about 10% of the full Jeopardy question dataset, but we can investigate it at least.

We also should remove uninformative words (e.g stopwords) in order to better understand the question overlap. For this we will use `nltk` package to remove stopwords in the english language.

In [86]:
from nltk.corpus import stopwords

question_overlap = []
terms_used = set()
stop_words = set(stopwords.words('english'))

# sorty jeopary values in ascending order
jeopardy.sort_values('AirDate', inplace=True)

# iterate through rows
for i, row in jeopardy.iterrows():
    # split the question by space
    split_question = row['clean_question'].split()
    
    # remove words with less than 6 characters (keep the ones more than)
    split_question = [word for word in split_question if word not in stop_words]
    
    # initialize count
    match_count = 0
    
    # loop through words in split_question: count matches
    for w in split_question:
        if w in terms_used:
            match_count += 1

        # add words to terms used set
        terms_used.add(w)
            
    if len(split_question) > 0:
        match_count /= len(split_question)
        
    question_overlap.append(match_count)
    
jeopardy['question_overlap'] = question_overlap
jeopardy['question_overlap'].mean()

0.7919779258717805

Our results show that there is an 79% overlap between terms in new questions vs terms in old questions. This doesn't actually suggest us to actually study past questions, since it only looks at terms instead of full questions or even phrases. Only 79% of questions reoccur, which tells us that it might be a good idea to spend some time reviewing past questions to prepare for new ones.

## Low value vs high value questions

Nevertheless, we can also look at questions that pertain high value in terms of money. This strategy can help us earn more money when on Jeopardy.
We can actually figure out which terms correspond to high-value questions using a chi-squared test. We'll first need to narrow down the questions into two categories:

* Low value -- Any row where Value is less than 800.
* High value -- Any row where Value is greater than 800.

Let's write the code which will help us identify value pairs for terms.

In [63]:
# function to find value 
def calculate_value(row):
    value = 1 if row['clean_value'] > 800 else 0
    return value

# determine questions with high value
jeopardy['high_value'] = jeopardy.apply(calculate_value, axis=1)

# function to calculate value pairs (high, low count) for any word
def calc_word(word):
    high_count = 0
    low_count = 0
    
    for index, row in jeopardy.iterrows():
        q = row['clean_question'].split()
        if word in q:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    
    return high_count, low_count

In [73]:
# empty dictionary to hold the word and 
# counts of high and low value questions the term appears in
observed_expected = {}

# sample terms to run the value function on
comparison_terms = list(terms_used)[:10]

# calculate value pairs (high and low count) for each word in the sample
for word in comparison_terms:
    observed_expected[word] = calc_word(word)

In [74]:
observed_expected

{'belushis': (1, 0),
 'functions': (2, 2),
 'brobdingnagian': (0, 1),
 'daddyo': (1, 0),
 'hrefhttpwwwjarchivecommedia20091117dj01jpg': (0, 1),
 'vereen': (1, 0),
 '62foottall': (0, 1),
 'galaxys': (0, 1),
 'seducing': (1, 0),
 'blaisdell': (0, 2)}

## Applying chi-squared test

Now that we've found the observed counts for a few terms, we can compute the expected counts and the chi-squared value.

In [75]:
# find total sum of high and low value questions
high_value_count = jeopardy["high_value"].sum()
low_value_count = jeopardy.shape[0] - high_value_count

# dictionary to hold chisquare and p-value for each term
chi_squared = {}

# loop through the dictionary 
for key, value in observed_expected.items():
    
    # calculate total value of high and low
    total = value[0] + value[1]
    
    # calculate word occurence proportion across the dataset
    total_prop = total / jeopardy.shape[0]
    
    # find expected term for high and low value rows
    high_exp, low_exp = total_prop * high_value_count, total_prop * low_value_count
    
    # calculate chi-squared statistic value and associated p-value
    chi_squared[key] = chisquare(value, (high_exp, low_exp))

In [76]:
chi_squared

{'belushis': Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 'functions': Power_divergenceResult(statistic=0.889754963322559, pvalue=0.3455437191483469),
 'brobdingnagian': Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 'daddyo': Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 'hrefhttpwwwjarchivecommedia20091117dj01jpg': Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 'vereen': Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 '62foottall': Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 'galaxys': Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 'seducing': Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 'blaisdell': Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571)}

The results of the chi-squared test show that all of the sample terms we chose are not statistical significant. So far this shows that none of the terms above correspond to high value questions if we defined test significance acceptance level of $p=0.05$