# Jeopardy

In this project, we have a portion of the full Jeopardy dataset to work with.  The full dataset is available __[here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file)__.  We will work with this dataset to find some patterns to help you to win the game.

## Data

The data we have available in the file is as follows:

- `Show Number` - the episode number of the Jeopardy show
- `Air Date` - the date this episode aired on television
- `Round` - the round (Jeopardy!, Double Jeopardy!, Final Jeopardy!, Tiebreaker) the question appeared
- `Category` - the category of the question
- `Value` - the dollar value of the question
- `Question` - the text of the question
- `Answer` - the text of the correct answer

Each row in the data represents one of the questions that has been asked on Jeopardy.

In [1]:
# imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# read in the data
jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [3]:
# review the column names
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [4]:
# some column names have leading spaces - let's fix this
jeopardy.columns = ['Show Number', 'Air Date', 'Round', 'Category', 'Value',
       'Question', 'Answer']

In [5]:
# review our data
jeopardy.describe(include='all')

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
count,19999.0,19999,19999,19999,19999,19999,19999
unique,,336,4,3581,76,19988,14963
top,,2007-11-13,Jeopardy!,TELEVISION,$400,[audio clue],Japan
freq,,62,9901,51,3892,5,22
mean,4312.730537,,,,,,
std,1374.121672,,,,,,
min,10.0,,,,,,
25%,3393.0,,,,,,
50%,4582.0,,,,,,
75%,5431.0,,,,,,


In [6]:
jeopardy.dtypes

Show Number     int64
Air Date       object
Round          object
Category       object
Value          object
Question       object
Answer         object
dtype: object

## Normalize Questions and Answers

The columns of data for `Question` and `Answer` contain strings of text.  In order to be able to perform comparisons, we will need to normalize them so that capitalization and punctuation don't cause two words to be treated differently.

To handle this, let's write a function to normalize these two columns of data.  This function will do the following:

- Take in a string
- Convert the string to all lowercase
- Remove punctuation
- Return the string

We can then apply this function to both the `Question` and `Answer` columns.

In [7]:
# define the function to normalize the strings
import re
def normalize_string(string):
    my_string = string.lower()
    my_string = re.sub('[^A-Za-z0-9\s]', '', my_string)
    return my_string

In [8]:
# apply function to Question and Answer
jeopardy['clean_question'] = jeopardy['Question'].apply(normalize_string)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize_string)

In [9]:
# review data
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams


## Normalize Value and Air Date

We can also work on normalize the `Value` and `Air Date` columns.  We would like `Value` to be numeric, and we would like `Air Date` to be a date/time.  Let's normalize these columns as follows:

For `Value`:
- Take in a string
- Remove punctuation
- Convert the string to an integer
- If the conversion fails, set to zero
- Return the integer

For `Air Date` we can use the to_datetime function to convert the column type.

In [10]:
# create a normalize function for Value
def normalize_value(text):
    value = re.sub('[^A-Za-z0-9\s]', '', text)
    try:
        value = int(value)
    except Exception:
        value = 0
    return value

In [11]:
# apply this function to Value
jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_value)

In [12]:
# convert Air Date to datetime
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

In [13]:
# review data
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200


In [14]:
# review data
jeopardy.describe(include='all')

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
count,19999.0,19999,19999,19999,19999,19999,19999,19999,19999,19999.0
unique,,336,4,3581,76,19988,14963,19987,14224,
top,,2007-11-13 00:00:00,Jeopardy!,TELEVISION,$400,[audio clue],Japan,audio clue,japan,
freq,,62,9901,51,3892,5,22,5,22,
first,,1984-09-21 00:00:00,,,,,,,,
last,,2012-01-19 00:00:00,,,,,,,,
mean,4312.730537,,,,,,,,,748.336267
std,1374.121672,,,,,,,,,653.988299
min,10.0,,,,,,,,,0.0
25%,3393.0,,,,,,,,,400.0


## Is the answer deducible from the question?

One thing we can investigate is whether the answer is deducible from the question.  We can see how many times words in the answer also occur as words in the question.

To do this, we will do the following:
- Split the `clean_question` and `clean_answer` on space
- Remove the word "the" as it doesn't add any value
- Check how many of the remaining words in the answer occur in the question
- Divide this count by the total length of the answer

We can then evaluate the answer for what this can tell us about how to prepare to answer the Jeopardy questions.

In [15]:
# define a function to do the splits and counts from above
def deduce(row):
    split_answer = row['clean_answer'].split(' ')
    split_question = row['clean_question'].split(' ')
    
    # remove all occurrences of the word 'the' from split_answer
    split_answer = [word for word in split_answer if word != 'the']
    # if answer length is zero, return zero (so don't divide by zero later)
    if len(split_answer) == 0:
        return 0
    
    # count matches
    match_count = 0
    for word in split_answer:
        if word in split_question:
            match_count += 1
    
    result = match_count / len(split_answer)
    return result

In [16]:
# apply our deduce function to each row of our data
jeopardy['answer_in_question'] = jeopardy.apply(deduce, axis=1)

In [17]:
# what is the average value of our answer_in_question column we created
jeopardy['answer_in_question'].mean()

0.059877607599993714

The answer only appears in the question about 6% of the time, so we can't rely on just pulling the answer out of the question.  We will need to find other ways to answer the questions to win the game.

## Are questions repeats of old questions

While the answer may not be just given to us in the question, if we study old questions that are asked, are we likely to find answers to current questions?  We can look at the dataset and find if common terms are used repeatedly.

To review this, we will do the following:

- Sort our dataset in order of `Air Date`
- Create a set to hold common terms
- Iterate through each row of our data
- Split clean_question into words, remove any words shorter than six characters, and check if each word is in the set
- Add to a counter for each repeated word, and add words to the set
- Divide the count by the length of the question

In [18]:
# create list of overlapping questions
question_overlap = []
terms_used = set()
for i, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(' ')
    split_question = [word for word in split_question if len(word) > 5]
    match_count = 0
    for w in split_question:
        if w in terms_used:
            match_count += 1
        terms_used.add(w)
    if len(split_question) > 0:
        match_count/= len(split_question)
    question_overlap.append(match_count)

jeopardy['question_overlap'] = question_overlap

In [19]:
jeopardy['question_overlap'].mean()

0.6925960057338647

Approximately 70% of the questions overlap with previous questions asked in our dataset.  It may be a good approach to study previous questions in order to be able to answer current questions.

However, note that this is only a portion of the full dataset, and we are only looking at repeated words, rather than phrases.  If we want to employ this as an approach, we should probably investigate the full dataset and incorporate phrases as well.

## High Value Questions

Could we focus our study on the topics that tend to be asked about in high value questions?  This would allow us to answer the questions that would generate more money, and therefore could help us win the game.

Let's break down the questions into low value (where value is less than or equal to \$800) and high value (where value is greater than \$800).  We can then loop through the terms we created above to find the number of each question value the term appears in.  We can compute a chi-squared test between the expected and observed counts for high and low value questions.

We can then find the words with the biggest differences between high and low value questions.  As this is an example only and doing this for all of our words would take a very long time, we will do this for just a small sample of our words.

In [20]:
# create a function to differentiate between high and low value questions
def high_value(row):
    if row['clean_value'] > 800:
        value = 1
    else:
        value = 0
    return value

In [21]:
# apply the high_value function to our data
jeopardy['high_value'] = jeopardy.apply(high_value, axis=1)

In [22]:
# create a function to count word usage for high and low values
def word_counts(word):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        split_question = row['clean_question'].split(' ')
        if word in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

In [23]:
# get counts for our small sample
observed_expected = []
sample_words = list(terms_used)[:20]
for word in sample_words:
    observed_expected.append(word_counts(word))

observed_expected

[(2, 4),
 (1, 0),
 (0, 1),
 (1, 0),
 (1, 0),
 (0, 1),
 (0, 1),
 (0, 1),
 (0, 7),
 (1, 0),
 (0, 6),
 (0, 2),
 (0, 1),
 (0, 3),
 (0, 1),
 (2, 3),
 (0, 1),
 (0, 1),
 (1, 0),
 (0, 1)]

In [24]:
# calculate chi-squared values
from scipy.stats import chisquare

count = jeopardy.shape[0]
high_value_count = jeopardy[jeopardy['high_value'] == 1].shape[0]
low_value_count = jeopardy[jeopardy['high_value'] == 0].shape[0]
chi_squared = []

for i in observed_expected:
    total = sum(i)
    total_prop = total / count
    high_value_expected = total_prop * high_value_count
    low_value_expected = total_prop * low_value_count
    observed = np.array([i[0], i[1]])
    expected = np.array([high_value_expected, low_value_expected])
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=0.06376233446880725, pvalue=0.8006453026878781),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.813739922888188, pvalue=0.09346026076900309),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.411777076761304, pvalue=0.120425590069509),
 Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571),
 Power_divergenceResul

None of the terms in our sample had a statistically significant different between high and low value questions.  Note that all of the terms in our sample had relatively low frequencies (all were less than 10), so it may make more sense to focus on those terms that are most commonly used.