# Patterns of Jeopardy
Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for many years, and is a major force in popular culture. Imagine that you want to compete on Jeopardy, and you're looking for any way to win. In this project, we'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help us win.

The dataset is named jeopardy.csv, and contains 20,000 rows from the beginning of a full dataset of Jeopardy questions, which you can download [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/). 

In [49]:
import pandas as pd
file = pd.read_csv('jeopardy.csv')

In [50]:
file.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [51]:
file.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [52]:
file.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
Show Number    19999 non-null int64
 Air Date      19999 non-null object
 Round         19999 non-null object
 Category      19999 non-null object
 Value         19999 non-null object
 Question      19999 non-null object
 Answer        19999 non-null object
dtypes: int64(1), object(6)
memory usage: 1.1+ MB


We see there are 7 columns present in the dataset:
- Show Number: the Jeopardy episode number
- Air Date: the date the episode aired
- Round: the round of Jeopardy
- Category: the category of the question
- Value: the number of dollars the correct answer is worth
- Question: the text of the question
- Answer: the text of the answer

There are also empty spaces in front of some of the names of the columns, let's remove those. 

In [53]:
file.columns = file.columns.str.strip() # Removes white spaces

### Normalize Question, Answer and Value Columns
Before analysis, we need to normalize all of the text columns by transforming values to lower case and removing punctuation. 

In [54]:
import string
def normalize(inputs):
    inputs = inputs.lower() #Converts to lower-case
    inputs = inputs.translate(str.maketrans('','', string.punctuation))
    # The translate() method typically takes a translation table
    # maketrans method here takes three arguments, 
    #the first two of which are empty strings, 
    #and the third is the list of punctuation we want to remove.
    
    return inputs

In [55]:
file['clean_question'] = file['Question'].apply(normalize)
file['clean_answer'] = file['Answer'].apply(normalize)

In [56]:
file[['Question','Answer','clean_question','clean_answer']]

Unnamed: 0,Question,Answer,clean_question,clean_answer
0,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus
1,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe
2,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona
3,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds
4,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams
5,"In the title of an Aesop fable, this insect sh...",the ant,in the title of an aesop fable this insect sha...,the ant
6,Built in 312 B.C. to link Rome & the South of ...,the Appian Way,built in 312 bc to link rome the south of ita...,the appian way
7,"No. 8: 30 steals for the Birmingham Barons; 2,...",Michael Jordan,no 8 30 steals for the birmingham barons 2306 ...,michael jordan
8,"In the winter of 1971-72, a record 1,122 inche...",Washington,in the winter of 197172 a record 1122 inches o...,washington
9,This housewares store was named for the packag...,Crate & Barrel,this housewares store was named for the packag...,crate barrel


### Normalizing other columns
The value column should be numeric, for easier analysis. We will have to remove the dollar sign from each row and convert from text to numeric. Also the 'Air Date' column should be a datetime, not a string. 

In [57]:
file['Value'].value_counts()[:20]

$400      3892
$800      2980
$200      2784
$600      1890
$1000     1796
$2000     1074
$1200     1069
$1600     1027
$100       804
$500       798
$300       764
None       336
$1,000     184
$2,000     149
$3,000      70
$1,500      50
$1,200      42
$4,000      32
$5,000      23
$1,800      22
Name: Value, dtype: int64

We can see that there is a value of 'none' being represented 336 times. Let's change that to zero to make it numeric and keep consistent with the other values in the column.

In [58]:
def normalize_value(inputs):
    inputs = inputs.lower()
    inputs = inputs.translate(str.maketrans('','', string.punctuation))
    if inputs == 'none': #If the value is none, transform it to 0
        inputs = '0'
    inputs = int(inputs)
    return inputs

In [59]:
file['clean_value'] = file['Value'].apply(normalize_value)

In [60]:
file['clean_value'].value_counts()[0] #Prints counts of 0 value

336

So this worked correctly, 'none' was transformed to 0

In [61]:
file['Air Date'] = pd.to_datetime(file['Air Date'])
file.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 10 columns):
Show Number       19999 non-null int64
Air Date          19999 non-null datetime64[ns]
Round             19999 non-null object
Category          19999 non-null object
Value             19999 non-null object
Question          19999 non-null object
Answer            19999 non-null object
clean_question    19999 non-null object
clean_answer      19999 non-null object
clean_value       19999 non-null int64
dtypes: datetime64[ns](1), int64(2), object(7)
memory usage: 1.5+ MB


We can see that the clean_value column is a integer now, and the air date has been changed to datetime. 

### Does the answer come through the question?

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:
- How often the answer can be used for a question.
- How often questions are repeated.

We can answer the second question by seeing how often complex words (> 6 characters) reoccur. We can answer the first question by seeing how many times words in the answer also occur in the question. We'll work on the first question and come back to the second.

In [62]:
def matching(row):
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split() #Separate by spaces
    if 'the' in split_answer:
        split_answer.remove('the') #The is a too common word
    if len(split_answer) == 0:
        return 0
    match_count = 0
    for item in split_answer: #If word is in both answer and question
        if item in split_question:
            match_count += 1
    return match_count / len(split_answer)

In [63]:
file['answer_in_question'] = file.apply(matching, axis = 1) 
#Axis = 1 to make sure we apply it to rows
mean_count = file['answer_in_question'].mean()
#Average across whole column
print('The mean count of the column answer_in_question is:',
     mean_count)

The mean count of the column answer_in_question is: 0.058861482035140716


On average, the answer only makes up about 6% of the question at hand. Meaning just hearing the question, we likely won't be able to come up with the answer. We'll need to look further. 

### Repeated questions
Let's say you want to investigate how often new questions are repeats of older ones. You can't completely answer this, because you only have about 10% of the full Jeopardy question dataset, but you can investigate it at least.

To do this, we can:
- Sort by ascending air date
- Maintain a set called terms_used that will be empty initially
- Iterate through each row
- Split clean_question, remove shorter words and check if words are reused in terms_used
    - If so, increment a counter,
    - Add each word to terms_used
    
This allows you to check if the terms in questions have been used previously or not. Only looking at words with six or more characters enables you to filter out words like the and than, which are commonly used, but don't tell you a lot about a question.

In [64]:
question_overlap = []
terms_used = set()

file = file.sort_values('Air Date')

for i, row in file.iterrows():
    split_question = row['clean_question'].split()
    split_question = [value for value in split_question if len(value) > 5]
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
        terms_used.add(word)
    if len(split_question) > 0:
        match_count = match_count / len(split_question)
    question_overlap.append(match_count)

file['question_overlap'] = question_overlap

print(file['question_overlap'].mean())

0.6889055316620328


There's about a 68% overlap between older and new questions in terms of word choice, which could be inconsequential, but leads me to believe there is more to look at in comparing questions. 

### The big money questions
Let's say you only want to study questions that pertain to high value questions instead of low value questions. This will help you earn more money when you're on Jeopardy.

You can actually figure out which terms correspond to high-value questions using a chi-squared test. You'll first need to narrow down the questions into two categories:
- Low value - Any row where Value is less than 800
- High value - Any row where Value is greater than 800

You'll then be able to loop through each of the terms through terms_used, and:
- Find the number of low value questions the word occurs in.
- Find the number of high value questions the word occurs in.
- Find the percentage of questions the word occurs in.
- Based on the percentage of questions the word occurs in, find expected counts. 
- Compute the chi squared value based on expected counts and observed counts for high and low value questions. 

You can then find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values. Doing this for all of the words would take a very long time, so we'll just do it for a small sample now.

In [65]:
def greater_800(row):
    binary = 0
    if row['clean_value'] > 800:
        binary = 1
    return binary    
#Adds column identifying if the word 

In [66]:
file['high_value'] = file.apply(greater_800, axis = 1)

In [67]:
def counting(word):
    low_count = 0
    high_count = 0
    for i, row in file.iterrows():
        question = row['clean_question'].split()
        if word in question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

In [68]:
import random
comparison_terms = random.sample(terms_used, 10) #Retrieves 10 random words

In [69]:
observed_expected = []

for word in comparison_terms:
    observed_expected.append(counting(word)) 
#Counts how many times a word was in a high value question vs low value

In [70]:
observed_expected

[(1, 0),
 (0, 1),
 (0, 1),
 (1, 0),
 (0, 1),
 (1, 0),
 (4, 6),
 (0, 1),
 (1, 1),
 (2, 4)]

With these words none of them were present in high value questions. 

### Chi Squared Values
Now we can compute the chi squared values with our observed counts and calculated expected counts

In [71]:
high_value_count = len(file[file['high_value'] == 1])
low_value_count = len(file[file['high_value'] == 0])

In [72]:
from scipy.stats import chisquare
import numpy as np

expected = []
chi_squared = []

for observations in observed_expected:
    total = sum(observations)
    total_prop = total / len(file)
    
    expected_high = total_prop * high_value_count
    expected_low = total_prop * low_value_count
    
    observed = np.array([observations[0],observations[1]])
    expected = np.array([expected_high, expected_low])
    chi_squared.append(chisquare(observed, expected))
    
chi_squared

[Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.6275336335698622, pvalue=0.42826143908800296),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996),
 Power_divergenceResult(statistic=0.06376233446880725, pvalue=0.8006453026878781)]

The frequencies are less than 5, and the significance through p-value aren't close to the typically accepted 5%. Running this test with words that only appeared more than 5-10x would help this test. 