## Winning Jeopardy 

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture.

The dataset is named jeopardy.csv, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions, [available here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file). 

Each row in the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:

* Show Number -- the Jeopardy episode number of the show this question was in.
* Air Date -- the date the episode aired.
* Round -- the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
* Category -- the category of the question.
* Value -- the number of dollars answering the question correctly is worth.
* Question -- the text of the question.
* Answer -- the text of the answer.

In [1]:
import pandas as pd
jeopardy = pd.read_csv('jeopardy.csv')

jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [2]:
#fix column names
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [3]:
jeopardy = jeopardy.rename(columns= lambda x: x.strip())
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

### Normalization
Before doing analysis on the Jeopardy questions, we need to normalize all of the text columns (the `Question` and `Answer` columns). We want to ensure that we lowercase words and remove punctuation so `Don't` and `don't` aren't considered to be different words when we compare them.

In [4]:
import re
#normalizaiton function: lower & remove punctuation
def normalize(string):
    string = str.lower(string)
    string = re.sub('[^\w\s]', '', string)
    return string

#test
a = 'Poo/.'
normalize(a)

'poo'

In [5]:
#normalize the question & answer columns
jeopardy['clean_question'] = jeopardy['Question'].apply(normalize)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize)

In [6]:
#turn value col into numeric
def turnumeric(string):
    s = str.replace(string,"$", "")
    s = re.sub('[^\w\s]', '', s)
    try:
        n = int(s)
    except:
        n = 0
    return n

#test
a = "None"
b ='$2,000'
print(turnumeric(a), turnumeric(b))

0 2000


In [7]:
jeopardy['clean_value'] = jeopardy['Value'].apply(turnumeric)

In [8]:
#convert air date to datetime
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

#test
jeopardy['Air Date'].head(1)

0   2004-12-31
Name: Air Date, dtype: datetime64[ns]

## Picking the topic to study

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

* How often the answer is deducible from the question.
* How often new questions are repeats of older questions.

We can answer the second question by seeing how often complex words (> 6 characters) reoccur. 
We can answer the first question by seeing how many times words in the answer also occur in the question. 

We'll work on the first question now, and come back to the second.

In [9]:
def splitter(row):
    #split the cell by white space into list
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()
    
    #remove 'the'
    if 'the' in split_answer:
        split_answer.remove('the')
        
    #prevent division by 0 later
    if len(split_answer) == 0:
        return 0
    
    match_count = 0
    #check how many words in answer occur in the question
    for word in split_answer:
        if word in split_question:
            match_count +=1
            
    #return proportion of words in the avg answer that are a part of the question
    return match_count/len(split_answer)

In [10]:
#create a column with the proportional value
jeopardy['answer_in_question'] = jeopardy.apply(splitter, axis=1)

In [11]:
#find mean answer in question
mean_answer_in_q = jeopardy['answer_in_question'].mean()
mean_answer_in_q

0.059001965249777744

The mean number of words in the question that are included in the answer is ~0.06, that is 6%. Studying the quesitons has little impact on getting the answers right. We'll have to study.

## New questions as repeats of old questions

We want to investigate how often new questions are repeats of older ones. We can't completely answer this, because we only have about 10% of the full Jeopardy question dataset, but we can investigate.

As a proxy of qusiton overlap we can use the occurence of complex/long words.

In [12]:
jeopardy_sorted = jeopardy.sort_values('Air Date')

In [13]:
question_overlap = []
terms_used = set()

for index, row in jeopardy.iterrows():
    split_question = row['clean_question'].split()
    
    #remove short words
    for word in split_question:
        if len(word) < 6:
            split_question.remove(word)
    
    match_count = 0 
    
    #add all long words to a total set of words
    for word in split_question:
        terms_used.add(word)
        
        #increment the counter of how many times the same word has been used before
        if word in terms_used:
            match_count += 1
        
        #calc an avg of prev used words contained in a quesiton
        if len(split_question) > 0:
            match_count = match_count/len(split_question)
      
    question_overlap.append(match_count)

In [14]:
jeopardy['question_overlap'] = question_overlap
mean_overlap = jeopardy['question_overlap'].mean()

print(mean_overlap)

0.18957336291968835


The mean overlap between quesitons is around 19%. This suggests there is ~1/5 chance that a new question will be related to something that has already been asked about. Studying for topics covered in past questions can increase our chances of doing well in Jeopardy.

## High value questions
Let's say we only want to study questions that pertain to high value questions instead of low value questions. This will help us earn more money on Jeopardy.We can figure out which terms correspond to high-value questions using a chi-squared test. 

To do it we will divide questions into two categories:

* Low value -- Any row where Value is less than 800.
* High value -- Any row where Value is greater than 800.

We'll then loop through each of the terms from terms_used, and:

* Find the number of low value questions the word occurs in.
* Find the number of high value questions the word occurs in.
* Find the percentage of questions the word occurs in.
* Based on the percentage of questions the word occurs in, find expected counts.
* Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.

We can then find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values.

In [15]:
#input: row, output 0/1

def check_value(row):
    v = 0
    if row['clean_value'] > 800:
        v = 1
    return v

In [16]:
#create new column
jeopardy['high_value'] = jeopardy.apply(check_value, axis=1)
#test
jeopardy['high_value'].value_counts()

0    14265
1     5734
Name: high_value, dtype: int64

In [17]:
#check number of times the word appears in the low an high value questions
#input: word, output: number of times word is in low/high v quesitons

def word_frequency(word):
    low_count = 0
    high_count = 0 
    
    for index, row in jeopardy.iterrows():
        split_question = row['clean_question'].split()

        if word in split_question:
            if row['high_value'] == 1:
                high_count +=1
            else:
                low_count += 0
    return low_count, high_count

In [18]:
observed_expected = []

#convert a set to a list & pick 5 items for comparison
comparison_terms = list(terms_used)[:5]
comparison_terms

['northanger', 'blackhearts', 'death', 'multideck', 'exulans']

In [19]:
for t in comparison_terms:
    frequencies = word_frequency(t)
    observed_expected.append(frequencies)

In [20]:
observed_expected

[(0, 1), (0, 1), (0, 36), (0, 0), (0, 1)]

We can compute the expected counts and the chi-squared value.

In [21]:
high_value_count = jeopardy[jeopardy['high_value'] == 1].shape[0]
low_value_count = jeopardy[jeopardy['high_value'] == 0].shape[0]

In [22]:
from scipy.stats import chisquare
import numpy as np

chi_squared = []

for l in observed_expected:
    total = sum(l)
    total_prop = total / jeopardy.shape[0]
    expected_high = total_prop * high_value_count
    expected_low = total_prop * low_value_count
    
    observed = np.array([l[0],l[1]])
    expected = np.array([expected_high,expected_low])
    
    chi_squared.append(chisquare(observed, expected))

  terms = (f_obs - f_exp)**2 / f_exp


In [23]:
chi_squared

[Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=14.470662460567823, pvalue=0.00014235955089733572),
 Power_divergenceResult(statistic=nan, pvalue=nan),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469)]

## Results
The frequencies of the terms in this set were very low. The pvalues indicate there was no significant diffrence in their appearance in the high and low value questions. 

### Next steps
Find a better way to eliminate non-informative words than just removing words that are less than 6 characters long. Ideas:
* Manually create a list of words to remove, like the, than, etc.
* Find a list of stopwords to remove.
* Remove words that occur in more than a certain percentage (like 5%) of questions.

Perform the chi-squared test across more terms to see what terms have larger differences. This is hard to do currently because the code is slow, but here are some ideas:
* Use the apply method to make the code that calculates frequencies more efficient.
* Only select terms that have high frequencies across the dataset, and ignore the others.

Look more into the Category column and see if any interesting analysis can be done with it. Some ideas:
* See which categories appear the most often.
* Find the probability of each category appearing in each round.

Use the whole Jeopardy dataset ([here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file)) instead of the subset we used above.
Use phrases instead of single words when seeing if there's overlap between questions. Single words don't capture the whole context of the question well.

In [24]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value,answer_in_question,question_overlap,high_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200,0.0,0.111111,0
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200,0.0,0.090909,0
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200,0.0,0.111111,0
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200,0.0,0.125,0
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200,0.0,0.090909,0


In [25]:
#try to find the % of questions that have a specific word in them

def appear_percentage(word):
    appearances_of_word = 0
    
    for index, row in jeopardy.iterrows():
        split_question = row['clean_question'].split()
    
        if word in split_question:
            appearances_of_word += 1
            
    return appearances_of_word / len(jeopardy['clean_question'])

In [26]:
appear_percentage('life')

0.008600430021501074

In [27]:
#create a dictionary of all unique words in questions + times they appear
#calculate % of appearances of each word
#remove each word that has over 5% of appearances (0.05)

jeopardy['split'] = jeopardy['clean_question'].str.split()

In [28]:
word_list = {}

#count frequencies
for row in jeopardy['split']:
    for word in row:
        if word in word_list:
            word_list[word] += 1
        else:
            word_list[word] = 1

#convert to %
for key, value in word_list.items():
    v = value/len(word_list)
    word_list[key] = v

In [29]:
#each of these words constitutes more than 1.3% of all the words
freq_words = []

for key, value in word_list.items():
    if value > 0.013:
        freq_words.append(key)
        
print(freq_words)

['for', 'the', 'of', 'his', 'was', 'this', '2', 'at', 'with', 'city', 'in', 'state', 'has', 'a', 'on', 'its', 'an', 'to', 'named', 'first', 'and', 'from', 'it', 'i', 'you', 'he', 'that', 'one', 'by', 'name', 'is', 'not', 'or', 'who', 'be', 'these', 'country', 'man', 'are', 'as', 'when', 'can', 'seen', 'like', 'new', 'called', 'us', 'have', 'were', 'type', 'she', 'her', 'about', 'but']


In [30]:
#remove freq words

def cleaner(row):
    split_question = row['clean_question'].split()
    
    #remove each word from the list
    resultwords  = [word for word in split_question if word not in freq_words]
    result = ' '.join(resultwords)

    return result

jeopardy['clean_question_1'] = jeopardy.apply(cleaner, axis=1)
jeopardy['clean_question_1'].head(5)

0    last 8 years life galileo under house arrest e...
1    no 1912 olympian football star carlisle indian...
2    yuma record average 4055 hours sunshine each year
3    1963 live art linkletter show company served b...
4    signer dec indep framer constitution mass seco...
Name: clean_question_1, dtype: object

We want to investigate how often new questions are repeats of older ones. We can't completely answer this, because we only have about 10% of the full Jeopardy question dataset, but we can investigate.

As a proxy of qusiton overlap we can use the occurence of complex/long words. This time the list of long words is more precise, as we removed more of the simple/stop words. 

In [31]:
#repeat the process with overlap
question_overlap_1 = []
terms_used_1 = set()
word_overlap = {}

for index, row in jeopardy.iterrows():
    split_question = row['clean_question_1'].split()
    
    #remove short words
    for word in split_question:
        if len(word) < 6:
            split_question.remove(word)
    
    match_count = 0 
    
    #add all long words to a total set of words
    for word in split_question:
        terms_used_1.add(word)
        
        #increment the counter of how many times the same word has been used before
        if word in terms_used_1:
            match_count += 1
        
        #calc an avg of prev used words contained in a quesiton
        if len(split_question) > 0:
            match_count = match_count/len(split_question)
          
        # create a dictionary of word + its percentage of overlap
        word_overlap[word] = match_count
        
    question_overlap_1.append(match_count)

In [32]:
jeopardy['question_overlap_1'] = question_overlap_1
mean_overlap_1 = jeopardy['question_overlap_1'].mean()

print(mean_overlap_1)

0.28063873856356764


The overlap is 28% which means that each new question has 28% chance of using a term used in a previous question.

In [33]:
w_overlap = pd.DataFrame.from_dict(word_overlap, orient='index', columns=['percentage'])
w_overlap.head()

Unnamed: 0,percentage
8,0.199996
life,0.166665
galileo,0.125
house,0.75
arrest,0.199846


Want to find all words that lie in the IQR of the overlap. Most common words around the mean. The assumption being they will contain clues of topics that are best to study. The most common ones are probably still fairly common nouns and verbs and the least common ones are less likely to pop up. They are considered outlierrs and will be removed.

We'll filter the w_overlap database to show only the words centered around the mean of the overlap value, the IQR of words. 

In [34]:
w_overlap.sort_values(by='percentage', inplace=True)

#compute IQR
Q1 = w_overlap['percentage'].quantile(0.25)
Q3 = w_overlap['percentage'].quantile(0.75)
IQR = Q3 - Q1

In [35]:
#filter db to show only words in the IQR

overlap_iqr = w_overlap[(w_overlap['percentage'] > Q1) & (w_overlap['percentage'] < Q3)]

In [36]:
w_overlap_len = w_overlap.shape[0]
overlap_iqr_len =  overlap_iqr.shape[0]
print("We eliminated", w_overlap_len - overlap_iqr_len, "elements, and are left with", overlap_iqr_len, "words.")

We eliminated 15036 elements, and are left with 14648 words.


For more stats fun, we can also see how this value differs from what'll happen if we only keep values that fall  within 1SD of the mean.

In [37]:
#calculate SD & pick words within one sd of the mean
#ddof = 1 because this is a sample of a whole dataset 
std = w_overlap.std(axis=0, ddof=1)
mean = w_overlap.mean()
v1 = mean-std
v2 = mean+std
print(v1, v2)

percentage    0.081222
dtype: float64 percentage    0.338646
dtype: float64


In [38]:
#filter to keep only values that fall within 1sd of the mean
overlap_1_std = w_overlap[(w_overlap['percentage'] > float(v1)) & (w_overlap['percentage'] < float(v2))]
o_1_std_len = overlap_1_std.shape[0]

In [39]:
print("There are", o_1_std_len, "words within 1 std of the mean, which eliminates", w_overlap_len-o_1_std_len,"words." )

There are 26191 words within 1 std of the mean, which eliminates 3493 words.


We'll use IQR to select the most promissing words for the chi square analysis. 

Using the same funciton as before to calculate freqencies.

In [40]:
#convert the df to a list
iqr_words = overlap_iqr.index.to_list()

In [41]:
hrefs = [x for x in iqr_words if 'href' in x]
print(hrefs[:5])
print("At the moment with all the hrefs the list has", len(iqr_words), "elements.")

['hrefhttpwwwjarchivecommedia20030107_dj_26jpg', 'hrefhttpwwwjarchivecommedia20040712_j_23ajpg', 'hrefhttpwwwjarchivecommedia20050502_j_02ajpg', 'hrefhttpwwwjarchivecommedia20060317_j_27jpg', 'hrefhttpwwwjarchivecommedia20040521_j_28jpg']
At the moment with all the hrefs the list has 14648 elements.


There are many words that seem to be filenames of some kind, ending with jpg or json. It would be best to remove them.

In [42]:
iqr_words = [x for x in iqr_words if not 'href' in x]
print("After removing the hrefs the list has", len(iqr_words), "elements.")

After removing the hrefs the list has 14428 elements.


In [43]:
#create empty columns to store values
overlap_iqr = overlap_iqr.reindex(columns= ["percentage"," high_count", "low_count"])
overlap_iqr.head(2)

Unnamed: 0,percentage,high_count,low_count
dragon,0.142857,,
corps,0.142857,,


I think I'll need to:
reindex the df to be able to pick one item from the row as a WORD to calc the frequences.
if that word is in split q we'll assign the counts to the cols in the df

In [44]:
#reindex the df
# new_index = ['word','percentage','high_count','low_count']
# overlap_iqr['word'] = overlap_iqr.index
overlap_iqr = overlap_iqr.reset_index()
overlap_iqr = overlap_iqr.rename(columns={'index':'word'})
overlap_iqr.head(2)

Unnamed: 0,word,percentage,high_count,low_count
0,dragon,0.142857,,
1,corps,0.142857,,


In [85]:
#check number of times the word appears in the low an high value questions
#input df with words 
#output: columns number of times word is in low/high v quesitons

def word_frequency_1(word):
    low_count = 0
    high_count = 0 

    for index, row in jeopardy.iterrows():
        split_question = row['clean_question_1'].split()
    
        if word in split_question:
            if row['high_value'] == 1:
                high_count +=1
            else:
                low_count += 0
                
    return low_count, high_count

In [86]:
iqr_5 = iqr_words[:5]
iqr_5

['dragon', 'corps', 'flawed', 'secret', 'annual']

In [87]:
observed_expected_1 =[]
for t in iqr_5:
    frequencies = word_frequency_1(t)
    observed_expected_1.append(frequencies)

In [89]:
observed_expected_1

[(0, 3), (0, 8), (0, 1), (0, 11), (0, 5)]

In [98]:
slice_5 = overlap_iqr[:5]
slice_5

Unnamed: 0,word,percentage,high_count,low_count
0,dragon,0.142857,,
1,corps,0.142857,,
2,flawed,0.142857,,
3,secret,0.142857,,
4,annual,0.142857,,


In [110]:
def count_values(row):
    for index, row in slice_5.iterrows():
        word = row['word']
        low_high = word_frequency_1(word)
        slice_5['low_count'] = low_high[0]
        slice_5['high_count'] = low_high[1]

In [112]:
test = slice_5.apply(count_values)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
