# Project: Winning Jeopardy

The following dataset contains 20,000 rows of the full dataset of jeopardy questions. The goal of the jeopardy analysis project is to find out if there are certain strategies that could help a person win based on the data and see any other patterns in the data

In [80]:
import pandas as pd
jeopardy= pd.read_csv('data/jeopardy.csv')
print(jeopardy.columns)

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')


In [43]:
columns = []
for col in jeopardy.columns:
    col = col.strip()
    columns.append(col)
jeopardy.columns = columns
print(jeopardy.columns)
    

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')


In [44]:
import string
#The function normale cleans the question column so the text is normalised for analysis
def normale(stri):
    stri = stri.lower()
    for char in string.punctuation:
        stri = stri.replace(char, '')
    return stri
jeopardy['clean_question'] = jeopardy['Question'].apply(normale)

In [45]:
print(jeopardy['clean_question'].head())

0    for the last 8 years of his life galileo was u...
1    no 2 1912 olympian football star at carlisle i...
2    the city of yuma in this state has a record av...
3    in 1963 live on the art linkletter show this c...
4    signer of the dec of indep framer of the const...
Name: clean_question, dtype: object


In [46]:
#Uses same function to normalise answers
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normale)
print(jeopardy['clean_answer'].head())

0    copernicus
1    jim thorpe
2       arizona
3     mcdonalds
4    john adams
Name: clean_answer, dtype: object


In [47]:
#For dollar values, takes out commas and dollar signs and converts them to integers
def normal_dollar(stri):
    stri = str(stri)
    stri = stri.replace(',', '')
    stri = stri.replace('$', '')
    if stri == 'nan' or stri == 'None':
        return 0
    else:
        stri = int(stri)
        return stri
jeopardy['clean_value'] = jeopardy['Value'].apply(normal_dollar)

print(jeopardy['clean_value'].head(10))

0    200
1    200
2    200
3    200
4    200
5    200
6    400
7    400
8    400
9    400
Name: clean_value, dtype: int64


In [48]:
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

In [49]:
#This checks for each row of the data, whether the answer is found inside the question. 
def detect_matches(row):
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer) == 0:
        return 0
    match_count  = 0 
    for word in split_answer:
        if word in split_question:
            match_count+=1
        return match_count/len(split_answer)
jeopardy['answer_in_question'] = jeopardy.apply(detect_matches, axis = 1)


In [50]:
#Sees if after the filtering there is still data
print(jeopardy[jeopardy['answer_in_question']!=0].head()) 

     Show Number   Air Date             Round           Category  Value  \
31          4680 2004-12-31  Double Jeopardy!     AIRLINE TRAVEL   $400   
67          5957 2010-07-06         Jeopardy!  RHYMES WITH SMART   $400   
73          5957 2010-07-06         Jeopardy!  RHYMES WITH SMART   $600   
79          5957 2010-07-06         Jeopardy!  RHYMES WITH SMART   $800   
100         5957 2010-07-06  Double Jeopardy!     JUST THE FACTS  $1200   

                                              Question             Answer  \
31   It can be a place to leave your puppy when you...           a kennel   
67   Small, slender missile thrown at a board in a ...             a dart   
73   It can be a separating line in your hair or a ...             a part   
79             A graphic representation of information            a chart   
100  <a href="http://www.j-archive.com/media/2010-0...  a German Shepherd   

                                        clean_question       clean_answer  \
31   it c

In [51]:
print(jeopardy['answer_in_question'].mean()) 
jeopardy['Air Date'].sort_values(ascending=True)

0.029624080410369708


19325   1984-09-21
19301   1984-09-21
19302   1984-09-21
19303   1984-09-21
19304   1984-09-21
           ...    
1953    2012-01-19
1954    2012-01-19
1955    2012-01-19
1945    2012-01-19
1922    2012-01-19
Name: Air Date, Length: 19999, dtype: datetime64[ns]

Around three percent of answers are inside the corresponding question. With such a low chance, it is a bad idea assume the answer will be in the question.

This compiles a set of all unique terms from the questions and also keeps track of a match count, which is seeing 
if a new word is already in the set of terms. The question overlap list takes for each row, what proportion of terms in the question are previously used terms.

In [52]:
terms_used = set()
question_overlap = []
for index, row in jeopardy.iterrows():
    split_question = row['clean_question'].split()
    split_question = [word for word in split_question if len(word)>5]
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count+= 1
        terms_used.add(word)
    if len(split_question)>0:
        question_overlap.append(match_count/len(split_question))
    else: 
        question_overlap.append(0)
jeopardy['question_overlap'] = question_overlap 


In [53]:
print(jeopardy['question_overlap'].mean())


0.6919577992203563


This number is larger than I expected! Terms used in the questions here are lots of times found in previous questions. Based on term overlap, a good strategy is to study other important words in each question than the answer because these words and topics are likely to appear in future questions

In [54]:
import numpy as np
from random import choice
def high(row):
    if row['clean_value']>800:
        value = 1
    else: 
        value = 0
    return value
jeopardy['high_value'] = jeopardy.apply(high, axis=1)
#For each word count the number of times word is in high value vs low value question
def high_low(word):
    low_count = 0
    high_count = 0
    for index, row in jeopardy.iterrows():
        split_question = row['clean_question'].split(" ")
        if word in split_question:
            if row['high_value'] == 1:
                high_count+=1
            else:
                low_count+=1
    return high_count, low_count

terms_used_list =list(terms_used)
comparison_terms = [choice(terms_used_list) for _ in range(10)]
observed_expected = [] #High-low value pairs for each term
for term in comparison_terms:
    observed_expected.append(high_low(term))
    
observed_expected

[(2, 3),
 (1, 5),
 (0, 1),
 (6, 3),
 (1, 0),
 (0, 1),
 (1, 0),
 (0, 1),
 (0, 1),
 (0, 1)]

Summary of the above function:

1)Makes a new column in jeopardy that sets a question to high value if the value is >800 dollars

2)Creates a function that counts how many times a given word is in either a high value or low value question, returns those values in a list

3)Draws a sample of 10 words, and appends the high_low count for each word in a list

In [55]:
high_value_count = jeopardy[jeopardy['high_value'] == 1].shape[0]#shape[0] denotes length

In [56]:
low_value_count = jeopardy[jeopardy['high_value'] == 0].shape[0]

In [57]:
chi_squared = []
from scipy.stats import chisquare
import numpy as np
for obs_list in observed_expected:
    total = sum(obs_list)
    total_prop = total/jeopardy.shape[0]
    high_prop = total_prop*high_value_count #expected high values and low values of the same sample size
    low_prop = total_prop*low_value_count
    
    chisquared, pvalue = chisquare([obs_list[0], obs_list[1]], [high_prop, low_prop])
    chi_squared.append([chisquared, pvalue])

In [58]:
chi_squared

[[0.3137668167849311, 0.5753778622944691],
 [0.42281054506129573, 0.515537958129453],
 [0.401962846126884, 0.5260772985705469],
 [6.353131314909584, 0.011717429728771048],
 [2.487792117195675, 0.11473257634454047],
 [0.401962846126884, 0.5260772985705469],
 [2.487792117195675, 0.11473257634454047],
 [0.401962846126884, 0.5260772985705469],
 [0.401962846126884, 0.5260772985705469],
 [0.401962846126884, 0.5260772985705469]]

The chisquare test was performed for each of the 10 terms from the sample list. It was comparing the proportion of high and low value questions for the term with the total proportion of high and low value terms for the whole dataset to see if each terms has a significant difference from the total proportion. 

The results show that only one of the terms have a p value less than the standard 0.05, which means that high and low value questions are not associated with specific terms

I would like to do a chisquare test to see if the expected length of question for high or value questions is the same as the expected length of general questions

Steps

1) Compute the average value of high value question length and low value question length

2) Do chi2 for comparing high and low to expected value of a general question

In [59]:
high_value_list  = jeopardy[jeopardy['high_value'] == 1]
low_value_list = jeopardy[jeopardy['high_value'] == 0]
high_value_sample = [choice(list(high_value_list['Question'])) for _ in range(10)]
low_value_sample = [choice(list(low_value_list['Question'])) for _ in range(10)]


chi_low = []
question_length_average = np.mean(jeopardy['Question'].str.len())
print(question_length_average)
def chi_question(sample):
    chi_high = []
    for question in sample:
      
        chisquared, pvalue = chisquare([len(question),question_length_average])
        chi_high.append([chisquared, pvalue])
    return chi_high

def chi_question(sample):
    
    chi_low= []
    for question in sample:
        print(len(question))
        
        chisquared, pvalue = chisquare([len(question),question_length_average])
        chi_low.append([chisquared, pvalue])  
        
    return chi_low

high_list = chi_question(high_value_sample)
low_list = chi_question(low_value_sample)
        



22    1939 Oscar winner: "...you are a credit to you...
24    This Asian political party was founded in 1885...
25    No. 5: Only center to lead the NBA in assists;...
26    The Kirschner brothers, Don & Bill, named this...
27    Revolutionary War hero: "His spirit is in Verm...
Name: Question, dtype: object
86.2193609680484
106
102
105
111
83
66
104
54
33
72
33
99
56
66
114
76
86
58
89
60


In [79]:
print("Chi square results of high value question length samples\n")
print(high_list)
print("\n")
print("Chi square results of low value question length samples\n")
print("____________")
print(low_list)

Chi square results of high value question length samples

[[2.0355581172564934, 0.15365777284612617], [1.3230762604651947, 0.25004070352402297], [1.8445433593275096, 0.1744193700071655], [3.1136906022698763, 0.07763685620590016], [0.061247631378010745, 0.8045347652108468], [2.685746118997656, 0.10124979119591554], [1.66203441529627, 0.1973288649260355], [7.40330874440334, 0.006510402096936676], [23.757050523081645, 1.0929427320661183e-06], [1.277910775916291, 0.25828785282901556]]


Chi square results of low value question length samples

____________
[[23.757050523081645, 1.0929427320661183e-06], [0.8818988102071178, 0.34768213757831434], [6.421135428406073, 0.011277003800947514], [2.685746118997656, 0.10124979119591554], [3.8545917901853293, 0.04960999757522686], [0.6437908395893718, 0.422341403684643], [0.0002794066476187961, 0.986663612523435], [5.5216742613458125, 0.0187822743620027], [0.04412727784929335, 0.8336169909353146], [4.701531213250492, 0.03013575757337761]]


Majority of p values in both the high value and low value question lengths is more than 0.05. The high p values of question lengths in either the high or low value length set are not significantly different from the expected value of the question length. So, the question length is not very different between high and low value questions.

In [62]:
print(jeopardy['Category'].value_counts()[:20])

TELEVISION          51
U.S. GEOGRAPHY      50
LITERATURE          45
BEFORE & AFTER      40
AMERICAN HISTORY    40
HISTORY             40
AUTHORS             39
WORD ORIGINS        38
WORLD CAPITALS      37
BODIES OF WATER     36
SPORTS              36
SCIENCE             35
MAGAZINES           35
RHYME TIME          35
SCIENCE & NATURE    35
WORLD GEOGRAPHY     33
WORLD HISTORY       32
HISTORIC NAMES      32
ANNUAL EVENTS       32
BIRDS               31
Name: Category, dtype: int64


In [63]:
def match_count(datalist):
    terms_used = set()
    question_overlap = []
    for index, row in datalist.iterrows():
        split_question = row['clean_question'].split()
        split_question = [word for word in split_question if len(word)>5]
        match_count = 0
        for word in split_question:
            if word in terms_used:
                match_count+= 1
            terms_used.add(word)
        if len(split_question)>0:
            question_overlap.append(match_count/len(split_question))
        else: 
            question_overlap.append(0)
    return question_overlap
        


In [64]:
tele_jeopardy = jeopardy[jeopardy['Category']=='TELEVISION']
geo_jeopardy = jeopardy[jeopardy['Category']=='U.S. GEOGRAPHY']
sports_jeopardy = jeopardy[jeopardy['Category']=='SPORTS']
american_jeopardy = jeopardy[jeopardy['Category'] == 'AMERICAN HISTORY']


In [74]:
print("Repeat terms for categores:\n")

print("Geography: ")
print(np.mean(match_count(geo_jeopardy)))
print("\n")
print("Television: ")
print(np.mean(match_count(tele_jeopardy)))
print("\n")
print("Sports: ")
print(np.mean(match_count(sports_jeopardy)))
print("\n")
print("American: ")
print(np.mean(match_count(american_jeopardy)))

Repeat terms for categores:

Geography: 
0.22261904761904763


Television: 
0.2222222222222222


Sports: 
0.20634920634920637


American: 
0.1309126984126984


This final test does the repeat terms test that was done earlier in the analysis (seeing if terms in the questions were used in previous questions). Surprisingly, the per-category term use is far lower than the general term use decimal (0.69). This test is also useful to see which categories have similar topics. American History has least similar terms compared to geogrpahy, television and sports. 