## Cleaning our Data
First we have to clean up the formatting of our column names and normalise them.

In [11]:
import pandas as pd

jeopardy = pd.read_csv("jeopardy.csv")
print(jeopardy.columns)
jeopardy.columns = jeopardy.columns.str.strip()
print(jeopardy.columns)

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')
Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')


In [12]:
print(jeopardy["Question"].head())

0    For the last 8 years of his life, Galileo was ...
1    No. 2: 1912 Olympian; football star at Carlisl...
2    The city of Yuma in this state has a record av...
3    In 1963, live on "The Art Linkletter Show", th...
4    Signer of the Dec. of Indep., framer of the Co...
Name: Question, dtype: object


In [13]:
import string

punct = set(string.punctuation)

def qa_normaliser(row):
    row = row.lower()
    row = ''.join(i for i in row if i not in punct)
    return row

jeopardy["clean_question"] = jeopardy["Question"].apply(qa_normaliser)
print(jeopardy["clean_question"].iloc[0])

jeopardy["clean_answer"] = jeopardy["Answer"].apply(qa_normaliser)
print(jeopardy["clean_answer"].iloc[0])


for the last 8 years of his life galileo was under house arrest for espousing this mans theory
copernicus


In [14]:
def val_normaliser(row):
    if row == "None":
        row = 0
    else:
        row = ''.join(i for i in row if i not in punct)
    return int(row)
    
jeopardy["clean_value"] = jeopardy["Value"].apply(val_normaliser)
print(jeopardy["clean_value"])

0         200
1         200
2         200
3         200
4         200
5         200
6         400
7         400
8         400
9         400
10        400
11        400
12        600
13        600
14        600
15        600
16        600
17        600
18        800
19        800
20        800
21        800
22       2000
23        800
24       1000
25       1000
26       1000
27       1000
28       1000
29        400
         ... 
19969    1200
19970    1200
19971    1500
19972    1200
19973    1200
19974    1200
19975    1600
19976    1600
19977    1600
19978    1600
19979    1600
19980    1600
19981    1200
19982    2000
19983    2000
19984    2000
19985    2000
19986    2000
19987       0
19988     100
19989     100
19990     100
19991     100
19992     100
19993     100
19994     200
19995     200
19996     200
19997     200
19998     200
Name: clean_value, dtype: int64


In [15]:
print(jeopardy["Air Date"])
jeopardy["Air Date"] = pd.to_datetime(jeopardy["Air Date"])
print(jeopardy["Air Date"])

0        2004-12-31
1        2004-12-31
2        2004-12-31
3        2004-12-31
4        2004-12-31
5        2004-12-31
6        2004-12-31
7        2004-12-31
8        2004-12-31
9        2004-12-31
10       2004-12-31
11       2004-12-31
12       2004-12-31
13       2004-12-31
14       2004-12-31
15       2004-12-31
16       2004-12-31
17       2004-12-31
18       2004-12-31
19       2004-12-31
20       2004-12-31
21       2004-12-31
22       2004-12-31
23       2004-12-31
24       2004-12-31
25       2004-12-31
26       2004-12-31
27       2004-12-31
28       2004-12-31
29       2004-12-31
            ...    
19969    2009-05-14
19970    2009-05-14
19971    2009-05-14
19972    2009-05-14
19973    2009-05-14
19974    2009-05-14
19975    2009-05-14
19976    2009-05-14
19977    2009-05-14
19978    2009-05-14
19979    2009-05-14
19980    2009-05-14
19981    2009-05-14
19982    2009-05-14
19983    2009-05-14
19984    2009-05-14
19985    2009-05-14
19986    2009-05-14
19987    2009-05-14


## Do we study past questions, general knowledge, or not study at all?
- How often the answer is deducible from the question.
- How often new questions are repeats of older questions.

We can answer the first question by seeing how many times words in the answer also occur in the question. We'll do this below, will a little bit of cleaning beforehand.

In [16]:
def find_match(row):
    split_answer = row["clean_answer"].split(" ")
    split_question = row["clean_question"].split(" ") 
    
    split_answer = [i for i in split_answer if i not in ("the")]
    
    if len(split_answer) == 0:
        return 0
        
    match_count = 0   
    
    for i in split_answer:
        if i in split_question:
            match_count += 1
    return match_count/len(split_answer)
        
    
jeopardy["answer_in_question"] = jeopardy.apply(find_match,axis=1)


print(jeopardy["answer_in_question"].mean())
print(len(jeopardy[jeopardy["answer_in_question"] == 1.0])/len(jeopardy))
print(len(jeopardy[jeopardy["answer_in_question"] == 0])/len(jeopardy))

    

0.0582264069913
0.006150307515375769
0.8757437871893594


Each returned value from the function is basically saying - "Out of all the words in the question, this is the percentage of those words that match the words given in the answer (and vice versa)." For example, if there's a match count of 9, and the length of the question is 10. there is a 90% match.


By calculating the mean of all these match percentages, only 5.8% of the whole question contains words that fit the answer on average. In other words, it seems we can only gleam the answer from the question only ~ 5.8% of the time. Worse still, on average, only 0.6% of answers contain all the words mentioned in the question. 87% of the answers do not contain a single word from the question.

Therefore, we can conclude that deducing the answer from the question is generally going to be fruitless. Since following this strategy is akin to shooting in the dark, two questions remain:

- Should we study past questions? or ;
- Should we cover all the bases and study general knowledge?

In [17]:


question_overlap = []
terms_used = set()

for i,row in jeopardy.iterrows():
    split_question = row["clean_question"].split(" ")
    split_question = [i for i in split_question if len(i) > 5]    
    match_count = 0
    
    for i in split_question:
        if i in terms_used:
            match_count += 1
        terms_used.add(i)
        
    if len(split_question) > 0:
        percentage = match_count / len(split_question)
        
    question_overlap.append(percentage)

jeopardy["question_overlap"] = question_overlap

print(jeopardy["question_overlap"].mean())


0.706609081113


Above, the match count keeps track of the frequency of the term used. For example, let's use the split_question = ['tiniest','indian','desert']. If both 'tiniest' and 'indian' have been used in prior questions, that means that there is a 2/3 overlap, or ~ 67% when the dataframe is sorted by date in descending order. 

Here, we find that the mean is ~ 70%. While the context of how the words are used within the question is taken out, it is a rough indication of that many of the terms are used time and again. If anything, it is a signal to investigate further if prior questions are recycled/rephrased.

Do terms really affect the value of the question.

In [18]:
jeopardy = jeopardy.sort_values("Air Date")

def find_value(row):
    if row["clean_value"] > 800:
        value = 1
    else:
        value = 0
    return value

jeopardy["high_value"] = jeopardy.apply(find_value,axis=1)

def high_low_count(term):
    low_count = 0
    high_count = 0
    
    for i,row in jeopardy.iterrows():
        if term in row["clean_question"].split(" "):
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

observed_expected = []

comparison_terms = list(terms_used)[:5]

for term in comparison_terms:
    result = high_low_count(term)
    observed_expected.append(result)

print(comparison_terms)
print("High-low count:")
print(observed_expected)
    

['baddie', 'hoochie', 'divides', 'hrefhttpwwwjarchivecommedia20060426j07mp3thisa', 'xenophon']
High-low count:
[(0, 1), (0, 1), (2, 2), (1, 0), (0, 1)]


Here we have our observed counts. For example, the term "Valley" is in 5 high value questions, and 29 low value questions. Now we can computer our expected counts and the chi-square value.

In [24]:
from scipy.stats import chisquare
import numpy as np

high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]

chi_squared = []

for i in observed_expected:
    total = sum(i)
    total_prop = total/len(jeopardy)
    expected_high = total_prop * high_value_count
    expected_low = total_prop * low_value_count

    observed = np.array([i[0], i[1]])
    expected = np.array([expected_high, expected_low])
    chisq,pval = chisquare(observed, expected)
    chi_squared.append([chisq,pval])

print(chi_squared) 

[[0.40196284612688399, 0.52607729857054686], [0.40196284612688399, 0.52607729857054686], [0.88975496332255899, 0.34554371914834681], [2.4877921171956752, 0.11473257634454047], [0.40196284612688399, 0.52607729857054686]]


All the p-values are above 0.05, which points to the use of terms not significantly different across both high and low value questions. However, a rule of thumb for chi-square tests is to have at least 3 observations, but alot of the terms have frequencies of less than that making our results pretty useless if i'm being honest.