# Project 17 - Winning Jeopardy

Jeopardy is a popular TV show in the US where participants answer questions to win money. In this project we will work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you win. You can find the dataset [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/).

In [1]:
import pandas as pd
data = pd.read_json('JEOPARDY_QUESTIONS1.json')
data

Unnamed: 0,category,air_date,question,value,answer,round,show_number
0,HISTORY,2004-12-31,"'For the last 8 years of his life, Galileo was...",$200,Copernicus,Jeopardy!,4680
1,ESPN's TOP 10 ALL-TIME ATHLETES,2004-12-31,'No. 2: 1912 Olympian; football star at Carlis...,$200,Jim Thorpe,Jeopardy!,4680
2,EVERYBODY TALKS ABOUT IT...,2004-12-31,'The city of Yuma in this state has a record a...,$200,Arizona,Jeopardy!,4680
3,THE COMPANY LINE,2004-12-31,"'In 1963, live on ""The Art Linkletter Show"", t...",$200,McDonald\'s,Jeopardy!,4680
4,EPITAPHS & TRIBUTES,2004-12-31,"'Signer of the Dec. of Indep., framer of the C...",$200,John Adams,Jeopardy!,4680
...,...,...,...,...,...,...,...
216925,RIDDLE ME THIS,2006-05-11,'This Puccini opera turns on the solution to 3...,$2000,Turandot,Double Jeopardy!,4999
216926,"""T"" BIRDS",2006-05-11,'In North America this term is properly applie...,$2000,a titmouse,Double Jeopardy!,4999
216927,AUTHORS IN THEIR YOUTH,2006-05-11,"'In Penny Lane, where this ""Hellraiser"" grew u...",$2000,Clive Barker,Double Jeopardy!,4999
216928,QUOTATIONS,2006-05-11,"'From Ft. Sill, Okla. he made the plea, Arizon...",$2000,Geronimo,Double Jeopardy!,4999


From the table above we can see that our data has 216930 rows and 7 columns. The columns are:
- `category` - the category of the question
- `air_date` - the date the episode aired
- `question` - the text of the question
- `value` - the number of dollars the correct answer is worth
- `answer` - the text of the correct answer
- `round` - the round of Jeopardy
- `show_number` - the Jeopardy episode number

## Normalizing Text

Before we begin our analysis, we need to normalize the text, meaning that we are going to lower all the text columns (for example `Don't` is not the same as `don't`, so we want all the letters lower case). We also want to remove all the punctuations.

In [2]:
import re
def normalize_text(text):
    text = text.lower()
    text = re.sub("\W", " ", text)
    text = re.sub("\s+", " ", text)
    return text

#testing
print("these two should look the same")
print(normalize_text("THese,, two ShoUld loOk!! The same,.."))

these two should look the same
these two should look the same 


Now we can normalize the `question` and `answer` columns with our function.

In [3]:
data.loc[:,'clean_question'] = data['question'].apply(normalize_text)
data.loc[:,'clean_answer'] = data['answer'].apply(normalize_text)
data.head()

Unnamed: 0,category,air_date,question,value,answer,round,show_number,clean_question,clean_answer
0,HISTORY,2004-12-31,"'For the last 8 years of his life, Galileo was...",$200,Copernicus,Jeopardy!,4680,for the last 8 years of his life galileo was ...,copernicus
1,ESPN's TOP 10 ALL-TIME ATHLETES,2004-12-31,'No. 2: 1912 Olympian; football star at Carlis...,$200,Jim Thorpe,Jeopardy!,4680,no 2 1912 olympian football star at carlisle ...,jim thorpe
2,EVERYBODY TALKS ABOUT IT...,2004-12-31,'The city of Yuma in this state has a record a...,$200,Arizona,Jeopardy!,4680,the city of yuma in this state has a record a...,arizona
3,THE COMPANY LINE,2004-12-31,"'In 1963, live on ""The Art Linkletter Show"", t...",$200,McDonald\'s,Jeopardy!,4680,in 1963 live on the art linkletter show this ...,mcdonald s
4,EPITAPHS & TRIBUTES,2004-12-31,"'Signer of the Dec. of Indep., framer of the C...",$200,John Adams,Jeopardy!,4680,signer of the dec of indep framer of the cons...,john adams


## Normalizing columns

We also want to normalize `value` column by deleting the dollar sign, and normalize `air_date` column from string to datetime.

In [4]:
def normalize_value(value):
    '''
    return value without punctuations and dollar sign
    return 0 if errors
    '''
    value = re.sub('\W', "", value)
    try:
        value = int(value)
    except Exception:
        value = 0
    return value
        
#test
print(normalize_value("$200,000"))

200000


In [5]:
#normalizing value column
data['value'] = data['value'].astype('str')
data['clean_value'] = data['value'].apply(normalize_value)

#normalizing date column
data['air_date'] = pd.to_datetime(data['air_date'])

In [6]:
data.head()

Unnamed: 0,category,air_date,question,value,answer,round,show_number,clean_question,clean_answer,clean_value
0,HISTORY,2004-12-31,"'For the last 8 years of his life, Galileo was...",$200,Copernicus,Jeopardy!,4680,for the last 8 years of his life galileo was ...,copernicus,200
1,ESPN's TOP 10 ALL-TIME ATHLETES,2004-12-31,'No. 2: 1912 Olympian; football star at Carlis...,$200,Jim Thorpe,Jeopardy!,4680,no 2 1912 olympian football star at carlisle ...,jim thorpe,200
2,EVERYBODY TALKS ABOUT IT...,2004-12-31,'The city of Yuma in this state has a record a...,$200,Arizona,Jeopardy!,4680,the city of yuma in this state has a record a...,arizona,200
3,THE COMPANY LINE,2004-12-31,"'In 1963, live on ""The Art Linkletter Show"", t...",$200,McDonald\'s,Jeopardy!,4680,in 1963 live on the art linkletter show this ...,mcdonald s,200
4,EPITAPHS & TRIBUTES,2004-12-31,"'Signer of the Dec. of Indep., framer of the C...",$200,John Adams,Jeopardy!,4680,signer of the dec of indep framer of the cons...,john adams,200


In [7]:
data.dtypes

category                  object
air_date          datetime64[ns]
question                  object
value                     object
answer                    object
round                     object
show_number                int64
clean_question            object
clean_answer              object
clean_value                int64
dtype: object

The table above shows that we have `int` values for `clean_value` and datetime for `air_date`. Next we are going to find out is it smart to study past questions, general knowledge, or not study at all.

## Answers in Questions

In order to figure out whether to study past questions, study general knowledge, or not study at all, it would be helpful to figure out two things:

- How often the answer can be used for a question.
- How often questions are repeated.

If questions are repeated, they probably aren't repeated exactly the same (words could be different for example, options could be different etc.). Because of this, we are going to write a function that checks how often complex words (over 6 characters) reoccur. For example for the question in row number 3: `in 1963 live on the art linkletter show this...`, we can check how often the word `linkletter` reoccurs.

In [8]:
def answer_in_question(row):
    '''
    this function check how many times terms in clean_answer occur in clean_question
    '''
    match_count = 0
    #splitting questions and answers to a list of words
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()
    #remove "the" from the list
    if "the" in split_answer:
        split_answer.remove('the')
    #prevents dividing by 0 later
    if len(split_answer) == 0:
        return 0
    #checks if the answer is in the question
    for word in split_answer:
        if word in split_question:
            match_count += 1
    return match_count/len(split_answer)

data['answer_in_question'] = data.apply(answer_in_question, axis=1)
mean_answer_in_question = data['answer_in_question'].mean()
mean_answer_in_question

0.06141691161144138

Above we created a function that checks if the answer can be found from the question. On average, the answer only makes up for about 6% of the question. We shouldn't hope that hearing the question makes us able to determine the answer.

## Recycled Questions

Let's now find out how often new questions repeat the older ones in our dataset.

In [9]:
question_overlap = []
terms_used = set()

#sorting dates to start from 1984
data.sort_values('air_date', ascending=True)
#looping each row
for index, row in data.iterrows():
    split_question = row['clean_question'].split(" ")
    ### remove words with less than 6 characters
    split_question = [word for word in split_question if len(word) > 5]
    match_count = 0
    #checks if words are in terms_used
    for word in split_question:
        if word in terms_used:
            match_count += 1
    #add words to terms_used
    for word in split_question:
        terms_used.add(word)
    #calculate match_count per words used
    if len(split_question) > 0:
        match_count = match_count/len(split_question)
    question_overlap.append(match_count) 
    
#question overlaps to column
data['question_overlap'] = question_overlap

#set to list just to show some example values
set_to_list = list(terms_used)
print(set_to_list[:10])

question_overlap_mean = data['question_overlap'].mean()
question_overlap_mean

['10_j_05a', 'toppler', 'javier', 'sinning', 'mauritius', 'durrell', 'kaside', 'huxtables', 'saucer', 'voorhees']


0.9003767772153685

Above we can see example of the values in our set, meaning words that are over 6 characters long. `Question overlap mean` is very high, over 90%. This means that most of the questions are recycled. This also increases over time: 
- first 20000 rows: around 72%
- first 50000 rows: around 81%
- first 100000 rows: around 86%

In [10]:
first_20000_mean = data['question_overlap'][:20000].mean()
first_50000_mean = data['question_overlap'][:50000].mean()
first_100000_mean = data['question_overlap'][:100000].mean()

print(first_20000_mean)
print(first_50000_mean)
print(first_100000_mean)

0.723582840725705
0.8104615722460577
0.8600331081817143


However, our code only looks at single terms. It doesn't necessarily mean that the question is exactly the same, for example:
- When was Crash Bandicoot published?
- How many players have played Crash Bandicoot?
  
Both of these questions contain term `Bandicoot` which would make them overlap, yet the question is completely different. It could still be beneficial to study about the questions that overlap the most, if you know everything about `Bandicoot`, you could answer both of these questions.

## Low Value vs High Value Questions

Next we are going to find out which terms correspond to high-value questions using a chi-squared test. Let's make two categories:
- Low value: value is less (or equal) than 800
- High value: value is more than 800

We will create a function to find out the amount of low and high value questions.

In [19]:
def value_finder(value):
    if value > 800:
        value = 1
    else:
        value = 0
    return value

data['high_value'] = data['clean_value'].apply(value_finder)
data.sample(3)

Unnamed: 0,category,air_date,question,value,answer,round,show_number,clean_question,clean_answer,clean_value,answer_in_question,question_overlap,high_value
169784,THE PERFECT GIFT,1998-12-23,"'This greeting card company operates ""Gold Cro...",$100,Hallmark,Jeopardy!,3293,this greeting card company operates gold crow...,hallmark,100,0.0,1.0,0
172622,"""FI""",1999-04-21,'The basic pitch of this small flute is usuall...,$200,Fife,Double Jeopardy!,3378,the basic pitch of this small flute is usuall...,fife,200,0.0,1.0,0
104290,18TH CENTURY AMERICA,2007-11-02,'Samuel Slater settled in R.I. & built America...,$1600,water power,Double Jeopardy!,5325,samuel slater settled in r i built america s ...,water power,1600,0.5,1.0,1


Next we will create a function that finds the terms used in low and high value questions. We will choose 10 words from `terms_used` to test our function.

In [12]:
def term_counter(text):
    low_count = 0
    high_count = 0
    for index, row in data.iterrows():
        split_question = row['clean_question'].split(" ")
        if text in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

#10 words from the set terms_used (picked earlier)
comparison_terms = set_to_list[:10]
print(comparison_terms)

observed_expected = []
for term in comparison_terms:
    term_counts =term_counter(term)
    observed_expected.append(term_counts)

print(observed_expected)

['10_j_05a', 'toppler', 'javier', 'sinning', 'mauritius', 'durrell', 'kaside', 'huxtables', 'saucer', 'voorhees']
[(1, 0), (0, 1), (6, 8), (2, 4), (3, 13), (0, 3), (0, 1), (0, 2), (1, 5), (0, 3)]


From above we can see 10 different words and their counts on high value and low value questions. For example the word `javier` shows up 6 times in high value questions, and 8 times in low value questions.

## Applying the Chi-squared Test

Above we calculated only the observed values for 10 different terms. To use Chi-squared test, we also have to find expected values. Let's do that now.

In [13]:
high_value_count = data[data['high_value'] == 1].shape[0]
low_value_count = data[data['high_value'] == 0].shape[0] 
print(high_value_count)
print(low_value_count)

61422
155508


Our dataset has 61422 high value questions and 155508 low value questions.

In [14]:
from scipy.stats import chisquare
import numpy as np
chi_squared = []
for list in observed_expected:
    total = list[0] + list[1]
    total_prop = total/data.shape[0]
    expected_high = total_prop*high_value_count
    expected_low = total_prop*low_value_count

    observed = np.array([list[0], list[1]])
    expected = np.array([expected_high, expected_low])
    chi_value = chisquare(observed, expected)
    chi_squared.append(chi_value)

chi_squared

[Power_divergenceResult(statistic=2.5317964247338085, pvalue=0.11157312838169751),
 Power_divergenceResult(statistic=0.3949764642333513, pvalue=0.5296950912486695),
 Power_divergenceResult(statistic=1.4587975000965432, pvalue=0.2271215479927517),
 Power_divergenceResult(statistic=0.07446818777814278, pvalue=0.7849388502668134),
 Power_divergenceResult(statistic=0.7210743923775407, pvalue=0.395791714978203),
 Power_divergenceResult(statistic=1.184929392700054, pvalue=0.27635474913315955),
 Power_divergenceResult(statistic=0.3949764642333513, pvalue=0.5296950912486695),
 Power_divergenceResult(statistic=0.7899529284667026, pvalue=0.3741143592744989),
 Power_divergenceResult(statistic=0.4010346717612653, pvalue=0.5265553925560025),
 Power_divergenceResult(statistic=1.184929392700054, pvalue=0.27635474913315955)]

Above we can see the results of chi-squared text for 10 terms. None of these 10 terms are statistically significant, because their `pvalue` is over 0.05.

## Applying the Chi-squared Test for Higher Frequency Terms

We can't really use our test for all the terms, because there are over 68997 terms, and running the code takes too long. Because of this, we will do chi-squared test for terms that have higher frequency.

In [20]:
#instead of set let's make a list of all words
word_list = []
for index, row in data.iterrows():
    split_question = row['clean_question'].split(" ")
    ### remove words with less than 6 characters
    split_question = [word for word in split_question if len(word) > 5]
    match_count = 0
    #instead of checking is the world already in a set, let's just add it to a list
    for word in split_question:
        word_list.append(word)
print(len(word_list))

939591


We now have a list of almost a million words. Let's only choose 10 most frequent words.

In [16]:
word_counts = {}
for word in word_list:
    if word in word_counts:
        word_counts[word] += 1
    else:
        word_counts[word] = 1

#filter words that appear less than 10 times
sorted_words = sorted(word_counts.items(), key=lambda item: item[1], reverse=True)
top_10_words = sorted_words[:10]

top_10_words_list = [word for word, count in top_10_words]
print(top_10_words_list)

['archive', 'target', '_blank', 'country', 'called', 'president', 'american', 'became', 'played', 'before']


Now we have 10 most frequently used words. Let's apply Chi-squared test for them.

In [17]:
observed_expected = []
for word in top_10_words_list:
    term_counts =term_counter(word)
    observed_expected.append(term_counts)
observed_expected

[(4732, 5802),
 (3885, 4614),
 (3864, 4570),
 (1647, 4332),
 (1700, 3721),
 (883, 2297),
 (1053, 2115),
 (903, 2230),
 (751, 2178),
 (787, 2114)]

In [18]:
chi_squared = []
p_values = []
for list in observed_expected:
    total = list[0] + list[1]
    total_prop = total/data.shape[0]
    expected_high = total_prop*high_value_count
    expected_low = total_prop*low_value_count

    observed = np.array([list[0], list[1]])
    expected = np.array([expected_high, expected_low])
    chi_value, p_value = chisquare(observed, expected)
    chi_squared.append(chi_value)
    p_values.append(p_value)

for word, p in zip(top_10_words_list, p_values):
    print(f"{word}: pvalue: {p}")

archive: pvalue: 0.0
target: pvalue: 1.4383186286040423e-277
_blank: pvalue: 1.0207344253417881e-278
country: pvalue: 0.18758212679269265
called: pvalue: 6.462707834956508e-07
president: pvalue: 0.49362469359699845
american: pvalue: 7.6417470118628e-10
became: pvalue: 0.5279394222257846
played: pvalue: 0.0013169445415356527
before: pvalue: 0.15635593921158333


## Statistically Significant Terms

Above we can see that from the top 10 words used, there are multiple words with p-values less than 0.05. This means that the following words are statistically significant:
- `archive`
- `target`
- `_blank` (this probably means questions like "The capital of Finland is _blank")
- `called`
- `american`
- `played`

We only used top 10 words because even with 10 words, this code takes some time to perform. However, we could use this approach for many more words to find out which ones are statistically significant.

Statistically significant words mean that they are very common in Jeopardy questions. It could be beneficial to study for questions that contain the following words. Especially if one is studying old questions, they probably should pay extra attention to questions that contain the word `archive`.