## Winning Jeopardy!

![Image](https://www.nerdwallet.com/assets/blog/wp-content/uploads/2015/08/Winning-money.png)

### Introduction

[Jeopardy](https://en.wikipedia.org/wiki/Jeopardy!) is a popular TV show in the US where participants answer questions to win money. In this project we'll try to find a way to become the winner and get as much money as possible.

We'll be working with the data set from [reddit](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/). It contains **216930** questions and other info. Let's take a look.

In [1]:
import pandas as pd

jeopardy = pd.read_csv('jeopardy.csv', parse_dates=[' Air Date'])
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [2]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 216930 entries, 0 to 216929
Data columns (total 7 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   Show Number  216930 non-null  int64         
 1    Air Date    216930 non-null  datetime64[ns]
 2    Round       216930 non-null  object        
 3    Category    216930 non-null  object        
 4    Value       216930 non-null  object        
 5    Question    216930 non-null  object        
 6    Answer      216928 non-null  object        
dtypes: datetime64[ns](1), int64(1), object(5)
memory usage: 11.6+ MB


### Cleaning

Before we start let's clean some mess. Several columns have spaces in front. Let's fix it.

In [3]:
cols_to_fix = jeopardy.columns
print('Old \n', cols_to_fix)

cols_to_fix = cols_to_fix.str.replace(r'^ ', '', regex=True)
print('Fixed \n',cols_to_fix)

jeopardy.columns = cols_to_fix

Old 
 Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')
Fixed 
 Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')


Also let's normalize strings in the `Question` and `Answer` columns. By that I mean:
* Everyword should be in lowercase
* No punctuation

The `re` library will help us with that. But we are missing two values in the `Answer` column, so let's remove these rows first.

In [4]:
import re

jeopardy.dropna(subset=['Answer'], inplace=True)

def clean_qa(string):
    '''
    Take a string and return it in lowercase and
    without any punctuation 
    '''
    string = string.lower()
    string = re.sub(r'[^\w\s]', '', string)
    return string

jeopardy['clean_question'] = jeopardy['Question'].apply(clean_qa)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(clean_qa)
jeopardy[['clean_question', 'clean_answer']].head()

Unnamed: 0,clean_question,clean_answer
0,for the last 8 years of his life galileo was u...,copernicus
1,no 2 1912 olympian football star at carlisle i...,jim thorpe
2,the city of yuma in this state has a record av...,arizona
3,in 1963 live on the art linkletter show this c...,mcdonalds
4,signer of the dec of indep framer of the const...,john adams


We also can turn `Value` column to the numeric to allow us to manipulate it easier. But we must remove **\$** and **,** signs first.

Some values are missing, we'll replace them with 0.

In [5]:
def clean_value(string):
    '''
    Take a string, remove $ and , signs and try to covert it to int
    
    If can't convert return 0
    '''
    string = re.sub(r'[$,]', '', string)
    
    try:
        string = int(string)
    except:
        string = 0
        
    return string

jeopardy['clean_value'] = jeopardy['Value'].apply(clean_value)
jeopardy['clean_value'].head()

0    200
1    200
2    200
3    200
4    200
Name: clean_value, dtype: int64

### Analysis

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:
* How often the answer can be used for a question
* How often questions are repeated

#### The answer from the question

We can answer the first question by seeing how many times words in the answer also occur in the question. Also we'll remove `The` from count due to it's commonly found in answers and questions, but doesn't have any meaningful use.

Let's do this.

In [6]:
def count_match(row):
    '''
    Take a row from dataframe and count how many
    times word from answer is in also in question
    
    Return relative result
    '''
    split_question = row['clean_question'].split()
    split_answer = row['clean_answer'].split()
    
    match_count = 0
    
    #Remove `the` from answer if it's in there
    if 'the' in split_answer:
        split_answer.remove('the')
        
    if split_answer:
        for word in split_answer:
            if word in split_question:
                match_count += 1
                
        return match_count / len(split_answer)
        
    else:
        return 0

jeopardy['answer_in_question'] = jeopardy.apply(count_match, axis=1)
print(jeopardy['answer_in_question'].mean())
jeopardy['answer_in_question'].head()

0.05792123724515945


0    0.0
1    0.0
2    0.0
3    0.0
4    0.0
Name: answer_in_question, dtype: float64

As we can see only **5.79%** words from the answer are also in the question on average. It means that we can't hope to find the right answer just by hearing the question. Probably we should study.

#### Repeated questions

So we want to investigate how often new questions are repeats of older ones. For that purpose we'll:
* go from the olderest questions to the most recent
* look for any word that has 6 characters at least
* add this word to the kind of vocabulary
* count how many times word's arleady occurred

In [7]:
#Sord df by date first
jeopardy.sort_values(by=['Air Date'], inplace=True)

#Prepare the dictionary
terms_used = {}

def overlap_count(question):
    '''
    Take a question, remove all short words (< 6 chars)
    and make up a dictionary of unique words
    
    Return relative overlap number
    '''
    split_question = question.split()
    
    #Remove short words (< 6 char)
    split_question = [word for word in split_question if len(word)>5]
    
    #Count the occurrence of word
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
    
    #Fill the dictionary
    for word in split_question:
        if word in terms_used:
            terms_used[word] += 1
        else:
            terms_used[word] = 1
    
    if len(split_question) != 0:
        overlap = match_count / len(split_question)
        return overlap
    else:
        return match_count
        
jeopardy['question_overlap'] = jeopardy['clean_question'].apply(overlap_count)

print(jeopardy['question_overlap'].mean())
jeopardy['question_overlap'].head()

0.8721734034757384


84523    0.0
84565    0.0
84566    0.0
84567    0.0
84568    0.0
Name: question_overlap, dtype: float64

There is about **87%** overlap between terms in new questions and old questions. So we can use old questions to prepare to the game.

But this set include about **10%** of all jeopardy question. And we looked only on separate words but not phrases. Using only overlaping question isn't the ultimate solution.

#### Question value

We could deside to pay attention to the high value question only. Given that we must separate all questions into two categories, for example:
* low value - any question worth less than \$800
* high value - any question worth \$800 or more

Then we can find which terms correspond to the high-value questions.

In [8]:
def high_low(value):
    '''
    Put a label on the question depends on it's value
    '''
    if value < 800:
        return 0
    else:
        return 1
    
jeopardy['high_value'] = jeopardy['clean_value'].apply(high_low)
jeopardy['high_value'].head()

84523    0
84565    1
84566    1
84567    1
84568    1
Name: high_value, dtype: int64

Now we can start to count word occurrences in each category.

In [9]:
#Prepare the dictionary
terms_high_low = {w:[0,0] for w in terms_used.keys()}

def high_low_count(row):
    '''
    Takes a question, removes all short words (< 6 chars)
    and makes up a dictionary of unique words according
    question value
    
    Returns nothing, updates dict
    '''
    split_question = row['clean_question'].split()
    
    #Remove short words (< 6 char)
    split_question = [word for word in split_question if len(word)>5]
    
    #Fill the dictionary
    if row['high_value'] == 1:
        for word in split_question:
            terms_high_low[word][0] += 1
    else:
        for word in split_question:
            terms_high_low[word][1] += 1
    
    return None

#Update the dict
jeopardy.apply(high_low_count, axis=1)

import random

#Take 10 random terms
random.seed(0)
comparison_terms = random.sample(tuple(terms_used), k=10)

observed_expected = []
for word in comparison_terms:
    result = terms_high_low[word]
    observed_expected.append(result)
    
observed_expected

[[0, 1],
 [1, 0],
 [0, 1],
 [0, 1],
 [5, 8],
 [1, 0],
 [2, 0],
 [1, 0],
 [0, 1],
 [1, 0]]

Now we've found the observed counts for the chi-squared test.

#### Chi-Squared test

We are almost reasy for chi-squared test, we just need the expected values. We can get them by using total terms proportions and overall number of high-value and low-value questions.

In [10]:
from scipy.stats.mstats import chisquare

#Numbers of question in each category
total_high = jeopardy['high_value'].sum()
total_low = len(jeopardy) - total_high

chi_squared = []
for pair in observed_expected:
    pair_sum = pair[0] + pair[1]
    total_prop = pair_sum / len(jeopardy)
    
    #Expected values
    high_exp = total_high * total_prop
    low_exp = total_low * total_prop
    
    chi_result = chisquare(pair, [high_exp, low_exp])
    chi_squared.append(chi_result)
    
chi_squared

[Power_divergenceResult(statistic=0.754427963702829, pvalue=0.3850779221952164),
 Power_divergenceResult(statistic=1.3255076006089066, pvalue=0.24960599546245862),
 Power_divergenceResult(statistic=0.754427963702829, pvalue=0.3850779221952164),
 Power_divergenceResult(statistic=0.754427963702829, pvalue=0.3850779221952164),
 Power_divergenceResult(statistic=0.10931382247720889, pvalue=0.7409266954088055),
 Power_divergenceResult(statistic=1.3255076006089066, pvalue=0.24960599546245862),
 Power_divergenceResult(statistic=2.651015201217813, pvalue=0.10348378920405768),
 Power_divergenceResult(statistic=1.3255076006089066, pvalue=0.24960599546245862),
 Power_divergenceResult(statistic=0.754427963702829, pvalue=0.3850779221952164),
 Power_divergenceResult(statistic=1.3255076006089066, pvalue=0.24960599546245862)]

None of the terms had a significant difference in usage between high value and low value rows. Also most of the frequencies were low, so the chi-squared test isn't as valid.

#### High frequency chi-Squared

It would be better to run this test with only terms that have higher frequencies.

In [11]:
top_used = pd.Series(terms_used).sort_values(ascending=False)
top_used.head(10)

called       5461
country      4868
became       3162
played       3011
president    3010
before       2909
american     2837
capital      2772
famous       2497
french       2488
dtype: int64

Let's apply chi-squared test for top 10 words.

In [12]:
top_10 = top_used.iloc[:10]

chi_squared_top = {}

for word in top_10.index:    
    total_prop = top_10[word] / len(jeopardy)

    #Expected values
    high_exp = total_high * total_prop
    low_exp = total_low * total_prop
    
    chi_result = chisquare(terms_high_low[word], [high_exp, low_exp])
    chi_squared_top[word] = chi_result
    
chi_squared_top

{'called': Power_divergenceResult(statistic=15.42649873971819, pvalue=8.577700638823399e-05),
 'country': Power_divergenceResult(statistic=0.03755079185319079, pvalue=0.8463479388356194),
 'became': Power_divergenceResult(statistic=0.13680455884183906, pvalue=0.7114786079420359),
 'played': Power_divergenceResult(statistic=7.98615820916753, pvalue=0.004713633071089914),
 'president': Power_divergenceResult(statistic=2.2058789247447224, pvalue=0.13748551321649827),
 'before': Power_divergenceResult(statistic=0.9415299409384695, pvalue=0.33188469035879187),
 'american': Power_divergenceResult(statistic=25.84258518926645, pvalue=3.70424931415654e-07),
 'capital': Power_divergenceResult(statistic=0.5312119125428831, pvalue=0.4660977915180684),
 'famous': Power_divergenceResult(statistic=0.703109189836983, pvalue=0.4017409294552422),
 'french': Power_divergenceResult(statistic=105.06948798342381, pvalue=1.1792691791457759e-24)}

We'vw got statistically significant results (p<0.05) for a few words:
* called
* played
* american
* french

Only two of them could make any sense but we can't tell the direction of differences between high and low value question only by chi-squared. Let's check values for both types of questions.

In [14]:
print('american', terms_high_low['american'])
print('french', terms_high_low['french'])

american [1354, 1483]
french [1323, 1165]


### Conlusion

The first value is the number of occurs in high value questions, second - low value questions. Word **`american`** occurs more often in **low value** question and word **`frensh`** is contrariwise - in **high value**.

So we can't pick guaranteed strategy for winning Jeopardy just by using the statistically significant results from chi-squared test only. This task needs further exploration.