# Winning Jeopardy


Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture. Let's say you want to compete on Jeopardy, and you're looking for any edge you can get to win. In this project, you'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you win.

The dataset is named jeopardy.csv, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions, which you can download [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/). Here's the beginning of the file:

## Jeopardy Questions

In [10]:
import pandas as pd
import numpy as np

jeopardy = pd.read_csv('jeopardy.csv')

In [11]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [12]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Show Number  19999 non-null  int64 
 1    Air Date    19999 non-null  object
 2    Round       19999 non-null  object
 3    Category    19999 non-null  object
 4    Value       19999 non-null  object
 5    Question    19999 non-null  object
 6    Answer      19999 non-null  object
dtypes: int64(1), object(6)
memory usage: 1.1+ MB


In [13]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [14]:
jeopardy.columns = jeopardy.columns.str.replace(" ","")

In [15]:
jeopardy.columns

Index(['ShowNumber', 'AirDate', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

## Normalizing Text

We need to normalize all of the text columns (the Question and Answer columns). We covered normalization before, but the idea is to ensure that you put words in lowercase and remove punctuation so Don't and don't aren't considered to be different words when you compare them.

In [16]:
# function  to normalise string 

import re

def normalise(string):
    string = re.sub(r'[^\w\s]','',string)
    string = string.lower()
    return string

normalise('Wo he, .,??m mje bfubf uWED')
    
    

'wo he m mje bfubf uwed'

In [17]:
##Normalize the Question column.

jeopardy['clean_question'] = jeopardy["Question"].apply(normalise)

##Normalize the Answer column.

jeopardy['clean_answer'] = jeopardy["Answer"].apply(normalise)

jeopardy['clean_answer'].head()

0    copernicus
1    jim thorpe
2       arizona
3     mcdonalds
4    john adams
Name: clean_answer, dtype: object

## Normalizing Columns

There are also some other columns to normalize.

The Value column should be numeric, to allow you to manipulate it easier. You'll need to remove the dollar sign from the beginning of each value and convert the column from text to numeric.

The Air Date column should also be a datetime, not a string, to enable you to work it easier.Normalising "Value" and "Air Date" COlumns

In [18]:
import sys

def normalise_dollar(dollar):
    dollar = re.sub(r'[^\w\s]','',dollar)
    try:
        dollar = int(dollar)
    except: 
        dollar = 0
    return dollar


normalise_dollar("$20")

20

In [19]:
## vNormalise value column 

jeopardy['clean_value'] = jeopardy["Value"].apply(normalise_dollar)


## COnvert to date time
jeopardy["AirDate"]= jeopardy["AirDate"].apply(pd.to_datetime)
jeopardy["AirDate"].head()

0   2004-12-31
1   2004-12-31
2   2004-12-31
3   2004-12-31
4   2004-12-31
Name: AirDate, dtype: datetime64[ns]

##  Answers in Questions

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

How often the answer can be used for a question.
How often questions are repeated.
You can answer the second question by seeing how often complex words (> 6 characters) reoccur. You can answer the first question by seeing how many times words in the answer also occur in the question. We'll work on the first question and come back to the second.

In [20]:
def match_counter(row):

    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()
    
    match_count = 0
    
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer)==0:
        return 0
    
    for word in split_answer:  
        if word in split_question:
            match_count+=1
    return match_count/len(split_answer)

#Count how many times terms in clean_answer occur in clean_question.    
jeopardy['answer_in_question']= jeopardy.apply(match_counter, axis = 1)

In [21]:
jeopardy["answer_in_question"].mean()

0.059001965249777744

On average 5.9% of the words that the answer contains occure in the question. This is a very low number, so I won't se too much of this for studying purposes. ie you have to know your stuff and can't guess it from the questions.

## Recycled Questions

Fidning the questions overlap mean 

In [22]:
question_overlap=[]
terms_used = set()

jeopardy.sort_values(by=['AirDate'], inplace=True) # This now sorts in date order

for index, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(" ")
    split_question = [word for word in split_question if len(word) > 5]
    
    match_count = 0
    
    for term in split_question:
        if term in terms_used:
            match_count+=1
            
    for word in split_question:
        terms_used.add(word)
    if len(split_question)>0:
        match_count /= len(split_question)
    question_overlap.append(match_count)

jeopardy['question_overlap'] = question_overlap

jeopardy['question_overlap'].mean()

0.6876235590919714

This shows that 68.7% of the Jeoperdy questions have got some recyclig in them for larger (over 6 letter) words, So I would start studing the past questions. 


## Low Value vs High Value Questions

In [23]:
def q_value(row):
    value = 0
    if row["clean_value"] >800:
        value = 1
    return value
        
jeopardy['high_value'] = jeopardy.apply(q_value, axis =1)



In [24]:
def q_value_count (word):
    
    low_count = 0
    high_count = 0
    
    for index, row in jeopardy.iterrows():
            split_question = row['clean_question'].split(" ")
            if word in split_question:
                if row["high_value"]==1:
                    high_count+=1
                else:
                    low_count+=1
    return high_count, low_count

In [25]:
import random

comparison_terms = random.sample(terms_used, 10)

comparison_terms

['pirating',
 'fleurs',
 'trendy',
 'mounds',
 'frenchman',
 'biscuits',
 'suites',
 'reallife',
 'designate',
 'appalling']

In [26]:
observed_expected = []

for term in comparison_terms:
    observed_expected.append(q_value_count(term)) 

In [27]:
observed_expected

[(0, 1),
 (0, 1),
 (0, 1),
 (1, 3),
 (5, 7),
 (0, 1),
 (1, 3),
 (3, 7),
 (0, 1),
 (0, 1)]

## Applying the Chi-squared Test

Find the number of rows in jeopardy where high_value is 1, and assign to high_value_count.
Find the number of rows in jeopardy where high_value is 0, and assign to low_value_count.
Create an empty list called chi_squared.
Loop through each list in observed_expected.
Add up both items in the list (high and low counts) to get the total count, and assign to total.
Divide total by the number of rows in jeopardy to get the proportion across the dataset. Assign to total_prop.
Multiply total_prop by high_value_count to get the expected term count for high value rows.
Multiply total_prop by low_value_count to get the expected term count for low value rows.
Use the scipy.stats.chisquare function to compute the chi-squared value and p-value given the expected and observed counts.
Append the results to chi_squared.
Look over the chi-squared values and the associated p-values. Are there any statistically significant results? Write up your thoughts in a markdown cell.

In [28]:
from scipy.stats import chisquare
import numpy as np

chi_squared = []

high_value_count = jeopardy[jeopardy['high_value']==1].shape[0]
low_value_count = jeopardy[jeopardy['high_value']==0].shape[0]

for observation in observed_expected:
    total = sum(observation)
    total_prop = total/jeopardy.shape[0]
    
    expected_high = total_prop*high_value_count
    expected_low = total_prop*low_value_count
    
    observed = np.array([observation[0], observation[1]])
    expected = np.array([expected_high, expected_low])
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.02636443308440769, pvalue=0.871013484688921),
 Power_divergenceResult(statistic=0.9909151991757656, pvalue=0.3195187946580277),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.02636443308440769, pvalue=0.871013484688921),
 Power_divergenceResult(statistic=0.008630851497838939, pvalue=0.9259811180040979),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469)]

We have high p value (randomness), lo frequencies are low, which means the chi-square test validity is not very good. 