<a href="https://colab.research.google.com/github/K-Erath/Dataquest/blob/master/14_Guided_Project_Winning_Jeopardy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Winning Jeopardy
Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for many years, and is a major force in popular culture. 

In this project, we will work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help us win if we had an opportunity to compete on Jeopardy.

In [1]:
import pandas as pd

In [2]:
csv = "https://drive.google.com/u/0/uc?id=0BwT5wj_P7BKXUl9tOUJWYzVvUjA&export=download" # dataset with 200K records
# csv = "jeopardy.csv" # sample dataset with 20K records
jeopardy = pd.read_csv(csv)
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [3]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [4]:
# remove spaces from column names
jeopardy.columns = jeopardy.columns.str.replace(' ', '')

## Normalize Columns
Normalize columns of text so that we can compare them and words will not be counted as different if they have different capitalization or punctuation.

In [5]:
import re

def clean_column(s):
    '''functions takes string as input, converts to lowercase, removes 
    punctuation, returns clean string.
    '''
    s = str(s)
    s = s.lower()
    # remove anything that is not a word character or a white space
    s = re.sub('[^\w\s]', '', s)
    return s

In [6]:
# clean question and anser columns so they will be ready for processing
jeopardy['clean_question'] = jeopardy['Question'].apply(clean_column)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(clean_column)

In [7]:
jeopardy

Unnamed: 0,ShowNumber,AirDate,Round,Category,Value,Question,Answer,clean_question,clean_answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams
...,...,...,...,...,...,...,...,...,...
216925,4999,2006-05-11,Double Jeopardy!,RIDDLE ME THIS,$2000,This Puccini opera turns on the solution to 3 ...,Turandot,this puccini opera turns on the solution to 3 ...,turandot
216926,4999,2006-05-11,Double Jeopardy!,"""T"" BIRDS",$2000,In North America this term is properly applied...,a titmouse,in north america this term is properly applied...,a titmouse
216927,4999,2006-05-11,Double Jeopardy!,AUTHORS IN THEIR YOUTH,$2000,"In Penny Lane, where this ""Hellraiser"" grew up...",Clive Barker,in penny lane where this hellraiser grew up th...,clive barker
216928,4999,2006-05-11,Double Jeopardy!,QUOTATIONS,$2000,"From Ft. Sill, Okla. he made the plea, Arizona...",Geronimo,from ft sill okla he made the plea arizona is ...,geronimo


In [8]:
def normalize_values(s):
    '''Function takes string as input, removes punctuation, converts string to 
    integer, returns integer. If there is an error, returns 0.
    '''
    # remove dollar signs and commas by removing everything except word characters
    s = re.sub('[^\w]', '', s)
    # if the new value will not convert to int, return zero
    try:
        val = int(s)
    except:
        val = 0
    return val

In [9]:
# clean dollar sign and punctuation from value column and convert to integer
jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_values)

In [10]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 216930 entries, 0 to 216929
Data columns (total 10 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   ShowNumber      216930 non-null  int64 
 1   AirDate         216930 non-null  object
 2   Round           216930 non-null  object
 3   Category        216930 non-null  object
 4   Value           216930 non-null  object
 5   Question        216930 non-null  object
 6   Answer          216928 non-null  object
 7   clean_question  216930 non-null  object
 8   clean_answer    216930 non-null  object
 9   clean_value     216930 non-null  int64 
dtypes: int64(2), object(8)
memory usage: 16.6+ MB


In [11]:
# correct the data type of the date column
jeopardy['AirDate'] = pd.to_datetime(jeopardy['AirDate'])

## Answer in question
In order to figure out if we need to study, we will look at how often the answer can be used for a question. We will investigate how many times words in the answer also occur in the question.

In [12]:
import requests

# get list of stopwords to filter out
stopwords_list = requests.get("https://gist.githubusercontent.com/rg089/35e00abf8941d72d419224cfd5b5925d/raw/12d899b70156fd0041fa9778d657330b024b959c/stopwords.txt").content
stopwords = set(stopwords_list.decode().splitlines())

Stopwords are any word in a stop list which are filtered out before or after processing of natural language data because they do not provide any useful information to decide in which category a text should be classified. This may be either because they don't have any meaning (prepositions, conjunctions, etc.) or because they are too frequent in the classification context. There is no single universal list of stop words used by all natural language processing tools, so we picked a publicly available list.

In [13]:
def word_match(df):
    '''Function takes row in jeopardy dataframe. Returns ratio of words in 
    answer that match words in question to total words in answer.
    '''
    # split strings into lists
    split_answer = df['clean_answer'].split()
    split_question = df['clean_question'].split()

    # remove stopwords from the list
    split_answer = [word for word in split_answer if word not in stopwords]

    # to avoid division by zero error, return 0 if length of list is zero
    if len(split_answer) == 0:
        return 0

    # count words in answer that exist in question
    match_count = 0
    for word in split_answer:
        if word in split_question:
            match_count += 1

    # return ratio of matching words / total words
    return match_count / len(split_answer)

In [14]:
jeopardy['answer_in_question'] = jeopardy.apply(word_match, axis=1)

In [15]:
jeopardy['answer_in_question'].mean()

0.03788726513937311

### Answer in question results
On average, the answer only makes up for about 3.7% of the question. This is not a huge number, and means that just hearing a question will probably not enable us to figure out the answer, so we will need to study.

## Recycled questions
In order to figure out whether to study past questions, we will investigate how often new questions are repeats of older ones. In order to do that we will see how often complex words (> 6 characters) reoccur. Only looking at words with six or more characters enables us to filter out words like 'the' and 'than', which are commonly used, but don't tell you a lot about a question.

In [16]:
# sort dataframe by date
jeopardy.sort_values('AirDate', inplace=True)

In [17]:
import datetime

In [18]:
# print(datetime.datetime.now().strftime('%H:%M:%S:%f'))
# question_overlap = []
# terms_used = set([])

# for index, row in jeopardy.iterrows():
#     split_question = row['clean_question'].split()
#     split_question = [word for word in split_question if len(word) > 5]
#     match_count = 0
#     for word in split_question:
#         if word in terms_used:
#             match_count +=1
#         terms_used.add(word)
#     if len(split_question) > 0:
#         match_count /= len(split_question)
#     question_overlap.append(match_count)

# jeopardy['question_overlap'] = question_overlap
# print(datetime.datetime.now().strftime('%H:%M:%S:%f'))

The above block of code works, but it runs slow. Let's write a function and use the apply method instead to see if it will run more efficiently.

In [19]:
print(datetime.datetime.now().strftime('%H:%M:%S:%f'))

# empty set that will hold all words in questions
terms_used = set([])

def overlap(s):
    '''Function takes string as input and returns ratio of words that match 
    previous questions to total words.
    '''
    # split string to list
    split_question = s.split()
    # filter out words shorter than 6 characters
    split_question = [word for word in split_question if len(word) > 5]
    # count of how many times a word matches words in previous questions
    match_count = 0
    for word in split_question:
        # if the word is in the set of terms used, add 1 to match count
        if word in terms_used:
            match_count +=1
        # if the word is not already in the set of terms used, add it
        terms_used.add(word)
    # to avoid divide by zero error, check if there words in split question
    if len(split_question) > 0:
        # divide matching words by total words in the question
        match_count /= len(split_question)
    return match_count

# create new column using apply
jeopardy['question_overlap'] = jeopardy['clean_question'].apply(overlap)
print(datetime.datetime.now().strftime('%H:%M:%S:%f'))

21:35:08:708001
21:35:09:806616


With the 20K sample size dataset, the first code block takes about 3 seconds to run. The second code block, using apply, runs in about 1 second. We will comment out the first code block to save processing time.

In [20]:
jeopardy['question_overlap'].mean()

0.8726690023368967

### Recycled questions results
There is about 87% overlap between terms in new questions and terms in old questions. This doesn't look at phrases, it looks at single terms. This makes it relatively insignificant, but it does mean that it's worth looking more into the recycling of questions.

## Low value vs high value questions
In order to earn more money while we are on Jeopardy, we will study topics that pertain to high value questions instead of low value questions.

We can figure out which terms correspond to high-value questions by using a chi-squared test. We will first need to narrow down the questions into two categories:
* Low value -- Any row where Value is less than 800.
* High value -- Any row where Value is greater than 800.


In [21]:
def high_low(df):
    '''Function takes row of jeopardy dataframe as input. If the clean_value 
    column is greater than 800, return 1 if not return 0.
    '''
    if df['clean_value'] > 800:
        return 1
    else:
        return 0

In [22]:
# create boolean column that contains 1 for high value and 0 for low value
jeopardy['high_value'] = jeopardy.apply(high_low, axis=1)

In [23]:
def high_low_count(word):
    '''Function takes word as input, loops through each row in the dataframe, 
    and counts how many times it was used in a high or low value questions.
    '''
    low_count = 0
    high_count = 0
    # loop through rows in dataframe
    for index, row in jeopardy.iterrows():
        # split string to list
        split_question = row['clean_question'].split()
        # if the word is in this row
        if word in split_question:
            # if question is high value add 1 to the high count
            if row['high_value'] == 1:
                high_count+=1
            # if question is low value add one to the low count
            else:
                low_count+=1
    # return number of times word appeared in high and low value questions
    return high_count, low_count

In [24]:
# create dataframe that counts how many times each word in the dataset is used
freq_df = jeopardy['clean_question'].str.split(expand=True).stack().value_counts().reset_index()
freq_df.head()

Unnamed: 0,index,0
0,the,180064
1,this,126776
2,of,114349
3,in,103144
4,a,101607


In [25]:
# rename columns
freq_df.rename(columns={'index': 'word', 0: 'question_count'}, inplace=True)

In [26]:
# filter the dataframe by terms used, since it only includes complex words
freq_df = freq_df[freq_df['word'].isin(terms_used)].copy()
freq_df.head()

Unnamed: 0,word,question_count
39,called,5461
49,country,4868
78,became,3162
81,played,3011
82,president,3010


For our final chi-square approximation to be valid, the expected frequency should be at least 5. Let's see how many questions a word has to appear in to get expected values of at least 5.

In [27]:
# total questions in dataset
total_questions = jeopardy.shape[0]

# get total number of high & low value questions in the dataset
high_value_count = jeopardy[jeopardy['high_value'] == 1].shape[0]
low_value_count = jeopardy[jeopardy['high_value'] == 0].shape[0]

# let's try 20 questions
total_prop = 20 / total_questions

# calculate expected high and low values for a word that appears in 20 questions
exp_high_count = total_prop * high_value_count
exp_low_count = total_prop * low_value_count

print(exp_high_count, exp_low_count)

5.662840547642096 14.337159452357902


In [28]:
# filter the dataframe by words that apear in more than 20 questions
freq_df = freq_df[freq_df['question_count'] > 20]
freq_df

Unnamed: 0,word,question_count
39,called,5461
49,country,4868
78,became,3162
81,played,3011
82,president,3010
...,...,...
10858,tunisia,21
10859,pointing,21
10861,mausoleum,21
10862,sabrina,21


In [29]:
frequent_terms =  freq_df['word'].to_list()

Randomly pick 10 terms, becuase doing this for all of the terms would take a very long time.

In [30]:
import random

comparison_terms = random.sample(frequent_terms, 10)
comparison_terms

['missionary',
 'produces',
 'acquitted',
 'performed',
 'operetta',
 'target_blankthis',
 'mainly',
 'bottom',
 'islamic',
 'elsewhere']

In [31]:
# get the high value and low value count for each word in comparison terms
observed_expected = []
word_frequency = []

for i in comparison_terms:
    h_l = high_low_count(i)
    observed_expected.append(h_l)

observed_expected

[(22, 33),
 (49, 119),
 (6, 17),
 (72, 215),
 (22, 27),
 (51, 64),
 (43, 124),
 (43, 157),
 (17, 42),
 (9, 20)]

In [32]:
from scipy.stats import chisquare
import numpy as np

In [33]:
chi_squared = []

for obs_high_count, obs_low_count in observed_expected:
    # total number of times the word appears
    total = obs_high_count + obs_low_count
    # total proportion of how often the word appears divided by the total number of questions
    total_prop = total / jeopardy.shape[0]

    # get expected values by multiplying the total proportion by the total number of high/low questions
    exp_high_count = total_prop * high_value_count
    exp_low_count = total_prop * low_value_count

    # create numpy arrays
    observed = np.array([obs_high_count, obs_low_count])
    expected = np.array([exp_high_count, exp_low_count])

    # add chivalue and pvalue to chi_squared list
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=3.700342529477874, pvalue=0.05440129897014992),
 Power_divergenceResult(statistic=0.06014836782359, pvalue=0.8062615926457806),
 Power_divergenceResult(statistic=0.05621171538502828, pvalue=0.8125868927130006),
 Power_divergenceResult(statistic=1.4725425958422673, pvalue=0.22494495745754595),
 Power_divergenceResult(statistic=6.6393328979036035, pvalue=0.009975128042275058),
 Power_divergenceResult(statistic=14.565444332456028, pvalue=0.00013537444924416217),
 Power_divergenceResult(statistic=0.5416149903282792, pvalue=0.46176419173876504),
 Power_divergenceResult(statistic=4.575332281103448, pvalue=0.0324354548070976),
 Power_divergenceResult(statistic=0.0072482992491946, pvalue=0.9321525197784184),
 Power_divergenceResult(statistic=0.10572745161307058, pvalue=0.745061812207719)]

Most of our p-values are larger than 5%, which means they are not statistically significant. Let's take a look at what words did have statistically significant results.

In [40]:
for i in range(10):
    word = comparison_terms[i]
    observed_h_l = observed_expected[i]
    chivalue = chi_squared[i].statistic
    pvalue = chi_squared[i].pvalue
    if pvalue < 0.05:
        print(word, observed_h_l, chivalue, pvalue)

operetta (22, 27) 6.6393328979036035 0.009975128042275058
target_blankthis (51, 64) 14.565444332456028 0.00013537444924416217
bottom (43, 157) 4.575332281103448 0.0324354548070976


## Chi-squared results
We planned on selecting the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values. Unfortunately, all three of our statistically significant words occur more often in low value questions than high value questions, so that was not what we are looking for. Additionally, only 'operetta' seems meaningful in our context. 

If we think we might actually get to play on Jeopardy, we might further investigate what words occur in questions and see if we can find more words that we should add to our list of stop words. We would also need to look at a larger sample than 10 words.