<a href="https://colab.research.google.com/github/K-Erath/Dataquest/blob/master/14_Guided_Project_Winning_Jeopardy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Winning Jeopardy
Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for many years, and is a major force in popular culture. 

In this project, we will work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help us win if we had an opportunity to compete on Jeopardy.

In [1]:
import pandas as pd

In [2]:
url = "https://drive.google.com/u/0/uc?id=0BwT5wj_P7BKXUl9tOUJWYzVvUjA&export=download" # dataset with 200K records
# url = "jeopardy.csv" # sample dataset with 20K records
jeopardy = pd.read_csv(url)
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [3]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [4]:
# remove spaces from column names
jeopardy.columns = jeopardy.columns.str.replace(' ', '')

In [5]:
import re

def clean_column(s):
    '''functions takes string as input, converts to lowercase, removes 
    punctuation, returns clean string.
    '''
    s = str(s)
    s = s.lower()
    # s = re.sub('[^A-Za-z0-9\s]', '', s)
    # s = re.sub('[\W\s]', '', s)
    # remove anything that is not a word character or a white space
    s = re.sub('[^\w\s]', '', s)
    return s

In [6]:
jeopardy['clean_question'] = jeopardy['Question'].apply(clean_column)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(clean_column)

In [7]:
jeopardy

Unnamed: 0,ShowNumber,AirDate,Round,Category,Value,Question,Answer,clean_question,clean_answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams
...,...,...,...,...,...,...,...,...,...
216925,4999,2006-05-11,Double Jeopardy!,RIDDLE ME THIS,$2000,This Puccini opera turns on the solution to 3 ...,Turandot,this puccini opera turns on the solution to 3 ...,turandot
216926,4999,2006-05-11,Double Jeopardy!,"""T"" BIRDS",$2000,In North America this term is properly applied...,a titmouse,in north america this term is properly applied...,a titmouse
216927,4999,2006-05-11,Double Jeopardy!,AUTHORS IN THEIR YOUTH,$2000,"In Penny Lane, where this ""Hellraiser"" grew up...",Clive Barker,in penny lane where this hellraiser grew up th...,clive barker
216928,4999,2006-05-11,Double Jeopardy!,QUOTATIONS,$2000,"From Ft. Sill, Okla. he made the plea, Arizona...",Geronimo,from ft sill okla he made the plea arizona is ...,geronimo


In [8]:
def normalize_values(s):
    '''Function takes string as input, removes punctuation, converts string to 
    integer, returns integer. If there is an error, returns 0.
    '''
    # s = re.sub('[^\w\s]', '', s)
    # s = str(s)
    # s = s.replace('$', '')
    s = re.sub('[^\w.]', '', s)
    try:
        val = int(s)
    except:
        val = 0
    return val


In [9]:
jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_values)
# jeopardy[(jeopardy['clean_value'] == 0) & (jeopardy['Value'].notnull())]

In [10]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 216930 entries, 0 to 216929
Data columns (total 10 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   ShowNumber      216930 non-null  int64 
 1   AirDate         216930 non-null  object
 2   Round           216930 non-null  object
 3   Category        216930 non-null  object
 4   Value           216930 non-null  object
 5   Question        216930 non-null  object
 6   Answer          216928 non-null  object
 7   clean_question  216930 non-null  object
 8   clean_answer    216930 non-null  object
 9   clean_value     216930 non-null  int64 
dtypes: int64(2), object(8)
memory usage: 16.6+ MB


In [11]:
jeopardy['AirDate'] = pd.to_datetime(jeopardy['AirDate'])

In [12]:
def word_match(df):
    '''Function takes row in jeopardy dataframe. Returns ratio of words in 
    answer that match words in question.
    '''
    # split strings into lists
    split_answer = df['clean_answer'].split()
    split_question = df['clean_question'].split()

    # remove the word 'the' since it occurs often & is meaningless for this excercise
    split_answer = [word for word in split_answer if word != 'the']

    # to avoid division by zero, return 0 if length of list is zero
    if len(split_answer) == 0:
        return 0

    # counts words in answer that exist in question
    match_count = 0
    for word in split_answer:
        if word in split_question:
            match_count += 1

    # return ratio of words in answer that match words in question
    return match_count / len(split_answer)

In [13]:
jeopardy['answer_in_question'] = jeopardy.apply(word_match, axis=1)

In [14]:
jeopardy['answer_in_question'].mean()

0.05732619088707737

## Recycled questions
On average, the answer only makes up for about 5.7% of the question. This isn't a huge number, and means that just hearing a question will probably not enable us to figure out the answer.

In [15]:
jeopardy.sort_values('AirDate', inplace=True)

In [16]:
question_overlap = []
terms_used = set([])

for index, row in jeopardy.iterrows():
    split_question = row['clean_question'].split()
    split_question = [word for word in split_question if len(word) > 5]
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count +=1
        terms_used.add(word)
    if len(split_question) > 0:
        match_count /= len(split_question)
    question_overlap.append(match_count)

In [17]:
jeopardy['question_overlap'] = question_overlap

In [18]:
jeopardy['question_overlap'].mean()

0.8726690023368967

## Low value vs high value questions
There is about 87% overlap between terms in new questions and terms in old questions. This doesn't look at phrases, it looks at single terms. This makes it relatively insignificant, but it does mean that it's worth looking more into the recycling of questions.

In [19]:
jeopardy

Unnamed: 0,ShowNumber,AirDate,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value,answer_in_question,question_overlap
84523,1,1984-09-10,Jeopardy!,LAKES & RIVERS,$100,River mentioned most often in the Bible,the Jordan,river mentioned most often in the bible,the jordan,100,0.000000,0.000000
84565,1,1984-09-10,Double Jeopardy!,THE BIBLE,$1000,"According to 1st Timothy, it is the ""root of a...",the love of money,according to 1st timothy it is the root of all...,the love of money,1000,0.333333,0.000000
84566,1,1984-09-10,Double Jeopardy!,'50'S TV,$1000,Name under which experimenter Don Herbert taug...,Mr. Wizard,name under which experimenter don herbert taug...,mr wizard,1000,0.000000,0.000000
84567,1,1984-09-10,Double Jeopardy!,NATIONAL LANDMARKS,$1000,D.C. building shaken by November '83 bomb blast,the Capitol,dc building shaken by november 83 bomb blast,the capitol,1000,0.000000,0.000000
84568,1,1984-09-10,Double Jeopardy!,NOTORIOUS,$1000,"After the deed, he leaped to the stage shoutin...",John Wilkes Booth,after the deed he leaped to the stage shouting...,john wilkes booth,1000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...
105947,6300,2012-01-27,Jeopardy!,VISITING THE CITY,$800,There's a great opera house on Bennelong Point...,Sydney,theres a great opera house on bennelong point ...,sydney,800,0.000000,1.000000
105948,6300,2012-01-27,Jeopardy!,PANTS,"$1,400",Tight-fitting pants patterned after those worn...,toreador pants,tightfitting pants patterned after those worn ...,toreador pants,1400,0.500000,1.000000
105949,6300,2012-01-27,Jeopardy!,CHILD ACTORS,$800,"This kid, with a familiar last name, is seen <...",Jaden Smith,this kid with a familiar last name is seen a h...,jaden smith,800,0.000000,0.500000
105951,6300,2012-01-27,Jeopardy!,LESSER-KNOWN SCIENTISTS,$800,Joseph Lagrange insisted on 10 as the basic un...,the metric system,joseph lagrange insisted on 10 as the basic un...,the metric system,800,0.500000,0.777778


In [20]:
def high_low(df):
    '''Function takes row of jeopardy dataframe as input. If the clean_value column is greater than 800, return 1 if not return 0.
    '''
    if df['clean_value'] > 800:
        return 1
    else:
        return 0

In [21]:
jeopardy['high_value'] = jeopardy.apply(high_low, axis=1)

In [22]:
def high_low_count(word):
    low_count = 0
    high_count = 0
    for index, row in jeopardy.iterrows():
        split_question = row['clean_question'].split()
        if word in split_question:
            if row['high_value'] == 1:
                high_count+=1
            else:
                low_count+=1
    return high_count, low_count

In [23]:
import random

comparison_terms = random.sample(terms_used, 10)
comparison_terms

['weatherrelated',
 'pastures',
 'hypocrisy',
 'winterthur',
 'steptoe',
 'waterloving',
 'goodbyeing',
 'celsius',
 'fuguer',
 'target_blankinteriora']

In [24]:
observed_expected = []

for i in comparison_terms:
    observed_expected.append(high_low_count(i))

observed_expected

[(0, 1),
 (2, 5),
 (1, 3),
 (0, 1),
 (1, 2),
 (0, 1),
 (0, 1),
 (11, 19),
 (0, 1),
 (1, 0)]

In [25]:
jeopardy.size

2820090

In [26]:
jeopardy.shape

(216930, 13)

In [27]:
jeopardy[jeopardy['high_value'] == 1].shape[0]

61422

In [28]:
from scipy.stats import chisquare
import numpy as np

In [29]:
high_value_count = jeopardy[jeopardy['high_value'] == 1].shape[0]
low_value_count = jeopardy[jeopardy['high_value'] == 0].shape[0]

In [31]:
chi_squared = []

for obs_high_count, obs_low_count in observed_expected:
    total = obs_high_count + obs_low_count
    total_prop = total / jeopardy.shape[0]

    exp_high_count = total_prop * high_value_count
    exp_low_count = total_prop * low_value_count

    observed = np.array([obs_high_count, obs_low_count])
    expected = np.array([exp_high_count, exp_low_count])

    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=0.3949764642333513, pvalue=0.5296950912486695),
 Power_divergenceResult(statistic=0.00022818639557405952, pvalue=0.9879477420204993),
 Power_divergenceResult(statistic=0.021646150708492677, pvalue=0.8830323245068887),
 Power_divergenceResult(statistic=0.3949764642333513, pvalue=0.5296950912486695),
 Power_divergenceResult(statistic=0.03723409388907139, pvalue=0.846989214486915),
 Power_divergenceResult(statistic=0.3949764642333513, pvalue=0.5296950912486695),
 Power_divergenceResult(statistic=0.3949764642333513, pvalue=0.5296950912486695),
 Power_divergenceResult(statistic=1.031129032701023, pvalue=0.30989363782327795),
 Power_divergenceResult(statistic=0.3949764642333513, pvalue=0.5296950912486695),
 Power_divergenceResult(statistic=2.5317964247338085, pvalue=0.11157312838169751)]

## Chi-squared results
Since all of the p values are < 0.05 that means none of the terms had a significant difference in usage between high value and low value rows. Additionally, most of the the frequencies were lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.