# Guided Project: Winning Jeopardy

## 1. Jeopardy questions
Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture. 
Let's say you want to compete on Jeopardy, and you're looking for any edge you can get to win. In this project, you'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you win.

The dataset is named jeopardy.csv, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions. Each row in the dataset of the file represents a single question on a single episode of Jeopardy. Here are explanations of each column:

<li>Show Number -- the Jeopardy episode number of the show this question was in.
<li>Air Date -- the date the episode aired.
<li>Round -- the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
<li>Category -- the category of the question.
<li>Value -- the number of dollars answering the question correctly is worth.
<li>Question -- the text of the question.
<li>Answer -- the text of the answer.

In [1]:
#Import Libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
#Read the dataset into a Dataframe called jeopardy using Pandas.
jeopardy=pd.read_csv('jeopardy.csv')

#Print out the first 5 rows of jeopardy.
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [3]:
#Print out the columns of jeopardy using jeopardy.columns.
print(jeopardy.shape)
print(jeopardy.columns)
print(jeopardy.info())

(19999, 7)
Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
Show Number    19999 non-null int64
 Air Date      19999 non-null object
 Round         19999 non-null object
 Category      19999 non-null object
 Value         19999 non-null object
 Question      19999 non-null object
 Answer        19999 non-null object
dtypes: int64(1), object(6)
memory usage: 1.1+ MB
None


In [4]:
#Remove the spaces in each item in jeopardy.columns.Assign the result back to jeopardy.columns to fix the column names in jeopardy.
jeopardy.columns=['Show_Number','Air_Date','Round','Category','Value','Question','Answer']

## 2. Normalizing Text

Before you can start doing analysis on the Jeopardy questions, you need to normalize all of the text columns (the Question and Answer columns). We covered normalization before, but the idea is to ensure that you lowercase words and remove punctuation so Don't and don't aren't considered to be different words when you compare them.

In [5]:
import string
#replace_punctuation = str.maketrans(string.punctuation,' '*len(string.punctuation))
print(string.punctuation)
#print(replace_punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [6]:
#Write a function to normalize questions and answers. Convert the string to lowercase and Remove all punctuation in the string.
import string
def normalize_text(text):
    #Convert the string to lowercase.
    text_lowered=str.lower(text)
    #Remove all punctuation in the string.
    #create a translation map from character-to-character mappings in different formats.
    replace_punctuation = str.maketrans(string.punctuation,' '*len(string.punctuation))
    #Return a copy of the string in which each character has been mapped through the given translation table. 
    text_final = text_lowered.translate(replace_punctuation)
    return text_final

#Normalize the Question column.
jeopardy['clean_question']=jeopardy['Question'].apply(normalize_text)

#Normalize the Answer column.
jeopardy['clean_answer']=jeopardy['Answer'].apply(normalize_text)

#Check Normalization
jeopardy.head()

Unnamed: 0,Show_Number,Air_Date,Round,Category,Value,Question,Answer,clean_question,clean_answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was ...,copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisl...,jim thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show th...,mcdonald s
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the co...,john adams


## 3. Normalizing Columns

Now that you've normalized the text columns, there are also some other columns to normalize.

The Value column should also be numeric, to allow you to manipulate it more easily. You'll need to remove the dollar sign from the beginning of each value and convert the column from text to numeric.

The Air Date column should also be a datetime, not a string, to enable you to work with it more easily.

In [7]:
#Write a function to normalize dollar values.Remove any punctuation in the string.Convert the string to an integer.
def normalize_values(value):
    #Remove all punctuation in the string.
    #create a translation map from character-to-character mappings in different formats.
    replace_punctuation = str.maketrans(string.punctuation,' '*len(string.punctuation))
    #Return a copy of the string in which each character has been mapped through the given translation table. 
    value = value.translate(replace_punctuation)
    try:
        value=int(value)
    except Exception:
        value=0
    return value

#Normalize the Value column.
jeopardy['clean_value']=jeopardy['Value'].apply(normalize_values)
    
#Check conversion
jeopardy.head()

Unnamed: 0,Show_Number,Air_Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was ...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisl...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show th...,mcdonald s,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the co...,john adams,200


In [8]:
#Use the pandas.to_datetime function to convert the Air Date column to a datetime column.
jeopardy['Air_Date']=pd.to_datetime(jeopardy['Air_Date'])

#Check results
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 10 columns):
Show_Number       19999 non-null int64
Air_Date          19999 non-null datetime64[ns]
Round             19999 non-null object
Category          19999 non-null object
Value             19999 non-null object
Question          19999 non-null object
Answer            19999 non-null object
clean_question    19999 non-null object
clean_answer      19999 non-null object
clean_value       19999 non-null int64
dtypes: datetime64[ns](1), int64(2), object(7)
memory usage: 1.5+ MB


## 4. Answers in questions

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

How often the answer is deducible from the question.
How often new questions are repeats of older questions.
You can answer the second question by seeing how often complex words (> 6 characters) reoccur. You can answer the first question by seeing how many times words in the answer also occur in the question. We'll work on the first question now, and come back to the second.

In [9]:
#Write a function that takes in a row in jeopardy, as a Series.
def counter(row):
    split_answer=row['clean_answer'].split(" ")
    split_question=row['clean_question'].split(" ")
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer)==0:
        return 0
    match_count=0
    for item in split_answer:
        if item in split_question:
            match_count +=1
    return match_count/len(split_answer)        

In [10]:
#Count how many times terms in clean_answer occur in clean_question.
jeopardy['answer_in_question']=jeopardy.apply(counter,axis=1)

In [11]:
#Find the mean of the answer_in_question column using the mean method on Series.
mean=jeopardy['answer_in_question'].mean()
mean

0.0954962720550924

Q.Write up a markdown cell with a short explanation of how finding this mean might influence your studying strategy for Jeopardy:

A.The answer only appears in the question about 9.5% of the time. This isn't a huge number, and means that we probably can't just hope that hearing a question will enable us to figure out the answer. We'll probably have to study.


## 5. Recycled questions

Let's say you want to investigate how often new questions are repeats of older ones. You can't completely answer this, because you only have about 10% of the full Jeopardy question dataset, but you can investigate it at least.

To do this, you can:

<li>Sort jeopardy in order of ascending air date.
<li>Maintain a set called terms_used that will be empty initially.
<li>Iterate through each row of jeopardy.
<li>Split clean_question into words, remove any word shorter than 6 characters, and check if each word occurs in terms_used.
<li>If it does, increment a counter.
<li>Add each word to terms_used.
This will enable you to check if the terms in questions have been used previously or not. Only looking at words greater than 6 characters enables you to filter out words like the and than, which are commonly used, but don't tell you a lot about a question.

In [12]:
#Create an empty list called question_overlap.
question_overlap=[]

#Create an empty set called terms_used.
terms_used=set()

#Use the iterrows Dataframe method to loop through each row of jeopardy.
for i, row in jeopardy.iterrows():
    split_question=row['clean_question'].split(' ')
    split_question = [q for q in split_question if len(q) > 5]
    match_count=0
    for item in split_question:
        if item in terms_used:
            match_count +=1
    for item in split_question:
        terms_used.add(item)
    if len(split_question)>0:
        match_count/=len(split_question)
    question_overlap.append(match_count)

In [13]:
#Assign question_overlap to the question_overlap column of jeopardy.
jeopardy['question_overlap']=question_overlap

In [14]:
#Find the mean of the question_overlap column and print it.
mean2=jeopardy['question_overlap'].mean()
mean2

0.7266867360208153

Q.Look at the value, and think about what this might mean for questions being recycled. Write up your thoughts in a markdown cell.

A.There is about 72% overlap between terms in new questions and terms in old questions. This only looks at a small set of questions, and it doesn't look at phrases, it looks at single terms. This makes it relatively insignificant, but it does mean that it's worth looking more into the recycling of questions.

## 6. Low value vs High value questions

Let's say you only want to study questions that pertain to high value questions instead of low value questions. This will help you earn more money when you're on Jeopardy.

You can actually figure out which terms correspond to high-value questions using a chi-squared test. You'll first need to narrow down the questions into two categories:

Low value -- Any row where Value is less than 800.
High value -- Any row where Value is greater than 800.
You'll then be able to loop through each of the terms from the last screen, terms_used, and:

Find the number of low value questions the word occurs in.
Find the number of high value questions the word occurs in.
Find the percentage of questions the word occurs in.
Based on the percentage of questions the word occurs in, find expected counts.
Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.
You can then find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values. Doing this for all of the words would take a very long time, so we'll just do it for a small sample now.

In [15]:
#Create a function that takes in a row from a Dataframe
def determine_value(row):
    if row['clean_value']>800:
        value=1
    else:
        value=0
    return value

In [16]:
#Determine which questions are high and low value.
jeopardy['high_value']=jeopardy.apply(determine_value,axis=1)

In [17]:
#Create a function that takes in a word
def count_usage(word):
    low_count=0
    high_count=0
    for i, row in jeopardy.iterrows():
        split_question=row['clean_question'].split(' ')
        if word in split_question:
            if row['high_value']==1:
                high_count +=1
            else:
                low_count +=1
    return (high_count,low_count)

In [18]:
#Create an empty list called observed_expected.
observed=[]

#Convert terms_used into a list using the list function, and assign the first 5 elements to comparison_terms.
comparison_terms=list(terms_used)[:5]
print(comparison_terms)

#Loop through each term in comparison_terms
for i in comparison_terms:
    value=count_usage(i)
    observed.append(value)
    
print(observed)

['combination', 'coriander', 'dominos', 'deceived', 'indian']
[(1, 5), (0, 1), (0, 1), (0, 1), (13, 31)]


## 7. Applying the chi-squared test

In [19]:
#Find the number of rows in jeopardy where high_value is 1, and assign to high_value_count.
high_value_total_count=jeopardy[jeopardy['high_value']==1].shape[0]
print(high_value_total_count)

#Find the number of rows in jeopardy where high_value is 0, and assign to low_value_count.
low_value_total_count=jeopardy[jeopardy['high_value']==0].shape[0]
print(low_value_total_count)

4972
15027


In [20]:
#Create an empty list called chi_squared.
chi_squared=[]
p_values=[]

import numpy as np
from scipy.stats import chisquare

#Loop through each list in observed_expected.
for obs in observed:
    total=sum(obs)
    total_proportion=total/jeopardy.shape[0]
    expected_high_value=total_proportion*high_value_total_count
    expected_low_value=total_proportion*low_value_total_count
    
    observed=np.array([obs[0],obs[1]])
    expected=np.array([expected_high_value,expected_low_value])
    chisq_value,pvalue=chisquare(observed,expected)
    chi_squared.append(chisq_value)
    p_values.append(pvalue)
    
print('chi_squared')
print(chi_squared)
print('p_values')
print(p_values)

chi_squared
[0.21568374788971523, 0.3308710986890265, 0.3308710986890265, 0.3308710986890265, 0.516819414527022]
p_values
[0.6423485514836023, 0.565146603267378, 0.565146603267378, 0.565146603267378, 0.4722015916853084]


None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.

In [21]:
jeopardy.shape[0]

19999