# Guided Project: Winning Jeopardy

## Jeopardy questions

For this particular project, I will work with a dataset of Jeopardy questions to see if there is any advantage I have at winning. I want to analyze the questions to see if there is a pattern so I can prepare when I actually compete on Jeopardy. Let's read in the data and analyze the data set. 

In [1]:
import pandas as pd
import re 

jeopardy = pd.read_csv('jeopardy.csv')
print(jeopardy.columns)
jeopardy.head()
jeopardy.columns = ['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question', 'Answer']
jeopardy.head()

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')


Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


## Normalizaing Text and Columns

Now, before any analysis can be done, I need to normalize all the text columns (Question and Answer columns). I will put all words to lowercase and remove the punctuation. Also, I will normalize the 'Value' column in which it allows me to make the column numeric so that manipulation will be easy. I will remove the dollar sign and convert the column from text to numeric. Lastly, I will convert the 'Air Date' column to a datetime column so it is not a string anymore. 

In [2]:
def normalize_text(text):
    text = text.lower()
    text = re.sub('[^A-Za-z0-9\s]', '', text)
    return text

def normalize_values(text):
    text = re.sub('[^A-Za-z0-9\s]', '', text)
    try:
        text = int(text)
    except Exception: 
        text = 0
    return text

jeopardy['clean_question'] = jeopardy['Question'].apply(normalize_text)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize_text) 
jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_values)
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])
jeopardy.dtypes
    
    
    

Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object

## Answers in Questions

I now would like to know if studying past questions, studying general knowledge, or not studying at all would be helpful to: 
        
   1) See if the answer is deducible from the question.
   
   2) See if new questions are repeats of old questions. 
   
In order to answer the second part, I will see how often complex words reoccur. For the first question, I will see how many times words in the answer also occur in the question. 

In [6]:
def jeopardy_row(row):
    split_answer = row['clean_answer'].split(' ')
    split_question = row['clean_question'].split(' ') 
    if 'the' in split_answer: 
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0 
    match_count = 0
    for item in split_answer: 
        if item in split_question:
            match_count += 1
    return match_count / len(split_answer)
        
jeopardy['answer_in_the_question'] = jeopardy.apply(jeopardy_row, axis = 1)
jeopardy['answer_in_the_question'].mean()

0.06049325706933587

## Recycled Questions

Now I want to investigate how often new questions are repeats of old ones. I only have 10% of the full dataset so it's hard to answer, however, I can try to see what the data says. 

In [15]:
question_overlap = []
terms_used = set()
jeopardy = jeopardy.sort_values('Air Date')
for i, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(' ')
    split_question = [q for q in split_question if len(q) > 5]
    match_count = 0
    for word in split_question: 
        if word in terms_used: 
            match_count += 1
    for word in split_question:
            terms_used.add(word)
    if len(split_question) > 0:
        match_count /= len(split_question)
    question_overlap.append(match_count)

jeopardy['question_overlap'] = question_overlap
jeopardy['question_overlap'].mean()
            

0.6876947803264011

## Low value vs high value questions

At this particular step, I will also consider the case where I only want to study questions that pertain to high value questions instead of low value questions. I will use a chi-squared test to see the corresponding high questions involved. I will define high and low as follows: 

   1) Low value: Any row where 'Value' is less than 800
   
   2) High value: Any row where 'Value' is greater than 800

I will loop through each term in terms used to figure this out. 

In [19]:
def question_classifier(row):
    value = 0
    if row['clean_value'] > 800:
        value = 1
    return value
jeopardy['high_value'] = jeopardy.apply(question_classifier, axis = 1)


def count(word):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        split_question = row['clean_question'].split(' ')
        if word in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else: 
                low_count += 1
    return high_count, low_count

comparison_terms = list(terms_used)[:5]
observed_expected = []
for row in comparison_terms:
    observed_expected.append(count(row))
    
observed_expected


[(0, 1), (1, 2), (0, 1), (5, 14), (1, 0)]

## Applying the chi-squared test

In [32]:
import numpy as np
from scipy.stats import chisquare

high_value_count = jeopardy[jeopardy['high_value'] == 1]
high_value_count.shape[0]
low_value_count = jeopardy[jeopardy['high_value'] == 0]
low_value_count.shape[0]
print(high_value_count.shape[0], low_value_count.shape[0])

chi_squared = []
for row in observed_expected:
    total = sum(row)
    total_prop = total / jeopardy.shape[0]
    high = total_prop * high_value_count.shape[0]
    low = total_prop * low_value_count.shape[0]
    observed = np.array([row[0], row[1]])
    expected = np.array([high, low])
    chi_squared.append(chisquare(observed, expected))

chi_squared

5734 14265


[Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.03188116723440362, pvalue=0.8582887163235293),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.05155372477690176, pvalue=0.8203814000640732),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047)]