# Winning Jeopardy: applying chi-square test

In [14]:
import pandas as pd
import csv

jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [15]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [23]:
jeopardy.columns = jeopardy.columns.str.strip()

In [24]:
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

In [17]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
Show Number    19999 non-null int64
 Air Date      19999 non-null object
 Round         19999 non-null object
 Category      19999 non-null object
 Value         19999 non-null object
 Question      19999 non-null object
 Answer        19999 non-null object
dtypes: int64(1), object(6)
memory usage: 1.1+ MB


## Normalising text

`Question` and `Answer` column. To ensure that you lowercase words and remove punctuation so `Don't` and `don't` aren't considered to be different words when you compare them.

In [26]:
import re

def normalize_text(text):
    text = text.lower()
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    return text

jeopardy['clean_question'] = jeopardy['Question'].apply(normalize_text)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize_text)

jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams


## Normalising Columns

The `Value` column should also be numeric, to allow you to manipulate it more easily.

The `Air Date` column should also be a datetime, not a string, to enable you to work with it more easily.

In [35]:
def normalize_int(text):
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    try:
        text = int(text)
    except:
        text = 0
    return text

jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_int)


jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])


jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200


In [37]:
jeopardy.dtypes

Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object

## Answers in questions

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

* How often the answer is deducible from the question.
* How often new questions are repeats of older questions.

You can answer the first question by seeing how many times words in the answer also occur in the question.

In [89]:
def splittext(row):
    split_answer = row['clean_answer'].split(" ")
    split_question = row['clean_question'].split(" ")
    match_count = 0
    split_answer = [word for word in split_answer if word.lower() != 'the']
    if len(split_answer) == 0:
        return 0
    for text in split_answer:
        if text in split_question:
            match_count += 1
    return match_count/len(split_answer)

jeopardy['answer_in_question'] = jeopardy.apply(splittext, axis=1)

jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value,answer_in_question,question_overlap,high_value
19325,10,1984-09-21,Final Jeopardy!,U.S. PRESIDENTS,,"Adventurous 26th president, he was 1st to ride...",Theodore Roosevelt,adventurous 26th president he was 1st to ride ...,theodore roosevelt,0,0.0,0.0,0
19279,10,1984-09-21,Jeopardy!,SPORTS,$100,What Gary Player plays professionailly,golf,what gary player plays professionailly,golf,100,0.0,0.0,0
19280,10,1984-09-21,Jeopardy!,GEOGRAPHY,$200,Dutch is still an official language in what is...,Dutch Guiana,dutch is still an official language in what is...,dutch guiana,200,0.5,0.0,0
19286,10,1984-09-21,Jeopardy!,DOUBLE TALK,$300,Adopted baby of Barney & Betty Rubble,Bamm-Bamm,adopted baby of barney betty rubble,bammbamm,300,0.0,0.0,0
19285,10,1984-09-21,Jeopardy!,GEOGRAPHY,$300,"8th most populous country in the world, this ""...",Bangladesh,8th most populous country in the world this be...,bangladesh,300,0.0,0.0,0


In [48]:
jeopardy['answer_in_question'].mean()

0.06049325706933587

### Answer terms in the question

The answer only appears in the question about 6% of the time. This isn't a huge number, and means that we probably can't just hope that hearing a question will enable us to figure out the answer. We'll probably have to study.

## Recycled questions

Let's say you want to investigate how often new questions are repeats of older ones. You can't completely answer this, because you only have about 10% of the full Jeopardy question dataset, but you can investigate it at least.

To do this, you can:

* Sort jeopardy in order of ascending air date.
* Maintain a set called terms_used that will be empty initially.
* Iterate through each row of jeopardy.
* Split clean_question into words, remove any word shorter than 6 characters, and check if each word occurs in terms_used.
    * If it does, increment a counter.
    * Add each word to terms_used.

This will enable you to check if the terms in questions have been used previously or not. Only looking at words greater than 6 characters enables you to filter out words like the and than, which are commonly used, but don't tell you a lot about a question.

In [68]:
question_overlap = []

terms_used = set()

jeopardy = jeopardy.sort_values(by='Air Date')

for index, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(" ")
    split_question = [q for q in split_question if len(q) > 5]
    match_count = 0
    for word in split_question:
        if word in terms_used:
                match_count += 1
    for word in split_question:
        terms_used.add(word)
#   Unsure why below code gives the incorrect output vs the above  
#   for item in split_question:
#       terms_used.add(item)
#       if item in terms_used:
#           match_count += 1
    if len(split_question) > 0:
        match_count /= len(split_question)
    question_overlap.append(match_count)

jeopardy['question_overlap'] = question_overlap

jeopardy['question_overlap'].mean()

0.6876709054165026

### Question overlap

There is about 70% overlap between terms in new questions and terms in old questions. This only looks at a small set of questions, and it doesn't look at phrases, it looks at single terms. This makes it relatively insignificant, but it does mean that it's worth looking more into the recycling of questions.

## Low value vs high value questions

study questions that pertain to high value questions instead of low value questions. This will help you earn more money when you're on Jeopardy.

You can actually figure out which terms correspond to high-value questions using a chi-squared test. You'll first need to narrow down the questions into two categories:

* Low value -- Any row where `Value` is less than `800`.
* High value -- Any row where `Value` is greater than `800`.

You'll then be able to loop through each of the terms from the last screen, `terms_used`, and:

* Find the number of low value questions the word occurs in.
* Find the number of high value questions the word occurs in.
* Find the percentage of questions the word occurs in.
* Based on the percentage of questions the word occurs in, find expected counts.
* Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.

You can then find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values. Doing this for all of the words would take a very long time, so we'll just do it for a small sample now.

In [70]:
def determine_value(row):
    value = 0
    if row['clean_value'] > 800:
        value = 1
    return value

jeopardy['high_value'] = jeopardy.apply(determine_value, axis=1)


In [78]:
def count_usage(word):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        split_question = row['clean_question'].split(" ")
        if word in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

observed_expected = []
comparison_terms = list(terms_used)[:5]

for item in comparison_terms:
    observed_expected.append(count_usage(item))
    
# observed counts of each term in high_val and low_val questions
observed_expected


[(0, 1), (0, 1), (1, 2), (0, 1), (2, 2)]

## Applying the chi-squared test

Now that you've found the observed counts for a few terms, you can compute the expected counts and the chi-squared value.

In [85]:
# expected value = (col total * row total) / sum total 
# row total
high_value_count = jeopardy[jeopardy['high_value'] == 1].shape[0]
low_value_count = jeopardy[jeopardy['high_value'] == 0].shape[0]

print(high_value_count)
print(low_value_count)

5734
14265


In [91]:
from scipy.stats import chisquare
import numpy as np

chi_squared = []

for obs in observed_expected:
    # column total
    total = sum(obs)
    
    # expected value = (col total * row total) / sum total 
    high_value_exp = (high_value_count * total)/jeopardy.shape[0]
    low_value_exp = (low_value_count * total)/jeopardy.shape[0] 

    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    
    chi_squared.append(chisquare(observed, expected))
    
chi_squared    

[Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.03188116723440362, pvalue=0.8582887163235293),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.889754963322559, pvalue=0.3455437191483468)]

### Chi-squared results

None of the terms/words had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.

## Next steps

That's it for the guided steps! We recommend exploring the data more on your own.

Here are some potential next steps:

* Find a better way to eliminate non-informative words than just removing words that are less than `6` characters long. Some ideas:
    * Manually create a list of words to remove, like `the`, `than`, etc.
    * Find a list of stopwords to remove.
    * Remove words that occur in more than a certain percentage (like `5%`) of questions.
* Perform the chi-squared test across more terms to see what terms have larger differences. This is hard to do currently because the code is slow, but here are some ideas:
    * Use the apply method to make the code that calculates frequencies more efficient.
    * Only select terms that have high frequencies across the dataset, and ignore the others.
* Look more into the `Category` column and see if any interesting analysis can be done with it. Some ideas:
    * See which categories appear the most often.
    * Find the probability of each category appearing in each round.
* Use the whole Jeopardy dataset (available here) instead of the subset we used in this mission.
* Use phrases instead of single words when seeing if there's overlap between questions. Single words don't capture the whole context of the question well.
