# Winning Jeopardy!

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture.

Let's say you want to compete on Jeopardy, and you're looking for any edge you can get to win. In this project, I'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you win.

The dataset is named "jeopardy.csv", and contains 20000 rows from the beginning of a full dataset of Jeopardy questions.

Each row in the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:

- Show Number -- the Jeopardy episode number of the show this question was in.
- Air Date -- the date the episode aired.
- Round -- the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
- Category -- the category of the question.
- Value -- the number of dollars answering the question correctly is worth.
- Question -- the text of the question.
- Answer -- the text of the answer.

In [1]:
import pandas as pd
jeopardy = pd.read_csv("jeopardy.csv")
print(jeopardy.head())

   Show Number    Air Date      Round                         Category  Value  \
0         4680  2004-12-31  Jeopardy!                          HISTORY   $200   
1         4680  2004-12-31  Jeopardy!  ESPN's TOP 10 ALL-TIME ATHLETES   $200   
2         4680  2004-12-31  Jeopardy!      EVERYBODY TALKS ABOUT IT...   $200   
3         4680  2004-12-31  Jeopardy!                 THE COMPANY LINE   $200   
4         4680  2004-12-31  Jeopardy!              EPITAPHS & TRIBUTES   $200   

                                            Question      Answer  
0  For the last 8 years of his life, Galileo was ...  Copernicus  
1  No. 2: 1912 Olympian; football star at Carlisl...  Jim Thorpe  
2  The city of Yuma in this state has a record av...     Arizona  
3  In 1963, live on "The Art Linkletter Show", th...  McDonald's  
4  Signer of the Dec. of Indep., framer of the Co...  John Adams  


In [2]:
print(jeopardy.columns)

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')


In [3]:
jeopardy.columns=jeopardy.columns.str.replace(" ", "")
print(jeopardy.columns)

Index(['ShowNumber', 'AirDate', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')


In [4]:
jeopardy.describe(include="all")

Unnamed: 0,ShowNumber,AirDate,Round,Category,Value,Question,Answer
count,19999.0,19999,19999,19999,19999,19999,19999
unique,,336,4,3581,76,19988,14963
top,,2007-11-13,Jeopardy!,TELEVISION,$400,[audio clue],Japan
freq,,62,9901,51,3892,5,22
mean,4312.730537,,,,,,
std,1374.121672,,,,,,
min,10.0,,,,,,
25%,3393.0,,,,,,
50%,4582.0,,,,,,
75%,5431.0,,,,,,


In [5]:
print(jeopardy.dtypes)

ShowNumber     int64
AirDate       object
Round         object
Category      object
Value         object
Question      object
Answer        object
dtype: object


## Normalizing text

Before I start doing analysis on the Jeopardy questions, I need to normalize (lowercase the words and remove the punctuation) all of the text columns (the Question and Answer columns).

In [6]:
import string
def normalize(el):
    el2=el.lower()
    el3=el2.translate(str.maketrans('', '', string.punctuation))
    return el3

jeopardy["clean_question"]=jeopardy["Question"].apply(normalize)
jeopardy["clean_answer"]=jeopardy["Answer"].apply(normalize)

In [7]:
print(jeopardy["clean_question"].head(10))

0    for the last 8 years of his life galileo was u...
1    no 2 1912 olympian football star at carlisle i...
2    the city of yuma in this state has a record av...
3    in 1963 live on the art linkletter show this c...
4    signer of the dec of indep framer of the const...
5    in the title of an aesop fable this insect sha...
6    built in 312 bc to link rome  the south of ita...
7    no 8 30 steals for the birmingham barons 2306 ...
8    in the winter of 197172 a record 1122 inches o...
9    this housewares store was named for the packag...
Name: clean_question, dtype: object


In [8]:
print(jeopardy["clean_answer"].head(10))

0        copernicus
1        jim thorpe
2           arizona
3         mcdonalds
4        john adams
5           the ant
6    the appian way
7    michael jordan
8        washington
9     crate  barrel
Name: clean_answer, dtype: object


## Normalizing Other Columns

The Value column should be numeric, thus should remove the dollar sign from the beginning of each value and convert the column from text to numeric.

The Air Date column should be a datetime, not a string.

In [9]:
import math
def norm_dollars(el):
    el=el.translate(str.maketrans('', '', string.punctuation))
    el = pd.to_numeric(el, errors='coerce', downcast='integer')
    if math.isnan(el):
        el=0
    return el
jeopardy["clean_value"]=jeopardy["Value"].apply(norm_dollars)
print(jeopardy["clean_value"].head(10))

0    200
1    200
2    200
3    200
4    200
5    200
6    400
7    400
8    400
9    400
Name: clean_value, dtype: int64


In [10]:
jeopardy["AirDate"] = pd.to_datetime(jeopardy["AirDate"])
print(jeopardy["AirDate"].head(10))

0   2004-12-31
1   2004-12-31
2   2004-12-31
3   2004-12-31
4   2004-12-31
5   2004-12-31
6   2004-12-31
7   2004-12-31
8   2004-12-31
9   2004-12-31
Name: AirDate, dtype: datetime64[ns]


## Answers in Questions

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

- How often the answer is deducible from the question.
- How often new questions are repeats of older questions.

You can answer the second question by seeing how often complex words (> 6 characters) reoccur. 

You can answer the first question by seeing how many times words in the answer also occur in the question. 

I'll work on the first question now, and come back to the second.

In [11]:
def answer_from_question(row):
    split_answer=row["clean_answer"].split()
    split_question=row["clean_question"].split()
    match_count=0
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer)==0:
        return 0
    else:
        for word in split_answer:
            if word in split_question:
                match_count+=1
        return match_count/len(split_answer)

jeopardy["answer_in_question"]=jeopardy.apply(answer_from_question, axis=1)
mean_answer_in_question=jeopardy["answer_in_question"].mean()
print(mean_answer_in_question)

0.058861482035140716


From the above found mean, I can deduce how often the answer is deducible from the question: just in the 6% of the questions.

## Recycled Questions

Let's see how often new questions are repeats of older ones. 

In [12]:
question_overlap=[]
terms_used = set()
jeopardy=jeopardy.sort_values(by="AirDate")
print(jeopardy["AirDate"].head(10))

19325   1984-09-21
19301   1984-09-21
19302   1984-09-21
19303   1984-09-21
19304   1984-09-21
19305   1984-09-21
19306   1984-09-21
19307   1984-09-21
19308   1984-09-21
19309   1984-09-21
Name: AirDate, dtype: datetime64[ns]


In [13]:
for index, row in jeopardy.iterrows():
    split_question=row["clean_question"].split(" ")
    split_question=[word for word in split_question if len(word)>=6]
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count+=1
        terms_used.add(word)
    if len(split_question)>0:
        match_count=match_count/len(split_question)
    question_overlap.append(match_count)
jeopardy["question_overlap"]=question_overlap
mean_question_overlap=jeopardy["question_overlap"].mean()
print(mean_question_overlap)

0.6889055316620328


About the 70% of the words from the questions are recycled by older questions. This doesn't mean the entire sentence is copied but maybe we can investigate more this point.

## Low Value vs High Value Questions

Let's say you only want to study questions that pertain to high value questions instead of low value questions. This will help you earn more money when you're on Jeopardy.

I can actually figure out which terms correspond to high-value questions using a chi-squared test. I'll first need to narrow down the questions into two categories:

- Low value -- Any row where Value is less than 800.
- High value -- Any row where Value is greater than 800.

I'll then be able to loop through each of the terms from the last screen, terms_used, and:

- Find the number of low value questions the word occurs in.
- Find the number of high value questions the word occurs in.
- Find the percentage of questions the word occurs in.
- Based on the percentage of questions the word occurs in, find expected counts.
- Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.

I can then find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values.

In [14]:
def class_value(row):
    if row["clean_value"]>800:
        return 1
    else:
        return 0

jeopardy["high_value"]=jeopardy.apply(class_value, axis=1)
print(jeopardy[["clean_value", "high_value"]].tail())

      clean_value  high_value
1953          800           0
1954          800           0
1955          800           0
1945          400           0
1922          400           0


In [15]:
def high_low_count(word):
    low_count=0
    high_count=0
    for index, row in jeopardy.iterrows():
        split_question=row["clean_question"].split(" ")
        if word in split_question:
            if row["high_value"]==1:
                high_count+=1
            else:
                low_count+=1
    return high_count, low_count
high, low = high_low_count("housewares")
print(high, " ", low)

0   2


In [16]:
import random

comparison_terms=[]
for i in range(10):
    comparison_terms.append(random.choice(tuple(terms_used)))
observed_expected=[]
for word in comparison_terms:
    tuple_high_low = high_low_count(word)
    observed_expected.append(tuple_high_low)
print(observed_expected)
print(comparison_terms)

[(0, 2), (1, 0), (1, 1), (1, 0), (0, 1), (0, 1), (1, 3), (1, 0), (2, 8), (0, 1)]
['planetary', 'corridors', 'dreamt', 'ambroses', 'outgrossing', 'corcoran', 'heights', 'bosnian', 'honorary', 'freuds']


## Applying the Chi-Squared Test

I've found the observed counts for a few terms. I can now compute the expected counts and the chi-squared value.

In [17]:
high_value_count=len(jeopardy[jeopardy["high_value"]==1])
print(high_value_count)

5734


In [18]:
low_value_count=len(jeopardy[jeopardy["high_value"]==0])
print(low_value_count)

14265


In [19]:
from scipy.stats import chisquare
chi_squared=[]
for tuple_val in observed_expected:
    total=tuple_val[0]+tuple_val[1]
    total_prop=total/len(jeopardy)
    prop_high_value_count=total_prop*high_value_count
    prop_low_value_count=total_prop*low_value_count
    expected_tuple_val=(prop_high_value_count, prop_low_value_count)
    result = chisquare(tuple_val, expected_tuple_val)
    chi_squared.append(result)
print(chi_squared)

[Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571), Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047), Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996), Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047), Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469), Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469), Power_divergenceResult(statistic=0.02636443308440769, pvalue=0.871013484688921), Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047), Power_divergenceResult(statistic=0.36767906209032747, pvalue=0.5442721040962595), Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469)]


There are not statistically significant results and anyway the frequencies of these words are very low.