# Winning Jeopardy

<p align="center">
<img src="https://upload.wikimedia.org/wikipedia/commons/5/51/Jeopardy%21_game_board_US.svg" alt="alt text" title="image Title" width="400"/>
</p>

This project is part of a guided project available on [Dataquest.io](https://dataquest.io). *Jeopardy!* is an American television game show created by Merv Griffin. The show features a quiz competition in which contestants are presented with general knowledge clues in the form of answers, and must phrase their responses in the form of questions.<sup>[^1]</sup>

In this project, you'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you win. The dataset contains 20000 rows from the beginning of a full dataset of Jeopardy questions, which you can download [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file).

**Dataset Description**
Each row in the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:

- `Show Number` - the Jeopardy episode number
- `Air Date` - the date the episode aired
- `Round` - the round of Jeopardy
- `Category` - the category of the question
- `Value` - the number of dollars the correct answer is worth
- `Question` - the text of the question
- `Answer` - the text of the answer


[^1]: https://en.wikipedia.org/wiki/Jeopardy!

# 1. Exploring the Dataset

In [1]:
# importing packages
import pandas as pd
import re
from scipy.stats import chisquare
import numpy as np

In [2]:
# reading the csv as pandas DataFrame
jeopardy = pd.read_csv("jeopardy.csv", parse_dates=[" Air Date"]) # setting the air date as datetime object

# printing first 5 rows
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [3]:
# printing only the columns
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [4]:
# function to clean and unite column names
def clean_col(col):
    col = col.strip()
    col = col.replace(" ", "_")
    col = col.lower()
    return col

# cleaning the columns
jeopardy_cols = []

for c in jeopardy.columns:
   clean_header = clean_col(c)
   jeopardy_cols.append(clean_header)

jeopardy.columns = jeopardy_cols

jeopardy.columns

Index(['show_number', 'air_date', 'round', 'category', 'value', 'question',
       'answer'],
      dtype='object')

In [5]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
show_number    19999 non-null int64
air_date       19999 non-null datetime64[ns]
round          19999 non-null object
category       19999 non-null object
value          19999 non-null object
question       19999 non-null object
answer         19999 non-null object
dtypes: datetime64[ns](1), int64(1), object(5)
memory usage: 1.1+ MB


## 2. Normalizing the Data
In this step we will normalize texts in the columns `question` and `answer` and values in the `value` column.

### 2.1 Normalizing the Text

In [6]:
# writing functin to normalize texts in the columns
def normalize_text(text):
    text = text.lower()
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    text = re.sub("\s+", " ", text)
    return text

jeopardy["clean_question"] = jeopardy["question"].apply(normalize_text)
jeopardy["clean_answer"] = jeopardy["answer"].apply(normalize_text)

### 2.2 Normalizing the Values

In [7]:
# writing functin to normalize values in the columns
def normalize_values(value):
    value = re.sub("[^A-Za-z0-9\s]", "", value)
    try:
        value = int(value)
    except:
        value = 0
    return value

jeopardy["clean_value"] = jeopardy["value"].apply(normalize_values)

jeopardy.head()
    

Unnamed: 0,show_number,air_date,round,category,value,question,answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200


## 3. Answering in Questions

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

1. How often the answer can be used for a question.
2. How often questions are repeated.

You can answer the second question by seeing how often complex words (> 6 characters) reoccur. You can answer the first question by seeing how many times words in the answer also occur in the question.

### 3.1 How Often the Answer Can Be Used for a Question?

In [8]:
def count_matches(row):
    split_answer = row["clean_answer"].split()
    split_question = row["clean_question"].split()
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer) == 0:
        return 0
    match_count = 0
    for item in split_answer:
        if item in split_question:
            match_count += 1
    return match_count / len(split_answer)

jeopardy["answer_in_question"] = jeopardy.apply(count_matches, axis=1)

In [9]:
jeopardy["answer_in_question"].mean() * 100

5.900196524977774

In [10]:
jeopardy[["question","answer", "answer_in_question"]][jeopardy["answer_in_question"] > 0].sort_values("answer_in_question", ascending=False).head()

Unnamed: 0,question,answer,answer_in_question
19994,"Of 8, 12 or 18, the number of U.S. states that...",18,1.0
7056,"Robin Leach, Robin Givens, Robin Cook",Robin Givens,1.0
10745,"(Hi, I'm Richie McDonald) They say this is goo...","""No News""",1.0
7382,"It's the only letter in ""piano"" that correspon...",A,1.0
10951,It's the first verb in the Pledge of Allegiance,pledge,1.0


On average, the answer only makes up for about 5.9% of the question. Choosing strategy that we will anwer based on the question is not really go-to. Let's check other strategy - how often are questions repeated.

### 3.2 How Often Are Questions Repeated
Let's say you want to investigate how often new questions are repeats of older ones.

In [11]:
question_overlap = []
terms_used = set()

jeopardy = jeopardy.sort_values("air_date")

for i, row in jeopardy.iterrows():
        split_question = row["clean_question"].split(" ")
        split_question = [q for q in split_question if len(q) > 5]
        match_count = 0
        for word in split_question:
            if word in terms_used:
                match_count += 1
        for word in split_question:
            terms_used.add(word)
        if len(split_question) > 0:
            match_count /= len(split_question)
        question_overlap.append(match_count)
jeopardy["question_overlap"] = question_overlap

jeopardy["question_overlap"].mean()

0.6876260592169776

## 4. Low Value vs High Value Questions

There is about 68% overlap between terms in new questions and terms in old questions. This only looks at a small set of questions, and it doesn't look at phrases, it looks at single terms. This makes it relatively insignificant, but it does mean that it's worth looking more into the recycling of questions. Let's say we want to study only questions with high value (above 800 USD).

In [12]:
def high_value(row):
    value = 0
    if row["clean_value"] > 800:
        value = 1
    return value

In [13]:
jeopardy["high_value"] = jeopardy.apply(high_value, axis=1)

In [14]:
def count_usage(term):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        if term in row["clean_question"].split(" "):
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

In [15]:
from random import choice

terms_used_list = list(terms_used)
comparison_terms = [choice(terms_used_list) for _ in range(10)]

observed_expected = []

for term in comparison_terms:
    observed_expected.append(count_usage(term))

observed_expected

[(1, 1),
 (0, 1),
 (0, 1),
 (0, 1),
 (1, 1),
 (0, 1),
 (0, 1),
 (0, 1),
 (1, 0),
 (1, 0)]

## 5. Applying Chi-squared Test
Now that we have the observed counts for a few terms, we can compute the expected counts and the chi-squared value.

In [16]:
high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]

chi_swuared = []
for obs in observed_expected:
    total = sum(obs)
    total_prop = total / jeopardy.shape[0]
    exp_high_value = total_prop * high_value_count
    exp_low_value = total_prop * low_value_count

    observed = np.array([obs[0], obs[1]])
    expected = np.array([exp_high_value, exp_low_value])

    chi_swuared.append(chisquare(observed, expected))

chi_swuared

[Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047)]

None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.