# Guided Project: Winning Jeopardy

## Introduction

The goal of this project is to figure out some patterns in the previous questions of the TV game show Jeopardy that could help us win.

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture.

## The Data

For this project, we will work with a dataset of Jeopardy previous questions. The dataset is named `jeopardy.csv`, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions.

Let's read this data set in!

In [1]:
import pandas as pd
import numpy as np

jeopardy = pd.read_csv("jeopardy.csv")

We start by printing the first 5 rows of the dataset to get familiar with it.

In [2]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


And print some info on the dataframe to know more about each column.

In [3]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
Show Number    19999 non-null int64
 Air Date      19999 non-null object
 Round         19999 non-null object
 Category      19999 non-null object
 Value         19999 non-null object
 Question      19999 non-null object
 Answer        19999 non-null object
dtypes: int64(1), object(6)
memory usage: 1.1+ MB


We can also print the columns name to ensure their formatting is correct.

In [161]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

From this initial exploration of the data, we can see that:
* Column names feature an extra space character in front of them that will need to be stripped out.
* The `Air Data` column values are stored as string objects. They will need to be converted to datetime objects.
* No column contains any single null value.
* The `Value` column values are stored as string objects and will probably need to be converted to numerics to make it more easily to work with them.
* Both `Question` and `Answer` columns will likely need to be re-formatted to be usable in our analysis.

In the next section, we address some of these cleaning tasks.

## Data Cleaning

### Cleaning of the Columns Name

We start by getting rid of the space character that is present in front of the columns name.

In [163]:
jeopardy.rename(columns=lambda x: x.strip(), inplace=True)
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

### Normalization of the Text Columns

Before we can start doing analysis on the Jeopardy questions, we need to normalize all of the text columns (the `Question` and `Answer` columns). 
We are going to lowercase words and remove punctuation so `Don't` and `don't` aren't considered to be different words when we compare them.

In the following block of code, we write a function to perform this normalization.

In [164]:
import re

def normalize(text):
    text = text.lower()
    text = re.sub('[^A-Za-z0-9\s]','',text)
    text = re.sub('\s+',' ',text)
    return text

We then apply this function to the `Question` and `Answer` columns.

In [165]:
jeopardy["clean_question"] = jeopardy["Question"].apply(normalize)
jeopardy["clean_answer"] = jeopardy["Answer"].apply(normalize)
print(jeopardy["clean_question"].head(3))
print('\n')
print(jeopardy["clean_answer"].head(3))

0    for the last 8 years of his life galileo was u...
1    no 2 1912 olympian football star at carlisle i...
2    the city of yuma in this state has a record av...
Name: clean_question, dtype: object


0    copernicus
1    jim thorpe
2       arizona
Name: clean_answer, dtype: object


### Normalization of the `Value` column

The `Value` column should also be numeric, to enable us to manipulate it more easily. We need to remove the dollar sign from the beginning of each value and convert the column from text to numeric type.

In the following block of code, we apply the `normalize` function to each item in the `Value` column.

In [166]:
jeopardy["clean_value"] = jeopardy["Value"].apply(normalize)
jeopardy["clean_value"] = pd.to_numeric(jeopardy["clean_value"], errors='coerce', downcast='integer')
print(jeopardy["clean_value"].head(3))

0    200.0
1    200.0
2    200.0
Name: clean_value, dtype: float64


### Conversion of the `Air Date` column to a datetime column

The `Air Date` column should also be a datetime, not a string, to be able to work with it more easily.

In [167]:
jeopardy["Air Date"] = pd.to_datetime(jeopardy["Air Date"])
jeopardy.dtypes

Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value              float64
dtype: object

Now that we performed some cleaning on our data, we can start analysing them. In the next section, we approach different strategies and analyse the past questions and answers to find any useful patterns that could help us for a future win.

## Which strategy to adopt?

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

* How often the answer is deducible from the question.
* How often new questions are repeats of older questions.

We can answer the second question by seeing how often complex words (> 6 characters) reoccur. We can answer the first question by seeing how many times words in the answer also occur in the question. We'll work on the first question now, and come back to the second.

### How often is the answer deducible from the question?

In the next block of code, we write a function that takes in a row in `jeopardy` and calculates the percentage of words from the answer that are present in the question.

In [168]:
def count_matches(row):
    split_answer = row["clean_answer"].split()
    split_question = row["clean_question"].split()
    
    match_count = 0
    
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0
    
    for word in split_answer:
        if word in split_question:
            match_count += 1
    
    return match_count / len(split_answer)

We then apply this function to the `jeopardy` dataset.

In [169]:
jeopardy["answer_in_question"] = jeopardy.apply(count_matches, axis=1)
jeopardy["answer_in_question"].value_counts().sort_index(ascending=False)

1.000000      124
0.875000        1
0.800000        2
0.750000       17
0.666667      104
0.600000        9
0.571429        2
0.500000     1448
0.444444        1
0.428571        2
0.400000       26
0.350000        1
0.333333      494
0.300000        2
0.285714        7
0.250000      155
0.200000       68
0.181818        2
0.166667       27
0.142857       21
0.125000        9
0.111111        2
0.000000    17475
Name: answer_in_question, dtype: int64

As we can see, most of the rows feature answers that have no words in common with their associated question.
On the other side, we have 128 rows where all the words in the answer are present in the question.

Let's print one example of each case.

In [170]:
# Example of a case in which all the words in the answer are present in the question
print('Question: ',jeopardy[jeopardy["answer_in_question"] == 1].iloc[0]["clean_question"])
print('Answer: ',jeopardy[jeopardy["answer_in_question"] == 1].iloc[0]["clean_answer"])

Question:  ljubljana bratislava barcelona
Answer:  barcelona


In [171]:
# Example of a case in which no words from the answer are present in the question
print('Question: ',jeopardy[jeopardy["answer_in_question"] == 0].iloc[0]["clean_question"])
print('Answer: ',jeopardy[jeopardy["answer_in_question"] == 0].iloc[0]["clean_answer"])

Question:  for the last 8 years of his life galileo was under house arrest for espousing this mans theory
Answer:  copernicus


The question in the first case is not really a question but a succession of choices which contains the answer.
In the second case, the question is better formulated but it does not indeed contain the answer.

Let's now calculate the mean proportion of the new `answer_in_question` column.

In [172]:
jeopardy["answer_in_question"].mean()

0.05900196524977763

We can see that, on average, around 6.3% of an answer's words are present in the question. This is not huge and we can't hope to be able to deduce the answer from the question.

Let's now investigate the second question we had.

### How often are new questions repeats of older ones?

We now want to investigate how often new questions are repeats of older ones. But we can't completely answer this, because we only have about 10% of the full Jeopardy question dataset. We can still can investigate it at least.

To try tackling that question, we can:
* Sort `jeopardy` in order of ascending air date.
* Maintain a set called `terms_used` that will be empty initially.
* Iterate through each row of `jeopardy`.
* Split `clean_question` into words, remove any word shorter than 6 characters, and check if each word occurs in `terms_used`.
    * If it does, increment a counter.
    * Add each word to `terms_used`.

This will enable us to check if the terms in questions have been used previously or not. Only looking at words with six or more characters enables us to filter out words like `the` and `than`, which are commonly used, but don't tell us a lot about a question.

In [173]:
question_overlap = []
terms_used = set()

jeopardy = jeopardy.sort_values("Air Date")

for index, row in jeopardy.iterrows():
    split_question = row["clean_question"].split(" ")
    split_question = [word for word in split_question if len(word) > 5]
    
    match_count = 0
    
    for word in split_question:
        if word in terms_used:
            match_count += 1
        terms_used.add(word)        
    if len(split_question) > 0:
            match_count /= len(split_question)
            
    question_overlap.append(match_count)

jeopardy["question_overlap"] = question_overlap
print(jeopardy["question_overlap"].mean())           

0.6894031359073245


We can see that, on average, about 70% of the terms in a new question already appeared in previous questions. 

This only looks at a small set of questions, and it doesn't look at phrases since it looks at single terms. This makes it relatively insignificant, but it does mean that it's worth looking more into the recycling of questions.

In the next section, we try to figure out a way to identify the high value questions to focus on.

### What are the most promising topics to focus on?

As we look at previous Jeopardy questions, it would be smart to focus on high value questions/topics that will help us earn more money.

We can actually figure out which terms correspond to high-value questions by using a chi-squared test to prove any statistical significance. We'll first need to narrow down the questions into two categories:
* Low value - Any row where `value` is less than `800`
* High value - Any row where `value` is more than `800`

We'll then be able to loop through each of the terms from `terms_used` and:
* Find the number of low value questions the word occurs in.
* Find the number of high value questions the word occurs in.
* Find the percentage of questions the word occurs in.
* Based on the percentage of questions the word occurs in, find the expected counts.
* Compute the chi squared value based on the expected and observed counts for high and low value questions.

We can then find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values. Doing this for all the words would take a very long time, so we'll just do it for a small sample now.

We sart by creating a function that categorizes questions based on the value criteria we previously introduced (more or less than $800).

In [174]:
def categorize_value(row):
    if row["clean_value"] > 800:
        value = 1
    else:
        value = 0
    return value

We then apply this function to each row in `jeopardy` and return the result in a new column.

In [175]:
jeopardy["high_value"] = jeopardy.apply(categorize_value, axis=1)

Next, we create another function that calculates and returns the number of times a word is counted in both high value and low value questions.

In [176]:
def count_high_low(word):
    low_count = 0
    high_count = 0
    for index, row in jeopardy.iterrows():
        clean_question = row["clean_question"].split(" ")
        if word in clean_question:
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

To test this function, we randomly pick ten elements of `terms_used` and append them to a list called `comparison_terms`.

In [177]:
import random
comparison_terms = random.sample(terms_used, 10)
print(comparison_terms)

['politicalsounding', 'xlibris', 'hrefhttpwwwjarchivecommedia20100706dj28jpg', 'plenty', 'eddies', 'committees', 'nephrite', 'lesley', 'dynamite', 'formeda']


We then iterate through those terms and run the `count_high_low` function for each of them. The returned `high_count` and `low_count` values will be appended to a list called `observed_expected`.

In [178]:
observed_expected = []

for term in comparison_terms:
    observed_expected.append(count_high_low(term))

Let's check the `observed_expected` list!

In [179]:
print(observed_expected)

[(0, 1), (1, 0), (1, 0), (1, 6), (1, 0), (1, 0), (1, 0), (1, 1), (0, 3), (1, 0)]


We can see that among the 10 words randomly picked from `terms_used`, some appeared more than once in high or low value questions.

Now that we've found the observed counts for a few terms, we can compute their expected count and associated chi-squared value, to prove any statistically significant difference in usage.

In [180]:
from scipy.stats import chisquare

high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]

chi_squared = []

for element in observed_expected:
    total = sum(element)
    total_prop = total / len(jeopardy)
    expected_high_count = total_prop * high_value_count
    expected_low_count = total_prop * low_value_count
    
    observed = np.array([element[0], element[1]])
    expected = np.array([expected_high_count, expected_low_count])
    chi_squared.append(chisquare(observed, expected))
    
    

In [181]:
chi_squared

[Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.7083506539662141, pvalue=0.39999189913636146),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996),
 Power_divergenceResult(statistic=1.205888538380652, pvalue=0.27214791766902047),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047)]

None of the terms had a significant difference in usage between high value and low value questions. Additionally, the frequencies were too small, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.

## Conclusion

In this project, we analyzed a dataset of questions previously asked in the game Jeopardy to try and figure out some patterns that could help us win in future games.

We observed that most questions or topics repeat in the game and that it would be a good strategy to focus on repeating topics. More than that, we studied the link between words and the gain value of the questions they appeared in.

Finally, we used our knowledge in significance testing to verify if a particular term has a statistically significant difference in usage between high value and low value questions.