# Winning Jeopardy

Jeopardy is a popular TV show in the US where patecipants answer questions to win money. It's been running for a few decades and it is a successful show. Imagine you want to compete on Jeopardy and you are looking for any edge you can get to win. In this project, we'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you to win.

The dataset can be downloaded [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file), it contains 20,000 rows , each row represent a single question. These are the columns:

* Show Number -- the Jeopardy episode number of the show this question was in.
* Air Date -- the date the episode aired.
* Round -- the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
* Category -- the category of the question.
* Value -- the number of dollars answering the question correctly is worth.
* Question -- the text of the question.
* Answer -- the text of the answer.

Let's start familiarize with the data set.

In [143]:
#Importing the library
import pandas as pd

#Reading the dataset into a dataframe
jeopardy = pd.read_csv("jeopardy.csv")

#Showing the dataframe
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [144]:
#Showing the columns' name
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [145]:
#Removing the spaces in front
jeopardy.columns = ["Show Number", "Air Date", "Round", "Category", "Value", "Question", "Answer"]

## Normalizing Test

Before we can start doing analysis on the dataframe, we need to normalize all of the text columns, *Question* and *Answer* columns. We'll lowercase words and remove punctuation because we want "Word" and "word" to be considered the same.

In [146]:
#Importing the library
import re

#Writing a function to normalize string
def normalizer(string):
    string = string.lower()
    string = re.sub("[^A-Za-z0-9\s]", "", string)
    return string

#Normalizing the columns
jeopardy["clean_question"] = jeopardy["Question"].apply(normalizer)
jeopardy["clean_answer"] = jeopardy["Answer"].apply(normalizer)

## Normalizing Columns

We had normalized the text columns, now we'll normalize some other columns. The *Value* column contains a dollar sign, we'll remove it to covert the column from text to numeric. The *Air Date* column should also be a datetime object, not a string, to enable us to work with it more easily.

In [147]:
#Writing a function to normalize dollar values
def normalizer_dollars(string):
    string = re.sub("[^a-zA-Z0-9\s]", "", string)
    try:
        string = int(string)
    except Exception:
        string = 0
    return string
        
jeopardy["clean_value"] = jeopardy["Value"].apply(normalizer_dollars)
jeopardy["Air Date"] = pd.to_datetime(jeopardy["Air Date"])

#Checking the changes
jeopardy.head(4)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200


In [148]:
#Checking the changes
jeopardy.dtypes

Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object

## Answers in Questions

In order to figure out how to procede, it would be helpful to considerer two things:

* 1) How often the answer is deducible from the question.
* 2) How often new questions are repeats of older questions.

We could answer the first question by counting how many times words in the answer also occur in the question and we can answer the second question by seeing how often complex words (> 6 characters) reoccur.

Let's work on the first question.

In [149]:
#Writing the match function
def count_matches(row):
    split_answer = row["clean_answer"].split(" ")
    split_question = row["clean_question"].split(" ")
    match_count = 0
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer) == 0:
        return 0
    for item in split_answer:
        if item in split_question:
            match_count += 1
    return match_count / len(split_answer)

jeopardy["answer_in_question"] = jeopardy.apply(count_matches, axis=1)

In [150]:
#Finding the mean
jeopardy["answer_in_question"].mean()

0.06049325706933587

We found that about the 6% of the answers are deducibile from the question. Now we'll trying to answer the second question: how often new questions are repeats of older questions ?

## Recycled Questions

Now we'll investigate how often new questions are repeats of older ones.

In [151]:
#Creating a list and a set to gathering data
question_overlap = []
terms_used = set()

#Sorting the dataframe
jeopardy = jeopardy.sort_values("Air Date")

#Looking for questions which overlap
for i, row in jeopardy.iterrows():
    split_question = row["clean_question"].split(" ")
    split_question = [w for w in split_question if len(w)>5]
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
    for word in split_question:
        terms_used.add(word)
    if len(split_question) > 0:
        match_count /= len(split_question)
    question_overlap.append(match_count)
jeopardy["question_overlap"] = question_overlap

#Showing the mean 
jeopardy["question_overlap"].mean()

0.6876260592169802

There is about 70% overlap between terms in old questions and terms in new questions. This doesn't have much sense because it refers to word not to phrases, anyway 70% it's a big value, it's worth looking more into the recycling of questions.

## Low Value vs High Value Questions

We'll only study questions that pertain to high value questions instead of low value questions. This will help you earn more money when you're on Jeopardy. 
We'll divide the questions into two categories:

* Low value -- Any row where *Value* is less than 800.
* High value -- Any row where *Value* is greater than 800.

Then we'll find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values.
Doing this for all of the words would take a very long time, so we'll just do it for a small sample now.

In [152]:
#Determining high or low questions
def high_or_low(row):
    value = 0
    if row["clean_value"] > 800:
        value = 1
    return value 

#Assigning the values to a column
jeopardy["high_value"] = jeopardy.apply(high_or_low, axis=1)

#Computing the high and low count
def count(word):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        if word in row["clean_question"].split(" "):
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

#Applying the function to 5 words
observed_expected = []
comparison_terms = list(terms_used)[:5]
for word in comparison_terms:
    observed_expected.append(count(word))
observed_expected    

[(1, 0), (0, 1), (0, 1), (0, 1), (1, 0)]

## Applying the Chi-Squared Test

Now that we've found the observed counts for a few terms, we can compute the expected counts and the chi-squared value.

In [153]:
#Importing the libraries
import numpy as np
from scipy.stats import chisquare

#Counting the high and low value questions
high_value_count = len(jeopardy[jeopardy["high_value"] == 1])
low_value_count = len(jeopardy[jeopardy["high_value"] == 0])

#Computing the expected values
chi_squared = []
for item in observed_expected:
    total = sum(item)
    total_prop = total / len(jeopardy)
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    observed = np.array([item[0], item[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))
    
#Showing the results    
chi_squared    

[Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047)]

## Conclusion

We found high p-value, so we can think that results are due to random chance. None of the terms had a significant difference between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-test isn't as valid.