# Guided Project: Winning Jeopardy

Jeopardy is a popular TV show in the US where participants answer questions to win money.

Imagine that you want to compete on Jeopardy, and you're looking for any way to win. In this project, you'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you win.

## Dataset
The dataset is named jeopardy.csv, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions, which you can download [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file). 

In [1]:
import pandas
import csv

jeopardy = pandas.read_csv("jeopardy.csv")

In [2]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [3]:
#renaming columns to remove the extra space at the beginning
jeopardy = jeopardy.rename(columns={' Question': 'Question'})
jeopardy = jeopardy.rename(columns={' Answer': 'Answer'})
jeopardy = jeopardy.rename(columns={' Air Date': 'Air Date'})
jeopardy = jeopardy.rename(columns={' Round': 'Round'})
jeopardy = jeopardy.rename(columns={' Category': 'Category'})
jeopardy = jeopardy.rename(columns={' Value': 'Value'})

In [4]:
jeopardy.head(5)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


 each row in the dataset represents a single question on a single episode of Jeopardy.
* <mark>Show Number</mark> - the Jeopardy episode number
* <mark>Air Date</mark> - the date the episode aired
* <mark>Round</mark> - the round of Jeopardy
* <mark>Category</mark> - the category of the question
* <mark>Value</mark> - the number of dollars the correct answer is worth
* <mark>Question</mark> - the text of the question
* <mark>Answer</mark> - the text of the answer

## Normalizing Text
Before we can start doing analysis on the Jeopardy questions, we need to normalize all of the text columns (the Question and Answer columns). The idea is to ensure that we put words in lowercase and remove punctuation so <mark>Don't</mark> and <mark>don't</mark> aren't considered to be different words when we compare them.

In [5]:
# importing library
import re
# normalize text function, lowercase and removing punctuation 
def normalize_text(text):
    text = text.lower()
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    text = re.sub("\s+", " ", text)
    return text

In [6]:

# normalizing Question and Answer columns
jeopardy["clean_question"] = jeopardy["Question"].apply(normalize_text)
jeopardy["clean_answer"] = jeopardy["Answer"].apply(normalize_text)
jeopardy.head(5)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams


## Normalizing Other Columns
Now that we've normalized the text columns, there are also some other columns to normalize.

The <mark>Value</mark> column should be numeric, to allow us to manipulate it easier. we'll need to remove the dollar sign from the beginning of each value and convert the column from text to numeric.

The <mark>Air Date</mark> column should also be a datetime, not a string, to enable us to work with it easier.

In [7]:
# function to normalize the value columns 
# removing the dollar sign and retun int value otherwise return 0
def normalize_values(text):
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    try:
        text = int(text)
    except Exception:
        text = 0
    return text

In [8]:
jeopardy["clean_value"] = jeopardy["Value"].apply(normalize_values)
jeopardy.head(5)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200


In [9]:
jeopardy["Air Date"] = pandas.to_datetime(jeopardy["Air Date"])
jeopardy.dtypes

Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object

## Answers in Questions
In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:
* How often the answer can be used for a question?
* How often questions are repeated?

First question can be answered by looking into how many times words in the answer also occur in the question. for the second question we can answer it by seeing how often complex words (> 6 characters) reoccur. 

In [None]:
# function to count the repetation of words from answers in the question
def count_matches(row):
    split_answer = row["clean_answer"].split()
    split_question = row["clean_question"].split()
    if "the" in split_answer:
        split_answer.remove("the") #removing the from the counting
    if len(split_answer) == 0:
        return 0
    match_count = 0
    for item in split_answer:
        if item in split_question:
            match_count += 1
    return match_count / len(split_answer)

In [None]:
jeopardy["answer_in_question"] = jeopardy.apply(count_matches, axis=1)
jeopardy["answer_in_question"].mean()*100

On average, the answer only makes up for about 6% of the question. This isn't a huge number, and it means that we probably can't just hope that hearing a question will enable us to determine the answer. We'll probably have to study.

## Recycled Questions
Let's investigate how often new questions are repeats of older ones. we can't completely answer this, because we only have about 10% of the full Jeopardy question dataset, but at least we can investigate it.

In [None]:
# sorting questions by air date
jeopardy = jeopardy.sort_values("Air Date")

# function to calculate the repetation of terms of questions in other questions
# needed variables for the function
question_overlap = []
terms_used = set()
for i, row in jeopardy.iterrows():
        split_question = row["clean_question"].split(" ")
        split_question = [q for q in split_question if len(q) > 5] # Only looking at words with six or more characters 
        match_count = 0
        for word in split_question: # counting repetation
            if word in terms_used:
                match_count += 1
        for word in split_question:
            terms_used.add(word)
        if len(split_question) > 0:
            match_count /= len(split_question)
        question_overlap.append(match_count)
jeopardy["question_overlap"] = question_overlap

In [None]:
jeopardy["question_overlap"].mean()*100

There is about a 69% overlap between terms in new questions and terms in old questions in our dataset. This looks only at a small set of questions 10% of the full Jeopardy questions, and it doesn't look at phrases — it looks at single terms-. This makes it relatively insignificant, but it means that it may be worth it to look more into the recycling of questions.

## Low Value vs High Value Questions
Let's say we only want to study questions that relates to high value questions instead of low value questions. This will help us earn more money when we're on Jeopardy.
We will be doing this for a subset of words but the same idea can be extended to the whole jeopardy questions.

In [None]:
# determining high value question value > 800
def high_value(row):
    value = 0
    if row["clean_value"] > 800:
        value = 1
    return value

jeopardy["high_value"] = jeopardy.apply(high_value, axis=1)

In [None]:
# the number of low/high value questions the word occurs in
def count_usage(term):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        if term in row["clean_question"].split(" "):
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

we will be using chi-squared test to see if terms are used for high-value questions more than expected

In [None]:
# choosing random 10 words from terms used in the questions
from random import choice

terms_used_list = list(terms_used)
comparison_terms = [choice(terms_used_list) for _ in range(10)]

comparison_terms

In [None]:
# counting the observed high and low values questions of the terms
observed_high_low = []

for term in comparison_terms:
    observed_high_low.append(count_usage(term))

observed_high_low

## Applying the Chi-squared Test

Now that we've found the observed counts for a few terms, we can compute the expected counts and the chi-squared value.

In [None]:
# calculation of the chi-squared test 
# p-value threshold 0.05
from scipy.stats import chisquare
import numpy as np

high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]

chi_squared = []
for obs in observed_high_low:
    total = sum(obs)
    total_prop = total / jeopardy.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))

chi_squared

None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.

## Next Steps
Some potential next steps for diving deeper in the analysis
* Finding a better way to eliminate non-informative words than just removing words that are less than 6 characters long. Some ideas:
    * Manually create a list of words to remove, like the, than, etc.
    * Find a list of stopwords to remove.
    * Remove words that occur in more than a certain percentage (like 5%) of questions.
* Looking more into the Category column and see if any interesting analysis can be done with it. Some ideas:
    * See which categories appear the most often.
    * Find the probability of each category appearing in each round.
* Using the whole Jeopardy dataset [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file) instead of the subset we used in this project.
* Using phrases instead of single words when seeing if there's overlap between questions. Single words don't capture the whole context of the question well.