# 1.- Jeopardy. Introduction to the Data

Jeopardy is a popular TV Show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture.

Let's say that we are thinking about competing on Jeopardy, and that we want to look for an edge to help us win. This project will examine Jeopardy questions to discover if there is any pattern in the questions that help us win.

Our dataset __jeopardy.csv__ contains 20000 rows of Jeopardy questions, with the next additional information:

- __Show Number__: Jeopardy episode number where the question was asked.
- __Air Date__: The date the episode aired.
- __Round__: The round of Jeopardy the question was asked in.
- __Category__: The category of the question.
- __Value__: The prize a person could get answering the question correctly.
- __Question__: Text of the question.
- __Answer__: Text of the answer.

The dataset is formated in __JSON__ and can be found [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/).

# 2. - Libraries

In [45]:
import pandas as pd
import numpy as np
import re
from scipy.stats import chisquare

# 3.- Importing Data and Normalizing Text

In [3]:
jeopardy = pd.read_csv("jeopardy.csv")
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


We are going to inspect the columns as they seem to be formatted wrong.

In [4]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

The name of almost every column has a space at the beginning of the string, so we rename them, and assign the new names to jeopardy.columns to make the change effective. 

In [5]:
jeopardy.columns = ['Show Number', 'Air Date', 'Round', 
                    'Category', 'Value', 'Question', 'Answer']

Before we start doing analysis on the Jeopardy questions of our dataset, we will need to normalize our text columns (__Question__ and __Answer__). To make our later analysis easier we will lowercase words and remove punctuation, so we don't consider similar words to be different.

To start, we will write a function *normalize()* that will take in a string, convert it to lowercase, remove all the punctuation and return the new string. Then, we will use it in both of our columns to normalize them.

In [6]:
def normalize_string(string):
    string = string.lower()
    string = re.sub("[^A-Za-z0-9\s]", "", string)
    return string

jeopardy["clean_question"] = jeopardy["Question"].apply(normalize_string)
jeopardy["clean_answer"] = jeopardy["Answer"].apply(normalize_string)

# 4.- Normalizing Numeric Values

If we take a look on the __Value__ column, we will see that the values within it begins with a dollar sign($). In order to manipulate this column more easily we are going to take the dollar sign out of it and convert the text to a numeric format.

Following the same principle, converting the __Air Date__ column to a datetime can be beneficial in the future, specially if we are going to be using this column for our analysis.

On this paragraph, we are going to write a function that cleans our __Value__ column and normalizes it.

In [7]:
def normalize_values(string):
    string = re.sub("[^A-Za-z0-9\s]", "", string)
    try:
        integer = int(string)
    except Exception:
        integer = 0 
    return integer
    
jeopardy["clean_value"] = jeopardy["Value"].apply(normalize_values)

Converting the __Air Date__ column to a datetime type is fairly easy if we use the pandas library's function *pandas.to_datetime*. After it, we are going to check the datatype of all our columns with the *dtype* method.

In [8]:
jeopardy["Air Date"] = pd.to_datetime(jeopardy["Air Date"])
jeopardy.dtypes

Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object

# 5.- Questions

If we want to have a higher success rate, we would have to figure out what it is better to study, past questions or general knowledge and to that end, answering the following would be beneficial:

- How often is the answer deducible from the question?
- How often are new questions just an older one repeated?

To answer our second question we can look up at how often complex words with more than 6 characters occur, and answer the first one by seeing how many times words in the answer are contained in the question too.

Let's begin by answering the first question by writing a function that counts how many times the terms in __clean_answer__ occur in __clean_question__.

In [14]:
def count_matches(row):
    split_answer = row["clean_answer"].split(" ")
    split_question = row["clean_question"].split(" ")
    match_count = 0 
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer) == 0:
        return 0 
    for i in split_answer:
        if i in split_question:
            match_count += 1
    return match_count/len(split_answer)

In [15]:
answer_in_question = jeopardy.apply(count_matches, axis = 1)
answer_in_question.mean()

0.060493257069335872

Our results tell us that the answer only appears in the question about a 6% of the time. Relying on hearing a question to try to figure out its possible answer is reckless, and therefore, we will need to analyze how often they are repeated.

This question won't be totally answered in this study due to our dataset having only about 10% of the full Jeopardy dataset, but at least can be investigated, and the methodology to use with the full dataset would be identical to the one exposed below.

The steps needed to solve this problem are:

- Sorting our dataset in order of ascending air date.
- Create an empty set called terms_used.
- Iterate through each row of our dataset, jeopardy.
- Split __clean_question__ into words, remove any word shorter than 6 characters and check if those words occur in terms_used:
    - if it does, increment a counter.
    - add each word to terms_used.
  
This will help us to check if the terms in questions have been used previously or not. Filtering by long words will allow us to ignore words used regularly like the and than, which are commonly used but don't provide useful information about the question.

In [18]:
question_overlap = []
terms_used = set()
for i,row in jeopardy.iterrows():
    split_question = row["clean_question"].split(" ")
    split_question = [q for q in split_question if len(q)>5]
    match_count = 0 
    for word in split_question:
        if word in terms_used:
            match_count += 1
    for word in split_question:
        terms_used.add(word)
    if len(split_question) > 0:
        match_count/=len(split_question)
    question_overlap.append(match_count)

jeopardy["question_overlap"] = question_overlap
jeopardy["question_overlap"].mean()

0.69087373156719623

This means that there are close to 70% overlap between terms in new questions and old questions. Due to the size of our dataset, our function doesn't look at whole phrases but at single terms. While this is not completely significant, it tells us that looking into old questions has a higher value than just trying to figure out the response out of the question.

# 6.- Low Value Vs High Value Questions

Let's now imagine that we want to study those questions whose value is among the highest values to help us win more money.

We can model this question, and find out which terms correspond to high-value questions using a chi-squared test if we narrow down our questions into 2 categories: __Low value__(<800) and __High value__(>800).

Once this it's been accomplished we can loop through __terms_used__ and implement a function that does the following:

- Find the number of low value questions a word occurs in.
- Find the number of high value questions a word occurs in.
- Find the percentage of questions a word occurs in.
- Based on the percentage of questions the word occurs in, find expected counts.
- Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.

We could then find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values. 

First, we will create a function that assigns 1 to all those rows with a __clean_value__ higher than 800 and 0 otherwise. 

In [20]:
def high_or_low_value(row):
    value = 0
    if row["clean_value"] > 800:
        value = 1
    return value

Now, we are going to determine which questions are high and low value using the *pandas.apply()* method and assing it to the __high_value__ column.

In [25]:
jeopardy["high value"] = jeopardy.apply(high_or_low_value, axis = 1)

For our next step, we will define a new function that returns the numbers of times a word has been used for high and low value.

In [42]:
def count_usage(word):
    low_count = 0 
    high_count = 0 
    for i,row in jeopardy.iterrows():
        if word in row["clean_question"].split(" "):
            if row["high value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

We should be able to find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values. Doing this for all the words would take forever, so we are going to experiment with a few rows for now.

In [44]:
comparison_terms = list(terms_used)[:5]
observed_expected = []

for i in comparison_terms:
    observed_expected.append(count_usage(i))

observed_expected

[(0, 1), (2, 2), (1, 1), (1, 3), (3, 0)]

# 7.- Applying the Chi-Squared Test

We are going to find those rows whose __high value__ is 1 and those whose value is 0 and create a new list called chi_squared, that we will use to estimate the chi-squared and p-values given the observed an expected counts.

In [51]:
high_value_count = jeopardy[jeopardy["high value"] == 1].shape[0]
low_value_count = jeopardy[jeopardy["high value"] == 0].shape[0]

chi_squared = []
for lista in observed_expected:
    total = sum(lista)
    total_prop = total/jeopardy.shape[0]
    expected_high = total_prop * high_value_count
    expected_low = total_prop * low_value_count
    
    observed = np.array([lista[0], lista[1]])
    expected = np.array([expected_high, expected_low])
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=0.40196284612688399, pvalue=0.52607729857054686),
 Power_divergenceResult(statistic=0.88975496332255899, pvalue=0.34554371914834681),
 Power_divergenceResult(statistic=0.44487748166127949, pvalue=0.50477764875459963),
 Power_divergenceResult(statistic=0.026364433084407689, pvalue=0.87101348468892104),
 Power_divergenceResult(statistic=7.4633763515870246, pvalue=0.0062966796687489992)]