**Project: Winning Jeopardy**

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for many years, and is a major force in popular culture.

The dataset is named jeopardy.csv, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions, which you can download [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file). 

In [1]:
import re, numpy as np, pandas as pd
from random import choice
from scipy.stats import chisquare

In [2]:
def normalize_text(text):
    text = text.lower()
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    text = re.sub("\s+", " ", text)
    return text

def normalize_values(text):
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    try:
        text = int(text)
    except Exception:
        text = 0
    return text

def count_matches(row):
    split_answer = row["clean_answer"].split()
    split_question = row["clean_question"].split()
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer) == 0:
        return 0
    match_count = 0
    for item in split_answer:
        if item in split_question:
            match_count += 1
    return match_count / len(split_answer)

def determine_value(row, col = 'clean_value'):
    value = 0
    if row[col] > 800:
        value = 1
    return value

def count_usage(term, df, col_1 = 'clean_question', col_2 = 'high_value'):
    low_count = 0
    high_count = 0
    
    for i, row in df.iterrows():
        if term in row[col_1].split(" "):
            if row[col_2] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count



In [3]:
# import data
df = pd.read_csv('jeopardy.csv')
df.head(3)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona


In [4]:
# remove spaces in col names
df.columns = df.columns.str.strip()

# clean question / answer / val data
df["clean_question"] = df["Question"].apply(normalize_text)
df["clean_answer"] = df["Answer"].apply(normalize_text)
df["clean_value"] = df["Value"].apply(normalize_values)

# format date
df['Air Date'] = pd.to_datetime(df['Air Date'])

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

- How often the answer can be used for a question.
- How often questions are repeated.

You can answer the second question by seeing how often complex words (> 6 characters) reoccur. You can answer the first question by seeing how many times words in the answer also occur in the question. We'll work on the first question and come back to the second.

In [5]:
df["answer_in_question"] = df.apply(count_matches, axis=1)
df["answer_in_question"].mean()

0.05900196524977763

On average, the answer only makes up 6% of the question. 

This isn't a huge number, and means that we probably can't just hope that hearing a question will enable us to figure out the answer. 

In [6]:
question_overlap = []
terms_used = set()

df = df.sort_values("Air Date")

for i, row in df.iterrows():
        split_question = row["clean_question"].split(" ")
        split_question = [q for q in split_question if len(q) > 5]
        
        match_count = 0
        
        for word in split_question:
            if word in terms_used:
                match_count += 1
        
        for word in split_question:
            terms_used.add(word)
        
        if len(split_question) > 0:
            match_count /= len(split_question)
        
        question_overlap.append(match_count)
        
df["question_overlap"] = question_overlap

df["question_overlap"].mean()

0.6876260592169802

There's about 70% overlap between terms in new questions and terms in old questions. 

This only looks at a small set of questions, and it doesn't look at phrases, it looks at single terms. 

This makes it relatively insignificant, but it does mean that it's worth looking more into the recycling of questions.

In [7]:
df["high_value"] = df.apply(determine_value, axis=1)

In [8]:
terms_used_list = list(terms_used)
comparison_terms = [choice(terms_used_list) for _ in range(10)]

observed_expected = []

for term in comparison_terms:
    observed_expected.append(count_usage(term, df))

observed_expected

[(0, 1),
 (5, 4),
 (0, 1),
 (0, 2),
 (1, 0),
 (2, 1),
 (0, 1),
 (0, 2),
 (0, 1),
 (0, 1)]

In [9]:
high_value_count = df[df["high_value"] == 1].shape[0]
low_value_count = df[df["high_value"] == 0].shape[0]

chi_squared = []
for obs in observed_expected:
    total = sum(obs)
    total_prop = total / df.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=3.180689829769113, pvalue=0.07451326597343068),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.1177104383031944, pvalue=0.14560406868264344),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469)]

None of the terms had a significant difference in usage between high value and low value rows. 

Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. 

It would be better to run this test with only terms that have higher frequencies.


**Potential next steps:**

- Find a better way to eliminate non-informative words than just removing words that are less than 6 characters long. Some ideas:
    - Manually create a list of words to remove, like the, than, etc.
    - Find a list of stopwords to remove.
    - Remove words that occur in more than a certain percentage (like 5%) of questions.
- Perform the chi-squared test across more terms to see what terms have larger differences. This is hard to do currently because the code is slow, but here are some ideas:
    - Use the apply method to make the code that calculates frequencies more efficient.
    - Only select terms that have high frequencies across the dataset, and ignore the others.
- Look more into the Category column and see if any interesting analysis can be done with it. Some ideas:
    - See which categories appear the most often.
    - Find the probability of each category appearing in each round.
- Use the whole Jeopardy dataset instead of the subset we used in this lesson.
- Use phrases instead of single words when seeing if there's overlap between questions. Single words don't capture the whole context of the question well.