# Guided Project: Winning Jeopardy

Goal: Analyzing winning patterns inside data about Jeopardy.

The dataset `jeopardy.csv` can be accessed from [this Reddit post by trexmatt](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/).

## 0. Introduction

We load the file `jeopardy.csv` and print the details about its shape, columns, and first five rows. Apparently there are some column names with leading/training spaces, so we removed their spaces.

In [1]:
import numpy as np
import pandas as pd

jeopardy = pd.read_csv("jeopardy.csv")
print("Shape: " + str(jeopardy.shape))
print("Columns initially:")
print(jeopardy.columns.values)
print("Data format:")
print(jeopardy.dtypes)
print("First five rows:")
print(jeopardy.head())

jeopardy.columns = [s.strip() for s in jeopardy.columns.values]
print("Columns after:")
print(jeopardy.columns.values)

Shape: (19999, 7)
Columns initially:
['Show Number' ' Air Date' ' Round' ' Category' ' Value' ' Question'
 ' Answer']
Data format:
Show Number     int64
 Air Date      object
 Round         object
 Category      object
 Value         object
 Question      object
 Answer        object
dtype: object
First five rows:
   Show Number    Air Date      Round                         Category  Value  \
0         4680  2004-12-31  Jeopardy!                          HISTORY   $200   
1         4680  2004-12-31  Jeopardy!  ESPN's TOP 10 ALL-TIME ATHLETES   $200   
2         4680  2004-12-31  Jeopardy!      EVERYBODY TALKS ABOUT IT...   $200   
3         4680  2004-12-31  Jeopardy!                 THE COMPANY LINE   $200   
4         4680  2004-12-31  Jeopardy!              EPITAPHS & TRIBUTES   $200   

                                            Question      Answer  
0  For the last 8 years of his life, Galileo was ...  Copernicus  
1  No. 2: 1912 Olympian; football star at Carlisl...  Jim Thorp

# 1. Normalizing Functions

We want to normalize strings in the question and answer columns, by removing punctuations and lowercasing. We consider the edge case of _No._ as in _No. 1_, _No. 2_, etc; since it cannot be differentiated from _no_ as in _yes or no_.

In [2]:
jeopardy[["Clean Question", "Clean Answer"]] \
    = jeopardy[["Question", "Answer"]].apply(
        lambda x : x.str.replace(r"(N| n)o\.( [0-9]+)", r"\1umber\2") \
                    .str.lower() \
                    .str.replace(r"[^a-z0-9 ]", "") \
    )
print(jeopardy[["Clean Question", "Clean Answer"]].head(10))

                                      Clean Question    Clean Answer
0  for the last 8 years of his life galileo was u...      copernicus
1  number 2 1912 olympian football star at carlis...      jim thorpe
2  the city of yuma in this state has a record av...         arizona
3  in 1963 live on the art linkletter show this c...       mcdonalds
4  signer of the dec of indep framer of the const...      john adams
5  in the title of an aesop fable this insect sha...         the ant
6  built in 312 bc to link rome  the south of ita...  the appian way
7  number 8 30 steals for the birmingham barons 2...  michael jordan
8  in the winter of 197172 a record 1122 inches o...      washington
9  this housewares store was named for the packag...   crate  barrel


Next, the entries of `Value` and `Air Date` are of the type `Object (String)` and are not suitable for numerical analysis. We first see their unique values in order to determine what data cleaning methods we should do.

In [3]:
print(jeopardy["Value"].value_counts().index.values)

['$400' '$800' '$200' '$600' '$1000' '$2000' '$1200' '$1600' '$100' '$500'
 '$300' 'None' '$1,000' '$2,000' '$3,000' '$1,500' '$1,200' '$4,000'
 '$5,000' '$1,800' '$1,400' '$1,600' '$2,500' '$700' '$2,200' '$2,400'
 '$3,600' '$7,000' '$6,000' '$1,100' '$1,300' '$3,200' '$3,500' '$900'
 '$2,800' '$1,900' '$3,400' '$10,000' '$3,100' '$8,000' '$2,600' '$3,800'
 '$2,100' '$5,600' '$4,400' '$4,600' '$4,800' '$7,200' '$12,000' '$7,400'
 '$9,000' '$5,800' '$4,100' '$2,300' '$10,800' '$3,389' '$1,111' '$7,500'
 '$2,900' '$5,400' '$6,200' '$3,300' '$4,500' '$6,800' '$1,492' '$4,700'
 '$3,900' '$5,200' '$1,700' '$2,127' '$2,021' '$1,020' '$750' '$8,200'
 '$367' '$6,100']


From above, we would like to remove the symbols `"$"` and `","` as well as converting `Value`'s string forms into numbers. Meanwhile, we convert `Air Date` into `datetime` form.

In [4]:
jeopardy["Clean Value"] = \
    jeopardy["Value"] \
    .str.replace(r"[$,]", "") \
    .str.replace("None", "0") \
    .astype(int)
print(jeopardy["Clean Value"].value_counts().index.values)

[  400   800   200  1000   600  2000  1200  1600   100   500   300     0
  3000  1500  4000  5000  1800  1400  2500   700  2200  3600  2400  7000
  6000   900  1300  3500  3200  1100  3400  1900  2800  8000 10000  3100
  2600  2100 12000  4600  7200  5600  4400  4800  3800  5200  1492  6800
  7500  1700  7400   750  4700  2127  3900  2300  3389  1020  9000  8200
  6200 10800  5800  5400  1111   367  4500  4100  2021  3300  2900  6100]


In [5]:
jeopardy["Air Date"] = pd.to_datetime(jeopardy["Air Date"])

## 2. Analyzing Winning Strategies

### a. Recycled Questions

One type of winning strategies is to study past problems on Jeopardy. It is helpful to measure the chance of a question being repeated, by counting the past questions' shared complex (> 6 characters) words.

In [6]:
jeopardy.sort_values(by = "Air Date", inplace = True)
question_overlapped = []
terms_repeated = set()
for index, row in jeopardy.iterrows():
    split_complex_question = [e for e in row["Clean Question"].split() if len(e) >= 6]
    tot = 0
    for word in split_complex_question:
        if word in terms_repeated:
            tot += 1
        else:
            terms_repeated.add(word)
    # If the question has no words, append 0 to prevent division by zero
    if (len(split_complex_question) == 0):
        question_overlapped.append(0)
    else:
        question_overlapped.append(tot/len(split_complex_question))

jeopardy["Question Overlapped"] = question_overlapped
print("The avg occurence level of a complex word on past questions is", np.mean(jeopardy["Question Overlapped"]))

The avg occurence level of a complex word on past questions is 0.6900454028691575


The chance is nearly $70 \%$, quite a large one. Although the data we have now is only $10 \%$ of Jeopardy's entire dataset, it might be worth to prepare for some recycled questions.

### b. Answers in Questions

Another strategy is to deduct the answer directly from the question. We may analyze this by comparing how often an answer's words are contained in its corresponding question.

In [7]:
def avg_answer_contained(row):
    split_question = row["Clean Question"].split()
    split_answer = row["Clean Answer"].split()
    # Remove the non-significant words such as "the", "a", "an"
    split_answer = [e for e in split_answer if e not in ["the", "a", "an"]]
    tot = 0
    for word in split_answer:
        if word in split_question:
            tot += 1
    # If the answer has no words, returns 0 to prevent division by zero
    if (len(split_answer) == 0):
        return 0
    else:
        return tot/len(split_answer)

jeopardy["Answer in Question"] = jeopardy.apply(avg_answer_contained, axis = 1)
print("The avg occurence level of an answer word coinciding with its question is", np.mean(jeopardy["Answer in Question"]))

The avg occurence level of an answer word coinciding with its question is 0.04261219013331619


The chance is a small $4\%$, so deducting answers from their questions is not a good strategy here.

### c. High vs Low-Value Questions

Alternatively, let say we study only the high-value questions (greater or equal to $\$800$). We will use chi-squared test to identify terms associated with high-value questions. We examine several words from `term_repeated` and assume that expectedly, the count of these words in high-value questions are equal (and similarly for low-value questions).

We performed the code `"chosen_words = list(terms_repeated)[:5]"` then put the result manually to give consistency (since `terms_repeated` is a set, therefore that line does not always produce the same result).

In [38]:
import random
from scipy.stats import chisquare

# Sort jeopardy back
jeopardy.sort_index(inplace = True)
# Count the proportions of high/low values in the entire dataset
is_hi_val = jeopardy["Clean Value"] >= 800
hi_val_prop = np.mean(is_hi_val)
lo_val_prop = 1 - hi_val_prop

# Function to calculate chi-squared values
def process_cs (word):
    # Count the observed proportions
    is_word_in_question = jeopardy["Clean Question"] \
                            .str.split() \
                            .apply(lambda l: word in l)
    word_in_highs_count = (is_word_in_question & is_hi_val).sum()
    word_in_lows_count = (is_word_in_question & ~is_hi_val).sum()
    observed = (word_in_highs_count, word_in_lows_count)
    # Expected counts is proportional to the overall ratio of high and low values
    expected = (hi_val_prop * np.sum(observed), lo_val_prop * np.sum(observed))
    chi_sq_val, p_val = chisquare(observed, expected)
    return chi_sq_val, p_val

# Pick the words
chosen_words = ["measure", "schubert", "prevow", "alphatrack", "pawtucket"]
for word in chosen_words:
    print(word, process_cs(word))

measure (0.2688716582048211, 0.6040896796217152)
schubert (0.7721754541426672, 0.3795448984353682)
prevow (0.7721754541426672, 0.3795448984353682)
alphatrack (1.295042460408538, 0.25512076479610835)
pawtucket (1.295042460408538, 0.25512076479610835)


None of the keywords have high chi-squared value, thus not having significant difference in its appearance in high or low-value problems. The sample sizes are also small, so we still need to work on that (putting all `terms_repeated`'s words make the code very slow, so there has to be some ).

## 3. Conclusion

Soon ...

In [9]:
# Sketch ....
a = '''abbrevs = {}
import re
for el in jeopardy["Clean Question"]:
    for word in re.findall(r"([A-Za-z]+\.) [^A-Z]", el):
        if word not in abbrevs:
            abbrevs[word] = 1
        else:
            abbrevs[word] += 1
print(sorted(abbrevs.items(), key = lambda x: x[1], reverse = True))'''