# Searching for Patterns in Jeopardy Questions

The TV show Jeopardy is one of the longest running game-shows of all-time, and to this day is one of the most respected trivia-based games around. It is particularly interesting to data scientists for its sheer volume. Every episode has over fifty trivia questions and answers, with new episodes airing almost every weekday. 

We want to look at a dataset containing 20,000 Jeopardy questions to look for patterns. The dataset can be found [here]('https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/').

In [1]:
import pandas as pd

jeopardy = pd.read_csv('jeopardy.csv')

print(jeopardy.head(5))
print(jeopardy.columns)
print(jeopardy.info())

   Show Number    Air Date      Round                         Category  Value  \
0         4680  2004-12-31  Jeopardy!                          HISTORY   $200   
1         4680  2004-12-31  Jeopardy!  ESPN's TOP 10 ALL-TIME ATHLETES   $200   
2         4680  2004-12-31  Jeopardy!      EVERYBODY TALKS ABOUT IT...   $200   
3         4680  2004-12-31  Jeopardy!                 THE COMPANY LINE   $200   
4         4680  2004-12-31  Jeopardy!              EPITAPHS & TRIBUTES   $200   

                                            Question      Answer  
0  For the last 8 years of his life, Galileo was ...  Copernicus  
1  No. 2: 1912 Olympian; football star at Carlisl...  Jim Thorpe  
2  The city of Yuma in this state has a record av...     Arizona  
3  In 1963, live on "The Art Linkletter Show", th...  McDonald's  
4  Signer of the Dec. of Indep., framer of the Co...  John Adams  
Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype

We can see a few things from this basic overview of the dataset. The values are stored as strings rather than integers, and some of the columns seem to have strange spacing around the names. None of the columns contain null values.

Let's fix the column names.

In [2]:
jeopardy.columns = ['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question', 'Answer']
print(jeopardy.columns)

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')


That's swell. The gimmick of Jeopardy is that the host reads contestants the answer and the contestant gives back the question. However, that is all a matter of framing. The dataset we're using uses the contestant response as the *answer*, so for the sake of our analysis, we will too.

## Normalizing Text

If we want to take in the text from the questions and use it to draw conclusions, we need it to be uniform throughout so we can decipher it with code. So we need to normalize all of the question text.

In [3]:
import re

def txt_normalizer(q):
    q = q.lower()
    q = re.sub("[^A-Za-z0-9\s]", "", q)
    q = re.sub("\s+", " ", q)
    return q

jeopardy['clean_question'] = jeopardy['Question'].apply(txt_normalizer)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(txt_normalizer)

### Normalizing Other Categories

We have now cleaned our questions and answers to have no whitespace on the edges, be lowercase, and have their punctuation removed. However, other columns, namely the "Value" and "Air Date" columns are wrong. The former should be numeric, i.e. integer, an the latter should be a datetime object.

In [4]:
def dollar_normalizer(v):
    v = re.sub("[^A-Za-z0-9\s]", "", v)
    try:
        v = int(v)
    except Exception:
        v = 0
    return v

jeopardy['clean_value'] = jeopardy['Value'].apply(dollar_normalizer)
jeopardy['clean_date'] = pd.to_datetime(jeopardy['Air Date'])
print(jeopardy.head(5))
print(jeopardy.info())

   Show Number    Air Date      Round                         Category Value  \
0         4680  2004-12-31  Jeopardy!                          HISTORY  $200   
1         4680  2004-12-31  Jeopardy!  ESPN's TOP 10 ALL-TIME ATHLETES  $200   
2         4680  2004-12-31  Jeopardy!      EVERYBODY TALKS ABOUT IT...  $200   
3         4680  2004-12-31  Jeopardy!                 THE COMPANY LINE  $200   
4         4680  2004-12-31  Jeopardy!              EPITAPHS & TRIBUTES  $200   

                                            Question      Answer  \
0  For the last 8 years of his life, Galileo was ...  Copernicus   
1  No. 2: 1912 Olympian; football star at Carlisl...  Jim Thorpe   
2  The city of Yuma in this state has a record av...     Arizona   
3  In 1963, live on "The Art Linkletter Show", th...  McDonald's   
4  Signer of the Dec. of Indep., framer of the Co...  John Adams   

                                      clean_question clean_answer  \
0  for the last 8 years of his life galil

Based on the first five rows and the info description, it seems the clean versions of all four columns work as intended.

## Is Jeopardy Repeating Old Questions?

We can't say for sure. However, we can see if in our dataset they are or are not. We can loop through our dataset in order of air date, and compare each new question to the previous questions to check for repeats, then add the new question to the vocab list from the previous questions.

In [5]:
question_overlap = []
terms_used = set()
jeopardy = jeopardy.sort_values('clean_date')

for i, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(" ")
    split_question = [w for w in split_question if len(w) > 5]
    
    match_count = 0
    match_rate = 0
    for w in split_question:
        if w in terms_used:
            match_count += 1
        terms_used.add(w)
    
    if len(split_question) > 0:
        match_rate = match_count/len(split_question)
    question_overlap.append(match_rate)
jeopardy["question_overlap"] = question_overlap
print(jeopardy["question_overlap"].mean())

print(jeopardy.head(3))
print(jeopardy.tail(3))

0.6894031359073245
       Show Number    Air Date             Round         Category Value  \
19325           10  1984-09-21   Final Jeopardy!  U.S. PRESIDENTS  None   
19301           10  1984-09-21  Double Jeopardy!     LABOR UNIONS  $200   
19302           10  1984-09-21  Double Jeopardy!             1789  $200   

                                                Question              Answer  \
19325  Adventurous 26th president, he was 1st to ride...  Theodore Roosevelt   
19301           Notorious labor leader missing since '75         Jimmy Hoffa   
19302  Washington proclaimed Nov. 26, 1789 this first...        Thanksgiving   

                                          clean_question        clean_answer  \
19325  adventurous 26th president he was 1st to ride ...  theodore roosevelt   
19301            notorious labor leader missing since 75         jimmy hoffa   
19302  washington proclaimed nov 26 1789 this first n...        thanksgiving   

       clean_value clean_date  questio

We were able to find out what percent of each row's question's words appeared in a previous question's words. On average, 69% of each row's words appeared previously. 

We can see that at the beginning of the dataset, no words matched previously. That makes sense as the first questions aired didn't have previous questions to match. By the end, there were full matches.

## Focusing on High Value Questions

To be the most profitable on jeopardy, it is most important to correctly answer the questions worth more money. We can match terms to higher values through a chi-squared test. To start, let's sort the questions into two values: High and low.

Questions go from 200 dollars to 2000 dollars, so we will designate values of 800 or lower as low, and 800 or higher as high. 

In [32]:
def high_low(row):
    if row['clean_value'] > 800:
        value = 1
    else:
        value = 0
    return value

jeopardy['high_value'] = jeopardy.apply(high_low, axis  = 1)

def h_l_count(word):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        split_q = row['clean_question'].split(" ")
        if word in split_q:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return (high_count, low_count)

This function can count the amount of times a word appears in high value questions and low value questions, and it returns both. We can see how it does with a random sample from the dataset.

In [33]:
import random

comparison_terms = random.sample(terms_used,10)
observed_expected = []

for term in comparison_terms:
    x = h_l_count(term)
    observed_expected.append(x)
    
print(comparison_terms)
observed_expected

['protectors', 'heyday', 'moravia', 'coverage', 'examples', 'guerrero', 'hirofumi', 'indents', 'muriel', 'bullfrog']


[(0, 2),
 (1, 0),
 (0, 1),
 (2, 0),
 (1, 8),
 (1, 0),
 (0, 1),
 (0, 1),
 (1, 2),
 (0, 1)]

## Applying Chi-squared Test

We now can look to see if the values from our sample are expected or not through a chi-squared test. We can find our expected values by looking at the number of high_value rows and low_value rows. 

In [34]:
high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]

5734


Next we run our chi-squared test. We can loop through our observed values, and sum the low counts and high counts to get the total counts for each term in our sample.

Then we can divide this by the total rows to get the proportion for the dataset.

Then we get the expected values by multiplying the rate of high/low values by the overall rate the term appears in the dataset.

Lastly, we make the observed and expected values into arrays to enter them into the chi-square function to get our results.

In [36]:
from scipy.stats import chisquare
import numpy as np

chi_squared = []
for obs in observed_expected:
    total = sum(obs)
    total_prop = total / jeopardy.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=4.97558423439135, pvalue=0.025707519787911092),
 Power_divergenceResult(statistic=1.3570460299240277, pvalue=0.24405008712856013),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.03188116723440362, pvalue=0.8582887163235293),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469)]

For our random sample, only one word had a signfiicant divergence between high and low value questions, but with a frequency still of only 2 this is hardly significant. Splitting questions into individual words does not have the frequency required to draw conclusions on whether words are more or less common in high or low value questions. To draw better conclusions, terms should be broken into groups or phrases or categories, and have the frequency at which a word within that group appears be measured. In addition, the dataset should be expanded beyond the 20,000 rows used here.

## Conclusion

While this analysis has not given us the key to winning on jeopardy, it has given us the ability to quickly sift through a dataset to see if words occur frequently or not, and it has shown us that words do indeed repeat within jeopardy questions. To take this analysis further, in the future the words could be broken down into category. Looking at questions with a high-rate of overlap could get a sense of which questions are repeats. Some of these repeats should be manually investigated to see if they truly do or do not repeat. 