## Patterns in Jeopardy! Questions

Jeopardy the Tv Show has people come up to answer questions based on a variety of topics for cash prizes. This project looks into any patterns in the questions and the best methods to win.
Dataset gotten from [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file)

In [3]:
import pandas as pd

jeo_data = pd.read_csv('jeopardy.csv', parse_dates=[' Air Date'])

In [4]:
jeo_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
Show Number    19999 non-null int64
 Air Date      19999 non-null datetime64[ns]
 Round         19999 non-null object
 Category      19999 non-null object
 Value         19999 non-null object
 Question      19999 non-null object
 Answer        19999 non-null object
dtypes: datetime64[ns](1), int64(1), object(5)
memory usage: 1.1+ MB


In [5]:
jeo_data.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [6]:
jeo_data.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

## Data Cleaning

In [7]:
jeo_data.columns = jeo_data.columns.str.strip()

In [8]:
jeo_data.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

In [9]:
import re
def norm_string(text):
    text = text.lower()
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    text = re.sub("\s+", " ", text)
    return text

In [10]:
jeo_data['clean_question'] = jeo_data['Question'].apply(norm_string)
jeo_data['clean_answer'] = jeo_data['Answer'].apply(norm_string)

In [11]:
j = ['Question', 'clean_question', 'Answer', 'clean_answer']
jeo_data[j].head(10)

Unnamed: 0,Question,clean_question,Answer,clean_answer
0,"For the last 8 years of his life, Galileo was ...",for the last 8 years of his life galileo was u...,Copernicus,copernicus
1,No. 2: 1912 Olympian; football star at Carlisl...,no 2 1912 olympian football star at carlisle i...,Jim Thorpe,jim thorpe
2,The city of Yuma in this state has a record av...,the city of yuma in this state has a record av...,Arizona,arizona
3,"In 1963, live on ""The Art Linkletter Show"", th...",in 1963 live on the art linkletter show this c...,McDonald's,mcdonalds
4,"Signer of the Dec. of Indep., framer of the Co...",signer of the dec of indep framer of the const...,John Adams,john adams
5,"In the title of an Aesop fable, this insect sh...",in the title of an aesop fable this insect sha...,the ant,the ant
6,Built in 312 B.C. to link Rome & the South of ...,built in 312 bc to link rome the south of ital...,the Appian Way,the appian way
7,"No. 8: 30 steals for the Birmingham Barons; 2,...",no 8 30 steals for the birmingham barons 2306 ...,Michael Jordan,michael jordan
8,"In the winter of 1971-72, a record 1,122 inche...",in the winter of 197172 a record 1122 inches o...,Washington,washington
9,This housewares store was named for the packag...,this housewares store was named for the packag...,Crate & Barrel,crate barrel


In [12]:
def norm_val(value):
    value = re.sub("[^A-Za-z0-9\s]", "", value)
    try:
        value = int(value)
    except Exception:
        value = 0
    return value

In [13]:
jeo_data['clean_value'] = jeo_data['Value'].apply(norm_val)

In [14]:
j = ['Value', 'clean_value']
jeo_data[j].head()

Unnamed: 0,Value,clean_value
0,$200,200
1,$200,200
2,$200,200
3,$200,200
4,$200,200


## How often the answer is deducible from the question.

In [18]:
def count_matches(row):
    split_answer = row["clean_answer"].split()
    split_question = row["clean_question"].split()
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer) == 0:
        return 0
    match_count = 0
    for item in split_answer:
        if item in split_question:
            match_count += 1
    return match_count / len(split_answer)

jeo_data["answer_in_question"] = jeo_data.apply(count_matches, axis=1)

In [19]:
jeo_data['answer_in_question'].mean()

0.05900196524977763

In [26]:
mean_df = jeo_data[jeo_data["answer_in_question"] >= 0.6]
mean_df[['Category','Question','clean_answer','answer_in_question']]

Unnamed: 0,Category,Question,clean_answer,answer_in_question
232,ANGELS,"With an appropriate-sounding name, John Dye pl...",angel of death,0.666667
266,NOT A CURRENT NATIONAL CAPITAL,"Ljubljana, Bratislava, Barcelona",barcelona,1.000000
272,NOT A CURRENT NATIONAL CAPITAL,"Istanbul, Ottawa, Amman",istanbul,1.000000
278,NOT A CURRENT NATIONAL CAPITAL,"Sofia, Sarajevo, Saigon",saigon,1.000000
284,NOT A CURRENT NATIONAL CAPITAL,"Bucharest, Bonn, Bern",bonn,1.000000
290,NOT A CURRENT NATIONAL CAPITAL,"Belize City, Guatemala City, Panama City",belize city,1.000000
395,WORLD FACTS,"A humid city, Rio de Janeiro lies just north o...",the tropic of capricorn,0.666667
661,SHAKESPEAREAN LAST SCENES,"You could say this comedy ""ends well"" -- Helen...",alls well that ends well,0.600000
677,VERMONTERS,George Franklin Edmunds wrote most of this ant...,the sherman antitrust act,0.666667
815,THE 1890s,"Of ""Frankenstein"", ""The Invisible Man"" or ""Dra...",frankenstein created in 1818,0.750000


As we can see a number of questions that have their answers in them are questions that require choosing an option from a list of options.

## How often new questions are repeats of older questions?

In [31]:
question_overlap = []
terms_used = set()
jeo_data.sort_values(['Air Date'], ascending=True, inplace=True)
match_count = 0
for index, row in jeo_data.iterrows():
    split_question = row['clean_question'].split()
    for value in split_question:
        if len(value) < 6:
            split_question.remove(value)
            continue
        if value in terms_used:
            match_count += 1
        else:
            terms_used.add(value)
    if len(split_question) > 0:
        match_count /= len(split_question)
    question_overlap.append(match_count)

jeo_data['question_overlap'] = question_overlap
print(jeo_data['question_overlap'].mean())

0.23690079733126823


About 23% percent of the questions in this dataset have been repeated previously. So the odds of winning can be increased a bit this way

## Investigating relationships between Questions and Value

We are doing this to investigate if there are some words in the answer who tend to be more related to high value questions rather than to low values ones - that is, a relationship between categorical variables.

In [33]:
def determine_value(row):
    value = 0
    if row["clean_value"] > 800:
        value = 1
    return value

jeo_data["high_value"] = jeo_data.apply(determine_value, axis=1)

def count_usage(term):
    low_count = 0
    high_count = 0
    for i, row in jeo_data.iterrows():
        if term in row["clean_question"].split(" "):
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

In [34]:
from random import choice

terms_used_list = list(terms_used)
comparison_terms = [choice(terms_used_list) for _ in range(10)]

observed_expected = []

for term in comparison_terms:
    observed_expected.append(count_usage(term))

observed_expected

[(0, 1),
 (1, 1),
 (0, 1),
 (0, 2),
 (0, 1),
 (2, 2),
 (0, 2),
 (1, 3),
 (0, 1),
 (1, 2)]

In [35]:
from scipy.stats import chisquare
import numpy as np

high_value_count = jeo_data[jeo_data["high_value"] == 1].shape[0]
low_value_count = jeo_data[jeo_data["high_value"] == 0].shape[0]

chi_squared = []
for obs in observed_expected:
    total = sum(obs)
    total_prop = total / jeo_data.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.889754963322559, pvalue=0.3455437191483468),
 Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571),
 Power_divergenceResult(statistic=0.02636443308440769, pvalue=0.871013484688921),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.03188116723440362, pvalue=0.8582887163235293)]

## Chi-squared results

None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.