# Winning Jeopardy
Jeopardy is a popular TV show in the United States where contestants have to answer questions in order to win money. It has been succesfull for a long time and running for a few decades. <br><br>
Let's pretend we want to compete on Jeopardy and are looking for any edge that can help us to win. Hence, we will be working with a dataset that contains Jeopardy questions from the past and could help us to figure out patterns about the questions. The dataset is called jeopardy.csv and contains 20,000 rows. It can be downloaded [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/)

## Exploring the Dataset

In [2]:
import pandas as pd
jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [3]:
jeopardy.tail()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
19994,3582,2000-03-14,Jeopardy!,U.S. GEOGRAPHY,$200,"Of 8, 12 or 18, the number of U.S. states that...",18
19995,3582,2000-03-14,Jeopardy!,POP MUSIC PAIRINGS,$200,...& the New Power Generation,Prince
19996,3582,2000-03-14,Jeopardy!,HISTORIC PEOPLE,$200,In 1589 he was appointed professor of mathemat...,Galileo
19997,3582,2000-03-14,Jeopardy!,1998 QUOTATIONS,$200,"Before the grand jury she said, ""I'm really so...",Monica Lewinsky
19998,3582,2000-03-14,Jeopardy!,LLAMA-RAMA,$200,Llamas are the heftiest South American members...,Camels


In [4]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

Some of the columns have spaces in front that have to be removed.

In [5]:
jeopardy.columns = jeopardy.columns.str.replace(' ','')

In [6]:
jeopardy.columns

Index(['ShowNumber', 'AirDate', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

In [7]:
jeopardy['Value'].describe()

count     19999
unique       76
top        $400
freq       3892
Name: Value, dtype: object

In [8]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
ShowNumber    19999 non-null int64
AirDate       19999 non-null object
Round         19999 non-null object
Category      19999 non-null object
Value         19999 non-null object
Question      19999 non-null object
Answer        19999 non-null object
dtypes: int64(1), object(6)
memory usage: 1.1+ MB


## Normalizing text

To be able to analyze the question and answer columns we ave to normalize the text. That means, converting the text to lowercase and removing all punctuation.

In [9]:
# Function to normalize a string
import re
def normalize(string):
    string = string.lower()
    string = re.sub("[^A-Za-z0-9\s]", "", string)
    return string

# Normalizing the question column
jeopardy['Question'] = jeopardy['Question'].apply(normalize)

# Normalizing the answer column
jeopardy['Answer'] = jeopardy['Answer'].apply(normalize)

In [10]:
jeopardy.head()

Unnamed: 0,ShowNumber,AirDate,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,for the last 8 years of his life galileo was u...,copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,no 2 1912 olympian football star at carlisle i...,jim thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,the city of yuma in this state has a record av...,arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,in 1963 live on the art linkletter show this c...,mcdonalds
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,signer of the dec of indep framer of the const...,john adams


We can see that the question and answer column have been normalized.

Next, we want to normalize the value and air date column to make future analysis easier. The value column should not include the dollar sign and the air date should be a datetime and not a string.

In [11]:
def dollar_norm(string):
    string = re.sub("[^A-Za-z0-9\s]", "", string)
    try:
        string = int(string)
    except Exception:
        string = 0
    return string
jeopardy['Value'] = jeopardy['Value'].apply(dollar_norm)

In [12]:
# Using pd.to_datetime to convert the air date column
jeopardy['Air Date'] = pd.to_datetime(jeopardy['AirDate'])

In [13]:
jeopardy.head()

Unnamed: 0,ShowNumber,AirDate,Round,Category,Value,Question,Answer,Air Date
0,4680,2004-12-31,Jeopardy!,HISTORY,200,for the last 8 years of his life galileo was u...,copernicus,2004-12-31
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,200,no 2 1912 olympian football star at carlisle i...,jim thorpe,2004-12-31
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,200,the city of yuma in this state has a record av...,arizona,2004-12-31
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,200,in 1963 live on the art linkletter show this c...,mcdonalds,2004-12-31
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,200,signer of the dec of indep framer of the const...,john adams,2004-12-31


In [14]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 8 columns):
ShowNumber    19999 non-null int64
AirDate       19999 non-null object
Round         19999 non-null object
Category      19999 non-null object
Value         19999 non-null int64
Question      19999 non-null object
Answer        19999 non-null object
Air Date      19999 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(2), object(5)
memory usage: 1.2+ MB


We can see that the dollar signs have been removed and that the air date is now a datetime type.

## Answers in questions

Next, we want to start analyzing past questions. <br>First, we want to know how often answers are deducible from the question. This means we have to check how many times words from the answer occur in words from the question. <br> Second, we want to analyze how often new questions are repeats from older question . This can be done by seeing how often complex words (>6 characters) reoccur.

In [15]:
def matches_count(row):
    split_answer = row['Answer'].split(' ')    
    split_question = row['Question'].split(' ')
    match_count = 0
    if 'the' in split_answer:
        split_answer.remove('the') # 'the' often does not have a meaningful use in answer
    if len(split_answer) == 0: 
        return 0 # returning 0 so we do not get an error later
    for word in split_answer:
        if word in split_question:
            match_count+=1
    return match_count/len(split_answer)

In [16]:
# Counting how many times terms in the answer column occur in the question column
jeopardy['answer_in_questions'] = jeopardy.apply(matches_count, axis=1)

In [17]:
jeopardy.head(10)

Unnamed: 0,ShowNumber,AirDate,Round,Category,Value,Question,Answer,Air Date,answer_in_questions
0,4680,2004-12-31,Jeopardy!,HISTORY,200,for the last 8 years of his life galileo was u...,copernicus,2004-12-31,0.0
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,200,no 2 1912 olympian football star at carlisle i...,jim thorpe,2004-12-31,0.0
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,200,the city of yuma in this state has a record av...,arizona,2004-12-31,0.0
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,200,in 1963 live on the art linkletter show this c...,mcdonalds,2004-12-31,0.0
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,200,signer of the dec of indep framer of the const...,john adams,2004-12-31,0.0
5,4680,2004-12-31,Jeopardy!,3-LETTER WORDS,200,in the title of an aesop fable this insect sha...,the ant,2004-12-31,0.0
6,4680,2004-12-31,Jeopardy!,HISTORY,400,built in 312 bc to link rome the south of ita...,the appian way,2004-12-31,0.0
7,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,400,no 8 30 steals for the birmingham barons 2306 ...,michael jordan,2004-12-31,0.0
8,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,400,in the winter of 197172 a record 1122 inches o...,washington,2004-12-31,0.0
9,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,400,this housewares store was named for the packag...,crate barrel,2004-12-31,0.333333


In [18]:
# Calculating the mean of the new column
mean_a_in_q = jeopardy['answer_in_questions'].mean()
mean_a_in_q

0.06049325706933587

In [19]:
jeopardy.sort_values('answer_in_questions', ascending=False).head(10)

Unnamed: 0,ShowNumber,AirDate,Round,Category,Value,Question,Answer,Air Date,answer_in_questions
18566,4974,2006-04-06,Jeopardy!,WHICH TV SHOW CAME FIRST?,400,the drew carey show the bob newhart show the b...,the bob newhart show,2006-04-06,1.0
7591,4808,2005-06-29,Jeopardy!,NOT A POPE,200,urban vii julius ii irving iii,irving iii,2005-06-29,1.0
18074,3227,1998-09-22,Double Jeopardy!,PUT 'EM IN ORDER,800,ramses ii nebuchadnezzar king solomon,ramses ii king solomon nebuchadnezzar,1998-09-22,1.0
11508,4461,2004-01-19,Jeopardy!,NOT A NATIONAL CAPITAL,800,katmandu kabul karachi,karachi,2004-01-19,1.0
12492,4586,2004-07-12,Jeopardy!,BETTER ASK FOR DIRECTIONS,200,of north south east or west the direction to t...,north,2004-07-12,1.0
15396,4239,2003-01-23,Jeopardy!,STUPID ANSWERS,200,florence nightingale was born in this italian ...,florence,2003-01-23,1.0
12032,4693,2005-01-19,Jeopardy!,CHERCHEZ LA FEMME,1000,pd james pg wodehouse pj orourke,pd james,2005-01-19,1.0
18213,2853,1997-01-15,Jeopardy!,SILLY SONGS,200,of de do do do de da da da goo goo nee nee na...,goo goo,1997-01-15,1.0
290,4931,2006-02-06,Double Jeopardy!,NOT A CURRENT NATIONAL CAPITAL,2000,belize city guatemala city panama city,belize city,2006-02-06,1.0
284,4931,2006-02-06,Double Jeopardy!,NOT A CURRENT NATIONAL CAPITAL,1600,bucharest bonn bern,bonn,2006-02-06,1.0


We can see that the answers only occur in around 6% of the questions. This shows at that it is probably not a smart move to speculate finding the same words from the question in the correct answer.

Next, we want to answer our second question which is how often questions are recycled throughout the years. Even though are data set contains only about 10% of all Jeopardy questions, we should still be able to see if there is a pattern of recycling questions or not.

In [20]:
question_overlap = []
terms_used = set()
jeopardy = jeopardy.sort_values("AirDate")
# Iterating over every row in jeopardy
for i, row in jeopardy.iterrows():
    split_question = row['Question'].split(" ")
    # Deleting all words that are shorter than 6 characters
    split_question = [q for q in split_question if len(q) > 5]
    match_count = 0
    # Calculating the number of matching words
    for w in split_question:
        if w in terms_used:
            match_count+=1
    # Adding every word to the set terms_used
    for w in split_question:
        terms_used.add(w)
    if len(split_question)>0:
        match_count /= len(split_question)
    question_overlap.append(match_count)


# Adding the question_overlap list to the dataframe
jeopardy['question_overlap'] = question_overlap
# Calculating the number of overlaping words per question
jeopardy['question_overlap'].mean()



0.6876260592169802

We can see that around 70% of all questions overlap in terms of new questions and old questions. Even though the dataset only has data from 2004 and later and we just compared single words, we can look deeper to see if there really is a schema when it comes to recycling questions.

Let's say we only want to analyze questions that have a high value of winning associated with them. In order to do so, we have to narrow down the questions into two categories:
- Low value questions: Any row where the value is less than 800
- High value questions: Any row where the value is greater than 800

In [21]:
# Function that categorizes values
def value_category(row):
    value = 1
    if row['Value'] > 800:
        return value
    else:
        value = 0
        return value

In [22]:
# Applying the function to the dataset
jeopardy['high_value'] = jeopardy.apply(value_category, axis=1)

In [28]:
jeopardy.head()

Unnamed: 0,ShowNumber,AirDate,Round,Category,Value,Question,Answer,Air Date,answer_in_questions,question_overlap,high_value
19325,10,1984-09-21,Final Jeopardy!,U.S. PRESIDENTS,0,adventurous 26th president he was 1st to ride ...,theodore roosevelt,1984-09-21,0.0,0.0,0
19301,10,1984-09-21,Double Jeopardy!,LABOR UNIONS,200,notorious labor leader missing since 75,jimmy hoffa,1984-09-21,0.0,0.0,0
19302,10,1984-09-21,Double Jeopardy!,1789,200,washington proclaimed nov 26 1789 this first n...,thanksgiving,1984-09-21,0.0,0.0,0
19303,10,1984-09-21,Double Jeopardy!,TOURIST TRAPS,200,both ferde grofe the colorado river dug this ...,the grand canyon,1984-09-21,0.0,0.5,0
19304,10,1984-09-21,Double Jeopardy!,LITERATURE,200,depending on the book he could be a jones a sa...,tom,1984-09-21,0.0,0.0,0


In [24]:
# Function that takes in a word and returns the number of times it appears in higher value questions and lower value questions
def count_usage(word):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        if word in row["Question"].split(" "):
            if row['high_value']==1:
                high_count+=1
            else:
                low_count+=1
    return high_count, low_count

In [25]:
# Converting the terms_used set into a list (only the first 5 elements)
comparison_terms = list(terms_used)[:5]
comparison_terms

observed_expected = []

In [26]:
# Applying the count_usage function on the 5 elements of the comparison_terms list
for word in comparison_terms:
    observed_expected.append(count_usage(word))

In [27]:
observed_expected

[(0, 1), (0, 1), (0, 2), (1, 0), (5, 10)]

## Applying the chi-squared test

Now that we found the observed counts for a few terms, we can compute the expected counts and the chi-squared value.

In [31]:
# Finding the number of rows where high_value == 1 and == 0
high_value_count = jeopardy[jeopardy['high_value']==1].shape[0]
low_value_count = jeopardy[jeopardy['high_value']==0].shape[0]

# Importing chisquare
from scipy.stats import chisquare
import numpy as np

chi_squared = []

# Using the observed_expected list to compute the chi-square value
for obs in observed_expected:
    # Total of each list
    total = sum(obs)
    total_prop = total / jeopardy.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    # Using chisquare to calculate the chi-square values
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.15940583617201806, pvalue=0.6897041389348488)]

None of the words had a statistically significant p-value. One of the reasons could be the relatively small datasets that leads to smaller amounts of word occurences. With our chi-squared calculations we can not find a statistically significance between low value and high value questions.

## Conclusion
In this project we analyzed a dataset about jeopardy questions. Using the 