# Winning Jeopardy

[Jeopardy](https://en.wikipedia.org/wiki/Jeopardy!) is a popular TV show in the US where participants answer questions to win money. In this the participants are actually given answer clues and they are supposed to somi up with the questions.It's been running for many years now and is a major force in popular culture.

Imagine that we want to compete on Jeopardy and we are looking for a way to win.In this project we will be dealing with a dataset that contains Jeopardy questions to figure out some patterns in the questions that could help us win.

The dataset can be downloaded from [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file).

In [1]:
#IRead the dataset
import pandas as pd
data=pd.read_csv("jeopardy.csv")
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)


In [2]:
data.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


Below is the explaination of the columns:

| Column Name 	| Description                                       	|
|-------------	|---------------------------------------------------	|
| Show Number 	| the Jeopardy episode number                       	|
| Air Date    	| the date the episode aired                        	|
| Round       	| the round of Jeopardy                             	|
| Category    	| the category of the question                      	|
| Value       	| the number of dollars the correct answer is worth 	|
| Question    	| the text of the question                          	|
| Answer      	| the text of the answer                            	|


Looking at the column names we can see the columns have spaces.Let's try dealing with those spaces first.

In [3]:
data.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [4]:
# Removing the space in columns
data.columns=[x.replace(' ','') for x in data.columns]

In [5]:
data.columns

Index(['ShowNumber', 'AirDate', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

# Normalization of Text Columns

Before we start doing any analysis on our columns we will need to normalize the text columns i.e. remove any punctuation and convert all the tect into either lower or upper case so that ABC and abc are not treated differently.

In [6]:
# Normalization of question and answer column
import re
def normalize_text(word):
    word=word.lower()
    word = re.sub("[^A-Za-z0-9\s]", "", word)
    word = re.sub("\s+", " ", word)
    return word

data["clean_question"]=data["Question"].apply(normalize_text)
    

In [7]:
data["clean_answer"]=data["Answer"].apply(normalize_text)

# Normalization of Other Columns

Now that the text columns are normalized we can look at some other columns which can be normalized.

The Value column should be numeric.It has a dollar sign.hence we will need to remove the same.

The Air Date column should also be a datetime not a string

In [8]:
def normalize_values(word):
    word= word.lower()
    word = re.sub("[^A-Za-z0-9\s]", "", word)
    word = re.sub("\s+", " ", word)
    '''
    Convert the word to an integer.
    If it throws an error then make it 0
    '''
    try:
        word=int(word)
    except Exception:
        word=0
    return word 

In [9]:
# Normalize the value column
data["clean_value"]=data["Value"].apply(normalize_values)

In [10]:
# Convert Air Date to datetime column

data["AirDate"]=pd.to_datetime(data["AirDate"])

# Explore past questions
In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

* How often the answer can be used for a question.
* How often questions are repeated.

We can answer the first question by seeing how many times words in the answer also occur in the question. 

We can answer the second question by seeing how often complex words (> 6 characters) reoccur.

In [11]:
# Write a function to determine how many times words in the question also occur
# in the answer.

def count_matches(row):
    split_answer=row["clean_answer"].split( )
    split_question=row["clean_question"].split( )
    match_count=0
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer)==0:
        return 0
    for word in split_answer:
        if word in split_question:
            match_count+=1
    return match_count/len(split_answer)                       

In [12]:
#Count how many times terms in clean_answer occur in clean_question.
answer_in_question=data.apply(count_matches,axis=1)

In [13]:
# Mean of answer_in_question column
answer_in_question.mean()

0.05900196524977763

From the above we can see that in 5% of the questions and answers have similar words in both.

This can be taken as a input while coming up with the question formation in jeopardy. We can look for some hints in the answers to come up with the questions.

# Repetition of Questions

We want to Investigate how oftern new questions are repeats of older ones.
This cannot be completely answered because we only have about 10% of the full Jeopardy question dataset but we can only investigate this.

To do this we can:

* Sort jeopardy in order of ascending air date.
* Maintain a set called terms_used that will be empty initially.
* Iterate through each row of jeopardy.
* Split clean_question into words, remove any word shorter than 6 characters, and check if each word occurs in terms_used.
* If it does, increment a counter.
* Add each word to terms_used.

In [14]:
# Writing the function using the above steps

question_overlap=[]
terms_used=set()

#Sort jeopardy by ascending date
data.sort_values("AirDate",inplace=True)

In [15]:
'''
We have already sorted columns as per date above.
So now we will iterate over those rows directly to check if words are repeated.
'''

for i,row in data.iterrows():
    split_question=row["clean_question"].split(' ')
    split_question=[k for k in split_question if len(k)>5]
    match_count=0
    for word in split_question:
        if word in terms_used:
            match_count+=1
    for word in split_question:
        terms_used.add(word)
    if len(split_question)>0:
        match_count/=len(split_question)
    question_overlap.append(match_count)


In [16]:
data["question_overlap"] = question_overlap

data["question_overlap"].mean()

0.6876260592169802

As per the above result there is almost 70% overlap of words in new questions and words in old questions.But we are only considering words and not phrases. Hence this might be really insignificant. However we can surely look into the repetition of questions side.

# High and Low Value Questions

We can consider focusing on the questions as per value and focus only on high value questions which will help us win more money on Jeopardy.

We can analyze the terms corresponding to high value questions using a chi squared test.

We can categorize the questions as below:

* Low value -- Any row where Value is less than 800.
* High value -- Any row where Value is greater than 800.

We will then be able to loop through each of the terms from the last screen, terms_used, and:

1. Find the number of low value questions the word occurs in.
2. Find the number of high value questions the word occurs in.
3. Find the percentage of questions the word occurs in.
4. Based on the percentage of questions the word occurs in, find expected counts.
5. Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.

We can then find the words with the biggest difference in usage between high and low value questions by selecting the words with the highest associated chi-squared values. Doing this for all words would tale significant time so we will only do it for a sample now.

In [30]:
# Assign the value as high or low

data["high_value"]=[1 if i>800 else 0 for i in data["clean_value"]]


In [32]:
# Create a function that counts the words in high and low count

def word_count(word):
    low_count=0
    high_count=0
    for i,row in data.iterrows():
        split_question=row["clean_question"].split(' ')
        if word in split_question:
            if row["high_value"]==1:
                high_count+=1
            else:
                low_count=1
    return high_count,low_count
                

In [37]:
'''
Randomly pick ten elements of terms_used and append them to a list called 
comparison_terms
'''
import random
comparison_terms=list(random.sample(terms_used,10))

In [38]:
# Create a list observed_expected to get observed values
observed_expected=[]

for i in comparison_terms:
    a=word_count(i)
    observed_expected.append(a)

In [39]:
observed_expected

[(0, 1),
 (1, 0),
 (0, 1),
 (2, 1),
 (0, 1),
 (3, 1),
 (2, 1),
 (0, 1),
 (1, 0),
 (1, 0)]

In [41]:
# Compute the expected counts and chi squared values
high_value_count=len(data["high_value"]==1)
low_value_count=len(data["high_value"]==0)


In [47]:
from scipy.stats import chisquare
import numpy as np
chi_squared=[]
for i in observed_expected:
    '''
    Add up both items in the list '''
    total=sum(i)
    '''
    Divide total by no of rows in data'''
    
    total_prop=total/data.shape[0]
    exp_high_val=total_prop*high_value_count
    exp_low_val=total_prop*low_value_count
    observed = np.array([i[0], i[1]])
    expected = np.array([exp_high_val, exp_low_val])
    chi_squared.append(chisquare(observed, expected)) 

In [48]:
chi_squared

[Power_divergenceResult(statistic=1.0, pvalue=0.31731050786291404),
 Power_divergenceResult(statistic=1.0, pvalue=0.31731050786291404),
 Power_divergenceResult(statistic=1.0, pvalue=0.31731050786291404),
 Power_divergenceResult(statistic=1.6666666666666665, pvalue=0.19670560245894675),
 Power_divergenceResult(statistic=1.0, pvalue=0.31731050786291404),
 Power_divergenceResult(statistic=2.5, pvalue=0.11384629800665763),
 Power_divergenceResult(statistic=1.6666666666666665, pvalue=0.19670560245894675),
 Power_divergenceResult(statistic=1.0, pvalue=0.31731050786291404),
 Power_divergenceResult(statistic=1.0, pvalue=0.31731050786291404),
 Power_divergenceResult(statistic=1.0, pvalue=0.31731050786291404)]

From the above we can see that none of the p values are less than he threshold of 0.05.Hence none of the chi squared values are significant. Therefore there is no significant difference of usgae of words between high or low values.Also when we checked the freq we could see that they are very low(less than 5).With such a low frequency we cannot gurantee the significance of the test.