# Analyzing Jeoperdy questions for patterns #

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for many years, and is a major force in popular culture. 

Imagine that you want to compete on Jeopardy, and you're looking for any way to win. In this project, you'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you win.

In [94]:
import pandas as pd
jeopardy = pd.read_csv('jeopardy.csv')
print(jeopardy.head(5))
print(jeopardy.columns)

   Show Number    Air Date      Round                         Category  Value  \
0         4680  2004-12-31  Jeopardy!                          HISTORY   $200   
1         4680  2004-12-31  Jeopardy!  ESPN's TOP 10 ALL-TIME ATHLETES   $200   
2         4680  2004-12-31  Jeopardy!      EVERYBODY TALKS ABOUT IT...   $200   
3         4680  2004-12-31  Jeopardy!                 THE COMPANY LINE   $200   
4         4680  2004-12-31  Jeopardy!              EPITAPHS & TRIBUTES   $200   

                                                                                                      Question  \
0             For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory   
1  No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves   
2                     The city of Yuma in this state has a record average of 4,055 hours of sunshine each year   
3                         In 1963, live on "The Art Linkl

In [95]:
#Removing spaces from columns names
columns = ['Show Number', 'Air Date', 'Round', 'Category', 'Value',
       'Question', 'Answer']
jeopardy.columns = columns
print(jeopardy.columns)

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')


In [96]:
#normalizing text columns

def norm_text(row):
    lower = row.lower()
    no_stops = lower.replace('.','')
    no_comma = no_stops.replace(',','')
    no_ap = no_comma.replace('\'','')
    no_q = no_ap.replace('"','')
    return no_q

clean_question = jeopardy['Question'].apply(norm_text)
clean_question.head(5)

0               for the last 8 years of his life galileo was under house arrest for espousing this mans theory
1    no 2: 1912 olympian; football star at carlisle indian school; 6 mlb seasons with the reds giants & braves
2                      the city of yuma in this state has a record average of 4055 hours of sunshine each year
3                             in 1963 live on the art linkletter show this company served its billionth burger
4          signer of the dec of indep framer of the constitution of mass second president of the united states
Name: Question, dtype: object

In [97]:
clean_answer = jeopardy['Answer'].apply(norm_text)
clean_answer.head(5)

0    copernicus
1    jim thorpe
2       arizona
3     mcdonalds
4    john adams
Name: Answer, dtype: object

In [98]:
#normalizing dollar values

def norm_val(row):
    no_p = row.replace('$','')
    no_c = no_p.replace(',', '')
    try:
        num = int(no_c)
    except:
        num = 0
    return num

clean_value = jeopardy['Value'].apply(norm_val)
jeopardy['clean_value'] = clean_value

In [99]:
#Converting Air Date column to datetime
jeopardy['AirDate'] = pd.to_datetime(jeopardy['Air Date'])

# Analysing questions and answers to: #
Answer the following questions :- 
- How often the answer can be used for a question?
- How often questions are repeated?

### How often the answer can be used for a question? ###


In [100]:
clean_answer = clean_answer.str.replace('the ','')
clean_answer = clean_answer.str.replace(' & ','')
jeopardy['clean_question'] = clean_question
jeopardy['clean_answer'] = clean_answer
split_answer = clean_answer.str.split(' ')
split_question = clean_question.str.split(' ')

answer_in_q = []
jeopardy['split_answer'] = split_answer
jeopardy['split_question'] = split_question

def match(row):
    match_count = 0
    if len(row['split_answer'])==0:
        return 0
    for item in row['split_answer']:
         if item in row['split_question']:
                match_count += 1
    return (match_count/len(row['split_answer']))
        

jeopardy['answer_in_question'] = jeopardy.apply(match,axis=1)
mean_answer_in_q = jeopardy['answer_in_question'].mean()
print('mean of answers in question',mean_answer_in_q)

mean of answers in question 0.05358149761744942


In [101]:
#checking if the matches returned are correct
pd.options.display.max_colwidth = 200
jeopardy[jeopardy['answer_in_question'] >0].head(5)


Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_value,AirDate,clean_question,clean_answer,split_answer,split_question,answer_in_question
14,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$600,"On June 28, 1994 the nat'l weather service began issuing this index that rates the intensity of the sun's radiation",the UV index,600,2004-12-31,on june 28 1994 the natl weather service began issuing this index that rates the intensity of the suns radiation,uv index,"[uv, index]","[on, june, 28, 1994, the, natl, weather, service, began, issuing, this, index, that, rates, the, intensity, of, the, suns, radiation]",0.5
24,4680,2004-12-31,Jeopardy!,HISTORY,$1000,"This Asian political party was founded in 1885 with ""Indian National"" as part of its name",the Congress Party,1000,2004-12-31,this asian political party was founded in 1885 with indian national as part of its name,congress party,"[congress, party]","[this, asian, political, party, was, founded, in, 1885, with, indian, national, as, part, of, its, name]",0.5
31,4680,2004-12-31,Double Jeopardy!,AIRLINE TRAVEL,$400,"It can be a place to leave your puppy when you take a trip, or a carrier for him that fits under an airplane seat",a kennel,400,2004-12-31,it can be a place to leave your puppy when you take a trip or a carrier for him that fits under an airplane seat,a kennel,"[a, kennel]","[it, can, be, a, place, to, leave, your, puppy, when, you, take, a, trip, or, a, carrier, for, him, that, fits, under, an, airplane, seat]",0.5
38,4680,2004-12-31,Double Jeopardy!,MUSICAL TRAINS,$800,"During the 1954-1955 Sun sessions, Elvis climbed aboard this train ""sixteen coaches long""","the ""Mystery Train""",800,2004-12-31,during the 1954-1955 sun sessions elvis climbed aboard this train sixteen coaches long,mystery train,"[mystery, train]","[during, the, 1954-1955, sun, sessions, elvis, climbed, aboard, this, train, sixteen, coaches, long]",0.5
53,4680,2004-12-31,Double Jeopardy!,MUSICAL TRAINS,$2000,"In 1961 James Brown announced ""all aboard"" for this train","""Night Train""",2000,2004-12-31,in 1961 james brown announced all aboard for this train,night train,"[night, train]","[in, 1961, james, brown, announced, all, aboard, for, this, train]",0.5


## Are the questions likely to have the answers in them? ##

Around 5.3% of the answers in the dataset occur in the questions. This tells us that **we cannot reply on the probability that the questions themselves will contain the answers in them**. We will have to study.

## How often questions are repeated? ##

In [102]:
#sort the dataset in ascending order by date

jeopardy.sort_values(by='Air Date')
questions_overlap = []
terms_used = set()

def six_char(row):
    n_row = []
    match_count = 0
    for item in row:
        if len(item) > 5:
            n_row.append(item)
            if item in terms_used:
                match_count += 1
            terms_used.add(item)
    if len(n_row) > 0:
        questions_overlap.append(match_count/len(n_row))
    else:
        questions_overlap.append(0)
    return n_row

jeopardy['split_question_six'] = jeopardy['split_question'].apply(six_char)
jeopardy['question_overlap'] = questions_overlap 

In [103]:
mean_questions = jeopardy['question_overlap'].mean()
print('The mean of question overlap is:', mean_questions)

The mean of question overlap is: 0.6763215188342797


### So are questions Recycled?  ###

67% of the questions are recycled. Since we have only used large terms to check, this doesn't mean much but it does give us incentive to dig deeper.

In [104]:
def hlvalue(row):
    value = 0
    if row['clean_value'] > 800:
        value = 1
    else:
        value = 0
    return value

jeopardy['high_value']= jeopardy.apply(hlvalue, axis = 1)
    

In [105]:
def hlcount(word):
    global high_count
    global low_count
    high_count = 0
    low_count= 0
    for i,row in jeopardy.iterrows():
        if word[0] in row['split_question']:
            if row['high_value']==1:
                high_count += 1
            else:
                low_count += 1
    return high_count,low_count


In [106]:
import random
comparison_terms = []
for i in range(0,10):
     comparison_terms.append(random.sample(terms_used,1))
observed_expected = []

for word in comparison_terms:
    observed_expected.append(hlcount(word))
        

In [107]:
print(observed_expected)

[(0, 1), (0, 4), (2, 1), (0, 1), (2, 4), (0, 1), (1, 2), (0, 1), (0, 2), (0, 1)]


In [108]:
high_value_count = jeopardy[jeopardy['high_value']==1].shape[0]
low_value_count = jeopardy[jeopardy['high_value']==0].shape[0]

In [109]:
from scipy.stats import chisquare
import numpy as np
chi_squared = []
for i in observed_expected:
    lis_i = list(i)
    total_prop = sum(lis_i)/jeopardy.shape[0]
    obs = np.array([lis_i[0], lis_i[1]])
    h_tp = total_prop * high_value_count
    l_tp = total_prop * low_value_count
    exp = np.array([h_tp,l_tp])
    chi_squared.append(chisquare(obs,exp))
    

In [110]:
print(chi_squared)

[Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469), Power_divergenceResult(statistic=1.607851384507536, pvalue=0.20479409439225948), Power_divergenceResult(statistic=2.1177104383031944, pvalue=0.14560406868264344), Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469), Power_divergenceResult(statistic=0.06376233446880725, pvalue=0.8006453026878781), Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469), Power_divergenceResult(statistic=0.03188116723440362, pvalue=0.8582887163235293), Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469), Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571), Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469)]


# Conclusion #

None of the terms had a significant difference in usage between high value and low value rows.The p-values being higher than 0.05 tells us that the results are statistically insignificant and we can reject the hypothesis that some terms are used more in high value counts or low value counts.