## Guided Project: To analyze jeopardy TV show

Jeopardy is a popular TV show in US where participants asks questions to win money. For this project we will work with a data set of Jeopardy questions to analyse and find some patterns in the question that could help to win. The dataset can be found [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/).

Below is explanation of each column.

* Show Number - the Jeopardy episode number
* Air Date - the date the episode aired
* Round - the round of Jeopardy
* Category - the category of the question
* Value - the number of dollars the correct answer is worth
* Question - the text of the question
* Answer - the text of the answer

In [1]:
import pandas as pd
jeopardy = pd.read_csv("jeopardy.csv")

In [2]:
jeopardy.head(5)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [3]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [4]:
#remove the spaces in the columns
jeopardy.columns = jeopardy.columns.str.lstrip()

In [5]:
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

In [6]:
import re

#convert to lowercase and remove puntuations
def normalize_string(string):
    string = string.lower()
    string = re.sub("[^A-Za-z0-9\s]", "", string)
    string = re.sub("\s+", " ", string)
    return string

jeopardy["clean_question"] = jeopardy['Question'].apply(normalize_string)
jeopardy["clean_answer"]= jeopardy['Answer'].apply(normalize_string)

In [7]:
jeopardy.head(3)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona


In [8]:
#function to replace the dollar sign in value column
def normalize_dollar(string):
    string = string.replace('$','')
    try:
        text = int(string)
    except Exception:
        text = 0
    return text
    
jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_dollar)  

In [9]:
#change the air_date column to datetime
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

In [10]:
def split_text(row):
    split_answer= row['clean_answer'].split()
    split_question= row['clean_question'].split()
    match_count = 0
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0
    for i in split_answer:
        if i in split_question:
            match_count +=1
    return match_count/len(split_answer)

jeopardy['answer_in_question']=jeopardy.apply(split_text, axis = 1)

In [11]:
jeopardy['answer_in_question'].mean()

0.05900196524977763

In the aboove cells the texts in the clean_question and clean_answer columns were split and word 'the' was removed from split_answer variable.
If the length of the split_answer is equal, it was set to zero. Then each word in split_answer was checked in split_question to check how many times the words occured. The answer then was devided by the length, and the mean value was calculated.

### How often the questions will repeat

In [12]:
question_overlap = []
terms_used = set()
jeopardy.sort_values("Air Date", ascending = True)

for i,row in jeopardy.iterrows():
    split_question = row["clean_question"].split(" ")
    
    #remove any words that are less than 6 charactors long
    split_question = [q for q in split_question if len(q) > 5]
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count +=1
    for word in split_question:
        terms_used.add(word)
    if len(split_question) > 0:
        match_count /= len(split_question)
    question_overlap.append(match_count)    
           
jeopardy['question_overlap'] = question_overlap
jeopardy['question_overlap'].mean()

0.6908737315671962

we will narrow the questions into two categories:
* Low value - any row where value is less than 800
* High value - any row where value is greater than 800

In [13]:
def function(row):
    if row['clean_value'] > 800:
        value = 1
    else:
        value = 0
    return value
jeopardy["high_value"] = jeopardy.apply(function, axis = 1)

In [14]:
def function2(word):
    low_count = 0
    high_count = 0
    for i,row in jeopardy.iterrows():
        if word in row['clean_question'].split(' '):
            if row['high_value'] == 1:
                high_count +=1
            else:
                low_count +=1
    return high_count, low_count        

In [15]:
from random import choice

terms_used_list = list(terms_used)
comparison_terms = [choice(terms_used_list) for _ in range(10)]

observed_expected = []

for i in comparison_terms:
    observed_expected.append(function2(word))

In [16]:
from scipy.stats import chisquare
import numpy as np

high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]

chi_squared = []
for obs in observed_expected:
    total = sum(obs)
    total_prop = total / jeopardy.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=0.1347260871920751, pvalue=0.7135813314275334),
 Power_divergenceResult(statistic=0.1347260871920751, pvalue=0.7135813314275334),
 Power_divergenceResult(statistic=0.1347260871920751, pvalue=0.7135813314275334),
 Power_divergenceResult(statistic=0.1347260871920751, pvalue=0.7135813314275334),
 Power_divergenceResult(statistic=0.1347260871920751, pvalue=0.7135813314275334),
 Power_divergenceResult(statistic=0.1347260871920751, pvalue=0.7135813314275334),
 Power_divergenceResult(statistic=0.1347260871920751, pvalue=0.7135813314275334),
 Power_divergenceResult(statistic=0.1347260871920751, pvalue=0.7135813314275334),
 Power_divergenceResult(statistic=0.1347260871920751, pvalue=0.7135813314275334),
 Power_divergenceResult(statistic=0.1347260871920751, pvalue=0.7135813314275334)]