# Guided Project Winning Jeopardy 

 Jeopardy is a popular TV show in the US where participants answer questions to win money.It has been eunning for a few decades, and is a major force in popular culture.In this project we will the various ways anyone can have an edge to win. The dataset we willbe working with is called ```jeopardy.csv```, and contains 20000 rows from the beginning of a full dataset of jeopardy questions, which can be downloaded [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file)

In [1]:
import pandas as pd

In [2]:
jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.head(2)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe


In [3]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

Each row in the dataset repesents a single question on a single episode of jeopardy. Here are explanations of each column:
- Show Number -- the Jeopardy episode number of the show this question was in.
- Air Date -- the date the episode aired.
- Round -- the round of jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
- Category -- the category of the question
- Value -- the number of dollars answwering the question correctly is worth


# Removing spaces from columns

In [4]:
jeopardy.columns = ['Show Number','Air Date','Round','Category','Value','Question','Answer']

In [5]:
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

# Normalizing the jeopardy questions and answers values

Before we can start doing analysis on the Jeopardy questions, we will need to normalize all of the text columns ( the Question and Answer columns).This is done to ensure that there are no discreprancies due to capitalization and punctuation. We will write a function that takes in a string, convert the string to lowercase, remove all punctuation in the string and return the string.We will use the apply function to apply the function to question column, answer column. We will write another function to help normalize the value column, we  write a function that takes in a string, remove anypuctuation in the string, convert the string to an integer,if the conversion has an error assign 0i 

In [6]:
# writing the function that will perform the operation
import re 
def normalize(string):
    string = string.lower()
    string = re.sub("[^A-Za-z0-9\s]","",string)
    return string
def normalize_values(string):
    string = re.sub("[^A-Za-z0-9\s]","",string)
    try:
        string =int(string)
    except Exception:
        string =0
    return string

In [7]:
jeopardy['clean_question'] = jeopardy['Question'].apply(normalize)

In [8]:
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize)

In [9]:
jeopardy['clean_values'] = jeopardy['Value'].apply(normalize_values)

In [10]:
jeopardy['clean_question'].head(2)

0    for the last 8 years of his life galileo was u...
1    no 2 1912 olympian football star at carlisle i...
Name: clean_question, dtype: object

In [11]:
jeopardy['clean_answer'].head(2)

0    copernicus
1    jim thorpe
Name: clean_answer, dtype: object

In [12]:
jeopardy['clean_values'].head(2)

0    200
1    200
Name: clean_values, dtype: int64

In [13]:
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

In [14]:
jeopardy['Air Date'].head()

0   2004-12-31
1   2004-12-31
2   2004-12-31
3   2004-12-31
4   2004-12-31
Name: Air Date, dtype: datetime64[ns]

In order to figure out whether to study past questions, study general knowledge, or not study it at all,it would be helpful to figure out two things:
- How often the answer is deducible from the question.
- How often new questions are repeats of older questions.

The second answer can be answered by seeing how often complex words (>6 characters) reoccur. The first answer can be answered by seeing how many times words in the answer also occur in the question. We will work with the first question and come bak to the second later

Here we will write a function that takes in a row in jeopardy, as a series. It should:
- Split the clean_answer column around spaces and assign to the variable split_answer
- split the clean_question column around spaces and assign to the variable split_question.
- Create a variable called match_count and set it to 0
- If 'the' is in the split_answer remove i using the remove method on lits. The is commonly found in answers and questons, but dosen't have any meaningful use in finding the answer.
- If the lenght of split_answer is 0, return 0. This prevents a division by zero error later.
- Loop through each item in split_answer and see if it occurs in split_question. If it does add 1 to match_count
- Divide match_count by the legth of split_answer and return the result.

- Count how many times terms in clean_answer occur in clean_question
- Use the Pandas dataframe.apply method to apply the function to each row in jeopardy
- pass the axis = 1 argument to apply the function across each row.
- Assign the result to the answer_in_question column
- Find the mean of the answer_in_question column using the mean method on series

In [15]:
def match_rows(row):
    split_clean_answer = row['clean_answer'].split(" ")
    split_clean_question = row['clean_question'].split(" ")
    match_count = 0
    if 'the' in split_clean_answer:
        split_clean_answer.remove('the')
    if len(split_clean_answer) ==0:
        return 0
    if len(split_clean_answer)>0:
        for word in split_clean_answer:
            if word in split_clean_question:
                match_count += 1
    return match_count/len(split_clean_answer)
    

In [16]:
jeopardy['answer_in_question'] = jeopardy.apply(match_rows,axis=1)

In [17]:
jeopardy['answer_in_question'].mean()

0.06049325706933587

From the computed mean of words that occured in both clean_answer and cleaned question is about 0.06 that is about 6%. This is quite low and seems that it is rare to find words that repeats itself in the question 

Let us say that we want to invetigate how often new questions are repeats of older ones. We can not completely answer this, because we only have about 10% of the full Jeopardy question dataset,but we can investigate it at least. 
- To do this we will sort jeopardy in order of ascending air date
- Maintain a set called terms_used that will be empty initially.
- Iterate through each row of jeopardy
- Split clean_question into words, remove any words shorter than 6 characters, check if word occurs in term_used
- If it does we increase a counter 
- add each word to terms_used.
This will enable us to check if the terms in questions have been used previously or not. Only looking at words with six or more characters enables us to filter out words like the and then, which are commonly used but do not tell us a lot about a question.



In [18]:
question_overlap = []
terms_used = set()
jeopardy = jeopardy.sort_values('Air Date')
for index, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(" ")
    split_question = [q for q in split_question if len(q)>5]
    match_count=0
    for word in split_question:
        if word in terms_used:
            match_count += 1
    for word in split_question:
        terms_used.add(word)
    if len(split_question)>0:
        calculated_match = match_count/len(split_question)
    question_overlap.append(calculated_match)

jeopardy['question_overlap'] = question_overlap
jeopardy['question_overlap'].mean()

0.7019788296638052

From here we could see that there are about 71% of questions overlap i.e we had more question that were repeated
      

Let us say we only want to study questions that pertain to high value questions instead of low value questions. This will help  us to earn more money when we are on Jeopardy.

We can actually figure out which terms correspond to high-value questions using a chi-squared test. We will first need to narrow down the questions into two categories:
- Low value -- Any row where value is less than 800
- High value-- Any row where value is greater than 800
- Find the perecentage of questions the word occur in,
- Based on the percentage of questions the word occurs in, find the expected counts.
- Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.
        
            

In [19]:
def high_value(row): # assign 1 of clean_values is >800 else 0
    if row['clean_values'] >800:
        value =1
    else:
        value = 0
    return value
jeopardy['high_value'] = jeopardy.apply(high_value,axis=1)


In [20]:
import numpy as np

In [21]:
def high_low_count(word):
    low_count = 0
    high_count = 0
    for index,rows in jeopardy.iterrows():
        split_question = rows['clean_question'].split(" ")
        if word in split_question:
            if rows['high_value'] ==1:
                high_count +=1 
            else:
                low_count += 1
    return low_count,high_count
comparison_terms = [np.random.choice(list(terms_used)) for i in range(10)]
observed_expected = []
#print(comparison_terms)
for i in comparison_terms:
    observed_expected.append(high_low_count(i))

- above we created a function that takes in a row from a dataframe,
- if the clean_value column is greater than 800, assign 1 to value
- otherwise assign 0 to value
- return value
Determine which questions are high and low
- used the pandas dataframe.appply method to apply the fuction to each row in jeopardy
- pass the axis=1 argumant to apply the function across each row
- Assign the result to the high_value column
We created a function that takes in a word,
- Assigned 0 to low_count
- Assigned 0 to high_count
- Looped through each row in jeopardt using the itterows method
- split the clean_question column on the space character (" ")
- if the is in the split question:
= if the high_value column is 1add 1 to high_count
- else add 1 to low_count
- Returns high_count and low_count.
- Randomly pick ten elements of terms_used and append them to a list called comparison_terms
- created an empty list called observed_expected
- Looped through each term in comparison_term 
- ran the function on the term to get the igh value and low value counts
- Append the result of running the function( which will be a list) to observed_expected

In [22]:
observed_expected

[(1, 0),
 (1, 0),
 (2, 3),
 (1, 0),
 (1, 0),
 (1, 0),
 (2, 0),
 (31, 15),
 (3, 0),
 (1, 0)]

Now that we have found the observed counts fro a few terms,we can ccompute the expected counts and the chi-squared value

- We will find the number of rows in jeopardy where high_value is 1 and assign to high_value_count
- find the number of rows in jeopardy where high_value is 0 and assign to low_value_count
- create an empty list caled chi_squared
- Loop through each list called chi_squared
- Add up both items in the list(high and low counts) to get the total count, assign to total.
- Divide total by the number of rows in jeopardy to get the proportion across the dataset.Asign to total_prop
- Multiply total_prop by high_value_count to get the expected term count for high value rows
- Multiply total_prop by low_value_count to get the expected term count for low value rows
- use the scipy.stats.chisquare function to compute the chi-squared value and p-value given the expected and observed counts
- Append the result to 'chi_squared'

In [23]:
from scipy.stats import chisquare
high_value_count = len(jeopardy[jeopardy['high_value']==1])
low_value_count = len(jeopardy[jeopardy['high_value']==0])
chi_squared = []
for i in observed_expected:
    total = sum(list(i))
    total_prop = total/len(jeopardy)
    expected_high_count = total_prop * high_value_count
    expected_low_count = total_prop * low_value_count
    observed = [i[0],i[1]]
    expected = [expected_high_count,expected_low_count]
    chisquare_hello = chisquare(observed,expected)
    chi_squared.append(chisquare_hello)
    

In [24]:
chi_squared

[Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.3137668167849311, pvalue=0.5753778622944691),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=4.97558423439135, pvalue=0.025707519787911092),
 Power_divergenceResult(statistic=33.72195358703462, pvalue=6.357910423575818e-09),
 Power_divergenceResult(statistic=7.463376351587025, pvalue=0.006296679668748999),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047)]

# Chisquared result 

From the chisquared result we could see that most of the p_values above the 0.05 threshold to reject any statistical significance,so it seems there is no statistical difference among the expected and observed high and low values on the premises of low frequency.

This shows that the high and low are different and they do not depend on each other to occur 