<h1>
<center>
Dataquest Guided Project 13:
Winning Jeopardy
</center>
</h1>

## Introduction

This is part of the Dataquest program.

- part of paths **Data Analyst in Python & Data Scientist in Python**
    - Step 5: **Probability and Statistics**
        - Course 2: **Probability and Statistics in Python : Intermediate **
            - Calculating probabilities
            - Probability distributions
            - Significance testing
            - Chi-squared tests
            - Multi-category chi-squared tests

As this is a guided project, we are following and deepening the steps suggested by Dataquest. In this project, we will work with a dataset of Jeopardy questions to figure out some patterns in the questions.

## Use case : Jeopardy

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades and is a major force in popular culture. 

The dataset we'll be using contains 20000 rows from the beginning of a full dataset of Jeopardy questions, found [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file).
Each row in the dataset represents a single question on a single episode of Jeopardy. Here are the explanations of each column : 

| Header | Definition   |
|------|------|
|   **Show Number**  | the Jeopardy episode number of the show this question was in|
|   **Air Date**  | the date the episode aired|
|   **Round**  | the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses|
|   **Category**  | the category of the question|
|   **Value**  | the number of dollars answering the question correctly is worth|
|   **Question**  | the text of the question|
|   **Answer**  | the text of the answer|

## Load the data

In [1]:
import pandas as pd
import re

In [2]:
jeopardy = pd.read_csv("jeopardy.csv")
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [3]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [4]:
jeopardy.columns = ['Show Number',
                    'Air Date',
                    'Round',
                    'Category',
                    'Value',
                    'Question',
                    'Answer']

## Normalizing text columns

Before we can start analyzing the Jeopardy questions, we need to normalize all of the text columns (the Question and Answer columns). The idea is to ensure that the words are all lowercase and the punctuation removed.

In [5]:
def normalize_text(text):
    text = text.lower()
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    return text

In [6]:
jeopardy["clean_question"] = jeopardy["Question"].apply(normalize_text)
jeopardy["clean_answer"] = jeopardy["Answer"].apply(normalize_text)

## Normalizing the other columns

The "Value" column should also be numeric to allow us to manipulate it more easily. The "Air Date" column should be a DateTime, not a string. 

In [7]:
def normalize_values(text):
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    try:
        text = int(text)
    except Exception:
        text = 0
    return text

In [8]:
jeopardy["clean_value"] = jeopardy["Value"].apply(normalize_values)

In [9]:
jeopardy["Air Date"] = pd.to_datetime(jeopardy["Air Date"])

To figure out whether to study past questions, examine general knowledge, or not study it all, it would be helpful to figure out two things:

- How often the answer is deducible from the question ?
- How often new questions are repeats of older questions ?

## How often the answer is deducible from the question?

To answer the first question, we'll study how many times words in the answer also occur in the question

In [10]:
def question_match_answer(row):
    split_answer = row["clean_answer"].split(" ")
    split_question = row["clean_question"].split(" ")
    match_count = 0
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer) == 0:
        return 0
    for word in split_answer:
        if word in split_question:
            match_count += 1
    ratio = match_count/len(split_answer)
    return ratio

In [11]:
jeopardy["answer_in_question"] = jeopardy.apply(question_match_answer, axis=1)

In [12]:
jeopardy["answer_in_question"].mean()

0.060493257069335872

In average, only 6% of words are both in the question and the answer. This means that hoping to guess the answer from the question in not a good strategy. 

## How often new questions are repeats of older questions?

We would like to investigate how often new questions are repeats of older ones. We can't completely answer this, because we only have about 10% of the full Jeopardy question dataset. We will only look at how often complex words (more than 6 characters) reoccur. 

In [13]:
question_overlap = []
terms_used = set()
for i, row in jeopardy.iterrows():
        split_question = row["clean_question"].split(" ")
        split_question = [q for q in split_question if len(q) > 5]
        match_count = 0
        for word in split_question:
            if word in terms_used:
                match_count += 1
        for word in split_question:
            terms_used.add(word)
        if len(split_question) > 0:
            match_count /= len(split_question)
        question_overlap.append(match_count)
jeopardy["question_overlap"] = question_overlap

jeopardy["question_overlap"].mean()

0.69087373156719623

69% of long words are recycled. This only concerns words of more than 6 characters, for 10% of the overall data. Moreover, this studies only individual words. The meaning of the whole sentence can be very different. Thus, this makes it relatively insignificant, but it does mean that it's worth looking more into the recycling of questions.

## Low value vs high value questions 

Now, we would like to focus on high-value questions to earn more money on Jeopardy. 
We can figure out which terms correspond to high-value questions using a chi-squared test. We will need first to narrow down the questions into 2 categories : 
- A low value is less than 800
- A high value is more than 800. 
The goal is to find the words with the most significant difference in usage between high and low-value questions, by selecting the words with the highest associated chi-squared values. Doing this for all of the words would take a long time, so we'll do it for a small sample. 

In [14]:
def determine_value(row):
    value = 0
    if row["clean_value"] > 800:
        value = 1
    return value

In [15]:
jeopardy["high_value"] = jeopardy.apply(determine_value, axis= 1)

In [16]:
def count_usage(term):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        if term in row["clean_question"].split(" "):
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

comparison_terms = list(terms_used)[:5]
observed_expected = []
for term in comparison_terms:
    observed_expected.append(count_usage(term))

In [17]:
from scipy.stats import chisquare
import numpy as np

high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]

chi_squared = []
for obs in observed_expected:
    total = sum(obs)
    total_prop = total / jeopardy.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=2.4877921171956752, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.026364433084407689, pvalue=0.87101348468892104),
 Power_divergenceResult(statistic=0.40196284612688399, pvalue=0.52607729857054686),
 Power_divergenceResult(statistic=2.4877921171956752, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.40196284612688399, pvalue=0.52607729857054686)]

In [18]:
from scipy.stats import chisquare
import numpy as np

high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]

chi_squared = []
for obs in observed_expected:
    total = sum(obs)
    total_prop = total / jeopardy.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=2.4877921171956752, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.026364433084407689, pvalue=0.87101348468892104),
 Power_divergenceResult(statistic=0.40196284612688399, pvalue=0.52607729857054686),
 Power_divergenceResult(statistic=2.4877921171956752, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.40196284612688399, pvalue=0.52607729857054686)]

None of the terms had a significant difference in usage between high value and low value rows. Additionally, None of the p values are less than 0.05 so this is not statiscally significant.
It would be better to run this test with only terms that have higher frequencies.

We can deepen the study with some potential additional steps: 
- Find a better way to eliminate non-informative words than just removing words that are less than 6 characters long. For instance : 
    - Create a stopwords list to remove
    - Remove words that occur in more than a certain percentage (like 5%) of questions.
- Perform chi-squared test across more terms to see what terms have lager difference. 
- Use phrases instead of single words when seeing if there is overlap between questions to capture the whole context. 
- Use the entire Jeopardy dataset. 