<h1 style="font-size:220%"> Analyzing Jeopardy Data Using Chi-Squared Test</h1>

***

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for many years, and is a major force in popular culture. In each game, contestants are presented trivia clues phrased as answers, to which they must respond in the form of a question that correctly identifies whatever the clue is describing. This project aims to find out if there are any patterns that would lead to a better chance of winning. The dataset can be downloaded [here(last updated in OCT 2021)](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/).

<h1 style="font-size:160%"> TABLE OF CONTENTS </h1>

<a id="0"></a>

>- [Read CSV File](#1)
>- [Summary of Columns](#2)
>- [Normalizing text](#3)
>- [Study](#4)
>    - [A) Answers in Questions](#5)
>    - [B) Repeated Words in Questions](#6)
>- [Identifying High Value and Low Value](#7)
>- [Chi-Squared Test](#8)
>- [Identifying High Frequencies](#9)
>- [Chi-Squared Test with High Frequencies](#10)
>- [Conclusion](#11)

<a id="1"></a>
<h1 style="font-size:160%"> Read CSV File</h1>

***

In [1]:
import pandas as pd
import csv

jeopardy = pd.read_csv("JEOPARDY_CSV.csv")
jeopardy

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams
...,...,...,...,...,...,...,...
216925,4999,2006-05-11,Double Jeopardy!,RIDDLE ME THIS,$2000,This Puccini opera turns on the solution to 3 ...,Turandot
216926,4999,2006-05-11,Double Jeopardy!,"""T"" BIRDS",$2000,In North America this term is properly applied...,a titmouse
216927,4999,2006-05-11,Double Jeopardy!,AUTHORS IN THEIR YOUTH,$2000,"In Penny Lane, where this ""Hellraiser"" grew up...",Clive Barker
216928,4999,2006-05-11,Double Jeopardy!,QUOTATIONS,$2000,"From Ft. Sill, Okla. he made the plea, Arizona...",Geronimo


<a id="2"></a>
<h1 style="font-size:160%"> Summary of Columns</h1>

***

|Columns||Description|
|---||:------------------|
|index||Row number|
|Show Number||The code for each episode|
|Air Date||The date the episode aired|
|Round||The round of Jeopardy that question was asked in|
|Category||Type of the question|
|Value||Amount of money that question can get|
|Question||Text of question|
|Answer||Text of answer|

Remove spaces between columns

In [2]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [3]:
jeopardy.columns = ['Show_Number', 'Air_Date', 'Round', 'Category', 'Value', 'Question', 'Answer']

<a id="3"></a>
<h1 style="font-size:160%"> Normalizing text</h1>


Before analyzing, the text needs to be normalize so that its easier to work with. For example, all the text needs to be in lowercase, the dollar sign in value column needs to be removed and etc.

In [4]:
import re

def norm_text(text):
    text = text.lower()
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    text = re.sub("\s+", " ", text)
    return text

def norm_values(text):
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    try:
        text = int(text)
    except:
        text = 0
    return text

In [5]:
jeopardy["Answer"]=jeopardy['Answer'].astype(str)
jeopardy["clean_question"] = jeopardy["Question"].apply(norm_text)
jeopardy["clean_answer"] = jeopardy["Answer"].apply(norm_text)
jeopardy["clean_value"] = jeopardy["Value"].apply(norm_values)
jeopardy["Air_Date"]= pd.to_datetime(jeopardy["Air_Date"])
jeopardy.head()

Unnamed: 0,Show_Number,Air_Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200


In [6]:
jeopardy.dtypes

Show_Number                int64
Air_Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object

<a id="4"></a>
<h1 style="font-size:160%"> Study</h1>

Answers in Questions
In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

>- How often the answer can be used for a question.
>- How often questions are repeated.

<a id="5"></a>
<h1 style="font-size: 135%">A) Answers in Questions</h1>

In [7]:
def count_matches(row):
    split_answer = row["clean_answer"].split()
    split_question = row["clean_question"].split()
    if len(split_answer) == 0:
        return 0
    for item in split_answer:
        if item in split_question:
            return True
        else:
            return False
a = jeopardy.apply(count_matches, axis=1)

In [8]:
a = a[a==True]

In [9]:
a

5         True
6         True
11        True
14        True
18        True
          ... 
216871    True
216877    True
216886    True
216898    True
216908    True
Length: 25782, dtype: object

There are 25782 sets of questions and answers that matched.<br>
Inspect the details to look for unwanted results.

In [10]:
jeopardy.loc[5,"Question"],jeopardy.loc[5,"Answer"]

('In the title of an Aesop fable, this insect shared billing with a grasshopper',
 'the ant')

In [11]:
jeopardy.loc[31,"Question"],jeopardy.loc[31,"Answer"]

('It can be a place to leave your puppy when you take a trip, or a carrier for him that fits under an airplane seat',
 'a kennel')

A lot of these matches occurred due to common words such as 'the' , 'a' and etc.<br>
#### Introducing stopwords to eliminate all the common words

In [12]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\cyeri\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\cyeri\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

#### Create a function that filters answers and calculate the ratio of number of words in the answer that are found in questions. 
#### Apply the function to every row of the data

In [13]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

def count_matches(row):
    stop_words = set(stopwords.words('english'))
    question_tokens = word_tokenize(row["clean_question"])
    answer_tokens = word_tokenize(row["clean_answer"])
    filtered_answer = [w for w in answer_tokens if not w in stop_words]
    
    if len(filtered_answer) == 0:
        return 0
    
    match_count = 0
    for item in filtered_answer:
        if item in question_tokens:
            match_count += 1
    return match_count / len(filtered_answer)

jeopardy["answer_in_question"] = jeopardy.apply(count_matches, axis=1)

In [14]:
jeopardy["answer_in_question"].mean()

0.038456264050234445

On average, about 3% of words could be found in questions. Chances of not needing to prepare for the game and winning is rather low.

<a id="6"></a>
<h1 style="font-size: 135%">B) Repeated Words in Questions</h1>

Another approach would be checking if certain term is being repeated over the many questions. Adding another condition of omiting words under 5 character would ensure the results to be specific terminology.

In [15]:
question_overlap = []
terms_used = set()

jeopardy = jeopardy.sort_values("Air_Date")

stop_words = set(stopwords.words('english'))

for i, row in jeopardy.iterrows():
        split_question = row["clean_question"].split(" ")
        split_question = [w for w in split_question if len(w) > 5 if not w in stop_words]
        match_count = 0
        for word in split_question:
            if word in terms_used:
                match_count += 1
        for word in split_question:
            terms_used.add(word)
        if len(split_question) > 0:
            match_count /= len(split_question)
        question_overlap.append(match_count)
jeopardy["question_overlap"] = question_overlap

In [16]:
jeopardy.question_overlap.mean()

0.8704485294555635

On average, about 87% chance a specific terminology has been repeated as least once throughout the dataset.

<a id="7"></a>
<h1 style="font-size:160%"> Identifying High Value and Low Value</h1>

It is important to be able to identify the difference of values within questions to increase the chances of winning. Then we can identify the terms that are corresponded to high-value questions using chi-squared test. To do that:

First, split and store value into two categories:
> - Low value : Row where value column is less than 800
> - High value : Row where value column is more than 800

Second, create a counter that counts the frequency of the high value or low value by checking if the if question contains the term.

Third, for every term in term_used, store the results of the function.  

In [17]:
def find_value(row):
    value = 0
    if row["clean_value"]>800:
        value +=1
    else:
        value = 0
    return value

jeopardy["high_value"] = jeopardy.apply(find_value,axis =1)

In [18]:
def value_count(w):
    high_count = 0
    low_count = 0
    
    pattern = r"\b{}\b".format(w)
    high_count = jeopardy[(jeopardy['clean_question'].str.contains(pattern, regex = True)) &
                         (jeopardy['high_value'] == 1)]['high_value'].count()
    low_count = jeopardy[(jeopardy['clean_question'].str.contains(pattern, regex = True)) &
                        (jeopardy['high_value'] == 0)]['high_value'].count()
    return w, high_count, low_count

In [19]:
from random import choice

terms_used_list = list(terms_used)
comparison_terms = [choice(terms_used_list) for _ in range(10)]

observed_expected = []

for term in comparison_terms:
    observed_expected.append(value_count(term))

<a id="8"></a>
<h1 style="font-size:160%"> Chi-Squared Test</h1>

Fourth, get sum of high value and low value to calculate for observed and expected value to perform Chi-squared test.

In [20]:
from scipy.stats import chisquare
import numpy as np

high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]

observed_expected =list(observed_expected)
df = pd.DataFrame(observed_expected,columns = ['word',"high_value","low_value"])
df["total"] = df["high_value"]+df["low_value"]
df

Unnamed: 0,word,high_value,low_value,total
0,hangin,2,2,4
1,hrefhttpwwwjarchivecommedia20040528dj20jpg,0,1,1
2,hrefhttpwwwjarchivecommedia20060630dj28jpg,1,0,1
3,hrefhttpwwwjarchivecommedia20080602dj18mp3dag,1,0,1
4,icedancing,1,0,1
5,xiaoping,1,2,3
6,negron,0,1,1
7,hrefhttpwwwjarchivecommedia20091209dj26wmvalex,0,1,1
8,moranis,1,0,1
9,saluton,1,0,1


In [21]:
def cal_chi_square(row):
    chi_squared = []
    total_prop = row['total']/jeopardy.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    observed = np.array([row['high_value'], row['low_value']])
    expected = np.array([high_value_exp, low_value_exp])
    
    chi_value, p_value = chisquare(observed, expected)
    
    chi_squared.append((row['word'], chi_value, p_value, row['high_value'], row['low_value']))
    return chi_squared
                         
    
chi_squared = df.apply(cal_chi_square, axis = 1)
chi_squared


0    [(hangin, 0.9267728889671603, 0.33570289422995...
1    [(hrefhttpwwwjarchivecommedia20040528dj20jpg, ...
2    [(hrefhttpwwwjarchivecommedia20060630dj28jpg, ...
3    [(hrefhttpwwwjarchivecommedia20080602dj18mp3da...
4    [(icedancing, 2.5317964247338085, 0.1115731283...
5    [(xiaoping, 0.03723409388907139, 0.84698921448...
6    [(negron, 0.3949764642333513, 0.52969509124866...
7    [(hrefhttpwwwjarchivecommedia20091209dj26wmval...
8    [(moranis, 2.5317964247338085, 0.1115731283816...
9    [(saluton, 2.5317964247338085, 0.1115731283816...
dtype: object

None of the above results give a p-value of less than 5 percent, chi-squared test is not valid, there is no significant difference in usage in high value and low value for these words. The test would give better results if run with terms with higher frequencies. 

<a id="9"></a>
<h1 style="font-size:160%"> Identifying High Frequencies</h1>

To identify frequencies, a new frequency counter that only focuses on occurrence of terms is required to optimize run time.

In [23]:
terms_used_sr = pd.Series(list(terms_used))
terms_used_sr

0                                       nominations
1                                           ardenne
2        hrefhttpwwwjarchivecommedia20110112j10ajpg
3                                          petunias
4                                            lawful
                            ...                    
99893                              targetblankhawka
99894                               vixenoffroadcom
99895                                      redskins
99896                                       glasser
99897                                        deeper
Length: 99898, dtype: object

In [127]:
def freq_counter(w):
    pattern = r"\b{}\b".format(w)
    freq = sum(jeopardy['clean_question'].str.contains(pattern, regex = True))
    return w,freq

The run time would take too long for the scope of this project, we would choose 5000 random terms to proceed.

In [139]:
from random import choice

freq_comparison_terms = pd.Series([choice(terms_used_list) for _ in range(5000)])

In [140]:
frequency = freq_comparison_terms.apply(freq_counter)

The first test proved chi-squared test would not be effective with low frequencies data. This test would limit occurrence to above 20 times.

In [196]:
high_freq_w = pd.DataFrame(list(frequency),columns = ['word','freq']).sort_values(by="freq",ascending=False)
high_freq_w = high_freq_w[high_freq_w["freq"]>20]
high_freq_w

Unnamed: 0,word,freq
1301,around,1777
4247,states,1359
385,ancient,912
3984,center,902
243,nation,854
...,...,...
1338,envelope,21
4286,highflying,21
1791,mideastern,21
4728,bridal,21


<a id="10"></a>
<h1 style="font-size:160%"> Chi-Squared Test with High Frequencies</h1>

In [202]:
high_freq_w_list = high_freq_w["word"]
high_low_value_table = high_freq_w_list.apply(value_count)
high_low_value_table = pd.DataFrame(list(high_low_value_table),columns=['word','high_value','low_value'])
high_low_value_table["total"] = high_low_value_table["high_value"]+high_low_value_table["low_value"]
high_low_value_table

Unnamed: 0,word,high_value,low_value,total
0,around,521,1256,1777
1,states,360,999,1359
2,ancient,292,620,912
3,center,291,611,902
4,nation,269,585,854
...,...,...,...,...
320,envelope,2,19,21
321,highflying,3,18,21
322,mideastern,8,13,21
323,bridal,4,17,21


In [205]:
high_freq_chi_squared = high_low_value_table.apply(cal_chi_square, axis = 1)
high_freq_chi_squared

0      [(around, 0.8840432228440693, 0.34709665951509...
1      [(states, 2.2279050234367053, 0.13553751336244...
2      [(ancient, 6.162328080925132, 0.01304993395438...
3      [(center, 6.924680318118202, 0.008501418791956...
4      [(nation, 4.267144686676575, 0.038856170473074...
                             ...                        
320    [(envelope, 3.653032823198812, 0.0559672105782...
321    [(highflying, 2.0361210587719096, 0.1536008974...
322    [(mideastern, 0.9898092208761973, 0.3197890070...
323    [(bridal, 0.8884257599609271, 0.34590426880223...
324    [(remedy, 2.0361210587719096, 0.15360089742564...
Length: 325, dtype: object

In [214]:
df = pd.DataFrame([c[0] for c in high_freq_chi_squared], 
                              columns = ['word', 'chi_squared', 'p_value', 'high_value', 'low_value'])
df

Unnamed: 0,word,chi_squared,p_value,high_value,low_value
0,around,0.884043,0.347097,521,1256
1,states,2.227905,0.135538,360,999
2,ancient,6.162328,0.013050,292,620
3,center,6.924680,0.008501,291,611
4,nation,4.267145,0.038856,269,585
...,...,...,...,...,...
320,envelope,3.653033,0.055967,2,19
321,highflying,2.036121,0.153601,3,18
322,mideastern,0.989809,0.319789,8,13
323,bridal,0.888426,0.345904,4,17


In [215]:
df["p_value"].mean()

0.418165706598223

The overall results are still not ideal. However, there is a great increase in effectiveness, as many terms have a p value of 5 percent. This may due to the 5000 terms limit that was set earlier and also the study may need a better solution to filter out words such as around, center and more.

<a id="11"></a>
<h1 style="font-size:160%"> Conclusion</h1>

The project aims to find patterns that would increase chances of winning Jeopardy. After studying the data, we know that:
> - On average, about 3% of words could be found in questions. Chances of not needing to prepare for the game and winning is rather low.
> - On average, about 87% chance a specific terminology has been repeated as least once throughout the dataset.
> - In order to draw correlation between specific words and high value questions, we need to limit the study to only high frequency terms

Next step after this project would be increasing the sample size to the full list of terms. Identify words where high value ocurrence exceeds low value. After that, find the category or questions these terms fall into.