In [1]:
import pandas as pd

jeopardy = pd.read_csv('jeopardy.csv')

jeopardy.head(5)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


To begin the analysis, I imported pandas and read in the correct csv for analysis.

In [2]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [3]:
jeopardy.columns = ['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question', 'Answer']

In [4]:
import re

def normalize_text(text):
    text = text.lower()
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    text = re.sub("\s+", " ", text)
    return text

def normalize_values(text):
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    try:
        text = int(text)
    except Exception:
        text = 0
    return text

jeopardy['Question_Clean'] = jeopardy['Question'].apply(normalize_text)
jeopardy['Answer_Clean'] = jeopardy['Answer'].apply(normalize_text)
jeopardy['Value_Clean'] = jeopardy['Value'].apply(normalize_values)
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,Question_Clean,Answer_Clean,Value_Clean
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200


In order to work closely with regular expressions, I imported in the re library to manipulate string elements within columns. Since the columns contain values and text, I created two separate functions to apply to the specific columns I want to clean. In order to make a cleaner text column, my function eliminates characters not found in my patter repl portion of my re.sub module. In summary, my text columns will only contain lowercase letters, numbers, and whitespace(space between words). Punctuation and grammatical conventions were eliminated. For values, I made my function only accepted values A-z, 0-9, and whitespace, eliminating any other elements. I created an exception to this function, as the function will except only text that represent numerical values and turn them into integers, except 0 will be accepted as is and represent that value in the column.

In [5]:
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

In the above I turned the Air Date column values into datetime values to maintain consistency across all data points.

In [6]:
jeopardy.dtypes

Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
Question_Clean            object
Answer_Clean              object
Value_Clean                int64
dtype: object

In [7]:
def count_matches(row):
    split_answer = row["Answer_Clean"].split()
    split_question = row["Question_Clean"].split()
    if "the" in split_answer:
       split_answer.remove("the")
    if len(split_answer) == 0:
        return 0
    match_count = 0
    for item in split_answer:
        if item in split_question:
            match_count += 1
    return match_count / len(split_answer)

jeopardy['answer_in_question'] = jeopardy.apply(count_matches, axis=1)

I wanted to see the matches between the answer and question columns to see how valuable studying past answers would help in answering future questions correctly. This match_count function is to see overlap. In creating this function, I defined new variables of split_answer (splitting every word of every answer row) and split_question (spkitting every word in the question row). From their I created an if statement that would eliminate the word 'the'if found in the answer values. I then created a separate if that determine if the answer column equaled 0. If so, that new row column would present a 0 value. Now that text is separated and certain values were represented as numerical, its now time to create a new match_count variable that we can instiate over with a for loop. We defined item as any text/word within the split_answer rows. If that same item that is in split answer and in split question, we add one to the match_count variable. By assigning += to the variable, that. means we add one to the preexisting value every time a match occurs. Last element of the function is dividing the match_count by the length of the split_answer, which helps determine the percentahe of words that match from the question to the answer. 

In [8]:
jeopardy['answer_in_question'].mean()

0.059001965249777744

This mean shows that only 6% of the answers come from the questions. That is not a very high percentage and means that studying will be required in order to answer a larger percentage of the questions.

In [9]:
question_overlap = []
terms_used = set()

jeopardy = jeopardy.sort_values("Air Date")

for i, row in jeopardy.iterrows():
    split_question = row["Question_Clean"].split(" ")
    split_question = [q for q in split_question if len(q) > 5]
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
    for word in split_question:
        terms_used.add(word)
    if len(split_question) > 0:
        match_count /= len(split_question)
    question_overlap.append(match_count)
jeopardy["question_overlap"] = question_overlap

jeopardy["question_overlap"].mean()

0.6876260592169776

After seeing that not many answers come from the questions, we want to see how many questions overlap and are repeated. In order to do this we must create a question_overlap list to store the mutubale and ordered values we hope to obtain. In order to make sure we are matching 1:1 words and questions across multiple different jeopaprdy episodes, we want to make sure we create a set that eliminates duplicates and allows to have a clean storage of terms that can be compared to. We also want to sort the air dates to make sure we are looking for match in chronological order. 

I want to iterate over the rows and columns of the dataset, so my for loop will have two variable (i, row) and utilizing the iterrows over the entire dataset. Within the for loop, I want to define my split_question variable, which at first will be the question_clean row values, split by a space. I will add to the split_question and continue to define the variable by only accepting question with greater than 5 characters. As before, I define the match_count variable as 0 to start and will instatiate values into the variable as values match. I begin the for loop with searching for words within split_question to iterate over. If specific words from that row are within the terms_used set, then the match_count varaible is added by 1. This process continues as it is iterated over the whole dataset rows, meaning the match_count variable will grow as more matches are found. Within the larger first for loop, I created another for loop to iterated words over the split_question column to add words that are seen as we iterate over the question to the terms_used set. A set will eliminate duplicates. Last element of the for loop will be an if statement that staes if the length of the characters within a specific word of the split_question is greater than 0, we will continue to divide the match_count (that can be constabntly growing) by the len of the words characters. 

We will append the match_count variable to the empty question_overlap list, so it is order. We will then turn that list into a column within the jeopardy dataset. Last but not least, we will find the average of the question_overlap column values to determine the average amount of question overlap there is. As seen above, there is above 69% overlap in terms used among questions.

In [10]:
def determine_value(row):
    value = 0
    if row['Value_Clean'] > 800:
        value = 1
    return value

jeopardy['high_value'] = jeopardy.apply(determine_value, axis=1)

I also want to analyze and determine what questions are more valuable than others. Above I created a function to apply to the dataset where a new column will be created to identify questions with a value about 800 (1) and below 800 (0).

In [11]:
def count_usage(term):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        if term in row["Question_Clean"].split(" "):
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

To further my analysis, I want to create a tuple that shows the amount of high value vs low value questions within the dataset. I created a function to that will instantiate rows and columns within the whole jeopardy dataset. Within the for loop, I look for row terms in the question column, while splitting the words of the question with a space. Now i created a nested if so that the original if will have specific values to split. If the row values within high_value column are 1 (>800) then the high_count will gain 1 each time this occurs. Likewise, whenever a value is below 800, the low_count variable will gain 1. The function will ultimately return the value for the high_count and low_count variables.

In [12]:
from random import choice

terms_used_list = list(terms_used)
comparison_terms = [choice(terms_used_list) for _ in range(10)]

observed_expected = []

for term in comparison_terms:
    observed_expected.append(count_usage(term))
    
observed_expected

[(0, 1),
 (0, 1),
 (0, 1),
 (1, 0),
 (1, 1),
 (1, 1),
 (3, 6),
 (0, 1),
 (0, 1),
 (1, 0)]

In [13]:
from scipy.stats import chisquare
import numpy as np

high_value_count = jeopardy[jeopardy['high_value'] == 1].shape[0]
low_value_count = jeopardy[jeopardy['high_value'] == 0].shape[0]

chi_squared = []

for obs in observed_expected:
    total = sum(obs)
    total_prop = total / jeopardy.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996),
 Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996),
 Power_divergenceResult(statistic=0.09564350170321084, pvalue=0.75712159875701),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047)]