# Winning Jeopardy: Is There a Pattern?
## Introduction
Jeopardy is a popular TV show ub the US similar to any game where you answer trivia questios and win money. If you are confused with a concept, a brief description can be found on the wiki: https://en.wikipedia.org/wiki/Jeopardy!

Let's say we've been asked to compete on the show. Is there an edge to win? Are there really ways to "game" the game? Let's see if we can find any patterns within the questions to figure this out. We have attached in this project a dataset (taken from reddit) that contains about 200000 questions from jeopardy, of which we take about 20000 from in our own csv for faster analysis, a decent sample size! They contain the following information:
* Show Number -- the Jeopardy episode number of the show this question was in.
* Air Date -- the date the episode aired.
* Round -- the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
* Category -- the category of the question.
* Value -- the number of dollars answering the question correctly is worth.
* Question -- the text of the question.
* Answer -- the text of the answer.

Let's get started!

## Exploration
First let's get familiar with what we're working with as usual:

In [1]:
# imports
import pandas as pd
import csv

# make our main df
jeopardy = pd.read_csv("jeopardy.csv")

# getting familiar
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [2]:
# see the shape and columns
print(jeopardy.shape)
jeopardy.columns

(19999, 7)


Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [3]:
# fix the columns for easier analysis
jeopardy.columns = ['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question', 'Answer']
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


## Normalizing Text
Before we can start on analysis, we need to normalize all of the text columns (questions and answers). The idea of this is to make sure we lowercase words and remove punctuation and the such so that things like "Don't" and "don't" are not different words when analyzing. 

In [4]:
# get re to convert
import re

# function to normalize the text
def normalize_text(text):
    text = text.lower()
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    return text

In [5]:
# claen up the questions and answers
jeopardy["clean_question"] = jeopardy["Question"].apply(normalize_text)
jeopardy["clean_answer"] = jeopardy["Answer"].apply(normalize_text)

In [6]:
# see our results
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams


## Normalizing Values
Values are also important and should be normalized. The Value and Air Date columns are obvious candidates, and should be converted to numeric (with $ sign removed) and datetime, respectively. We can do that with the following:

In [7]:
# function to normalize the values
def normalize_values(text):
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    text = re.sub("\$", "", text)
    try:
        text = int(text)
    except Exception:
        text = 0
    return text

In [8]:
# clean up the values
jeopardy["clean_value"] = jeopardy["Value"].apply(normalize_values)
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200


In [9]:
# clean up the air date column
jeopardy["Air Date"] = pd.to_datetime(jeopardy["Air Date"])

# see the dtypes to confirm
jeopardy.dtypes

Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object

## What to Study?
In order to figure out whether we should study past questions, general knowledge, not study at all, etc., it would be helpful to figure out:
* How often the answer is deducible from the question
* How often new questions are repeats of old questions

We can do this by seeing how often complex words (we'll say "complex" is over 6 characters) occur and repeat. For example, with the first question, we can see how many times a word in the answer also occurs in the question, with the following:

In [10]:
# function to count matches given a row
def count_matches(row):
    split_answer = row["clean_answer"].split(" ")
    split_question = row["clean_question"].split(" ")
    match_count = 0
    # remove common and 0
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer) == 0:
        return 0
    
    for item in split_answer:
        if item in split_question:
            match_count += 1
            
    return match_count / len(split_answer)

In [11]:
# apply and see results
jeopardy["answer_in_question"] = jeopardy.apply(count_matches, axis=1)
jeopardy["answer_in_question"].mean()

0.06049325706933587

It looks like about 6% of the time, the answer to the question is already in the question itself. Although this looks promising, 6% isn't really all that much, and most of the time just guessing would probably produce a lucky error rate of around 6%. We'll probably have to study in contrast to hoping for the answers. We can explore other our 2nd question now to see if that gets us any better results (hopefully better than a 6% rate).

## Old/New Question Overlaps
We can now see whether or not (or how often, rather) new questions are repeats of old questions. We only have a sample of the total population of questions, but can still use it as a representation. We can sort jeopardy by air date, and maintain a set of terms used that will be updated with each row in the dataset. For each row, if a term is already in the set, a counter is incremented. Again, we'll be using complex words over 6 characters.

In [12]:
# empty list and set
question_overlap = []
terms_used = set()

# iterate rows, splitting question
for i, row in jeopardy.iterrows():
    split_question = row["clean_question"].split(" ")
    split_question = [q for q in split_question if len(q) > 5]
    match_count = 0
    # iterate words in question, setting matches
    for word in split_question:
        if word in terms_used:
            match_count += 1
        terms_used.add(word)
    # set match count and append
    if len(split_question) > 0:
        match_count /= len(split_question)
    question_overlap.append(match_count)

# create new column 
jeopardy["question_overlap"] = question_overlap

# display the mean to see results
jeopardy["question_overlap"].mean()

0.6925960057338647

We can see that there is about a 70% overlap (usually) between old and new questions. Again, while promising, it is important to note that it only looks at individual words. This makes the results slightly less significant, as words like "Greece" can obviously have a multitude of different contexts in the form of numerous question categories.

Still, it is much better than the 6% question to answer rate we saw before, so this makes it worth studying old questions, as you probably can't get a 70% hit rate anywhere else.

## In Terms of Value
Let's say now we only want to focus on high valued questions, which makes sense. This can help us earn more money. We can figure out which terms correspond to high-valued questions with a chi-squared test. The categories we need are:
* Low value: less than or \$800
* High value: more than \$800

We can use the terms_used from above and find the words with the biggest difference in usage between the two categories, and naturally focus on studying the high valued ones.

In [13]:
# create function to return value
def determine_value(row):
    value = 0
    if row["clean_value"] > 800:
        value = 1
    return value

In [14]:
# apply the function
jeopardy["high_value"] = jeopardy.apply(determine_value, axis = 1)

# see results
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value,answer_in_question,question_overlap,high_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200,0.0,0.0,0
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200,0.0,0.0,0
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200,0.0,0.0,0
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200,0.0,0.0,0
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200,0.0,0.0,0


In [15]:
# create function to find high and low usage
def count_usage(term):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        if term in row["clean_question"].split(" "):
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

In [16]:
# empty list
observed_expected = []
# convert terms_used to a list and get first 5 items
comparison_terms = list(terms_used)[:5]

# loop through the terms
for term in comparison_terms:
    observed_expected.append(count_usage(term))
    
# display results
observed_expected

[(1, 2), (2, 1), (1, 0), (0, 1), (1, 1)]

Notice that we only used the first 5 items in the new terms_used list to save time as this step takes a while to run for all terms. The main gist is the same if we wanted to use all the terms.

## Calculating Chi-Squared
Now we can actually see the chi-squared results using everything we've written so far:

In [17]:
# get necessary modules
from scipy.stats import chisquare
import numpy as np

# get high and low counts
high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]

# empty chi-squared
chi_squared = []
# perform the analysis
for obs in observed_expected:
    total = sum(obs)  # total of the counts 
    total_prop = total / jeopardy.shape[0]  # get the proportion
    
    # expected values
    high_exp = total_prop * high_value_count
    low_exp = total_prop * low_value_count
    
    # make arrays
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_exp, low_exp])
    
    # chi-sq
    chi_squared.append(chisquare(observed, expected))
    
# display results
chi_squared

[Power_divergenceResult(statistic=0.03188116723440362, pvalue=0.8582887163235293),
 Power_divergenceResult(statistic=2.1177104383031944, pvalue=0.14560406868264344),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996)]

In [18]:
# see the terms again
comparison_terms

['influential', 'chiricahua', 'assured', 'targetblankmaska', 'defines']

The p-values indicate little to no value in high vs low usage. However, we only did this quickly for the comparison terms of five terms, and see that all terms had a very low frequency count of under 10 total. At best, the chi-squared test probably isn't very valid here, but it is still good practice.

It would be better to run this test with only terms that have higher frequencies. Doing so would not only produce better results and give us actual meaningful data on what to study more efficiently, but also eliminate some time in calculations as well so we don't have to keep using samples.

## Further Analysis / Next Steps
## Frequencies
Here we do what we described just now above and use the chi-squared test only across terms with high frequencies. We re-do our comparison terms, and run the code again to see what differences we can find (hopefully larger differences).

In [19]:
# empty list
observed_expected = []
# convert terms_used to a list and get first 100 terms
comparison_terms = list(terms_used)[:100]

# loop through the terms
for term in comparison_terms:
    usage = count_usage(term)
    if sum(usage) > 20: 
        observed_expected.append(usage)
    
# display results
observed_expected

[(4, 24), (13, 47), (77, 212)]

As we can see, we took 100 elements this time instead of just 5, and still only returned 3 terms with a high frequency. We'll use these for now, but it's safe to assume the list of terms are very broad and not used frequently.

(Note: The previous cell takes a while to run.)

In [21]:
# empty chi-squared
chi_squared = []
# perform the analysis
for obs in observed_expected:
    total = sum(obs)  # total of the counts 
    total_prop = total / jeopardy.shape[0]  # get the proportion
    
    # expected values
    high_exp = total_prop * high_value_count
    low_exp = total_prop * low_value_count
    
    # make arrays
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_exp, low_exp])
    
    # chi-sq
    chi_squared.append(chisquare(observed, expected))
    
# display results
chi_squared

[Power_divergenceResult(statistic=0.9957444564014047, pvalue=0.3183424205069737),
 Power_divergenceResult(statistic=0.00027385366262769215, pvalue=0.9867967904367213),
 Power_divergenceResult(statistic=0.18455103159085381, pvalue=0.6674909797518206)]

It doesn't really matter what the terms are at this point, because we can see that there really isn't a correlation for these high frequency terms in the chi-square. Again, there were only three such high frequemcy terms, and even these weren't THAT high, so we'll probably need to use something other than a chi-squared test in the future.

However, it was easy to see what changing the range of our sample can do, and different ways to do so such as only choosing a certain number of comparison terms for observed/expected results.

Another way to do this would be to improve the speed of the calculation functions in order to include the entire term set instead of just a sample for more accurate numbers.

## Eliminating Words
We chose "complex" words that were over 6 characters long to eliminate in our preliminary cleanup for the question overlapping column, but we can do better than this. There are a few ways other than hard-coding a character limit, including:
* manually making a list of words to remove
* finding a list of words to remove
* remove words that occur in more than a percentage of questions (such as 5%) as they are just noise at this point

The first two options are as simple as just making a list or finding one with common words, and then just seeing if your split list terms are in that list. The third one is a little bit more involved, and is good because it will get rid of high percentage words that are not really common, but still probably don't really hold that much weight, like "count"; doesn't really have much context.

There are also terms such as 'hrefhttpwwwjarchivecommedia20071106dj25jpg' which is probably the result of poor formatting in the question columns, which can easily be worked around by seeing if 'jpg' is in the term at all, for example. 

We work with the jpg elimination below.

In [31]:
# empty list and set
question_overlap = []
terms_used = set()

# iterate rows, splitting question
for i, row in jeopardy.iterrows():
    split_question = row["clean_question"].split(" ")
    split_question = [q for q in split_question if len(q) > 5]
    split_question = [q for q in split_question if q.find("jpg") == -1]
    match_count = 0
    # iterate words in question, setting matches
    for word in split_question:
        if word in terms_used:
            match_count += 1
        terms_used.add(word)
    # set match count and append
    if len(split_question) > 0:
        match_count /= len(split_question)
    question_overlap.append(match_count)

# create new column 
jeopardy["question_overlap"] = question_overlap

# display the mean to see results
jeopardy["question_overlap"].mean()   

0.6975454160738909

## Category Column
The category column poses some interesting questions to answer, and perhaps we can look at things like which categories appear the most often and finding out the probability of each category appearing in each round.

In [32]:
# get value counts
jeopardy["Category"].value_counts()

TELEVISION                          51
U.S. GEOGRAPHY                      50
LITERATURE                          45
BEFORE & AFTER                      40
HISTORY                             40
AMERICAN HISTORY                    40
AUTHORS                             39
WORD ORIGINS                        38
WORLD CAPITALS                      37
BODIES OF WATER                     36
SPORTS                              36
SCIENCE & NATURE                    35
SCIENCE                             35
MAGAZINES                           35
RHYME TIME                          35
WORLD GEOGRAPHY                     33
HISTORIC NAMES                      32
WORLD HISTORY                       32
ANNUAL EVENTS                       32
IN THE DICTIONARY                   31
BIRDS                               31
FICTIONAL CHARACTERS                31
POTPOURRI                           30
ISLANDS                             30
OPERA                               30
TRAVEL & TOURISM         

There's a bunch of categories, and a lot of which only appear once, so let's take the top 20 as an example.

In [37]:
# get top 20 and display
cats = jeopardy["Category"].value_counts(ascending = False)[:20]
cats

TELEVISION           51
U.S. GEOGRAPHY       50
LITERATURE           45
BEFORE & AFTER       40
HISTORY              40
AMERICAN HISTORY     40
AUTHORS              39
WORD ORIGINS         38
WORLD CAPITALS       37
BODIES OF WATER      36
SPORTS               36
SCIENCE & NATURE     35
SCIENCE              35
MAGAZINES            35
RHYME TIME           35
WORLD GEOGRAPHY      33
HISTORIC NAMES       32
WORLD HISTORY        32
ANNUAL EVENTS        32
IN THE DICTIONARY    31
Name: Category, dtype: int64

In [46]:
# see different rounds
jeopardy["Round"].value_counts()

Jeopardy!           9901
Double Jeopardy!    9762
Final Jeopardy!      335
Tiebreaker             1
Name: Round, dtype: int64

In [49]:
# make lists
rounds = jeopardy["Round"].value_counts().index.tolist()
cats = cats.index.tolist()

AttributeError: 'builtin_function_or_method' object has no attribute 'tolist'

In [50]:
# get probabilities
prob_final_jpy = {}
for cat in cats:
    jpy = jeopardy[jeopardy["Category"] == cat]
    count_total = len(jpy)
    jpy2 = jpy[jpy["Round"] == "Final Jeopardy!"]
    count = len(jpy2)
    prob = count / count_total
    prob_final_jpy[cat] = prob
    
# see final jeopardy probabilities
prob_final_jpy

{'AMERICAN HISTORY': 0.05,
 'ANNUAL EVENTS': 0.0,
 'AUTHORS': 0.10256410256410256,
 'BEFORE & AFTER': 0.0,
 'BODIES OF WATER': 0.027777777777777776,
 'HISTORIC NAMES': 0.0625,
 'HISTORY': 0.0,
 'IN THE DICTIONARY': 0.03225806451612903,
 'LITERATURE': 0.0,
 'MAGAZINES': 0.0,
 'RHYME TIME': 0.0,
 'SCIENCE': 0.0,
 'SCIENCE & NATURE': 0.0,
 'SPORTS': 0.0,
 'TELEVISION': 0.0196078431372549,
 'U.S. GEOGRAPHY': 0.0,
 'WORD ORIGINS': 0.21052631578947367,
 'WORLD CAPITALS': 0.05405405405405406,
 'WORLD GEOGRAPHY': 0.09090909090909091,
 'WORLD HISTORY': 0.0625}

In [52]:
# final jeopardy proportion
final_jpy_prop = 335/19999
final_jpy_prop

0.016750837541877093

Above is just one example of how probability analysis could work. As we can see, studying different things for final jeopardy vs. not final jeopardy could make a lot of sense. While the category 'Word Origins' appears in the top 10 list of most frequent categories, over 20% of the time it is involved in final jeopardy.

We can see that final jeopardy only accounts for less than 2% of total questions, so if we were focusing on the first rounds of jeopardy to get going, we will probably not want to include word origins in our study list.

Examples like this are very useful when doing analysis and it is clear we probably can't see things like this just from looking at the table initially, or just from lookin at value counts by themselves.

## More Questions/Data
We can download the complete dataset of all 200000+ questions from the link: https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/.

The analysis will more or less be the same in terms of process if we use this complete dataset, which is why we used a sample of 20000 in the first place. It will produce more accurate results obviously, so this should be done for research purposes, but for the sake of redundancy the process will not be repeated here. The process itself will take a lot longer due to the increased sized csv, but we can ultimately just re-run the cells for the most part if we replace the csv read in.

## Phrases for Overlapping
Lastly, we can use phrases instead of single words when seeing if there's overlap between questions. Single words, like we said, don't really capture the whole context, and phrases can work much better ("Greece soccer" or "Greece debt" provides a whole lot more context than just "Greece").

There isn't really a perfect way to do this, and finding the perfect way to split questions into meaningful phrases could take a bit of work. 

The best and easiest way would probably be to use a manual list again, of well-known phrases in jeopardy. We can match this list to a first half of the air date dataset, and then again to the second half, to see if they correlate at all. If so, studying past questions is the way to go! 

Another way to do this would be to split the questions using the " " space delimiter again, but then halve the list so that we take every other space instead, resulting in lists of two words. We can increase this to three, four, etc. words as needed, and then just re-do the analysis from earlier to see their overlapping percentage. The problem with this is that sometimes the 2-3-4 word combinations can be utterly meaningless and more confusing than just each word by themselves, which poses another problem.

It's best to not create more problems by attempting to solve problems, so using a manual list with a few key phrases is best. If we wanted to study American History, for example, we could include a list of phrases like "electoral college" in a set to see how many times it gets triggered in the match_count variable. We don't have to split the question for this, but rather use the find substring method. 

If again we see the question overlapping being high, we can assume our list of phrases work and can begin studying American History by its category from old questions!

## Conclusion
There were a lot of ideas presented in this small project about jeopardy, but it's easy to say that if there was a secret sauce, it would've already been exposed! We obviously saw that we can't just rely on the answers being within the questions themselves, and that the best way to win is to study questions (duh).

While there are numerous categories and questions, and therefore numerous ways to study, we found a few reasons to believe that some ways are probably more effective than others.

For one, studying past questions to get prepared for new ones seems to be a good idea, as questions seem like they are never completely unique in most cases. 

While it looked like there was not really a point in studying "high valued" questions due to our chi-squared analysis, the fact that chi-square doesn't really work well here leads us with an otherwise optimistic inconclusive conclusion. 

We also found that studying for particular parts of the game could be better suited to different categories. If you wanted to just bank on winning final jeopardy, for example, focusing a lot on studying word origins may be beneficial. We can probably use this same probability analysis on high and low value options too instead of using chi-square, if you wanted to focus on getting the most money (on a second read-through of this project, this was probably a good idea - and a cell following this conclusion will show an example of it).

All in all, depending on where your strengths and weaknesses are, the main conclusion here is that each area can be worked on differently and we can find different areas to study accordingly. This was all by hand though! In projects following this one, hopefully I will finally begin incorporating some automation to this process and work more with some machine learning.

That means that data science will be a focus from now on, instead of simply data analysis, as there are now enough projects to go about analysis. Thanks for sticking around! And good luck as well if you are following along.

In [55]:
# a probability example
# copied code below
# get probabilities
prob_high_jpy = {}
for cat in cats:
    jpy = jeopardy[jeopardy["Category"] == cat]
    count_total = len(jpy)
    jpy2 = jpy[jpy["high_value"] == 1]
    count = len(jpy2)
    prob = count / count_total
    prob_high_jpy[cat] = prob
    
# see final jeopardy probabilities
prob_high_jpy

{'AMERICAN HISTORY': 0.3,
 'ANNUAL EVENTS': 0.21875,
 'AUTHORS': 0.15384615384615385,
 'BEFORE & AFTER': 0.475,
 'BODIES OF WATER': 0.1388888888888889,
 'HISTORIC NAMES': 0.3125,
 'HISTORY': 0.175,
 'IN THE DICTIONARY': 0.5161290322580645,
 'LITERATURE': 0.28888888888888886,
 'MAGAZINES': 0.17142857142857143,
 'RHYME TIME': 0.2,
 'SCIENCE': 0.42857142857142855,
 'SCIENCE & NATURE': 0.4,
 'SPORTS': 0.16666666666666666,
 'TELEVISION': 0.11764705882352941,
 'U.S. GEOGRAPHY': 0.28,
 'WORD ORIGINS': 0.3157894736842105,
 'WORLD CAPITALS': 0.21621621621621623,
 'WORLD GEOGRAPHY': 0.18181818181818182,
 'WORLD HISTORY': 0.21875}

Footnote: we can see of all the high count categories, categories like 'in the dictionary' have over half their questions as high valued ones. While this isn't a sure way to get all the high value questions, it's a start. If you like money, go here!