**Jeopardy** is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture.  
The dataset is named [jeopardy.csv](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file), and contains 20000 rows from the beginning of a full dataset of Jeopardy questions.  

Data | Description 
 --- | --- 
Show Number | the Jeopardy episode number of the show this question was in.
Air Date | the date the episode aired.
Round | the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
Category | the category of the question.
Value | the number of dollars answering the question correctly is worth.
Question | the text of the question.
Answer | the text of the answer.

## Importing data and libraries

In [48]:
import numpy as np
import pandas as pd

import re

from bs4 import BeautifulSoup

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

%autosave 10

Autosaving every 10 seconds


In [49]:
data = pd.read_csv("jeopardy.csv")

In [50]:
data.columns = [x.strip() for x in data.columns]

In [51]:
data.head(3)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona


In [52]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
Show Number    19999 non-null int64
Air Date       19999 non-null object
Round          19999 non-null object
Category       19999 non-null object
Value          19999 non-null object
Question       19999 non-null object
Answer         19999 non-null object
dtypes: int64(1), object(6)
memory usage: 1.1+ MB


In [53]:
data['Air Date'] = pd.to_datetime(data['Air Date'])

In [54]:
data['Value'].unique()

array(['$200', '$400', '$600', '$800', '$2,000', '$1000', '$1200',
       '$1600', '$2000', '$3,200', 'None', '$5,000', '$100', '$300',
       '$500', '$1,000', '$1,500', '$1,200', '$4,800', '$1,800', '$1,100',
       '$2,200', '$3,400', '$3,000', '$4,000', '$1,600', '$6,800',
       '$1,900', '$3,100', '$700', '$1,400', '$2,800', '$8,000', '$6,000',
       '$2,400', '$12,000', '$3,800', '$2,500', '$6,200', '$10,000',
       '$7,000', '$1,492', '$7,400', '$1,300', '$7,200', '$2,600',
       '$3,300', '$5,400', '$4,500', '$2,100', '$900', '$3,600', '$2,127',
       '$367', '$4,400', '$3,500', '$2,900', '$3,900', '$4,100', '$4,600',
       '$10,800', '$2,300', '$5,600', '$1,111', '$8,200', '$5,800',
       '$750', '$7,500', '$1,700', '$9,000', '$6,100', '$1,020', '$4,700',
       '$2,021', '$5,200', '$3,389'], dtype=object)

In [55]:
def cleanValue(bad_price):
    bad_price = bad_price.replace("$","")
    good_price = bad_price.replace(",","")
    return(good_price)

In [56]:
data['Value'] = data['Value'].apply(cleanValue).replace("None",np.nan).fillna(0).astype(int)

In [59]:
data['Value'].value_counts(bins = 15,dropna=False, sort = True)

(-12.001, 800.0]      14265
(800.0, 1600.0]        4228
(1600.0, 2400.0]       1275
(2400.0, 3200.0]        106
(3200.0, 4000.0]         56
(4800.0, 5600.0]         27
(6400.0, 7200.0]         10
(5600.0, 6400.0]         10
(4000.0, 4800.0]          9
(7200.0, 8000.0]          5
(9600.0, 10400.0]         3
(11200.0, 12000.0]        2
(10400.0, 11200.0]        1
(8800.0, 9600.0]          1
(8000.0, 8800.0]          1
Name: Value, dtype: int64

## To-do

* Write a function to normalize questions and answers. 
* Investigate repeated questions
* Perform the chi-squared test across more terms to see what terms have larger differences
* Look 
* more into the Category column and see if any interesting analysis can be done with it
* Use the whole Jeopardy dataset (available [here]) instead of the subset we used in this mission.