# Winning Jeopardy!
The exploratory analysis finds out correlations in the jeopardy dataset since 2004, to find out patterns which potentially give an edge in winning the game.

For that matter we use a sample 10% of full real jeopardy dataset.

In [87]:
import pandas as pd
from scipy.stats import chisquare

In [17]:
jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


## Data Preprocessing and cleaning
normalize, cleanup the data

In [18]:
print(jeopardy.columns)

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')


Removing unnecessary space for colnames

In [19]:
jeopardy.columns = [column.strip() for column in jeopardy.columns]
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

### Normalize data
Lowercasing text columns, filter out punctuation.

In [20]:
pattern = r"[\,\.\:;!?\#\(\)\[\]\@\%\^|&\"\/\-\+\=]+"
repl = ' '

In [21]:
jeopardy['clean_question'] = jeopardy['Question'].str.lower()
jeopardy['clean_question'] = jeopardy['clean_question'].str.replace(pattern,repl)

In [22]:
jeopardy['clean_answer'] = jeopardy['Answer'].str.lower()
jeopardy['clean_answer'] = jeopardy['clean_answer'].str.replace(pattern,repl)

Convert value column to numeric

In [25]:
pattern_value = r'$'
repl_value = ''
jeopardy['clean_value'] = jeopardy['Value'].str.replace(pattern_value,repl_value)
jeopardy['clean_value'] = pd.to_numeric(jeopardy['clean_value'], errors='coerce')
jeopardy['clean_value'].fillna(0, inplace=True)

Convert date column to datetime format.

In [26]:
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

## Hacking into jeopardy question-answer data.

It would be helpful to figure out two things:

- How often the answer is deducible from the question.
- How often new questions are repeats of older questions.

We can answer the second question by seeing how often complex words (> 6 characters) reoccur.
We can answer the first question by seeing how many times words in the answer also occur in the question.

First things first: how ofthen the answer is deduciable from the question?

In [29]:
# Also removing 'the' word from the answers, as it would not be meaningful.
jeopardy['clean_answer'] = jeopardy['clean_answer'].str.replace('the','')

In [42]:
# A function that calculates how answer words appear in questions.
def count_matches(row_series):
    #print(row_series.loc['clean_answer'])
    split_answer = row_series.loc['clean_answer'].split()
    split_question = row_series.loc['clean_question'].split()
    match_count = 0
    if len(split_answer) == 0:
        return 0
    else:
        for word in split_answer:
            if word in split_question:
                match_count+=1
        return match_count/len(split_answer)

In [43]:
# Apply the function and calculate the mean
jeopardy['answer_in_question'] = jeopardy.apply(count_matches, axis=1)
answer_occurence = jeopardy['answer_in_question'].mean()
print(answer_occurence)

0.0570043693937


## Explanation
There is a 5.7% chance that the answer is contained in the question already. Which is quite small and not enough to rely on.

Now lets found out "How often new questions are repeats of older questions?"

In [58]:
jeopardy.sort_values('Air Date',inplace=True)

In [59]:
question_overlap = []
terms_used = set()

In [60]:
# The function finds number of question overlap of the terms >=6 characters.
rows = jeopardy.iterrows()
for this_row in rows:
    split_question = this_row[1].loc['clean_question'].split()
    # removing any short words, to take into account only longer ones
    split_question = [word for word in split_question if len(word)>=6]
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count+=1
        terms_used.add(word)
    if len(split_question)>0:
        match_count/=len(split_question)
    question_overlap.append(match_count)

In [62]:
jeopardy['question_overlap'] = question_overlap

q_overlap_mean = jeopardy['question_overlap'].mean()
print(q_overlap_mean)

0.71017680742


### Explanation
71% overlap is a big one. Meaning one can learn or lookup the questions historically to get an edge in winning the game.

## Find high correlation words to question value
The idea is that we want to pay attention to higher valued questions. Let's find such correlation for terms used, and see which correspond significantly to such questions.


In [67]:
# define high/low valued questions
def is_high_valued(row_series):
    if row_series.loc['clean_value']>=800:
        return 1
    else: return 0
    
jeopardy['high_value'] = jeopardy.apply(is_high_valued, axis=1)

In [78]:
# Counts term occurence in low and high valued questions.
def high_low_count_terms(word):
    low_count = 0
    high_count = 0
    rows = jeopardy.iterrows()
    for row in rows:
        clean_question_splitted = row[1].loc['clean_question'].split()
        if word in clean_question_splitted:
            if row[1].loc['high_value'] == 1:
                high_count+=1
            else: low_count+=1
    return high_count, low_count

Compute observed term counts for high and low valued questions for first 5 terms.

In [81]:
observed_expected = []
comparison_terms = list(terms_used)[:5]

In [82]:
for term in comparison_terms:
    observed_expected.append(high_low_count_terms(term))

In [83]:
print(observed_expected)

[(0, 1), (1, 0), (3, 3), (2, 1), (0, 1)]


Compute expected counts and chi-squared value!

In [85]:
high_value_count = len(jeopardy[jeopardy['high_value']==1])
low_value_count = len(jeopardy[jeopardy['high_value']==0])

In [86]:
chi_squared = []

In [88]:
for item in observed_expected:
    total = sum(item)
    total_prop = total/len(jeopardy)
    expected_high = total_prop*high_value_count
    expected_low = total_prop*low_value_count
    chisq, p = chisquare(item, (expected_high, expected_low))
    chi_squared.append(chisq)

In [89]:
print(chi_squared)

[0.66008134805345742, 1.5149647887323945, 0.26256920517877735, 0.90664683432767801, 0.66008134805345742]


### Conclusion on chisquared
Looks like 2-nd term is most significant in terms of correlation to high/low values.

While 3-d term does not correlate much.

## Additional exploration ideas:
- Find a better way to eliminate non-informative words than just removing words that are less than 6 characters long. Some ideas:
    - Manually create a list of words to remove, like the, than, etc.
    - Find a list of stopwords to remove.
    - Remove words that occur in more than a certain percentage (like 5%) of questions.
- Perform the chi-squared test across more terms to see what terms have larger differences. This is hard to do currently because the code is slow, but here are some ideas:
    - Use the apply method to make the code that calculates frequencies more efficient.
    - Only select terms that have high frequencies across the dataset, and ignore the others.
- Look more into the Category column and see if any interesting analysis can be done with it. Some ideas:
    - See which categories appear the most often.
    - Find the probability of each category appearing in each round.
- Use the whole Jeopardy dataset (available here) instead of the subset we used in this mission.
- Use phrases instead of single words when seeing if there's overlap between questions. Single words don't capture the whole context of the question well.