# Winning Jeopardy

## Introduction

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for many years, and is a major force in popular culture.

We want to compete on Jeopardy and looking for any way to win it. So in this project, we'll work with a [dataset](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file) of Jeopardy questions, which contains 20000 rows from the beginning of a full dataset, to figure out some patterns in the questions that could help win.

## Jeopardy Questions

We'll start with looking at our dataset and get some information about it.

In [1]:
# Importing libraries
import pandas as pd
import re

# Read dataset
jeopardy = pd.read_csv('jeopardy.csv')

In [2]:
# Quick look in to datset
jeopardy

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams
5,4680,2004-12-31,Jeopardy!,3-LETTER WORDS,$200,"In the title of an Aesop fable, this insect sh...",the ant
6,4680,2004-12-31,Jeopardy!,HISTORY,$400,Built in 312 B.C. to link Rome & the South of ...,the Appian Way
7,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$400,"No. 8: 30 steals for the Birmingham Barons; 2,...",Michael Jordan
8,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$400,"In the winter of 1971-72, a record 1,122 inche...",Washington
9,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$400,This housewares store was named for the packag...,Crate & Barrel


The original dataset has 19999 rows and 7 columns.

| column name | description |
| ----------- | ----------- |
| Show Number | the Jeopardy episode number |
| Air Date | the date the episode aired |
| Round | the round of Jeopardy |
| Category | the category of the question |
| Value | the number of dollars the correct answer is worth |
| Question | the text of the question |
| Answer | the text of the answer |

## Data Cleaning

Before we can start doing analysis on the Jeopardy questions, we need to do some data cleaning. These include
- Check for duplicate data
- Remove extra spaces before the column names
- checking and converting the datatype of column if needed
- for texts of questions and answers, in order to extract the words for analysis, we will
    - put words in lowercase
    - remove punctuation
- for "Value" column
    - remove the dollar sign from the beginning of each cell, then
    - convert the datatype from text to numeric.

### Check for duplicate values in daatset

In [3]:
# Duplicates check
duplicated = jeopardy.duplicated()
duplicated.sum()

0

### Fixing coulmns names

In [4]:
# Printing dataset columns
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [5]:
# Fixing columns names
jeopardy.columns = ['Show Number', 'Air Date', 'Round', 'Category', 'Value',
       'Question', 'Answer']

### Datatype check

In [6]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
Show Number    19999 non-null int64
Air Date       19999 non-null object
Round          19999 non-null object
Category       19999 non-null object
Value          19999 non-null object
Question       19999 non-null object
Answer         19999 non-null object
dtypes: int64(1), object(6)
memory usage: 1.1+ MB


As we can see in abowe tabel only `Show number` column has dataype intiger, rest columns are objects, does mean that all are strings.

We need to conver datatype for `Air Date` column from object to datatime.

In [7]:
# Changing datatype for `Air Date` column
jeopardy["Air Date"] = pd.to_datetime(jeopardy["Air Date"])

In [8]:
# Checking columns datatype
jeopardy.dtypes

Show Number             int64
Air Date       datetime64[ns]
Round                  object
Category               object
Value                  object
Question               object
Answer                 object
dtype: object

### Normalizing Text and Values

We need to normalized all of te text columns. 

We will write a function with 2 sets of codes for normamalizing both texts and "Value" column respectively.

In [9]:
def normalized_text(text):
    text = text.lower()
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    text = re.sub(r'\s+', ' ', text)
    
    return text

def normalize_values(text):
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    try:
        text = int(text)
    except Exception:
        text = 0
    return text

Now we will use our functions to normalized `Question`, `Answer` and `Value` columns and asign results to new columns. In this way, we will retain the original data and have the corrected data necessary for our analysis.

In [10]:
# Generate 3 new columns by applying the appropriate funtion above to Question, Answer and Value columns
jeopardy['clean_question'] = jeopardy['Question'].apply(normalized_text)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalized_text)
jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_values)

In [11]:
# Check cleaned dataset
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200


## Investigate the Trends of Jeopardy! Questions

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:
- How often the answers were used for a question?
- How often questions are repeated?
- find out the "key words" or specific "terms" that can be used as proxies to make comparisons between low-value and high_value questions.

### How often the answer were used for a question

Now we will:
- write a function that take in the row with string and split each word into a list,
- filter out the most common word "the" , use if-else method,
- loop through each item in splited text in "Answer" column, and see if it also occurs in splited text in " Question" column,
- count how many matches are there and also creat a list to collect the matched words,
- caculate the mean value of all matches.

In [14]:
def count_matches(row):
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()
    
    match_count = 0
    
    if 'the' in split_answer:
        split_answer.remove('the')
        
    if len(split_answer) == 0:
        return 0
    
    for word in split_answer:
        if word in split_question:
            match_count += 1
    
    return match_count / len(split_answer)

We will creat new column with information how many times terms in `clean_answer` occur in `clean_question` using for this  the function we created.

In [15]:
# Apply the function above to the database
jeopardy['answer_in_question'] = jeopardy.apply(count_matches, axis=1)

In [16]:
# Get the mean for answer_in_question column
jeopardy['answer_in_question'].mean()

0.05900196524977763

According to the above analysis, we find that approximately 6% of answers were used for a question. The odds is not great, so we should not spend much time trying to find answers in the archived questions. Let’s move to our second option - find out how many repeated questions are there and how often did these repeated questions appear in the database.

### Find out the frequency of recycled questions in the archive

In this step, we will wirte a function that
- Sort jeopardy in order of ascending `Air Date` column.
- Create a empty set called `terms_used`
- Iterate through each row of jeopardy and
    - split `clean_question` into words, remove any word shorter than 6 characters, and
    - check if each word occurs in terms_used. If it does,
        - increment a counter.
        - add each word to terms_used
- Creat a new column "question_overlap" in jeopardy that keep the counts of the repeat questions
- Find the mean of the question_overlap column

In [19]:
question_overlap = []
term_used = set()

# Sorting jeopardy by ascending air date
jeopardy = jeopardy.sort_values('Air Date')

for i, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(' ')
    split_question = [n for n in split_question if len(n) > 5]
    
    match_count = 0
    
    for word in split_question:
        if word in term_used:
            match_count += 1
            
    for word in split_question:
        term_used.add(word)
        
    if len(split_question) > 0:
        match_count /= len(split_question)
        
    question_overlap.append(match_count)

In [20]:
# Create a new column for repeated questions
jeopardy['question_overlap'] = question_overlap

In [21]:
# Find the means of the new column
jeopardy['question_overlap'].mean()

0.6876924200174069

On average, the recycled questions appear at a frequency of 69%, much better than the previous option. In fact, word repetition is not an ideal proxy for repeated questions, because we may have repetition in the structure words rather than the “key words” that really define the meaning of the questions. We may need to sieve through our collected list to find the real repeated questions. Anyway, studying repeat questions will at least give us some general ideas and familiarities of the format/styles of the completion, and thus we will secure to ace the frequently asked questions and avoid the embossing scores during the competition. As to the difficult questions, they are rare and usually unique, so they may not be included in the recycled questions; we definitely need explore other means to be able to stand out.

### Analysis of the terms associated with high-values vs low-value questions