## Winning Jeopardy!

![Image](https://media.comicbook.com/2020/12/jeopardy-logo-2019-1250897.jpeg?auto=webp)

### Introduction

[Jeopardy](https://en.wikipedia.org/wiki/Jeopardy!) is a popular TV show in the US where participants answer questions to win money. In this project we'll try to find a way to become the winner.

We'll be working with the data set from [reddit](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/). It contains **216930** questions and other info. Let's take a look.

In [1]:
import pandas as pd

jeopardy = pd.read_csv('jeopardy.csv', parse_dates=[' Air Date'])
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [2]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 216930 entries, 0 to 216929
Data columns (total 7 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   Show Number  216930 non-null  int64         
 1    Air Date    216930 non-null  datetime64[ns]
 2    Round       216930 non-null  object        
 3    Category    216930 non-null  object        
 4    Value       216930 non-null  object        
 5    Question    216930 non-null  object        
 6    Answer      216928 non-null  object        
dtypes: datetime64[ns](1), int64(1), object(5)
memory usage: 11.6+ MB


### Cleaning

Before we start let's clean some mess. Several columns have spaces in front. Let's fix it.

In [3]:
cols_to_fix = jeopardy.columns
print('Old \n', cols_to_fix)

cols_to_fix = cols_to_fix.str.replace(r'^ ', '', regex=True)
print('Fixed \n',cols_to_fix)

jeopardy.columns = cols_to_fix

Old 
 Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')
Fixed 
 Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')


Also let's normalize strings in the `Question` and `Answer` columns. By that I mean:
* Everyword should be in lowercase
* No punctuation

The `re` library will help us with that. But we are missing two values in the `Answer` column, so let's remove these rows first.

In [4]:
import re

jeopardy.dropna(subset=['Answer'], inplace=True)

def clean_qa(string):
    '''
    Take a string and return it in lowercase and
    without any punctuation 
    '''
    string = string.lower()
    string = re.sub(r'[^\w\s]', '', string)
    return string

clean_question = jeopardy['Question'].apply(clean_qa)
clean_answer = jeopardy['Answer'].apply(clean_qa)

print(clean_question[:5], clean_answer[:5])

0    for the last 8 years of his life galileo was u...
1    no 2 1912 olympian football star at carlisle i...
2    the city of yuma in this state has a record av...
3    in 1963 live on the art linkletter show this c...
4    signer of the dec of indep framer of the const...
Name: Question, dtype: object 0    copernicus
1    jim thorpe
2       arizona
3     mcdonalds
4    john adams
Name: Answer, dtype: object


We also can turn `Value` column to the numeric to allow us to manipulate it easier. But we must remove **\$** and **,** signs first.

Some values are missing, we'll replace them with 0.

In [20]:
def clean_value(string):
    '''
    Take a string, remove $ and , signs and try to covert it to int
    
    If can't convert return 0
    '''
    string = re.sub(r'[$,]', '', string)
    
    try:
        string = int(string)
    except:
        string = 0
        
    return string

clean_value = jeopardy['Value'].apply(clean_value)
clean_value[:5]

0    200
1    200
2    200
3    200
4    200
Name: Value, dtype: int64