In [1]:
# import libraries
from IPython.display import Image  # for displaying images in markdown cells
import pandas as pd  # Dataframe manipulation
import numpy as np  # Arrays manipulation


In [2]:
%%html
<style>
table {align:left;display:block}  # to align html tables to left
</style> 

# Dataquest | Hypothesis Testing: Fundamentals <br/> <br/> Project Title: Winning Jeopardy

## 1) Introduction / Jeopardy Questions

#### Background:

Provided by: [Dataquest.io](https://www.dataquest.io/)

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for many years, and is a major force in popular culture.

Imagine that you want to compete on Jeopardy, and you're looking for any way to win. In this project, you'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you win.

The dataset is named **jeopardy.csv**, and contains **20000** rows from the beginning of a full dataset of Jeopardy questions, which you can download [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/).

As you can see, each row in the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:

Columns | Description
--- | ---
Show Number | the Jeopardy episode number
Air Date | the date the episode aired
Round | the round of Jeopardy
Category | the category of the question
Value | the number of dollars the correct answer is worth
Question | the text of the question
Answer | the text of the answer


In [3]:
# read the file into dataframe
df = pd.read_csv('JEOPARDY_CSV.csv')

# review file
print(df.columns,'\n')
df.head()

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object') 



Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [4]:
# Remove the spaces from columns
df.columns = df.columns.str.strip()

# review transformation
print(df.columns)

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')


## 2) Normalising Text

Provided by: [Dataquest.io](https://www.dataquest.io/)

Before starting analysis on the Jeopardy questions, need to normalize all of the text columns (the **Question** and **Answer** columns).

The idea is to put words in lowercase and remove punctuation for comparability.

In [5]:
# Write a function to normalize questions and answers.

import re  # to work with regex

def normalise_string(sentence):
    sentence_new = str(sentence).lower()
    sentence_new = re.sub(r'[^A-Za-z0-9_\s]', '', sentence_new, flags=re.I)
    return sentence_new

In [6]:
# Normalize the 'Question' column.
df['clean_question'] = df['Question'].apply(normalise_string)

# review transformation
print(df['clean_question'][0], '\n')
print(df['Question'][0], '\n')
df['clean_question'].head()


for the last 8 years of his life galileo was under house arrest for espousing this mans theory 

For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory 



0    for the last 8 years of his life galileo was u...
1    no 2 1912 olympian football star at carlisle i...
2    the city of yuma in this state has a record av...
3    in 1963 live on the art linkletter show this c...
4    signer of the dec of indep framer of the const...
Name: clean_question, dtype: object

In [7]:
# Normalize the 'Answer' column.
df['clean_answer'] = df['Answer'].apply(normalise_string)

# review transformation
print(df['clean_answer'][0], '\n')
print(df['Answer'][0], '\n')
df['clean_answer'].head()


copernicus 

Copernicus 



0    copernicus
1    jim thorpe
2       arizona
3     mcdonalds
4    john adams
Name: clean_answer, dtype: object

## 3) Normalising Columns

Provided by: [Dataquest.io](https://www.dataquest.io/)

There are also some other columns to normalise.

The **Value** column should be numeric, to allow easier manipulation. Need to remove the dollar sign from the beginning of each value and convert the column from text to numeric.

The **Air Date** column should also be a datetime, not a string, to enable to work it easier.

In [8]:
# Write a function to normalize dollar values.
# Also assign 0 instead if the conversion has an error.


