# This Is Jeopardy!
This project contains a series of open-ended requirements which describe the project you’ll be building. There are many possible ways to correctly fulfill all of these requirements, and you should expect to use the internet, Codecademy, and other resources when you encounter a problem that you cannot easily solve.

#### Project Goals
- You will work to write several functions that investigate a dataset of Jeopardy! questions and answers. 
- Filter the dataset for topics that you’re interested in, compute the average difficulty of those questions, and train to become the next Jeopardy champion!

In [1]:
import pandas as pd

pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)

In [2]:
df = pd.read_csv('jeopardy.csv')
df.head(2)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,"No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves",Jim Thorpe


In [3]:
df.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [4]:
#rename columns
df.columns = ['show_num', 'show_date', 'round', 'category', 'value', 'question', 'answer']
df.columns

Index(['show_num', 'show_date', 'round', 'category', 'value', 'question',
       'answer'],
      dtype='object')

In [5]:
print("num answers : ",df['answer'].nunique())
print(df['answer'].unique())

num answers :  88268
['Copernicus' 'Jim Thorpe' 'Arizona' ... 'Anaïs Nin' 'a titmouse'
 'Grigori Alexandrovich Potemkin']


In [6]:
# lower the category
df['category'] = df['category'].apply(lambda x: x.lower())
print("num categories : ", df['category'].nunique())
print(df['category'].unique())

num categories :  27916
['history' "espn's top 10 all-time athletes" 'everybody talks about it...'
 ... 'off-broadway' 'riddle me this' 'authors in their youth']


In [7]:
# filtering a dataset by a list of words
def filter_data(data, words):
    filter = lambda x : all(word.lower() in x.lower() for word in words)
    return data.loc[data['question'].apply(filter)]

In [8]:
# testing the filter function
filtered = filter_data(df, ['king', 'england', 'viking'])
print('num of questions that contain those words : ', filtered['question'].nunique())
print('\n', filtered['question'])
print('\n',filtered[['question','value']])

num of questions that contain those words :  5

 6337               In retaliation for Viking raids, this "Unready" king of England attacks Norse areas of the Isle of Man
74611           In 1497 this Venetian sailing for England became the first European since the Vikings to reach N. America
148050                         In the late 800s, this king of Wessex prevented the Vikings from conquering all of England
183694    By 878 the Vikings had conquered all of England except for this southern kingdom controlled by Alfred the Great
208295                 In 1066 this great-great grandson of Rollo made what some call the last Viking invasion of England
Name: question, dtype: object

                                                                                                                question  \
6337             In retaliation for Viking raids, this "Unready" king of England attacks Norse areas of the Isle of Man   
74611         In 1497 this Venetian sailing for England became 

In [9]:
# check value
print("num values : ", df['value'].nunique())
print(df['value'].unique())

num values :  150
['$200' '$400' '$600' '$800' '$2,000' '$1000' '$1200' '$1600' '$2000'
 '$3,200' 'None' '$5,000' '$100' '$300' '$500' '$1,000' '$1,500' '$1,200'
 '$4,800' '$1,800' '$1,100' '$2,200' '$3,400' '$3,000' '$4,000' '$1,600'
 '$6,800' '$1,900' '$3,100' '$700' '$1,400' '$2,800' '$8,000' '$6,000'
 '$2,400' '$12,000' '$3,800' '$2,500' '$6,200' '$10,000' '$7,000' '$1,492'
 '$7,400' '$1,300' '$7,200' '$2,600' '$3,300' '$5,400' '$4,500' '$2,100'
 '$900' '$3,600' '$2,127' '$367' '$4,400' '$3,500' '$2,900' '$3,900'
 '$4,100' '$4,600' '$10,800' '$2,300' '$5,600' '$1,111' '$8,200' '$5,800'
 '$750' '$7,500' '$1,700' '$9,000' '$6,100' '$1,020' '$4,700' '$2,021'
 '$5,200' '$3,389' '$4,200' '$5' '$2,001' '$1,263' '$4,637' '$3,201'
 '$6,600' '$3,700' '$2,990' '$5,500' '$14,000' '$2,700' '$6,400' '$350'
 '$8,600' '$6,300' '$250' '$3,989' '$8,917' '$9,500' '$1,246' '$6,435'
 '$8,800' '$2,222' '$2,746' '$10,400' '$7,600' '$6,700' '$5,100' '$13,200'
 '$4,300' '$1,407' '$12,400' '$5,401' '$7,800

In [10]:
# andding new column. clean the value count $ and None
df['float_value'] = df['value'].apply(lambda x: float(x[1:].replace(",","")) if x != "None" else 0)
print(df['float_value'].unique())

[2.000e+02 4.000e+02 6.000e+02 8.000e+02 2.000e+03 1.000e+03 1.200e+03
 1.600e+03 3.200e+03 0.000e+00 5.000e+03 1.000e+02 3.000e+02 5.000e+02
 1.500e+03 4.800e+03 1.800e+03 1.100e+03 2.200e+03 3.400e+03 3.000e+03
 4.000e+03 6.800e+03 1.900e+03 3.100e+03 7.000e+02 1.400e+03 2.800e+03
 8.000e+03 6.000e+03 2.400e+03 1.200e+04 3.800e+03 2.500e+03 6.200e+03
 1.000e+04 7.000e+03 1.492e+03 7.400e+03 1.300e+03 7.200e+03 2.600e+03
 3.300e+03 5.400e+03 4.500e+03 2.100e+03 9.000e+02 3.600e+03 2.127e+03
 3.670e+02 4.400e+03 3.500e+03 2.900e+03 3.900e+03 4.100e+03 4.600e+03
 1.080e+04 2.300e+03 5.600e+03 1.111e+03 8.200e+03 5.800e+03 7.500e+02
 7.500e+03 1.700e+03 9.000e+03 6.100e+03 1.020e+03 4.700e+03 2.021e+03
 5.200e+03 3.389e+03 4.200e+03 5.000e+00 2.001e+03 1.263e+03 4.637e+03
 3.201e+03 6.600e+03 3.700e+03 2.990e+03 5.500e+03 1.400e+04 2.700e+03
 6.400e+03 3.500e+02 8.600e+03 6.300e+03 2.500e+02 3.989e+03 8.917e+03
 9.500e+03 1.246e+03 6.435e+03 8.800e+03 2.222e+03 2.746e+03 1.040e+04
 7.600

In [11]:
# mean float value
print('The average value is ', df['float_value'].mean())

The average value is  739.9884755451067


In [12]:
# filtering the dataset and finding the average value of those questions
filtered = filter_data(df, ['king', 'england', 'viking'])
print('the average value of those questions is ',filtered['float_value'].mean())

the average value of those questions is  1080.0


In [13]:
# a function to find the unique answers of a set of data
def answer_count(data):
    return data['answer'].value_counts()

In [14]:
# testing the function
print(answer_count(filtered))

Wessex                   1
John Cabot               1
Alfred the Great         1
Ethelred                 1
William the Conqueror    1
Name: answer, dtype: int64


## Explore

- Investigate the ways in which questions change over time by filtering by the date. How many questions from the 90s use the word <b>"Computer"</b> compared to questions from the 2000s?

In [15]:
from datetime import datetime

day = lambda x:datetime.strptime(x, "%Y-%m-%d")
df['date'] = df['show_date'].apply(day)

df['decade']= ((pd.DatetimeIndex(df['date']).year)//10)*10
df['decade'].value_counts().reset_index()

Unnamed: 0,index,decade
0,2000,123852
1,1990,56745
2,2010,28225
3,1980,8108


In [16]:
#Write a function that filters the dataset for questions that contains all of the words in a list of words. 
#For example, when the list ["King", "England"] was passed to our function, Every row had the strings "King" and "England" somewhere in its " Question".
# Step 1 : concatenate string to create a regex
def concat(list):
    new_string = ''
    for i in list:
        new_string += '(?=.*'+i+')'
    return new_string

#Step 2 : filter
def filter_func(column, data):
    mots = concat(column)
    new_df = data[data.question.str.contains(mots, case = False, regex=True)]
    return new_df

filtered_df_2 = filter_func(['computer'], df)
result_2 = filtered_df_2.groupby('decade').show_num.count()
print(result_2)

decade
1980      6
1990     98
2000    268
2010     59
Name: show_num, dtype: int64


question used word `computer` in 90s 104 questiond, while in 2000s 327 questions.

- Is there a connection between the round and the category? Are you more likely to find certain categories, like "Literature" in Single Jeopardy or Double Jeopardy?

In [17]:
category = df['category'].value_counts()
print(category)

before & after                                       547
science                                              519
literature                                           496
american history                                     418
potpourri                                            401
world history                                        377
word origins                                         371
colleges & universities                              351
history                                              349
sports                                               342
u.s. cities                                          339
world geography                                      338
bodies of water                                      327
animals                                              324
state capitals                                       314
business & industry                                  311
islands                                              301
world capitals                 

In [None]:
import numpy as np

# grouping to 'other'
# category.groupby(np.where(category>=100,category.index,'other')).sum()#.plot.pie()

In [None]:
# category.groupby(np.where(category>=100,category.index,'other')).sum().plot.pie()

- Build a system to quiz yourself. Grab random questions, and use the input function to get a response from the user. Check to see if that response was right or wrong.

In [17]:
import random 

def random_question(data):
    yes_or_no = 'yes'
    correct = 0
    incorrect = 0
    while (yes_or_no == 'yes'):
        random_index = random.randint(1, len(data))
        print(data.question[random_index])
        x = input()
        if (x.lower() == df.answer[random_index].lower()):
            correct +=1
            print('Correct ! You have '+ str(correct) + ' correct answers and ' + str(incorrect) + ' incorrect answers')
        else:
            incorrect +=1
            print('Incorrect ! the right answer is '+ data.answer[random_index].lower())
            print('You have '+ str(correct) + ' correct answers and ' + str(incorrect) + ' incorrect answers')
        print('Do you want to answer another question ?')
        y = input()
        if (y.lower()=='yes'):
            yes_or_no = 'yes'
        else:
            yes_or_no = 'no'

random_question(df)

Gerty Theresa Cori won a Nobel Prize for finding an enzyme that helps the body turn this into sugar
yes
Incorrect ! the right answer is starch
You have 0 correct answers and 1 incorrect answers
Do you want to answer another question ?
yes
With 200 victories, he's won more NASCAR races than any other driver
no
Incorrect ! the right answer is richard petty
You have 0 correct answers and 2 incorrect answers
Do you want to answer another question ?
yes
As you can gather from its name, Lake Texoma is shared by these 2 states
no ide
Incorrect ! the right answer is texas & oklahoma
You have 0 correct answers and 3 incorrect answers
Do you want to answer another question ?
no
