# This Is Jeopardy!


## Project Goals

We will work to write several functions that investigate a dataset of Jeopardy! questions and answers. Filter the dataset for topics that we’re interested in, compute the average difficulty of those questions, and train to become the next Jeopardy champion!

We’ve been provided a csv file containing data about the game show Jeopardy! in a file named jeopardy.csv. First of all we need load the data into a DataFrame and investigate its contents.

In [47]:
import pandas as pd
pd.set_option('display.max_colwidth', None)

# Loading the data and investigating it
df = pd.read_csv('jeopardy.csv')
print(df.head())

print(df.columns)
print(df.info())

   Show Number    Air Date      Round                         Category  Value  \
0         4680  2004-12-31  Jeopardy!                          HISTORY   $200   
1         4680  2004-12-31  Jeopardy!  ESPN's TOP 10 ALL-TIME ATHLETES   $200   
2         4680  2004-12-31  Jeopardy!      EVERYBODY TALKS ABOUT IT...   $200   
3         4680  2004-12-31  Jeopardy!                 THE COMPANY LINE   $200   
4         4680  2004-12-31  Jeopardy!              EPITAPHS & TRIBUTES   $200   

                                                                                                      Question  \
0             For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory   
1  No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves   
2                     The city of Yuma in this state has a record average of 4,055 hours of sunshine each year   
3                         In 1963, live on "The Art Linkl

The column names all have a leading space. I'll rename them to make life easier the rest of the project.

In [48]:
#renaming misformatted columns
df.columns = ['Show number', 'Air date', 'Round', 'Category', 'Value', 'Question', 'Answer']

print(df.columns)
print(df['Question'])

Index(['Show number', 'Air date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')
0                               For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory
1                    No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves
2                                       The city of Yuma in this state has a record average of 4,055 hours of sunshine each year
3                                           In 1963, live on "The Art Linkletter Show", this company served its billionth burger
4                       Signer of the Dec. of Indep., framer of the Constitution of Mass., second President of the United States
                                                                   ...                                                          
216925                                                This Puccini opera turns on the solution to 3 riddles po

I'll write a function that filters the dataset for questions that contains all of the words in a list of words. For example, when the list ["King", "England"] is passed to my function, the function returnes a DataFrame of much less rows. Every row will have the strings "King" and "England" somewhere in its "Question".

In [49]:
#filtering the dataset by a list of words
def filter_function(data, words):
    filter = lambda row: all(word.lower() in row.lower() for word in words)
    return data.loc[data['Question'].apply(filter)]

filter_function(df, ['King', 'England'])

Unnamed: 0,Show number,Air date,Round,Category,Value,Question,Answer
4953,3003,1997-09-24,Double Jeopardy!,"""PH""UN WORDS",$200,"Both England's King George V & FDR put their stamp of approval on this ""King of Hobbies""",Philately (stamp collecting)
6337,3517,1999-12-14,Double Jeopardy!,Y1K,$800,"In retaliation for Viking raids, this ""Unready"" king of England attacks Norse areas of the Isle of Man",Ethelred
9191,3907,2001-09-04,Double Jeopardy!,WON THE BATTLE,$800,This king of England beat the odds to trounce the French in the 1415 Battle of Agincourt,Henry V
11710,2903,1997-03-26,Double Jeopardy!,BRITISH MONARCHS,$600,"This Scotsman, the first Stuart king of England, was called ""The Wisest Fool in Christendom""",James I
13454,4726,2005-03-07,Jeopardy!,A NUMBER FROM 1 TO 10,$1000,It's the number that followed the last king of England named William,4
...,...,...,...,...,...,...,...
208295,4621,2004-10-11,Jeopardy!,THE VIKINGS,$600,In 1066 this great-great grandson of Rollo made what some call the last Viking invasion of England,William the Conqueror
208742,4863,2005-11-02,Double Jeopardy!,BEFORE & AFTER,"$3,000",Dutch-born king who ruled England jointly with Mary II & is a tasty New Zealand fish,William of Orange roughy
213870,5856,2010-02-15,Double Jeopardy!,URANUS,$1600,In 1781 William Herschel discovered Uranus & initially named it after this king of England,George III
216021,1881,1992-11-09,Double Jeopardy!,HISTORIC NAMES,$1000,"His nickname was ""Bertie"", but he used this name & number when he became king of England in 1901",Edward VII


Now that we can filter the dataset of question, I'll add a new column that contains the float values of each question and use it to find the “difficulty” of certain topics. For example, what is the average value of questions that contain the word "King"?

In [50]:
#Adding a new column. If the value is 'None' I'll replce it with 0. If not I'll remove the $ sign from the begining of 
#value and replace the commas with nothing then cast that value to a float.
df["Float Value"] = df["Value"].apply(lambda x: float(x[1:].replace(',','')) if x != "None" else 0)

# Filtering the dataset and finding the average value of those questions
filtered = filter_function(df, ["King"])
print(filtered["Float Value"].mean())


771.8833850722094


Now I can write a function that returns the count of the unique answers to all of the questions in a dataset. For example, after filtering the entire dataset to only questions containing the word "King", I could then find all of the unique answers to those questions.

In [51]:
# A function to find the unique answers of a set of data
def get_answer_counts(data):
    return data["Answer"].value_counts()

# Testing the answer count function
print(get_answer_counts(filtered))


Henry VIII            55
Solomon               35
Richard III           33
Louis XIV             31
David                 30
                      ..
Goneril and Regan      1
Lamentations           1
knee-jerk reaction     1
Mongolia               1
10 Downing Street      1
Name: Answer, Length: 5268, dtype: int64


## Let's explore.

Here are some ideas on ways to continue working with this data:

#### Investigate the ways in which questions change over time by filtering by the date. How many questions from the 90s use the word "Computer" compared to questions from the 2000s?

In [52]:
#using the filter_function() to obtain a datafarame containing the word 'computer' in the 'question' column.
computer_contain_df = filter_function(df, ['Computer'])
#print(computer_contain_df.head())
print('Number of rows in the df dataframe that contains the word "Computer" is: ' + str(len(computer_contain_df)))

#filtering the computer_contain_df by the decades.
def computer_decades(data, decades):
    filtered = data[data['Air date'].str[:3] == decades[:3]]
    print('Number of rows in the df dataframe that contains the word "Computer" in {} is: {}\n'.format(decades, len(filtered)))

computer_df_1980s = computer_decades(computer_contain_df, '1980s')

computer_df_1990s = computer_decades(computer_contain_df, '1990s')

computer_df_2000s = computer_decades(computer_contain_df, '2000s')

computer_df_2010s = computer_decades(computer_contain_df, '2010s')

Number of rows in the df dataframe that contains the word "Computer" is: 431
Number of rows in the df dataframe that contains the word "Computer" in 1980s is: 6

Number of rows in the df dataframe that contains the word "Computer" in 1990s is: 98

Number of rows in the df dataframe that contains the word "Computer" in 2000s is: 268

Number of rows in the df dataframe that contains the word "Computer" in 2010s is: 59



#### Is there a connection between the round and the category? Are you more likely to find certain categories, like "Literature" in Single Jeopardy or Double Jeopardy?

In [53]:
#lets see how many unique categories is in the dataframe.
unique_categories_no = len(df.Category.unique())
print('There are {} unique categories of questions in the dataframe.'.format(len(df.Category.unique())), '\n')

#creating a dataframe grouped by round and category.
round_category_df = df.groupby(['Round', 'Category']).Question.count().reset_index()
print(round_category_df)
#creating a dataframe for literature category only.
literature_round_df = round_category_df[round_category_df.Category == 'LITERATURE'].reset_index()
print(literature_round_df)

There are 27995 unique categories of questions in the dataframe. 

                  Round                 Category  Question
0      Double Jeopardy!                  "-ARES"         5
1      Double Jeopardy!            "...OD" WORDS         5
2      Double Jeopardy!            "1", "2", "3"         5
3      Double Jeopardy!           "20" QUESTIONS         5
4      Double Jeopardy!                  "A" + 4         5
...                 ...                      ...       ...
31681         Jeopardy!                  “NORTH”         5
31682         Jeopardy!                “STREETS”         5
31683        Tiebreaker             CHILD'S PLAY         1
31684        Tiebreaker      LITERARY CHARACTERS         1
31685        Tiebreaker  THE AMERICAN REVOLUTION         1

[31686 rows x 3 columns]
   index             Round    Category  Question
0   7495  Double Jeopardy!  LITERATURE       381
1  15633   Final Jeopardy!  LITERATURE        10
2  24351         Jeopardy!  LITERATURE       105
