In [2]:
import pandas as pd

# display the full contents of a column
pd.set_option('display.max_colwidth', None)

__1\.__ Load the data into a DataFrame and investigate its contents. Try to print out specific columns.

In [3]:
# import CSV file
df = pd.read_csv('jeopardy.csv')

__2\.__ There’s something odd about the column names. After you figure out the problem with the column names, you may want to rename them to make your life easier the rest of the project.

In [4]:
# inspect df
df.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,"No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves",Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,"The city of Yuma in this state has a record average of 4,055 hours of sunshine each year",Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", this company served its billionth burger",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Constitution of Mass., second President of the United States",John Adams


In [5]:
# basic stats
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 216930 entries, 0 to 216929
Data columns (total 7 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   Show Number  216930 non-null  int64 
 1    Air Date    216930 non-null  object
 2    Round       216930 non-null  object
 3    Category    216930 non-null  object
 4    Value       216930 non-null  object
 5    Question    216930 non-null  object
 6    Answer      216928 non-null  object
dtypes: int64(1), object(6)
memory usage: 11.6+ MB


In [6]:
# inspect col names
df.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

As we can see via the above output almost all column names have __trailing whitespace__. 

One way to remove this whitespace is to just __rename the columns__.

In [7]:
df.rename(columns={' Air Date': 'Date',
                   ' Round': 'Round',
                   ' Category': 'Category',
                   ' Value': 'Value',
                   ' Question': 'Question',
                   ' Answer': 'Answer'},
          inplace=True)

df.columns

Index(['Show Number', 'Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

__3\.__ Write a function that filters the dataset for questions that contains all of the words in a list of words. 

For example, when the list ["King", "England"] was passed to our function, the function returned a DataFrame of 152 rows. Every row had the strings "King" and "England" somewhere in its " Question".

Test your function by printing out the column containing the question of each row of the dataset.

`pandas.Series.str.contains()` [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html) and a geeksforgeeks [article](https://www.geeksforgeeks.org/python-pandas-series-str-contains/).

In [8]:
def find_question(my_list):
    """Find questions that includes the word in the list."""

    
    new_df = df[(df['Question'].str.contains(my_list[0], na = False)) & (df['Question'].str.contains(my_list[1], na = False))]
    
    print("There are {} questions.".format(len(new_df)))
    return new_df

find_question(['King', 'England'])

There are 49 questions.


Unnamed: 0,Show Number,Date,Round,Category,Value,Question,Answer
4953,3003,1997-09-24,Double Jeopardy!,"""PH""UN WORDS",$200,"Both England's King George V & FDR put their stamp of approval on this ""King of Hobbies""",Philately (stamp collecting)
14912,2832,1996-12-17,Jeopardy!,WORLD HISTORY,$100,"This country's King Louis IV was nicknamed ""Louis From Overseas"" because he was raised in England",France
21511,4650,2004-11-19,Jeopardy!,"THE ""O.C.""",$1000,this man and his son ruled England following the execution of King Charles I,Oliver Cromwell
23810,4862,2005-11-01,Jeopardy!,NAME THE YEAR,$400,William the Conqueror was crowned King of England in Westminster Abbey on Christmas Day in this year,1066
27555,1799,1992-05-28,Double Jeopardy!,HISTORIC IN-LAWS,$600,This member of the Medici family was the mother-in-law of England's King Charles I,Marie de Medici
33294,5589,2008-12-18,Jeopardy!,THE BAYEUX TAPESTRY,$600,"(Sarah of the Clue Crew delivers the clue from the Bayeux Cathedral in France.) Despite taking an oath to assure another succession, Harold is crowned King of England; the tapestry indicated it was <a href=""http://www.j-archive.com/media/2008-12-18_J_08.jpg"" target=""_blank"">Stigant</a>, this archbishop, who performed the ceremony",the Archbishop of Canterbury
41148,5925,2010-05-21,Double Jeopardy!,TREATIES,$1600,"This French king recognized William of Orange as William II, King of England, under the terms of 1697's Treaty of Ryswick",Louis XIV
41357,2751,1996-07-15,Jeopardy!,HISTORIC NAMES,$400,"England's King Henry VIII had 3 wives named Catherine: Catherine Howard, Catherine of Aragon & this one",Catherine Parr
43122,3937,2001-10-16,Double Jeopardy!,"OH, HENRY!",$400,The father of England's King Edward VI,Henry VIII
47814,4365,2003-07-18,Double Jeopardy!,POTIONS,$2000,This steak sauce was created for King George IV of England,A1


In [9]:
# test function with other words
find_question(['last', 'fish'])

There are 6 questions.


Unnamed: 0,Show Number,Date,Round,Category,Value,Question,Answer
45769,4197,2002-11-26,Jeopardy!,"SOMETHING ""OLD"", SOMETHING ""NEW""",$200,This fish story of 1952 was one of the last works Hemingway published during his lifetime,"""The Old Man and the Sea"""
51428,4706,2005-02-07,Jeopardy!,THE LAST KING,$800,"Victor Emmanuel II didn't mind being the last king of this ""fishy"" Italian island; he got to be the first king of Italy",Sardinia
72188,5100,2006-11-10,Double Jeopardy!,SLAW & ORDER,$2000,"If you're having gravlax, the last 3 letters should tell you that you're eating this 6-letter fish",salmon
102887,2055,1993-07-09,Jeopardy!,FOOD,$300,"The last name of a nursery rhyme Jack, or a fish that's so high in fat he couldn't eat it",Sprat
128057,5403,2008-02-20,Jeopardy!,CAFETERIA,$600,"Caesar salad--great! As long as it doesn't include pieces of these small salty fish, like last time",anchovies
190289,3838,2001-04-18,Double Jeopardy!,LET'S GET MOVING,$200,"The Blueback, Bonefish & Barbel were the U.S. Navy's last new diesel-powered subs before the move to these",Nuclear power


__Codecademy's Solution__

In [13]:
# Filtering a dataset by a list of words
def filter_data(data, words):
  # Lowercases all words in the list of words as well as the questions. Returns true is all of the words in the list appear in the question.
  filter = lambda x: all(word.lower() in x.lower() for word in words)
  # Applies the labmda function to the Question column and returns the rows where the function returned True
  return data.loc[data["Question"].apply(filter)]

# Testing the filter function
filtered = filter_data(df, ["King", "England"])
pd.DataFrame(filtered["Question"])

Unnamed: 0,Question
4953,"Both England's King George V & FDR put their stamp of approval on this ""King of Hobbies"""
6337,"In retaliation for Viking raids, this ""Unready"" king of England attacks Norse areas of the Isle of Man"
9191,This king of England beat the odds to trounce the French in the 1415 Battle of Agincourt
11710,"This Scotsman, the first Stuart king of England, was called ""The Wisest Fool in Christendom"""
13454,It's the number that followed the last king of England named William
...,...
208295,In 1066 this great-great grandson of Rollo made what some call the last Viking invasion of England
208742,Dutch-born king who ruled England jointly with Mary II & is a tasty New Zealand fish
213870,In 1781 William Herschel discovered Uranus & initially named it after this king of England
216021,"His nickname was ""Bertie"", but he used this name & number when he became king of England in 1901"


__4\.__ Test your original function with a few different sets of words to try to __find some ways your function breaks__. Edit your function so it is more robust.

For example, think about __capitalization__. We probably want to find questions that contain the word `"King"` or `"king"`.

You may also want to check to __make sure you don’t find rows that contain substrings of your given words__. 

For example, our function found a question that didn’t contain the word `"king"`, however it did contain the word `"viking"` — it found the `"king"` inside `"viking"`. 

Note that this also comes with some drawbacks — you would no longer find questions that contained words like `"England's"`.

In [28]:
import re

def find_question(my_list):
    """Find questions that includes the word in the list."""
    
    # adding whitespace before & after words to ensure that it does not return substrings
    pattern1 = '\s' + my_list[0] + '\s'
    pattern2 = '\s' + my_list[1] + '\s'
    #capitalization issue addressed using 'case=False'
    new_df = df[(df['Question'].str.contains(pat=pattern1, na = False, case=False, regex=True)) & \
     (df['Question'].str.contains(pat=pattern2, na = False, case=False, regex=True))]
    
    print("There are {} questions.".format(len(new_df)))
    return new_df

find_question(['King', 'England'])

There are 51 questions.


Unnamed: 0,Show Number,Date,Round,Category,Value,Question,Answer,Float Value,Difficulty
6337,3517,1999-12-14,Double Jeopardy!,Y1K,$800,"In retaliation for Viking raids, this ""Unready"" king of England attacks Norse areas of the Isle of Man",Ethelred,800.0,800.0
9191,3907,2001-09-04,Double Jeopardy!,WON THE BATTLE,$800,This king of England beat the odds to trounce the French in the 1415 Battle of Agincourt,Henry V,800.0,800.0
13454,4726,2005-03-07,Jeopardy!,A NUMBER FROM 1 TO 10,$1000,It's the number that followed the last king of England named William,4,1000.0,1000.0
18076,3227,1998-09-22,Double Jeopardy!,WORLD HISTORY,$1000,In 1199 this crusader king of England was mortally wounded while besieging the castle of Chalus,Richard the Lionhearted,1000.0,1000.0
19168,3109,1998-02-19,Jeopardy!,HISTORIC WORLD LEADERS,$300,"He was the only king of England to have ""The Great"" tacked on to his name",Alfred,300.0,300.0
21511,4650,2004-11-19,Jeopardy!,"THE ""O.C.""",$1000,this man and his son ruled England following the execution of King Charles I,Oliver Cromwell,1000.0,1000.0
23810,4862,2005-11-01,Jeopardy!,NAME THE YEAR,$400,William the Conqueror was crowned King of England in Westminster Abbey on Christmas Day in this year,1066,400.0,400.0
23979,4664,2004-12-09,Double Jeopardy!,MEDIEVAL TIMES,$2000,"This ""unready"" king of England lost most of his country to Sven Forkbeard, the king of Denmark",Aethelred the Unready,2000.0,2000.0
26780,2118,1993-11-17,Double Jeopardy!,THE MIDDLE AGES,"$1,200",This king of England was killed by a Norman arrow at the Battle of Hastings,Harold II,1200.0,1200.0
33174,1333,1990-05-23,Jeopardy!,THE CRUSADES,$200,This king of England was a leader of the Third Crusade,Richard I (Richard the Lionhearted),200.0,200.0


In [54]:
find_question(['FISH', 'last'])

There are 2 questions.


Unnamed: 0,Show Number,Date,Round,Category,Value,Question,Answer,Difficulty
45769,4197,2002-11-26,Jeopardy!,"SOMETHING ""OLD"", SOMETHING ""NEW""",$200,This fish story of 1952 was one of the last works Hemingway published during his lifetime,"""The Old Man and the Sea""",200.0
102887,2055,1993-07-09,Jeopardy!,FOOD,$300,"The last name of a nursery rhyme Jack, or a fish that's so high in fat he couldn't eat it",Sprat,300.0


__5\.__ We may want to eventually __compute aggregate statistics__, like `.mean()` on the `Value` column. But right now, the values in that column are strings. __Convert the `Value` column to floats__. If you’d like to, you can create a new column with the float values.

Now that you can filter the dataset of question, __use your new column that contains the float values of each question to find the “difficulty” of certain topics__. For example, what is the average value of questions that contain the word `King`?

Make sure to use the dataset that contains the float values as the dataset you use in your filtering function.

In [15]:
# check data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 216930 entries, 0 to 216929
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   Show Number  216930 non-null  int64  
 1   Date         216930 non-null  object 
 2   Round        216930 non-null  object 
 3   Category     216930 non-null  object 
 4   Value        216930 non-null  object 
 5   Question     216930 non-null  object 
 6   Answer       216928 non-null  object 
 7   Float Value  216930 non-null  float64
dtypes: float64(1), int64(1), object(6)
memory usage: 13.2+ MB


In [16]:
df.Value.head()

0    $200
1    $200
2    $200
3    $200
4    $200
Name: Value, dtype: object

In [17]:
# clean string from '$', ',' and 'None'
difficulty = [value.replace('$', "").replace(',', "").replace('None', '0')  for value in df['Value']]

# create a pandas Series
df['Difficulty'] = difficulty

# convert column to float
df['Difficulty'] = pd.to_numeric(df["Difficulty"], downcast="float")

# check dtypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 216930 entries, 0 to 216929
Data columns (total 9 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   Show Number  216930 non-null  int64  
 1   Date         216930 non-null  object 
 2   Round        216930 non-null  object 
 3   Category     216930 non-null  object 
 4   Value        216930 non-null  object 
 5   Question     216930 non-null  object 
 6   Answer       216928 non-null  object 
 7   Float Value  216930 non-null  float64
 8   Difficulty   216930 non-null  float32
dtypes: float32(1), float64(1), int64(1), object(6)
memory usage: 14.1+ MB


In [31]:
# create subset
new_df = find_question(['King', 'England'])

# calculate mean using subset
new_df['Difficulty'].mean()

There are 51 questions.


817.6470336914062

In [32]:
# create subset
new_dfA = find_question(['FISH', 'last'])

# calculate mean using subset
new_dfA['Difficulty'].mean()

There are 2 questions.


250.0

__Codecademy's Solution__

In [20]:
# Adding a new column. If the value of the float column is not "None",
# then we cut off the first character (which is a dollar sign),
# and replace all commas with nothing, and then cast that value to a float.
# If the answer was "None", then we just enter a 0.
df["Float Value"] = df["Value"].apply(lambda x: float(x[1:].replace(',','')) if x != "None" else 0)
df.head()

Unnamed: 0,Show Number,Date,Round,Category,Value,Question,Answer,Float Value,Difficulty
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory",Copernicus,200.0,200.0
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,"No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves",Jim Thorpe,200.0,200.0
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,"The city of Yuma in this state has a record average of 4,055 hours of sunshine each year",Arizona,200.0,200.0
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", this company served its billionth burger",McDonald's,200.0,200.0
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Constitution of Mass., second President of the United States",John Adams,200.0,200.0


In [25]:
# Filtering the dataset and finding the average value of those questions
filtered = filter_data(df, ["King"])
print(filtered["Float Value"].mean())

771.8833850722094


__6\.__ Write a function that returns the __count of the unique answers to all of the questions in a dataset__. 

For example, after filtering the entire dataset to only questions containing the word `King`, we could then find all of the unique answers to those questions. The answer `Henry VIII` appeared 3 times and was the most common answer.

In [21]:
def find_question(df, string):
    """Find questions that includes the word in the list."""
    
    if len(string) == 1:
        # adding whitespace before & after words to ensure that it does not return substrings
        pattern1 = '\s' + string[0] + '\s'
        #capitalization issue addressed using 'case=False'
        new_df = df[(df['Question'].str.contains(pat=pattern1, na = False, case=False, regex=True))]
    elif len(string) == 2:
        # adding whitespace before & after words to ensure that it does not return substrings
        pattern1 = '\s' + string[0] + '\s'
        pattern2 = '\s' + string[1] + '\s'
        #capitalization issue addressed using 'case=False'
        new_df = df[(df['Question'].str.contains(pat=pattern1, na = False, case=False, regex=True)) & \
         (df['Question'].str.contains(pat=pattern2, na = False, case=False, regex=True))]
    elif len(string) == 3:
        # adding whitespace before & after words to ensure that it does not return substrings
        pattern1 = '\s' + string[0] + '\s'
        pattern2 = '\s' + string[1] + '\s'
        pattern2 = '\s' + string[2] + '\s'
        #capitalization issue addressed using 'case=False'
        new_df = df[(df['Question'].str.contains(pat=pattern1, na = False, case=False, regex=True)) & \
         (df['Question'].str.contains(pat=pattern2, na = False, case=False, regex=True)) & 
                   (df['Question'].str.contains(pat=pattern2, na = False, case=False, regex=True))]
                    
    print("There are {} questions.".format(len(new_df)))
    return new_df

In [22]:
# search questions that include 'King'
new_df = find_question(df, ['King'])

There are 1786 questions.


In [23]:
# find most common answer
from collections import Counter
 
def most_frequent(List):
    occurence_count = Counter(List)
    return occurence_count.most_common(1)[0][0]

print("The most common answer is: {}.".format(most_frequent(new_df['Answer'])))

The most common answer is: Henry VIII.


__Codecademy's Solution__

In [24]:
# A function to find the unique answers of a set of data
def get_answer_counts(data):
    return data["Answer"].value_counts()


# Testing the answer count function
print(get_answer_counts(filtered))

William the Conqueror       6
Wessex                      3
Richard the Lionhearted     3
Henry VIII                  3
George III                  3
                           ..
The Magna Carta             1
King Hussein                1
Charles                     1
(Sir Edward) Elgar          1
William of Orange roughy    1
Name: Answer, Length: 114, dtype: int64


__7\.__ Explore from here! This is an incredibly rich dataset, and there are so many interesting things to discover. There are a few columns that we haven’t even started looking at yet. Here are some ideas on ways to continue working with this data:

Investigate the ways in which questions change over time by filtering by the date. How many questions from the 90s use the word "Computer" compared to questions from the 2000s?

Is there a connection between the round and the category? Are you more likely to find certain categories, like "Literature" in Single Jeopardy or Double Jeopardy?

Build a system to quiz yourself. Grab random questions, and use the input function to get a response from the user. Check to see if that response was right or wrong. Note that you can’t do this on the Codecademy platform — to do this, download the data, and write and run the code on your own computer!

Is there a connection between the round and the category? Are you more likely to find certain categories, like "Literature" in Single Jeopardy or Double Jeopardy?