<h1>This is Jeopardy</h1>

<h2> Goal</h2>
<p>Work with several functions to investigate a dataset of Jeopardy! questions and answers. Filter the dataset for topics that you’re interested in, compute the average difficulty of those questions, and train to become the next Jeopardy champion!.</p>

<p>First, we import the libraries random (we will need it later, to extract a random question). We import panda and we use pd.set_option('display.max_colwidth', None) to display the full contents of a column. The, in order to analyze the data we use .read_csv</p>

In [1]:
import random
import pandas as pd
pd.set_option('display.max_colwidth', None)
jeopardy_df = pd.read_csv('jeopardy.csv')

<p>We use .head() to read the first rows </p>

In [2]:
print(jeopardy_df.head())

   Show Number    Air Date      Round                         Category  Value  \
0         4680  2004-12-31  Jeopardy!                          HISTORY   $200   
1         4680  2004-12-31  Jeopardy!  ESPN's TOP 10 ALL-TIME ATHLETES   $200   
2         4680  2004-12-31  Jeopardy!      EVERYBODY TALKS ABOUT IT...   $200   
3         4680  2004-12-31  Jeopardy!                 THE COMPANY LINE   $200   
4         4680  2004-12-31  Jeopardy!              EPITAPHS & TRIBUTES   $200   

                                                                                                      Question  \
0             For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory   
1  No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves   
2                     The city of Yuma in this state has a record average of 4,055 hours of sunshine each year   
3                         In 1963, live on "The Art Linkl

<p> We rename the dataframe columns in order to clean the data, we erased the blank spaces of the columns names. Then we write a function to filter the dataset for questions that contains all of the words in a list of words. For that we need to make the search words match the words in the dataframe, that's why we use the .upper(), to unify both the words that are searched and the words in the dataframe. We use a lambda function for this. </p>

In [3]:
jeopardy_df = jeopardy_df.rename(columns = {' Air Date': 'Air_Date', ' Round': 'Round', ' Category': 'Category', ' Value': 'Value', ' Question': 'Question', ' Answer': 'Answer'})
def filter_data(data, words):
  filter = lambda x: all(word.upper() in x.upper() for word in words)
  return data.loc[data['Question'].apply(filter)]

<p> We test the function. </p>

In [4]:
test = filter_data(jeopardy_df, ['King', 'England'])
print(test)

        Show Number    Air_Date             Round               Category  \
4953           3003  1997-09-24  Double Jeopardy!           "PH"UN WORDS   
6337           3517  1999-12-14  Double Jeopardy!                    Y1K   
9191           3907  2001-09-04  Double Jeopardy!         WON THE BATTLE   
11710          2903  1997-03-26  Double Jeopardy!       BRITISH MONARCHS   
13454          4726  2005-03-07         Jeopardy!  A NUMBER FROM 1 TO 10   
...             ...         ...               ...                    ...   
208295         4621  2004-10-11         Jeopardy!            THE VIKINGS   
208742         4863  2005-11-02  Double Jeopardy!         BEFORE & AFTER   
213870         5856  2010-02-15  Double Jeopardy!                 URANUS   
216021         1881  1992-11-09  Double Jeopardy!         HISTORIC NAMES   
216789         5070  2006-09-29  Double Jeopardy!        ANCIENT HISTORY   

         Value  \
4953      $200   
6337      $800   
9191      $800   
11710     $600 

<p> We use the .mean() function to calculate the average on the " Value" column. But first, with a lambda function we convert the " Value" column to floats.</p>

In [5]:
jeopardy_df["Values2"] = jeopardy_df.Value.apply(lambda x: float(x[1:].replace(',','')) if x != "None" else 0)
print(jeopardy_df.Values2.mean())

739.9884755451067


<p> We write a function that returns the count of the unique answers to all of the questions in a dataset.</p>

In [6]:
def get_values_count(data):
  filter = data['Answer'].value_counts()
  return filter

print(get_values_count(jeopardy_df))

China              216
Australia          215
Japan              196
Chicago            194
France             193
                  ... 
Sciatic nerve        1
a blood diamond      1
Roebling             1
Raincoat             1
Sconce               1
Name: Answer, Length: 88268, dtype: int64


<p> We create a function to filter the data by year, this to investigate the ways in which the questions change over time by filtering by date. We create two sums and two new dataframes, one containing the 90s and the other the 2000s millennium (because it is not specified if it is the 2000s or the entire millennium). Then, a loop is made to determine how many times a certain word appears in each data frame (attention is paid to the format of the word, it is unified through the .upper () function), finally the comparison is printed. </p>

In [7]:
def comparing_words(data, word):
    sum1 = 0
    sum2 = 0
    words_count1 = data.loc[data['Air_Date'].between('1990-01-1', '1999-12-31')]
    words_count2 = data.loc[data['Air_Date'].between('2000-01-1', '2999-12-31')]
    for row in words_count1['Question'].str.upper():
        if word.upper() in row:
            sum1 +=1
    for row in words_count2['Question'].str.upper():
        if word.upper() in row:
            sum2 +=1
    return '{} questions from the 90s use the word "{}" compared to {} questions from the 2000s.'.format(sum1, word.title(), sum2)

test2 = comparing_words(jeopardy_df, 'computer')
print(test2)


98 questions from the 90s use the word "Computer" compared to 326 questions from the 2000s.


<p> We create a function to filter the data by category, this to investigate if there is a connection between the round and the category, to find out if it is more likely to find certain categories in Single Jeopardy or Double Jeopardy. </p>

In [8]:
def filter_category(data, word):
    x = 0
    y = 0
    for row in data[data['Round'] == 'Jeopardy!']['Category'].str.upper():
        if row == word.upper():
            x += 1
    for row in data[data['Round'] == 'Double Jeopardy!']['Category'].str.upper():
        if row == word.upper():
            y += 1
    if x == y:
        return 'The Category appears {} times in both types of rounds'.format(x)
    else:
        return 'The Category appears {} times in Jeopardy! and {} in Double Jeopardy!'.format(x,y)

<p>We test the function using the "Literature" category, and found that it is most likely to be found in a Double Jeopardy! round than in a Jeopardy! round.</p>

In [9]:
test3 = filter_category(jeopardy_df, 'literature')
print(test3)

The Category appears 105 times in Jeopardy! and 381 in Double Jeopardy!


<p>Finally, we play. First, we get a random question, then we answer it, and then we see if we were right.</p>

In [10]:
def the_question(data):
    x = random.randint(0, 216930)
    y = data.iloc[x]['Question']
    z = data.iloc[x]['Value']
    if data.iloc[x]['Round'] == 'Jeopardy!':
        return '{}: Jeopardy! for {}, "{}":'.format(x,z,y)
    else:
        return '{}: Double Jeopardy! for {}, "{}":'.format(x,z,y)

play = the_question(jeopardy_df)
print(play)

93210: Jeopardy! for $600, "The lovely locks of a Lipizzaner":


In [11]:
answer = input('Your answer is: ')

Your answer is: know


In [12]:
answer1 = int(play.split(':')[0])

if answer == jeopardy_df.iloc[answer1]['Answer']:
    value = jeopardy_df.iloc[answer1]['Value']
    print('Correct! You win: {}'.format(value))
else:
    print('Incorrect!')

Incorrect!
