## This is Jeopardy


In [1]:
import pandas as pd
pd.set_option('display.max_colwidth', None)
jeopardy_data = pd.read_csv('jeopardy_starting/jeopardy.csv')

print('*'*10 + ' ' + "Data Info" + ' ' + '*'*10)
print('\n')
print(jeopardy_data.info())
print('\n')
print('*'*10 + ' ' + "Data Columns" + ' ' + '*'*10)
print('\n')
print(jeopardy_data.columns)
print('\n')
print('*'*10 + ' ' + "Data Head" + ' ' + '*'*10)
print('\n')
print(jeopardy_data.head())

********** Data Info **********


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 216930 entries, 0 to 216929
Data columns (total 7 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   Show Number  216930 non-null  int64 
 1    Air Date    216930 non-null  object
 2    Round       216930 non-null  object
 3    Category    216930 non-null  object
 4    Value       216930 non-null  object
 5    Question    216930 non-null  object
 6    Answer      216928 non-null  object
dtypes: int64(1), object(6)
memory usage: 11.6+ MB
None


********** Data Columns **********


Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')


********** Data Head **********


   Show Number    Air Date      Round                         Category  Value  \
0         4680  2004-12-31  Jeopardy!                          HISTORY   $200   
1         4680  2004-12-31  Jeopardy!  ESPN's TOP 10 ALL-TIME A

Data contains 7 columns and 216928 rows, each one with a Show Number, Air Date, Round, Category, Value, Question and an Answer.
Some of the column names should be renamed.

In [2]:
# Renaming columns:
jeopardy_data.rename(columns={
    'Show Number': 'ShowNumber',
    ' Air Date': 'AirDate',
    ' Round': 'Round',
    ' Category': 'Category',
    ' Value': 'Value',
    ' Question': 'Question',
    ' Answer': 'Answer'}, 
    inplace=True)

## Filter Questions

This function filters a dataset for a specific column that contains all of the words in a list of words. 

filter_data returns a dataframe with the rows where the specified column matches the words cointained in the list. 

filter_data has 4 parameters:
* data: the dataframe where we are going to search data
* column: the name of the column where the filter is going to be established
* word: the list of words we are going to search in the column of the data
* f_type: the type of search the function will execute (by default is "all" and it means that all the words in the word list must be in the same column record; if any other value is put as an input, it will throw the results that matches at least one of the words listed.

In [3]:
def filter_data(data, column, words, f_type='all'):
    if f_type == 'all':
        filter_words = lambda x, words: all(word.lower() in x.lower() for word in words)
    else:
        filter_words = lambda x, words: any(word.lower() in x.lower() for word in words)
    filtered_data = data.loc[data[column].apply(filter_words, words=words)]
    return filtered_data

search_words = ["King", "England"]
search_column = "Question"
filtered_questions = filter_data(jeopardy_data, search_column, search_words)[search_column]
filtered_questions.reset_index(drop=True, inplace=True)
print('*'*10 + ' ' + "Results" + ' ' + '*'*10)
print('\n')
print('Searched words: ' + str(search_words))
print('\n')
print('Number of results found: ' + str(len(filtered_questions)))
print('\n')
# print('Results found: ' + str(filtered_questions[search_column]))
print('Results found: ' + '\n' + str(filtered_questions))

********** Results **********


Searched words: ['King', 'England']


Number of results found: 152


Results found: 
0                    Both England's King George V & FDR put their stamp of approval on this "King of Hobbies"
1      In retaliation for Viking raids, this "Unready" king of England attacks Norse areas of the Isle of Man
2                    This king of England beat the odds to trounce the French in the 1415 Battle of Agincourt
3                This Scotsman, the first Stuart king of England, was called "The Wisest Fool in Christendom"
4                                        It's the number that followed the last king of England named William
                                                        ...                                                  
147        In 1066 this great-great grandson of Rollo made what some call the last Viking invasion of England
148                      Dutch-born king who ruled England jointly with Mary II & is a tasty New Zealand fish
149

## Change Data Type 
Some of the columns contain numeric data. We want to know the average value of a selected group of questions.

In [4]:
jeopardy_data['FloatValue'] = jeopardy_data.Value.apply(lambda x: '0' if x == 'None' else x)
jeopardy_data.FloatValue = jeopardy_data.FloatValue.replace('[\$,]', '', regex=True)
jeopardy_data.FloatValue = pd.to_numeric(jeopardy_data.FloatValue)
print('Average Value for all Questions: $' + str(format(jeopardy_data.FloatValue.mean(), '.2f')))

Average Value for all Questions: $739.99


In [5]:
search_words = ["King", "England"]
search_column = "Question"
filtered_data = filter_data(jeopardy_data, search_column, search_words)
filtered_questions_mean = filtered_data.FloatValue.mean()
print('Average Value for filtered Questions including ' + str(search_words) + ': $' + str(format(filtered_questions_mean, '.2f')))

Average Value for filtered Questions including ['King', 'England']: $886.84


## Unique Answers
We want to know how many times an answer repeats for a filtered set of questions.

In [6]:
search_words = ["King", "England"]
search_column = "Question"
filtered_data = filter_data(jeopardy_data, search_column, search_words)
unique_answers = filtered_data.Answer.value_counts()
print(unique_answers)

William the Conqueror                        6
James I                                      3
Edward                                       3
Oliver Cromwell                              3
Richard the Lionhearted                      3
                                            ..
King Hussein                                 1
Georgian                                     1
Battle of Hastings (which Harold II lost)    1
Edward the Confessor                         1
Le Mans                                      1
Name: Answer, Length: 114, dtype: int64


## Analize some data
How Data changes over time? We want to know how many times the word "Computer" appears for questions in the 90s vs 2000s shows.

In [7]:
search_words = ["Computer"]
search_column = "Question"
filtered_data = filter_data(jeopardy_data, search_column, search_words)

shows_years = filtered_data.AirDate.str.split('-').str.get(0)
shows_years.reset_index(drop=True, inplace=True)
shows_years = pd.to_numeric(shows_years)

shows_90s = shows_years[(shows_years >= 1990) & (shows_years < 2000)]
shows_2000s = shows_years[(shows_years >= 2000) & (shows_years < 2010)]

print('Number of 90s shows that mention ' + str(search_words) + ': ' +str(shows_90s.value_counts().sum()))
print('Number of 2000s shows that mention ' + str(search_words) + ': ' + str(shows_2000s.value_counts().sum()))

Number of 90s shows that mention ['Computer']: 98
Number of 2000s shows that mention ['Computer']: 268
