**Project Goals**

You will work to write several functions that investigate a dataset of Jeopardy! questions and answers. 
- Filter the dataset for topics that youâ€™re interested in, 
- compute the average difficulty of those questions, 
- and train to become the next Jeopardy champion!

In [50]:
import numpy as np
import pandas as pd
import re

pd.set_option('display.max_colwidth', None)

# read csv file
df = pd.read_csv('dataset_jeopardy.csv')

# remove white spaces from column names
df.columns = df.columns.str.strip()
df = df.dropna()

# convert Question columns to string dtype
df['Question'] = df.Question.astype("string").apply(lambda x: re.sub(r"<[^<>]*>", "", x))

# Convert the "Value" column to floats
df['Value'] = df['Value'].dropna()
df['Value'] = pd.to_numeric(df['Value'].str.replace('\D', ''))

df.head(2)

  df['Value'] = pd.to_numeric(df['Value'].str.replace('\D', ''))


Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,200.0,"For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,200.0,"No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves",Jim Thorpe


In [48]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 216928 entries, 0 to 216929
Data columns (total 7 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   Show Number  216928 non-null  int64  
 1   Air Date     216928 non-null  object 
 2   Round        216928 non-null  object 
 3   Category     216928 non-null  object 
 4   Value        213294 non-null  float64
 5   Question     216928 non-null  object 
 6   Answer       216928 non-null  object 
dtypes: float64(1), int64(1), object(5)
memory usage: 13.2+ MB


In [53]:
# Provide insights on questions difficulty and value
def difficulty_filter(dataset):
    """
    Function description
        Inputs:
            dataset > DataFrame with at least a 'Value' column
        Output:
            string with mean, max and mean values
    """
    difficulty_mean = 'Average value: $' + str(round(dataset['Value'].mean()))
    difficulty_max = 'Maximum value: $' + str(round(dataset['Value'].max()))
    difficulty_min = 'Minimum value: $' + str(round(dataset['Value'].min()))
    return difficulty_mean, difficulty_max, difficulty_min
# difficulty_filter(function_test)

# Match questions with word from list of words
def jeopardy_filter(dataset, words):
    """
    Function description
        Inputs:
            dataset > DataFrame with at least a 'Question' column
            words > list of words to look for in the 'Question' column
        Output:
            Filtered DataFrame with matching questions!
    """
    filter = lambda x: all(word.lower() in x.lower() for word in words)
    new_df = dataset.loc[dataset["Question"].apply(filter)]
    return new_df

# returns the count of the unique answers to all of the questions in a dataset 
def count_unique_answers(df_filtered):
    """
     Function description
        Inputs:
            df_filtered > df filtered with jeopardy_filter()
        Output:
            size of Answer column and number of unique answers
    """
    count_answers = df_filtered['Answer'].size
    count_unique_answers = df_filtered['Answer'].unique().size
    return print('For a total of {c_answer} answers, {cu_answer} are unique'.format(c_answer=count_answers, cu_answer=count_unique_answers))

function_test = jeopardy_filter(df, ['king', 'england'])
count_unique_answers(function_test)

For a total of 152 answers, 114 are unique
