# This is Jeopardy!

#### Overview

This project is slightly different than others you have encountered thus far. Instead of a step-by-step tutorial, this project contains a series of open-ended requirements which describe the project you'll be building. There are many possible ways to correctly fulfill all of these requirements, and you should expect to use the internet, Codecademy, and/or other resources when you encounter a problem that you cannot easily solve.

#### Project Goals

You will work to write several functions that investigate a dataset of _Jeopardy!_ questions and answers. Filter the dataset for topics that you're interested in, compute the average difficulty of those questions, and train to become the next Jeopardy champion!

## Prerequisites

In order to complete this project, you should have completed the Pandas lessons in the <a href="https://www.codecademy.com/learn/paths/analyze-data-with-python">Analyze Data with Python Skill Path</a>. You can also find those lessons in the <a href="https://www.codecademy.com/learn/data-processing-pandas">Data Analysis with Pandas course</a> or the <a href="https://www.codecademy.com/learn/paths/data-science/">Data Scientist Career Path</a>.

Finally, the <a href="https://www.codecademy.com/learn/practical-data-cleaning">Practical Data Cleaning</a> course may also be helpful.

## Project Requirements

1. We've provided a csv file containing data about the game show _Jeopardy!_ in a file named `jeopardy.csv`. Load the data into a DataFrame and investigate its contents. Try to print out specific columns.

   Note that in order to make this project as "real-world" as possible, we haven't modified the data at all - we're giving it to you exactly how we found it. As a result, this data isn't as "clean" as the datasets you normally find on Codecademy. More specifically, there's something odd about the column names. After you figure out the problem with the column names, you may want to rename them to make your life easier for the rest of the project.
   
   In order to display the full contents of a column, we've added this line of code for you:
   
   ```py
   pd.set_option('display.max_colwidth', None)
   ```

In [6]:
import pandas as pd
pd.set_option('display.max_colwidth', None)

# Create dataframe jeopardy
jeopardy = pd.read_csv("jeopardy.csv")
print(jeopardy.columns)

# Column names have spaces (leading and inbetween). Remove spaces.
jeopardy = jeopardy.rename(columns = {"Show Number": "Show_Number", " Air Date": "Air_Date", " Round" : "Round", " Category": "Category", " Value": "Value", " Question":"Question", " Answer": "Answer"})
print(jeopardy.columns)
print(jeopardy["Category"].head(2))
print(jeopardy["Question"].head(2))

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')
Index(['Show_Number', 'Air_Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')
0                            HISTORY
1    ESPN's TOP 10 ALL-TIME ATHLETES
Name: Category, dtype: object
0               For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory
1    No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves
Name: Question, dtype: object


2. Write a function that filters the dataset for questions that contains all of the words in a list of words. For example, when the list `["King", "England"]` was passed to our function, the function returned a DataFrame of 49 rows. Every row had the strings `"King"` and `"England"` somewhere in its `" Question"`.

   Test your function by printing out the column containing the question of each row of the dataset.

In [11]:
# Codecademy solution flawed. 
# Comment states:   # Lowercases all words in the list of words as well as the questions. Returns true if all of the words in the list appear in the question.
# However, the code does not use "lower()":  filter = lambda x: all(word in x for word in words)
# Yet again, their instructions have not been updated.



# Define a function: def filter_rows_with_words(<where to look>, <what to look for>)
# <where to look>: jeopardy["Question"], <what to look for>: words (in this case ["King", "England"])
# Return those rows of the Question column where all of the words are included.
#     The all() function returns True if all items in an iterable are true, otherwise it returns False.
#     lambda x: all(word in x for word in words) --> nested for loop.
# lambda's input parameter x is the Question column. x[0], x[1], x[2] are the rows (questions).
# This part "all(word in x" checks whether first word is in row 0 of x. True or False.
# This part "for word in words" provides the next word from the list of words. 
# This part again "all(word in x" checks whether second word is in row 0 of x. True or False.
# If both words found in row 0, "lambda <row 0 of Question column>: all(word in <row 0 of Question column> for word in words)" is True.
# Inner loop: word[0] to word[n], outer loop: row[0] to row[n].
# Each row the all() function returns True will be returned as result.

def find_rows_true(x, words):
    rows_true = lambda x: all(word in x for word in words)
    # df.loc[<list of rows or a condition for rows>]  # loc iterates through each row of dataframe
    # x.loc[<rows for which rows_true is True>    
    return x.loc[x.apply(rows_true)]

questions = find_rows_true(jeopardy.Question, ['King', 'England']) # new dataframe that holds the questions
print(questions)
print(questions.info()) # 49 questions found
questions.to_csv('found_questions.csv')

4953                                                                                                                                                                                                                                                                      Both England's King George V & FDR put their stamp of approval on this "King of Hobbies"
14912                                                                                                                                                                                                                                                            This country's King Louis IV was nicknamed "Louis From Overseas" because he was raised in England
21511                                                                                                                                                                                                                                                                                 this man and

3. Test your original function with a few different sets of words to try to find some ways your function breaks. Edit your function so it is more robust.

   For example, think about capitalization. We probably want to find questions that contain the word `"King"` or `"king"`.
   
   You may also want to check to make sure you don't find rows that contain substrings of your given words. For example, our function found a question that didn't contain the word `"king"`, however it did contain the word `"viking"` &mdash; it found the `"king"` inside `"viking"`. Note that this also comes with some drawbacks &mdash; you would no longer find questions that contained words like `"England's"`.

In [13]:
# In my results (questions.to_csv('found_questions.csv')) "viking" is not included.
# Out of 49 cells, 1 cell includes "taking", 2 cells include "king". 
# They were selected only because these cells also include both "England" and "King", which was the selection criteria.
# Instances where "King" appears only as "king" in a row, are not selected as a result, if such rows exist in jeopardy.csv. 
# Turning everything to lowercase would return rows where for example the word king appears only once, and was originally written in lowercase.
# Adding a space before and/or after King would remove "King's", "Kings." etc. There are compromises, more robust in one way less robust in another.

# Try lowercase.
def find_rows_true_lowercase(x, words):
    rows_true = lambda x: all(word.lower() in x.lower() for word in words)
    # df.loc[<list of rows or a condition for rows>]  # loc iterates through each row of dataframe
    # x.loc[<rows for which rows_true is True>    
    return x.loc[x.apply(rows_true)]

questions_lowercase = find_rows_true_lowercase(jeopardy.Question, ['King', 'England']) # new dataframe that holds the questions
print(questions_lowercase)
print(questions_lowercase.info()) # 49 questions found
questions_lowercase.to_csv('found_questions_lowercase.csv')

# Result grew from 49 to 152 cells. 
# This result includes instances where "king" appears only once, in lowercase.

4953                    Both England's King George V & FDR put their stamp of approval on this "King of Hobbies"
6337      In retaliation for Viking raids, this "Unready" king of England attacks Norse areas of the Isle of Man
9191                    This king of England beat the odds to trounce the French in the 1415 Battle of Agincourt
11710               This Scotsman, the first Stuart king of England, was called "The Wisest Fool in Christendom"
13454                                       It's the number that followed the last king of England named William
                                                           ...                                                  
208295        In 1066 this great-great grandson of Rollo made what some call the last Viking invasion of England
208742                      Dutch-born king who ruled England jointly with Mary II & is a tasty New Zealand fish
213870                In 1781 William Herschel discovered Uranus & initially named it after this

4. We may want to eventually compute aggregate statistics, like `.mean()` on the `" Value"` column. But right now, the values in that column are strings. Convert the`" Value"` column to floats. If you'd like to, you can create a new column with float values.

   Now that you can filter the dataset of question, use your new column that contains the float values of each question to find the "difficulty" of certain topics. For example, what is the average value of questions that contain the word `"King"`?
   
   Make sure to use the dataset that contains the float values as the dataset you use in your filtering function.

In [22]:
# The strings in the "Value" column have leading $ sign and thousands separators, e.g., $2,800.
# Add new column "Float_Value". 
# From old column, remove $ sign, remove commas, convert to float.

# x: Each value in column "Value".
# x[1: ]: Leave index[0], which is the $ sign, and start with index[1] until and including last index.
# replace(",",""): Replace comma with nothing (not even space).
# float(x[1...): Convert this trimmed string into float.
jeopardy["Float_Value"] = jeopardy["Value"].apply(lambda x: float(x[1: ].replace(",","")) if x != "None" else 0)
print(jeopardy.Float_Value.head())

# Call find_rows_true with jeopardy["Float_Value"] and ["King", "England"]. This will return the same 49 records from Step 2. 
# Since I want to work with the newly added "Float_Value" column, I cannot use my old function find_rows_true, 
# which is designed to take a single column as input and return that column (after filtering) as output.
# The function needs to be modified to receive entire dataframe (jeopardy) as x2, perform filtering operations on jeopardy.Question.

def find_rows_true_value(x2, words):
    rows_true = lambda x2: all(word in x2 for word in words)
    # df.loc[<list of rows or a condition for rows>]  # loc iterates through each row of dataframe
    # x.loc[<rows for which rows_true is True>    
    return x2.loc[x2.Question.apply(rows_true)]

questions_value = find_rows_true_value(jeopardy, ['King', 'England']) # new dataframe that holds the questions and float values
# print(questions_value)
print("questions_value.info(): ")
print(questions_value.info()) # 49 questions found
questions_value.to_csv('found_questions_value.csv')
print("questions_value.Float_Value.mean(): " + str(questions_value.Float_Value.mean()))

0    200.0
1    200.0
2    200.0
3    200.0
4    200.0
Name: Float_Value, dtype: float64
questions_value.info(): 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 49 entries, 4953 to 200369
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Show_Number  49 non-null     int64  
 1   Air_Date     49 non-null     object 
 2   Round        49 non-null     object 
 3   Category     49 non-null     object 
 4   Value        49 non-null     object 
 5   Question     49 non-null     object 
 6   Answer       49 non-null     object 
 7   Float_Value  49 non-null     float64
dtypes: float64(1), int64(1), object(6)
memory usage: 3.4+ KB
None
questions_value.Float_Value.mean(): 918.3673469387755


5. Write a function that returns the count of unique answers to all of the questions in a dataset. For example, after filtering the entire dataset to only questions containing the word `"King"`, we could then find all of the unique answers to those questions. The answer "Henry VIII" appeared 55 times and was the most common answer.

In [26]:
# In order to compare my results with Codecademy, run find_rows_true_value again, this time only with keyword ["King"].

def find_rows_true_value(x2, words):
    rows_true = lambda x2: all(word in x2 for word in words)
    # df.loc[<list of rows or a condition for rows>]  # loc iterates through each row of dataframe
    # x.loc[<rows for which rows_true is True>    
    return x2.loc[x2.Question.apply(rows_true)]

questions_value = find_rows_true_value(jeopardy, ['King']) # new dataframe that holds the questions and float values
# print(questions_value)
print("questions_value.info(): ")
print(questions_value.info()) # found 1604 rows
questions_value.to_csv('found_questions_value.csv')
print("questions_value.Float_Value.mean(): " + str(questions_value.Float_Value.mean()))

# DISCREPANCY: Codecademy's mean is 771.8833..., my mean is 773.4413...

# DISCREPANCY: Codecademy's results results cannot be reproduced with the provided csv file and the instructions in this project.
# Their results may be based on functions that use lower(), as found in older versions of this project.
# My results below are confirmed in the found_questions_value.csv file.

# 1604 questions contain the word "King" in the Question column.
# To count unique values in the "Answer" column we need to use a the regular expression value_counts() in a new function.
#         Internet: The value_counts() function is used to get a Series containing counts of unique values. 
#                   The resulting object will be in descending order 
#
def count_unique_answers(input_df):
    return input_df["Answer"].value_counts()

print(count_unique_answers(questions_value))



questions_value.info(): 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1604 entries, 56 to 216787
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Show_Number  1604 non-null   int64  
 1   Air_Date     1604 non-null   object 
 2   Round        1604 non-null   object 
 3   Category     1604 non-null   object 
 4   Value        1604 non-null   object 
 5   Question     1604 non-null   object 
 6   Answer       1604 non-null   object 
 7   Float_Value  1604 non-null   float64
dtypes: float64(1), int64(1), object(6)
memory usage: 112.8+ KB
None
questions_value.Float_Value.mean(): 773.4413965087282
Sweden                               19
Scotland                             11
Norway                               11
Denmark                              10
Morocco                              10
                                     ..
blood poisoning                       1
the Hundred Years' War                1
War

6. Explore from here! This is an incredibly rich dataset, and there are so many interesting things to discover. There are a few columns that we haven't even started looking at yet. Here are some ideas on ways to continue working with this data:

 * Investigate the ways in which questions change over time by filtering by the date. How many questions from the 90s use the word `"Computer"` compared to questions from the 2000s?
 * Is there a connection between the round and the category? Are you more likely to find certain categories, like `"Literature"` in Single Jeopardy or Double Jeopardy?
 * Build a system to quiz yourself. Grab random questions, and use the <a href="https://docs.python.org/3/library/functions.html#input">input</a> function to get a response from the user. Check to see if that response was right or wrong.

In [None]:
# Completed

## Solution

7. Compare your program to our <a href="https://content.codecademy.com/PRO/independent-practice-projects/jeopardy/jeopardy_solution.zip">sample solution code</a> - remember, that your program might look different from ours (and probably will) and that's okay!

8. Great work! Visit <a href="https://discuss.codecademy.com/t/this-is-jeopardy-challenge-project-python-pandas/462365">our forums</a> to compare your project to our sample solution code. You can also learn how to host your own solution on GitHub so you can share it with other learners! Your solution might look different from ours, and that's okay! There are multiple ways to solve these projects, and you'll learn more by seeing others' code.

In [None]:
# Completed