# Introduction
This assignment will test how well you're able to perform various data science-related tasks.

Each Problem Group below will center around a particular dataset that you have worked with before.

To ensure you receive full credit for a question, make sure you demonstrate the appropriate pandas, altair, or other commands as requested in the provided code blocks.

You may find that some questions require multiple steps to fully answer. Others require some mental arithmetic in addition to pandas commands. Use your best judgment.

## Submission
Each problem group asks a series of questions. This assignment consists of two submissions:

1. After completing the questions below, open the Module 01 Assessment Quiz in Canvas and enter your answers to these questions there.

2. After completing and submitting the quiz, save this Colab notebook as a GitHub Gist (You'll need to create a GitHub account for this), by selecting `Save a copy as a GitHub Gist` from the `File` menu above.

    In Canvas, open the Module 01 Assessment GitHub Gist assignment and paste the GitHub Gist URL for this notebook. Then submit that assignment.

## Problem Group 1

For the questions in this group, you'll work with the Netflix Movies Dataset found at this url: [https://raw.githubusercontent.com/byui-cse/cse450-course/master/data/netflix_titles.csv](https://raw.githubusercontent.com/byui-cse/cse450-course/master/data/netflix_titles.csv)


### Question 1
Load the dataset into a Pandas data frame and determine what data type is used to store the `release_year` feature.

In [2]:
import pandas as pd
netflix_df = pd.read_csv("https://raw.githubusercontent.com/byui-cse/cse450-course/master/data/netflix_titles.csv")
netflix_df.dtypes

Unnamed: 0,0
show_id,int64
type,object
title,object
director,object
cast,object
country,object
date_added,object
release_year,int64
rating,object
duration,object


### Question 2
Filter your dataset so it contains only `TV Shows`. How many of those TV Shows were rated `TV-Y7`?

In [3]:
# Filter for TV Shows
tv_shows = netflix_df[netflix_df['type'] == 'TV Show']

# Count how many are rated TV-Y7
tv_y7_count = tv_shows[tv_shows['rating'] == 'TV-Y7'].shape[0]
tv_y7_count

100

### Question 3
Further filter your dataset so it only contains TV Shows released between the years 2000 and 2009 inclusive. How many of *those* shows were rated `TV-Y7`?

In [4]:
# Filter TV Shows released between 2000 and 2009
tv_shows_2000s = tv_shows[(tv_shows['release_year'] >= 2000) & (tv_shows['release_year'] <= 2009)]

# Count how many are rated TV-Y7
tv_y7_2000s_count = tv_shows_2000s[tv_shows_2000s['rating'] == 'TV-Y7'].shape[0]
tv_y7_2000s_count


4

## Problem Group 2

For the questions in this group, you'll work with the Cereal Dataset found at this url: [https://raw.githubusercontent.com/byui-cse/cse450-course/master/data/cereal.csv](https://raw.githubusercontent.com/byui-cse/cse450-course/master/data/cereal.csv)


### Question 4
After importing the dataset into a pandas data frame, determine the median amount of `protein` in cereal brands manufactured by Kelloggs. (`mfr` code "K")

In [5]:
cereal_df = pd.read_csv("https://raw.githubusercontent.com/byui-cse/cse450-course/master/data/cereal.csv")

In [6]:
# Filter for Kelloggs cereals (mfr code "K")
kelloggs_cereals = cereal_df[cereal_df['mfr'] == 'K']

# Calculate median protein
median_protein = kelloggs_cereals['protein'].median()
median_protein


3.0

### Question 5
In order to comply with new government regulations, all cereals must now come with a "Healthiness" rating. This rating is calculated based on this formula:

    healthiness = (protein + fiber) / sugar

Create a new `healthiness` column populated with values based on the above formula.

Then, determine the median healthiness value for only General Mills cereals (`mfr` = "G"), rounded to two decimal places.

In [7]:
# Create the healthiness column
cereal_df['healthiness'] = (cereal_df['protein'] + cereal_df['fiber']) / cereal_df['sugars']

# Filter for General Mills cereals (mfr = "G")
general_mills = cereal_df[cereal_df['mfr'] == 'G']

# Calculate median healthiness, rounded to 2 decimal places
median_healthiness = round(general_mills['healthiness'].median(), 2)
median_healthiness


0.47

## Problem Group 3

For the questions in this group, you'll work with the Titanic Dataset found at this url: [https://raw.githubusercontent.com/byui-cse/cse450-course/master/data/titanic.csv](https://raw.githubusercontent.com/byui-cse/cse450-course/master/data/titanic.csv)

### Question 6

After loading the dataset into a pandas DataFrame, create a new column called `NameGroup` that contains the first letter of the passenger's surname in lower case.

Note that in the dataset, passenger's names are provided in the `Name` column and are listed as:

    Surname, Given names

For example, if a passenger's `Name` is `Braund, Mr. Owen Harris`, the `NameGroup` column should contain the value `b`.

Then count how many passengers have a `NameGroup` value of `k`.

In [8]:
titanic_df = pd.read_csv("https://raw.githubusercontent.com/byui-cse/cse450-course/master/data/titanic.csv")

In [9]:
# Create NameGroup column with first letter of surname in lowercase
titanic_df['NameGroup'] = titanic_df['Name'].apply(lambda x: x.split(',')[0][0].lower())

# Count passengers with NameGroup 'k'
k_count = titanic_df[titanic_df['NameGroup'] == 'k'].shape[0]
k_count


28