# Cereal Dataset
## Numpy and Pandas Notebook

In this exercise you will use a dataset containing information about different cereals in order to further your knowledge of NumPy and Pandas.  Before expanding our knowledge we will first review some data exploration basics.

First, download the notebook and the .csv file `cereal.csv` and place them in the same folder.

In this notebook, you will be asked to first practice some basic NumPy and Pandas skills. Practice is extremely important in building your manipulation skills so you should attempt to complete the task yourself before watching the solution.

Also note that there are sometimes different ways to do the same thing when using Python, NumPy, and Pandas for data manipulation. Some ways are more efficient than others, especially with larger data sets, but since these datasets are so small, the suggested solutions may not be the most efficient. Instead, our goal is to show you different options when it comes to completing the tasks.

The complete dataset can be found here: [Kaggle](https://www.kaggle.com/crawford/80-cereals)

|Variable | Description|
|:--------|:-----------|
|Name| Name of cereal
|mfr| Manufacturer of cereal|
|type| hot or cold|   
|calories| calories per serving|
|protein| grams of protein|
|fat| grams of fat|
|sodium| milligrams of sodium|
|fiber| grams of dietary fiber|
|carbo| grams of complex carbohydrates|
|sugars| grams of sugars|
|potass| milligrams of potassium|
|vitamins| vitamins and minerals - 0, 25, or 100, indicating the typical percentage of FDA recommended|
|shelf| display shelf (1, 2, or 3, counting from the floor)|
|weight| weight in ounces of one serving|
|cups| number of cups in one serving|
|rating| a rating of the cereals (Possibly from Consumer Reports?)|

# Exploration

### Initial Imports

In [1]:
# import common libraries

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt 

In [4]:
# load data from csv

data = pd.read_csv('cereal.csv', index_col = 'name') #ensure file is in same location as notebook or add path

First look at `cereal`

In [5]:
# head

data.head()

Unnamed: 0_level_0,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679
All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
Almond Delight,R,C,110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843


In [None]:
# shape

data.shape

77 observations, 15 variables

Now lets see what the columns are named:

In [None]:
# columns

for i in data.columns:
    print(i)

There are several ways to get a quick view of your data:

* the `head()` method (seen above) provides a quick look at the first several rows of your data
* the `info()` method provides a brief summary of your DataFrame including non-null count and data types
* the `describe()` method provides descriptive statistics of numerical data 

In [None]:
# info

data.info()

In [None]:
# describe

data.describe()

In [None]:
# number of observations

len(data)

In [None]:
# descriptives for calories

data.calories.describe()

We can view unique values using `unique()` and identify the counts using `value_counts()`

Let's look at manufacturer 

In [None]:
# unique manufacturers

data.mfr.unique()

In [None]:
# manufacturer value counts

data.mfr.value_counts()

In [None]:
# pivot table for calories by manufacturer

data.pivot_table('calories', 'mfr')

In [None]:
# pivot table for calories by manufacturer by type

data.pivot_table('calories', 'mfr', 'type')

# Selecting columns

We're interested in calories, manufacturer, cups, rating, and cost (which doesn't yet exist)

In [None]:
# select variables of interest - including cost

data2 = pd.DataFrame(data, columns = ['calories', 'mfr', 'cups', 'rating', 'cost'])
data2

Notice `cost` = NaN because it didn't exist before selecting it

In [None]:
# this doesn't work (not including cost)

data['calories', 'mfr', 'cups', 'rating']

In [None]:
# select multiple variables - but not cost

data[['calories', 'mfr', 'cups', 'rating']]

In [None]:
# this doesn't work, either

data[['calories', 'mfr', 'cups', 'rating', 'cost']]

In [None]:
# set cost to $3.50

data2['cost'] = 3.5
data2

In [None]:
# create random values for cost

np.random.seed(56)

costs = np.random.uniform(low = 2.5, high = 4.5, size = (len(data2))).round(2)

data2['cost'] = costs
data2

In [None]:
# when assigning values, they must be the correct size
# notice here we have 78 values, not 77, which will cause an error

np.random.seed(56)

costs2 = np.random.uniform(low = 2.5, high = 4.5, size = (78)).round(2)
data2['cost'] = costs2

data2

Now we want identify only those with values greater than 3.50m

In [None]:
# create new boolean where costs > 3.5

data2['cost2'] = data2['cost'] > 3.5

data2

In [None]:
# let's use `del` to delete `cost2` - we don't need it

del data2['cost2']
data2

In [None]:
# we can index our original dataset using this value, as well

data3 = data2[data2['cost']>3.5]

data3.head()

In [None]:
print(data2['cost']>3.5)

### Selecting rows

As with selecting columns, there are frequently multiple ways to accomplish tasks

To select observations we can use `.loc` and `.iloc`

`.loc` = axis labels

`.iloc` = integers

In [None]:
data.head()

In [None]:
data.loc['All-Bran']

In [None]:
data.iloc[2]

We can select values by indicating the row(s) and column(s) we're interested in

In [None]:
data.loc['All-Bran', 'fiber']

In [None]:
# use brackets instead of parentheses as shown in module video to prevent later problems
data.loc[['All-Bran', 'Almond Delight'], 'fiber']

In [None]:
# use brackets instead of parentheses as shown in module video to prevent later problems
data.loc[['All-Bran', 'Almond Delight'], ['fiber', 'mfr']]

# Reinforcement

#### Student Practice
Pause the video and try to perform the following tasks on the `cereal` dataset. Then check your answers as I walk through the solutions. 

**Exercise:** Select only the `protein` column from the data.

In [6]:
# select only variable protein

data['protein']

name
100% Bran                    4
100% Natural Bran            3
All-Bran                     4
All-Bran with Extra Fiber    4
Almond Delight               2
                            ..
Triples                      2
Trix                         1
Wheat Chex                   3
Wheaties                     3
Wheaties Honey Gold          2
Name: protein, Length: 77, dtype: int64

**Exercise:** Select the `fat`, `calories`, and `sugars` columns from the data.

In [8]:
# select only variables fat, calories, sugar

data[['fat', 'calories','sugars']]

Unnamed: 0_level_0,fat,calories,sugars
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
100% Bran,1,70,6
100% Natural Bran,5,120,8
All-Bran,1,70,5
All-Bran with Extra Fiber,0,50,0
Almond Delight,2,110,8
...,...,...,...
Triples,1,110,3
Trix,1,110,12
Wheat Chex,1,100,3
Wheaties,1,100,3


**Exercise:** Select the following observations from the data: `Cocoa Puffs`, `Frosted Flakes`, and `Fruity Pebbles`.

In [10]:
# select observations Cocoa Puffs, Frosted Flakes, and Fruity Pebbles

data.loc[['Cocoa Puffs', 'Frosted Flakes', 'Fruity Pebbles']]

Unnamed: 0_level_0,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
Cocoa Puffs,G,C,110,1,1,180,0.0,12.0,13,55,25,2,1.0,1.0,22.736446
Frosted Flakes,K,C,110,1,0,200,1.0,14.0,11,25,25,1,1.0,0.75,31.435973
Fruity Pebbles,P,C,110,1,1,135,0.0,13.0,12,25,25,2,1.0,0.75,28.025765


**Exercise:** Select the cereals that have `fiber` greater than 10.

In [13]:
# select variable fiber where values are greater than 10

data.loc[data['fiber']>10]

Unnamed: 0_level_0,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912


**Exercise:** Import the csv file called `myratings.csv` and save it as a DataFrame called `myratings`.

In [14]:
# import myratings.csv

myratings = pd.read_csv('myratings.csv')

**Exercise:** Output the `myratings` DataFrame.  How many rows of data do we have?  Can you figure out why we do not have 77 rows as we did in the original `cereal` data?

In [15]:
# what do we have?

myratings

Unnamed: 0,myrating
0,1
1,2
2,3
3,3
4,3
5,3
6,5
7,3
8,3
9,2


**Exercise:** Check [Pandas documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) to determine what is the default for handling blank lines of data.  How could you change this to interpret blank lines as NaN values instead of skipping them?

Import the `myratings.csv` file again, saving it was `myratings`, but this time making sure that blank lines are treated as missing values.

In [16]:
# import myratings.csv

myratings = pd.read_csv('myratings.csv', skip_blank_lines = False)

**Exercise:** Set the index of `myratings` as the same as the index of the `cereal` data (created as `data` above).

In [17]:
# set index to cereal name

myratings.set_index(data.index, inplace = True)

**Exercise:** How many missing values are represented in the `myratings` data?

In [None]:
# count missing values

myratings.isnull().sum()

**Exercise:** Merge the `data` DataFrame and the `myratings` DataFrame.

In [None]:
# merge myratings with original dataframe

data = pd.merge(data, myratings, on = 'name')

**Exercise:** The `myrating` attribute is on a scale of 1-5 while the rating attribute from the original data is on a scale of 0-100.  

Create a new column called `myrating20` that multiplies the `myrating` column by 20 so that is has the same scale as the original data.

In [None]:
# multiply myrating variable by 20

data['myrating20'] = data['myrating20'] * 20

**Exercise:** What is the difference between the original `rating` column and the `myrating20` column?  Create a new column called `diff` that subtracts the `myrating20` column from the `rating` column.

In [None]:
# calculate difference 

data['diff'] =data['rating'] - data['myrating20']

**Exercise:** Find the cereals with the five largest differences values.

In [None]:
# find five largest differences

data.nlargest(5, ['diff'])


## Create per cup amounts

#### Student Practice
Pause the video and try to perform the following tasks on the `cereal` dataset. Then check your answers as I walk through the solutions. 

**Exercise:** Create four new variables called:

* `calPerCup` - calories per cup
* `proPerCup` - protein per cup
* `fatPerCup` - fat per cup
* `sugPerCup` - sugar per cup

All created by taking the variable and dividing by cups (e.g., calories/cups)

In [18]:
# create 'calPerCup', 'proPerCup', 'fatPerCup', 'sugPerCup' variables as ratio of variable to cups

data['calPerCup'] = data['calories']/data['cups']
data['proPerCup'] = data['protein']/data['cups']
data['fatPerCup'] = data['fat']/data['cups']
data['sugPerCup'] = data['sugars']/data['cups']

**Exercise:** Create a scatterplot of `calories` by `calPerCup`.  You can use Matplotlib or Seaborn.

In [None]:
# create scatterplot for calories by calPerCup

### ENTER CODE HERE ###

**Exercise:** Which cereal has the highest calories per cup?

In [20]:
# which cereal had the highest calories per cup?

mostcalCup = data.nlargest(1, ['calPerCup'])
mostcalCup

Unnamed: 0_level_0,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating,calPerCup,proPerCup,fatPerCup,sugPerCup
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
Grape-Nuts,P,C,110,3,0,170,3.0,17.0,3,90,25,3,1.0,0.25,53.371007,440.0,12.0,0.0,12.0


## Sorting

**Exercise:** Sort `data` by index.

In [None]:
# sort index alphabetically

data.sort_index()

**Exercise:** Sort the columns alphabetically.

In [None]:
# sort columns

data.sort_index(axis = 1)

**Exercise:** Sort the columns in descending order alphabetically.

In [None]:
# sort in descending order
data.sort_index(axis = 1, ascending = False)

**Exercise:** Sort the data by the `calories` column.

In [None]:
# sort by different columns
data.sort_values(by = 'calories')

**Exercise:** Sort the data by `calories` and then `mfr`.

In [None]:
# sort by multiple columns

data.sort_values(by = ['calories', 'mfr'])

## Descriptive stats

Many descriptive stats are available quickly - although you should be careful 

**Exercise:** Sum the entire data by column.

In [None]:
# sum for all variables

data.sum()

**Exercise:** Sum the `calories` column only.

In [None]:
# sum of calories

data.calories.sum()

**Exercise:** Calculate the standard deviation by column.

In [None]:
# std for all variables

data.std()

**Exercise:** Calculate the standard deviation for the `calories` column.

In [None]:
# std of calories

data['calories'].std()

**Exercise:** Sum all variables by observation.  Note that the information here is for illustrative purposes only and the output is not useful.

In [None]:
# sum by observation
# the output here is non useful - only for illustrative purposes

data.sum[axis = 'columns']

Some methods are not direct stats - but use them

**Exercise:**  What is the index value of the cereal that has the maximum calories?

In [None]:
# identify index value of maximum for calories

data.calories.idmax()

**Exercise:** What is the index value of the cereal that has the minimum calories?

In [None]:
# identify index value of minimum for calories

data.calories.idmin()

Correlation and covariance work well with DataFrames

**Exercise:** What is the correlation for all numeric variables in the data?

In [None]:
# correlation for all numeric variables in the df

data.corr()

**Exercise:** What is the covariance for all numeric variables in the data?

In [None]:
# covariance for all numeric variables in the df

data.cov()