# Cleaning the data
## Exercise: Applying functions to dataframes

Solution
---

> **Exercises**: 
>
> 1) Create a new column called Manufacturer which takes the entries in the column `mfr` and maps them to the full name as follows:
>
> * A = American Home Food Products;
> * G = General Mills
> * K = Kelloggs
> * N = Nabisco
> * P = Post
> * Q = Quaker Oats
> * R = Ralston Purina
>
> 2) Calories can be converted to kilojoules using the formula: 1 calorie = 4.184 kilojoules. Find the amount of kilojoules per serving for each cereal. Store the results in a new column.
>
> 3) For each numerical column, find which cereal has the maximum value in that column.

In [1]:
# Import libraries

import pandas as pd
import numpy as np

In [4]:
# Load the data from 'cereal.csv' file

df = pd.read_csv('c2_cereal.csv')

In [5]:
# Show top entries

df.head()

Unnamed: 0,name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
0,100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
1,100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679
2,All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
3,All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
4,Almond Delight,R,C,110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843


## Exercise 1 solution

In the first exercise, we take the column `mfr`, which has a letter for each manufacturer and map this letter to the full manufacturer name. 

We will use the `map` function to do this and provide a dictionary; each entry in the column `mfr` with be substituted by the corresponding manufacturer's name. We also store the result in a new column labeled 'Manufacturer'.

In [6]:
# Map entries in 'mfr' to full manufacturer name
# Store result in new column 'Manufacturer'

df['Manufacturer'] = df['mfr'].map({
    'A': 'American Home Food Products',
    'G': 'Gneral Mills',
    'K': 'Kellogs',
    'N': 'Mabisco',
    'P': 'Post',
    'Q': 'Quaker Oats',
    'R': 'Ralston Purina'
})

# Show top entries

df.head()

Unnamed: 0,name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating,Manufacturer
0,100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973,Mabisco
1,100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679,Quaker Oats
2,All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505,Kellogs
3,All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912,Kellogs
4,Almond Delight,R,C,110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843,Ralston Purina


## Exercise 2 solution

In the second exercise, we want to create a new column to store the energy in kilojoules for each cereal.

We will start by writing a function called`caltokj` to convert calories to kilojoules.

In [7]:
# Define function to convert calories to kilojoules

def caltokj(x):
    
    return x*4.184

Next, we apply the `caltokj` function to each entry of the column `calories` and store the result in a new column labeled `kilojoules`.

In [8]:
# Apply 'caltokj' function to each entry of column 'calories'
# Store result in new column 'kilojoules'

df['kilojoules'] = df['calories'].apply(caltokj)

# Show top entries
df.head()

Unnamed: 0,name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating,Manufacturer,kilojoules
0,100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973,Mabisco,292.88
1,100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679,Quaker Oats,502.08
2,All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505,Kellogs,292.88
3,All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912,Kellogs,209.2
4,Almond Delight,R,C,110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843,Ralston Purina,460.24


We see that the `caltokj` function was applied to each entry of the column `calories`. The result was stored in a new column labeled `kilojoules`.

## Exercise 3 solution

In the last exercise, we want to obtain the cereal with the maximum value in each numeric column.

We will first create a new, smaller dataframe to contain only the numeric columns. However, before that, we should change the index of the dataframe to the `name` column. 

In [9]:
# Change the index to the 'name' column
df.set_index(keys = 'name', # specify which column for the index
             inplace=True   # apply changes to the original df
            )

# Show top entries
df.head()

Unnamed: 0_level_0,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating,Manufacturer,kilojoules
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973,Mabisco,292.88
100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679,Quaker Oats,502.08
All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505,Kellogs,292.88
All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912,Kellogs,209.2
Almond Delight,R,C,110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843,Ralston Purina,460.24


The index of the dataframe is now given by the cereal name. Below, we create a new dataframe to contain the numeric columns using the `select_dtypes` function.

In [10]:
# Create new dataframe with numeric columns
df_num = df.select_dtypes(include = [np.number])

# Show top entries
df_num.head()

Unnamed: 0_level_0,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating,kilojoules
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
100% Bran,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973,292.88
100% Natural Bran,120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679,502.08
All-Bran,70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505,292.88
All-Bran with Extra Fiber,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912,209.2
Almond Delight,110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843,460.24


Now, we will apply the pandas `Series.idxmax` function to each column of `df_num`. `Series.idxmax` returns the index of the row containing the maximum value per column. Since we already set the index of the dataframe equal to the `name` column, this will return the cereal name, which is what we want! We don't use the `max` function because that would return the maximum value in each column instead of the index.

In [12]:
df_num.apply(pd.Series.idxmax)

calories             Mueslix Crispy Blend
protein                          Cheerios
fat                     100% Natural Bran
sodium                         Product 19
fiber           All-Bran with Extra Fiber
carbo                           Rice Chex
sugars                       Golden Crisp
potass          All-Bran with Extra Fiber
vitamins      Just Right Crunchy  Nuggets
shelf                           100% Bran
weight               Mueslix Crispy Blend
cups                                  Kix
rating          All-Bran with Extra Fiber
kilojoules           Mueslix Crispy Blend
dtype: object