# Cleaning the data
## Exercise: Applying functions to dataframes

Solution
---

> **Exercises**: 
>
> 1) Create a new column called Manufacturer which takes the entries in the column `mfr` and maps them to the full name as follows:
>
> * A = American Home Food Products;
> * G = General Mills
> * K = Kelloggs
> * N = Nabisco
> * P = Post
> * Q = Quaker Oats
> * R = Ralston Purina
>
> 2) Calories can be converted to kilojoules using the formula: 1 calorie = 4.184 kilojoules. Find the amount of kilojoules per serving for each cereal. Store the results in a new column.
>
> 3) For each numerical column, find which cereal has the maximum value in that column.

In [None]:
# Import libraries

import pandas as pd
import numpy as np

In [None]:
# Load the data from 'cereal.csv' file

df = pd.read_csv('cereal.csv')

In [None]:
# Show top entries

df.head()

## Exercise 1 solution

In the first exercise, we take the column `mfr`, which has a letter for each manufacturer and map this letter to the full manufacturer name. 

We will use the `map` function to do this and provide a dictionary; each entry in the column `mfr` with be substituted by the corresponding manufacturer's name. We also store the result in a new column labeled 'Manufacturer'.

In [None]:
# Map entries in 'mfr' to full manufacturer name
# Store result in new column 'Manufacturer'

df['Manufacturer'] = df['mfr'].map({
    'A': 'American Home Food Products',
    'G': 'Gneral Mills',
    'K': 'Kellogs',
    'N': 'Mabisco',
    'P': 'Post',
    'Q': 'Quaker Oats',
    'R': 'Ralston Purina'
})

# Show top entries

df.head()

## Exercise 2 solution

In the second exercise, we want to create a new column to store the energy in kilojoules for each cereal.

We will start by writing a function called`caltokj` to convert calories to kilojoules.

In [None]:
# Define function to convert calories to kilojoules

def caltokj(x):
    
    return x*4.184

Next, we apply the `caltokj` function to each entry of the column `calories` and store the result in a new column labeled `kilojoules`.

In [None]:
# Apply 'caltokj' function to each entry of column 'calories'
# Store result in new column 'kilojoules'

df['kilojoules'] = df['calories'].apply(caltokj)

# Show top entries
df.head()

We see that the `caltokj` function was applied to each entry of the column `calories`. The result was stored in a new column labeled `kilojoules`.

## Exercise 3 solution

In the last exercise, we want to obtain the cereal with the maximum value in each numeric column.

We will first create a new, smaller dataframe to contain only the numeric columns. However, before that, we should change the index of the dataframe to the `name` column. 

In [None]:
# Change the index to the 'name' column
df.set_index(keys = 'name', # specify which column for the index
             inplace=True   # apply changes to the original df
            )

# Show top entries
df.head()

The index of the dataframe is now given by the cereal name. Below, we create a new dataframe to contain the numeric columns using the `select_dtypes` function.

In [None]:
# Create new dataframe with numeric columns
df_num = df.select_dtypes(include = [np.number])

# Show top entries
df_num.head()

Now, we will apply the pandas `Series.idxmax` function to each column of `df_num`. `Series.idxmax` returns the index of the row containing the maximum value per column. Since we already set the index of the dataframe equal to the `name` column, this will return the cereal name, which is what we want! We don't use the `max` function because that would return the maximum value in each column instead of the index.

In [None]:
df_num.apply(pd.Series.idxmax)