# 🛠 IFQ718 Module 06 Exercises 02

## 🔍  Context: Aggregating data

In this notebook, the DataFrame within a frame will be grouped and functions will be applied to each group. 

Recall the penguins dataset, where there are three penguin species. We could ask the question, what is the average bill length for each species? Pandas will answer this question for us.

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('data/penguins.csv')

In [None]:
df

### ✍ Activity 1: try the basic aggregation functions on the DataFrame `df`

They are:

|    Aggregation   |           Description           |
|:----------------:|:-------------------------------:|
| `count()`          | Total number of items           |
| `first()`, `last()`  | First and last item             |
| `mean()`, `median()` | Mean and median                 |
| `min()`, `max()`     | Minimum and maximum             |
| `std()`, `var()`     | Standard deviation and variance |
| `mad()`            | Mean absolute deviation         |
| `prod()`           | Product of all items            |
| `sum()`            | Sum of all items                |

We will get you started with `count()`.

For each, write a short comment with your observations, to explain the purpose of each function.

In [None]:
df.count()

# counts the number of non-NaN values in each column

In [None]:
# Try `first()` here

In [None]:
# Try `last()` here

In [None]:
# Try `mean()` here

In [None]:
# Try `median()` here

In [None]:
# Try `min()` here

In [None]:
# Try `max()` here

In [None]:
# Try `std()` here

In [None]:
# Try `var()` here

In [None]:
# Try `mad()` here

In [None]:
# Try `prod()` here

In [None]:
# Try `sum()` here

### Using the `aggregate` function

The `.aggregate()` function performs the same operation as the functions in the previous activity do.

However, you can specify user-defined functions, and multiple functions per column.

In [None]:
df.aggregate('mean')

Notice there is a warning/error? How would you calculate the average of a list of strings? Seems strange, doesn't it? Pandas thinks so too.

Let's take a look at which columns are numerical types, then specify that we only want to aggregate those columns using the mathematical functions.

In [None]:
df.dtypes

In [None]:
df.aggregate({
    'bill_length_mm' : 'mean', 
    'bill_depth_mm' : 'mean', 
    'flipper_length_mm' : 'mean', 
    'body_mass_g' : 'mean'
})

In [None]:
df.aggregate({
    'bill_length_mm' : ['min', 'mean', 'max'],
    'bill_depth_mm' : ['min', 'mean', 'max'],
    'flipper_length_mm' : ['min', 'mean', 'max'],
    'body_mass_g' : ['min', 'mean', 'max']
})

In [None]:
# a user-defined function, `my_range`, named to avoid conflicting with the built-in `range` function
def my_range(series):
    return series.max() - series.min()

In [None]:
df.aggregate({
    'bill_length_mm' : [my_range, 'min', 'mean', 'max'],
    'bill_depth_mm' : [my_range, 'min', 'mean', 'max'],
    'flipper_length_mm' : [my_range, 'min', 'mean', 'max'],
    'body_mass_g' : [my_range, 'min', 'mean', 'max']
})

### ✍ Activity 2: write your own aggregate function to count the number of penguins with `body_mass_g` above the mean

Hint, use:
* `.mean()`, and
* `.sum()`

In [None]:
def count_above_the_mean(series):
    return series.min() # replace this line with your code

df.aggregate({
    'body_mass_g' : [count_above_the_mean, 'min', 'mean', 'max']
})

### Grouping data

What if we want to compute the average of each bill dimension for each penguin species?

We can use `.groupby()`. 

Let's try it:

In [None]:
df.groupby('species')

It gave us a `pandas.core.groupby.generic.DataFrameGroupBy` object. Not useful, yet. Pandas is handling the grouped data for us, but does not know how to display it properly.

But, let's apply some aggregate functions to the groups:

In [None]:
df_bills_by_species = df.groupby('species').aggregate({
    'bill_length_mm' : ['min', 'mean', 'max'],
    'bill_depth_mm' : ['min', 'mean', 'max']
})

In [None]:
df_bills_by_species

Now, slice the frame to only have the column `bill_length_mm`

In [None]:
df_bills_by_species['bill_length_mm']

and again, with only the mean of that column

In [None]:
df_bills_by_species['bill_length_mm']['mean']

What about bill dimensions by species and location?

In [None]:
df.groupby(['species', 'island']).aggregate({
    'bill_length_mm' : ['min', 'mean', 'max'],
    'bill_depth_mm' : ['min', 'mean', 'max']
})

In [None]:
df.groupby(['species', 'island'])[['bill_length_mm', 'bill_depth_mm']].mean()

### ✍ Activity 3: find the min, mean, median, average and range of each bill dimension, per sex

In [None]:
# Write your code here