<img src="https://datasciencecampus.ons.gov.uk/wp-content/uploads/sites/10/2017/03/data-science-campus-logo-new.svg"
             alt="ONS Data Science Campus Logo"
             width = "240"
             style="margin: 0px 60px"
             />

In [1]:
# import the helper functions from the parent directory,
# these help with things like graph plotting and notebook layout
import sys
sys.path.append('..')
from helper_functions import *

# set things like fonts etc - comes from helper_functions
set_notebook_preferences()

# add a show/hide code button - also from helper_functions
toggle_code(title = "import functions")

# Aggregation

Aggregation means grouping data together by a particular grouping variable and producing a summary of one or more columns for that grouping variable.

## 6.1 Aggregation using the .groupby() method

We'll use the `groupby()` function. 

This function can be really useful, especially when your data are disaggregate - e.g. data about individual units of people or things. 

`groupby()` allows you to aggregate by a categorical variable and summarise numerical data into a new dataframe.

`.groupby()` works on a principle known as 'split-apply-combine':
* Split - a dataframe is divided into a set of smaller dataframes based on the grouping variable.
* Apply - an aggregation is applied to each of the groups to create a single row for each group in the original dataframe.
* Combine - bring together the aggregated dataframe rows into a final new dataframe.

Let's walk through what that might look like for the `titanic` dataframe:
* Firstly, we decide to **split** the data by the 'pclass'. This divides the `titanic` dataframe into effectively three separate dataframes, one for first, one for second and one for third class.
* Secondly, we **apply** an aggregation to the dataframe. You can either produce an aggregate statistic for all rows, or you can selected specific columns on which to do the aggregation. If we **apply** a `.mean()` aggregation to 'fare', then for each 'pclass' group we get the average fare cost.
* Finally, pandas returns a **combined** dataframe that contains the new aggregate statistics.

Let's look at that in code:

In [None]:
titanic.groupby('pclass')['fare'].mean()

In [None]:
# Similarly a count of children by class might look like this:
titanic.groupby('pclass')['child'].sum()

Hopefully this all sounds fairly straightforward! `.groupby()` is a powerful tool, particularly when you are working with any kind of hierarchical data where you might want to know something aggregate about the groups within the data, for instance:
* individuals nested in households.
* employees nested in firms.
* patients nested in primary or secondary care trusts.
* small area geographies (e.g. wards, output areas, postcodes etc.) nested in larger geographies (e.g. districts, counties etc.)
* countries nested in supra-national entities.

or, demographic, cultural and socio-economic classes:
* individuals by age, sex, ethnicity, religion etc.
* employees by grade or occupational social class.
* households by neighbourhood deprivation rank or decile.
* experimental subjects in intervention and control arms of a trial.

We can also aggregate according to more complicated groupings:

In [None]:
# Groupby passenger class, then city of embarkation.
titanic.groupby(['pclass','embarked_city'])['fare'].mean()

In [None]:
# NB order is important to the output.
titanic.groupby(['embarked_city','pclass'])['fare'].mean()

The ordering of groups may be important as it affects the resultant DataFrame.

If you assign the `groupby()` output to a variable, you can also pull out dataframes for particular groups, just as if you had written a filter condition!

In [None]:
classes = titanic.groupby('pclass')
classes.get_group(3).head()