# <font color=#14F278>Unit 8 - Aggregations</font>
---

## <font color=#14F278>1. The Power of Aggregation:</font>


<font color=#14F278>**Aggregation**</font> simply means <font color=#14F278>**putting things together**</font>. So far we have learnt how to construct and work with Pandas Objects, perform element-wise operations and combine datasets. However, in order for our data to deliver robust and more reliable insights and <font color=#14F278>**drive a business narrative**</font>, we need to step away from the detail and <font color=#14F278>**look at the big picture**</font>. Aggregating (summarising) datasets enable us to do exactly this!

<font color=#14F278>**Aggregation Function - Definition:**</font> 
- A function that takes a collection of values and returns a single value
- Examples of aggregation functions include __sum, count, mean, max,median, standard deviation, quantiles, etc__.


In [1]:
import pandas as pd
import numpy as np

---
### <font color=#14F278>1.1 Aggregation on Series:</font>

- <font color=#14F278>**Aggregation on Series**</font> simply takes all elements of the Series as input and returns a <font color=#14F278>**single-valued output**</font>
- certain aggregations require elements of a numeric data type
- others work both on numeric and categorical data types 

In [2]:
# Construct a simple Series Object
s = pd.Series([1,2,3,4,5])
display(s)

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [3]:
# Conduct basic aggregation on the Series
print('s.mean()  :', s.mean())
print('s.median():', s.median())
print('s.count() :', s.count())
print('s.std()   :', s.std())
print('s.max()   :', s.max())

s.mean()  : 3.0
s.median(): 3.0
s.count() : 5
s.std()   : 1.5811388300841898
s.max()   : 5


---
### <font color=#14F278>1.2 Aggregation on DataFrames:</font>

- Recall that a DataFrame is a <font color=#14F278>**collection of Series**</font> - each column can be viewed as a Series
- Since DataFrames are 2-dimensional objects, aggregations are performed <font color=#14F278>**along a given axis**</font> - either on the rows or the columns
- Depending on the instruction - `axis=0` for aggregating the columns, or `axis=1` for aggregating the rows, the function will produce a single value per column or per row

In [4]:
# Let's construct a product dataframe, containing product type, quantity, price and cost of production
product_data = {
    'product':['vase','vase', 'vase', 'plate', 'plate'],
    'quantity': [130,247,75,300,180],
    'price_per_unit':[15,8,25,np.nan,12],
    'cost_per_unit':[5,2,8,4,7]
}

product_df = pd.DataFrame(product_data)
display(product_df)

Unnamed: 0,product,quantity,price_per_unit,cost_per_unit
0,vase,130,15.0,5
1,vase,247,8.0,2
2,vase,75,25.0,8
3,plate,300,,4
4,plate,180,12.0,7


In [5]:
# sum(axis = 0) or simply .sum() aggregates each column
product_df.sum(axis = 0)

product           vasevasevaseplateplate
quantity                             932
price_per_unit                      60.0
cost_per_unit                         26
dtype: object

In [6]:
# sum(axis = 1) aggregates each row
# Note that each row may contain a number of data types, which may break the aggregation
# Rule of thumb - assess which values you want to aggregate and only select them for the operation
product_df[['quantity', 'price_per_unit', 'cost_per_unit']].sum(axis = 1)

0    150.0
1    257.0
2    108.0
3    304.0
4    199.0
dtype: float64

Here is a summary of the observations:
<center>
    <div>
        <img src="..\images\aggregation_001.png"/>
    </div>
</center>

<font color=#FF8181>**Important:**</font> By default, missing values (NaN) are <font color=#FF8181>**ommitted from any aggregation calculation**</font>. The blanks are excluded from the population of values, used to derive the statistics. This means that we do not need to handle missing data, if the end goal of our work is producing aggregations!


---
### <font color=#14F278>1.3 DataFrame Summary Statistics:</font>
- Pandas provides a very easy way of producing <font color=#14F278>**summary statistics**</font> on dataframes, which saves us a lot of time and effort in the initial exploratory phase
- The `describe()` method produces the most important aggregations on numeric data and returns it back in the form of a dataframe
- Again, all NaN values are excluded from the calculations

In [8]:
product_df.describe()

Unnamed: 0,quantity,price_per_unit,cost_per_unit
count,5.0,4.0,5.0
mean,186.4,15.0,5.2
std,89.734609,7.25718,2.387467
min,75.0,8.0,2.0
25%,130.0,11.0,4.0
50%,180.0,13.5,5.0
75%,247.0,17.5,7.0
max,300.0,25.0,8.0


---
## <font color=#14F278>2. Aggregation with GroupBy - Split-Apply-Combine:</font>

So far, we would take a DataFrame:
- <font color=#14F278>**apply**</font> a given aggregation function along an axis,
- then <font color=#14F278>**combine**</font> the results into a single summary object

Very often, we would like to <font color=#14F278>**involve a categorical column**</font> in the analysis, by <font color=#14F278>**splitting**</font> observations by category, and aggregating for each sub-set. In other words, categorical columns provide us with a natural way of <font color=#14F278>**grouping**</font> data into subsets, which we can analyse individually!

A <font color=#14F278>**GroupBy Operation**</font> is an operation that involves <font color=#14F278>**splitting**</font> the object, <font color=#14F278>**applying**</font> a function on each subset, then <font color=#14F278>**combining**</font> the results.

Let's explore the case of `product_df`:

<center>
    <div>
        <img src="..\images\aggregation_002.png"/>
    </div>
</center>


The <font color=#14F278>**Split-Apply-Combine**</font> process can be implemented in Pandas via the `groupby()` method:
- first, we apply the `groupby()` method on the dataframe, passing the categorical column name as a parameter - this **splits** the data
- next, we select the columns to aggregate on - this is optional - we can select a single column, multiple columns or the whole dataset
- finally, we apply an aggregation function such as `sum()` or `mean()` - this performs the **apply** and **combine** steps

In [9]:
# Call the aggregation function sum() on the whole dataset
# Grouping by 'product'
# What happened to the column 'product'?
product_df.groupby('product').sum()

Unnamed: 0_level_0,quantity,price_per_unit,cost_per_unit
product,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
plate,480,12.0,11
vase,452,48.0,15


In [10]:
# We can select a single column for aggregation
# If passed inside a single [], the output is a Series Object
# If passed inside a double [[]], the output is a DataFrame object
product_df.groupby('product')['quantity'].sum()

product
plate    480
vase     452
Name: quantity, dtype: int64

In [11]:
# To aggregate on more than 1 columns, we always need to pass their names in a double [[]] structure
product_df.groupby('product')[['quantity', 'price_per_unit']].sum()

Unnamed: 0_level_0,quantity,price_per_unit
product,Unnamed: 1_level_1,Unnamed: 2_level_1
plate,480,12.0
vase,452,48.0


---
### <font color=#14F278>2.1 `groupby()` and the `agg()` Method:</font>

In the above example we applied a **single aggregation function** to a column or a set of columns, splitting by a categorical variable.

Sometimes, we want to extend the **Split-Apply-Combine** process to performing:
- <font color=#14F278>**multiple aggregations**</font> on a column/columns
- <font color=#14F278>**different aggregations to different columns**</font> 

To do this, we chain the `agg()` method after the `groupby()` method:
- `df.groupby('column_name').agg([agg_fn1, agg_fn2, ...])`
- `df.groupby('column_name').agg({'column_name1':agg_fn1, 'column_name2':agg_fn2, ...})` 

In [12]:
# Using the .agg method to apply multiple aggregation functions across all DataFrame columns.
product_df.groupby('product').agg([np.sum, np.mean])

  product_df.groupby('product').agg([np.sum, np.mean])
  product_df.groupby('product').agg([np.sum, np.mean])
  product_df.groupby('product').agg([np.sum, np.mean])


Unnamed: 0_level_0,quantity,quantity,price_per_unit,price_per_unit,cost_per_unit,cost_per_unit
Unnamed: 0_level_1,sum,mean,sum,mean,sum,mean
product,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
plate,480,240.0,12.0,12.0,11,5.5
vase,452,150.666667,48.0,16.0,15,5.0


In [14]:
# Using the .agg method to apply multiple aggregation functions across a specified column
product_df.groupby('product')[['quantity']].agg(['sum', 'mean'])

Unnamed: 0_level_0,quantity,quantity
Unnamed: 0_level_1,sum,mean
product,Unnamed: 1_level_2,Unnamed: 2_level_2
plate,480,240.0
vase,452,150.666667


In [17]:
# Using .agg method, specifying different aggregation functions for each column.
# for quantity, find the total sum for the product group
# for price_per_unit, find the average price for the product group
# for cost_per_unit, find the minimal cost for the product group
product_df.groupby('product').agg({'quantity':'sum', 'price_per_unit':'mean', 'cost_per_unit':'min'})

Unnamed: 0_level_0,quantity,price_per_unit,cost_per_unit
product,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
plate,480,12.0,4
vase,452,16.0,2


---
### <font color=#14F278>2.2 `groupby()` and the `filter()` Method:</font>
Another very useful combination in Pandas is the `filter()` method, chained after a `groupby()` method. It essentially allows us to:
- <font color=#14F278>**filter out groups**</font> that do not satisfy a given <font color=#14F278>**boolean condition**</font>
- the boolean condition revolves around some sort of <font color=#14F278>**aggregation**</font>
- the output contains only the rows for the groups, for which the aggregation result satisfies the boolean criterion

Syntax:
- `df.groupby('column_name').filter(boolean aggregate function)`


Let's explore an example in the case of `product_df`:
- you work in a manufactoring company, producing different types of products
- you are tasked with conducting further research on cost effectiveness for each product group
- in particular, you want to investigate only those products, for which the minimum cost per unit exceeds £3.

<center>
    <div>
        <img src="..\images\aggregation_003.png"/>
    </div>
</center>

In [18]:
# Recall how product_df looks like
display(product_df)

Unnamed: 0,product,quantity,price_per_unit,cost_per_unit
0,vase,130,15.0,5
1,vase,247,8.0,2
2,vase,75,25.0,8
3,plate,300,,4
4,plate,180,12.0,7


In [19]:
# the filter() method accepts a function as its argument
# we can either provide a function name or define a lambda function on the go
# importantly, the function should return a True or False value for each group of records
# Currently, the function returns True for group 'plate' and False for group 'vase'
# Only the rows of category 'plate' got returned
product_df.groupby('product').filter(lambda group: min(group['cost_per_unit']>3))

Unnamed: 0,product,quantity,price_per_unit,cost_per_unit
3,plate,300,,4
4,plate,180,12.0,7


---
### <font color=#14F278>2.2 `groupby()` and the `transform()` Method:</font>
Lastly, we can use the `transform()` method, chained after a `groupby()` method to:
- perform an <font color=#14F278>**aggregation**</font> on a given column per category
- easily <font color=#14F278>**join the aggregated values**</font> back to the original dataframe

Syntax:
- `df.groupby('column_name')[column_name1].transform(agg_function)`

This is extremely useful when you want to <font color=#14F278>**compare individual observations to the group-level statistic**</font>! Let's again explore the case of `product_df`:
- you work in a manufactoring company, producing different types of products
- you are tasked with conducting further research on cost effectiveness for each product group
- in particular, you want to check which product record has a **higher cost of production per unit** compared to the **average cost for the given product type**:

<center>
    <div>
        <img src="..\images\aggregation_004.png"/>
    </div>
</center>



In [20]:
# the transform() method allows us to produce a column with aggregate values for a given column
# importantly, the length of the column is identical to the length of the original dataframe columns
product_df.groupby('product')['cost_per_unit'].transform(np.mean)

  product_df.groupby('product')['cost_per_unit'].transform(np.mean)


0    5.0
1    5.0
2    5.0
3    5.5
4    5.5
Name: cost_per_unit, dtype: float64

In [21]:
# The above output can easily be added as a new column
product_df['avg_cost'] = product_df.groupby('product')['cost_per_unit'].transform(np.mean)
display(product_df)

  product_df['avg_cost'] = product_df.groupby('product')['cost_per_unit'].transform(np.mean)


Unnamed: 0,product,quantity,price_per_unit,cost_per_unit,avg_cost
0,vase,130,15.0,5,5.0
1,vase,247,8.0,2,5.0
2,vase,75,25.0,8,5.0
3,plate,300,,4,5.5
4,plate,180,12.0,7,5.5


In [22]:
# From here, we can easily proceed with any further analysis
# For instance, we can create a boolean column 'above_avg_price?'
product_df['above_avg_price'] = product_df.apply(lambda row: 'No' if row['cost_per_unit'] <= row['avg_cost'] else 'Yes', axis =1)
display(product_df)

Unnamed: 0,product,quantity,price_per_unit,cost_per_unit,avg_cost,above_avg_price
0,vase,130,15.0,5,5.0,No
1,vase,247,8.0,2,5.0,No
2,vase,75,25.0,8,5.0,Yes
3,plate,300,,4,5.5,No
4,plate,180,12.0,7,5.5,Yes


---
## <font color=#14F278>3. Pivot Tables:</font>

By definition, a <font color=#14F278>**Pivot**</font> is a <font color=#14F278>**central point on which a mechanism oscillates**</font> - this is a staple concept in data analysis and something we come across quite often when working with data.

A <font color=#14F278>**Pivot Table**</font> is a <font color=#14F278>**summary tool**</font> that allows us to summarize information from bigger tables by <font color=#14F278>**changing their central point**</font>.

---
### <font color=#14F278>3.1 Pivoting a Table - `pivot_table()`:</font>
Let's consider the following example:
- we are given access to the `product_df` table, which now has an additional column `'country'`
- this is an example of a dataset with **2 categorical columns**
- each observation (row) in the data falls in a given <font color=#14F278>**pair of categories - (country,product type)**</font>
- you want to find the **average production cost per unit for each product, in each country**:

<center>
    <div>
        <img src="..\images\aggregation_005.png"/>
    </div>
</center>


Syntax:
- `df.pivot_table(index = 'column_name1', columns = 'column_name2', values = 'column_name3', aggfunc = ....`)

In [23]:
# Let's construct a product dataframe, containing product type, quantity, price and cost of production
product_data = {
    'country':['UK', 'Germany', 'Germany', 'UK', 'UK'],
    'product':['vase','vase', 'vase', 'plate', 'plate'],
    'cost_per_unit':[5,2,8,4,7]
}

product_df = pd.DataFrame(product_data)
display(product_df)

Unnamed: 0,country,product,cost_per_unit
0,UK,vase,5
1,Germany,vase,2
2,Germany,vase,8
3,UK,plate,4
4,UK,plate,7


In [25]:
# pivoting the original table
# the index parameter is assigned to the column, whose values become the row labels (i.e. the index)
# the columns parameter is assigned to the column, whose values become the column labels
# the values parameter is assigned to the numeric column we want to aggregate
# the aggfunc is assigned to the aggregation function name we want to apply

pivot_df = product_df.pivot_table(index = 'country', columns = 'product', values = 'cost_per_unit', aggfunc = 'mean')
display(pivot_df)

product,plate,vase
country,Unnamed: 1_level_1,Unnamed: 2_level_1
Germany,,5.0
UK,5.5,5.0


---
### <font color=#14F278>3.2 Melting a Table - `melt()`:</font>

Sometimes we want to <font color=#14F278>**reverse**</font> a pivot table in its original column structure. This is called <font color=#14F278>**Melting**</font>. 
- While melting cannot restore the original level of granularity (we cannot reverse engineer the aggregation), it can restore the way categorical data is stored
- Recall that in a pivot table the column names are the possible values of a categorical variable
- To <font color=#14F278>**melt a table**</font> means to take its column names, and store them as the <font color=#14F278>**values in a single column**</font>:


Continuing with our example of a Pivot Table from above, here is how it looks when **melted**:
<center>
    <div>
        <img src="..\images\aggregation_006.png"/>
    </div>
</center>


Syntax:
- `df.melt(id_vars = 'column_name1', var_name = 'column_name2', value_name = 'column_name3')`

In [26]:
# First, we need to reset the index (which will import country as a stand-alone column in the dataframe):
pivot_df.reset_index(inplace = True)
display(pivot_df)

product,country,plate,vase
0,Germany,,5.0
1,UK,5.5,5.0


In [27]:
# Next, apply .melt() on the pivot_df
# the id_vars parameter is assigned to the categorical column(s) that remain stored as columns in the melted table
# the var_name parameter is assigned to the variable name you impose, where we will store the column names of the pivot table
# the value_name parameter is assigned to the variable name you impose, where we will store the values in the pivot table
pivot_df.melt(id_vars = ['country'], var_name = 'product', value_name = 'avg_cost')

Unnamed: 0,country,product,avg_cost
0,Germany,plate,
1,UK,plate,5.5
2,Germany,vase,5.0
3,UK,vase,5.0


---
## <font color=#14F278> 4 Summary:</font>
- An __Aggregation Function__ is a function that takes a collection of values and returns a single value
- To obtain a summary of statistics on a DataFrame, use the `.describe()` method 
- To perform aggregations on a DataFrame across the values of a categorical column, use the `.groupby()` method
- Use the `.agg()`, `.filter()` and `.transform()` methods in combination with the `.groupby()` method to conduct aggregation on multiple levels, filter and join aggregate statistics to the initial DataFrame
- To __Pivot Table__ use the `.pivot_table()` method
- To __Melt__ a table, use the `.melt()` method

---
## <font color=#FF8181> 5. Concept Check: </font>

1. Suppose we have the following Series: `s = pd.Series([1,2,3, np.nan, np.nan, 6,7])`. Calculate:
- the output produced by `s.mean()`
- the output produced by `s.median()`
- What can we conclude about missing values and aggregations? Do we include NaNs in the population of values, used for the calculations?
2. Suppose we have the following DataFrame: `df = pd.DataFrame({'Uni':['Bath', 'Warwick', 'Bristol', 'Bristol', 'Warwick', 'Bath'], 'Subject':['Maths', 'Physics', 'English', 'Maths', 'Maths', 'English'], 'Score':[78, 68, 65, 75, 82, 62]})`
- create new a DataFrame with: number of universities and average score per subject
- create a new column to the initial DataFrame with the avarage university scores across all subjects using `.transform()`
- pivot the initial dataframe so that the values in 'Subject' column become column names

In [48]:
s = pd.Series([1,2,3, np.nan, np.nan, 6,7])

print(s.mean())
print(s.median())

3.8
3.0


In [58]:
df = pd.DataFrame({'Uni':['Bath', 'Warwick', 'Bristol', 'Bristol', 'Warwick', 'Bath'], 'Subject':['Maths', 'Physics', 'English', 'Maths', 'Maths', 'English'], 'Score':[78, 68, 65, 75, 82, 62]})
display(df)
df_new = df.groupby('Subject').agg({'Uni':pd.Series.nunique, 'Score':'mean'})
display(df_new)
df['avg_uni_score'] = df.groupby('Uni')['Score'].transform('mean')
display(df)
pivot_df = df.pivot_table(index = 'Uni', columns = 'Subject', values = 'Score', aggfunc = 'max')
display(pivot_df)



Unnamed: 0,Uni,Subject,Score
0,Bath,Maths,78
1,Warwick,Physics,68
2,Bristol,English,65
3,Bristol,Maths,75
4,Warwick,Maths,82
5,Bath,English,62


Unnamed: 0_level_0,Uni,Score
Subject,Unnamed: 1_level_1,Unnamed: 2_level_1
English,2,63.5
Maths,3,78.333333
Physics,1,68.0


Unnamed: 0,Uni,Subject,Score,avg_uni_score
0,Bath,Maths,78,70.0
1,Warwick,Physics,68,75.0
2,Bristol,English,65,70.0
3,Bristol,Maths,75,70.0
4,Warwick,Maths,82,75.0
5,Bath,English,62,70.0


Subject,English,Maths,Physics
Uni,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Bath,62.0,78.0,
Bristol,65.0,75.0,
Warwick,,82.0,68.0
