# Pandas - Grouping and Aggregation

An essential piece of analysis of large data is efficient summarization: computing aggregations in which a single number gives insight into the nature of a potentially large dataset. I will explore aggregations in Pandas, from simple operations and more sophisticated operations based on the concept of a groupby.

The following table summarizes some other built-in Pandas aggregations:

| Aggregation              | Description                     |
|--------------------------|---------------------------------|
| ``count()``              | Total number of items           |
| ``first()``, ``last()``  | First and last item             |
| ``mean()``, ``median()`` | Mean and median                 |
| ``min()``, ``max()``     | Minimum and maximum             |
| ``std()``, ``var()``     | Standard deviation and variance |
| ``mad()``                | Mean absolute deviation         |
| ``prod()``               | Product of all items            |
| ``sum()``                | Sum of all items                |

## 1. GroupBy: Split, Apply, Combine

Simple aggregations can give you a flavor of your dataset, but often we would prefer to aggregate conditionally on some label or index: this is implemented in the so-called `groupby` operation. Although the name "group by" comes from a command in the SQL database language, the `groupby` operation can be considered as split, apply, combine.

![title](../Data/Notebook_Images/Groupby.png)


This makes clear what the groupby accomplishes:

* The split step involves breaking up and grouping a DataFrame depending on the value of the specified key.
* The apply step involves computing some function, usually an aggregate, transformation, or filtering, within the individual groups.
* The combine step merges the results of these operations into an output array.

### 1.1. GroupBy object

The GroupBy object is a very flexible abstraction. The most basic split-apply-combine operation can be computed with the `groupby()` method of DataFrames, passing the name of the desired key column.

In [21]:
import pandas as pd
import numpy as np

df = pd.read_csv('../Data/Height_Weight.csv')
df.head()

Unnamed: 0,Name,Height,Weight,Hometown
0,Ashley,155,140,Palo Alto
1,Robin,145,122,Fremont
2,Priyanka,152,131,Santa Clara
3,Youngchul,167,148,Cupertino
4,Aziz,161,139,San Francisco


In [22]:
geo_gp = df.groupby('Hometown')
print('groupby object:', geo_gp, '\n')

groupby object: <pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000026D109D5630> 



In [23]:
# iteration over the groups, returning groups as Series
for i,j in geo_gp:
    print('keys:', i, '\n', j, '\n')

keys: Cupertino 
         Name  Height  Weight   Hometown
3  Youngchul     167     148  Cupertino 

keys: Fremont 
      Name  Height  Weight Hometown
1   Robin     145     122  Fremont
10   Emma     165     120  Fremont 

keys: Hayward 
    Name  Height  Weight Hometown
5  Zoey     181     190  Hayward 

keys: Los Angeles 
       Name  Height  Weight     Hometown
6      Jay     183     180  Los Angeles
7  Frances     172     110  Los Angeles 

keys: Palo Alto 
       Name  Height  Weight   Hometown
0   Ashley     155     140  Palo Alto
11   Terry     185     220  Palo Alto 

keys: San Francisco 
    Name  Height  Weight       Hometown
4  Aziz     161     139  San Francisco
9   Xia     162     110  San Francisco 

keys: Santa Clara 
        Name  Height  Weight     Hometown
2  Priyanka     152     131  Santa Clara
8      Abby     158     120  Santa Clara 



In [24]:
# select specific group in the groupby object
geo_gp.get_group('Los Angeles')

Unnamed: 0,Name,Height,Weight,Hometown
6,Jay,183,180,Los Angeles
7,Frances,172,110,Los Angeles


In [25]:
# customize columns name 
geo_gp.sum().add_prefix('sum_')

Unnamed: 0_level_0,sum_Height,sum_Weight
Hometown,Unnamed: 1_level_1,Unnamed: 2_level_1
Cupertino,167,148
Fremont,310,242
Hayward,181,190
Los Angeles,355,290
Palo Alto,340,360
San Francisco,323,249
Santa Clara,310,251


## 2. Aggregate, filter, transform, apply

In particular, GroupBy objects have `aggregate()`, `filter()`, `transform()`, and `apply()` methods that efficiently implement a variety of useful operations before combining the grouped data.

### 2.1. Aggregation

We're now familiar with GroupBy aggregations with sum(), median(), and the like, but the `aggregate()` method allows for even more flexibility. It can take a string, a function, or a list thereof, and compute all the aggregates at once.

In [27]:
df.groupby('Hometown').agg(['min', np.median, max])

Unnamed: 0_level_0,Height,Height,Height,Weight,Weight,Weight
Unnamed: 0_level_1,min,median,max,min,median,max
Hometown,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Cupertino,167,167.0,167,148,148.0,148
Fremont,145,155.0,165,120,121.0,122
Hayward,181,181.0,181,190,190.0,190
Los Angeles,172,177.5,183,110,145.0,180
Palo Alto,155,170.0,185,140,180.0,220
San Francisco,161,161.5,162,110,124.5,139
Santa Clara,152,155.0,158,120,125.5,131


In [34]:
df.groupby(['Hometown']).agg({'Height': np.mean, 'Weight': np.min})

Unnamed: 0_level_0,Height,Weight
Hometown,Unnamed: 1_level_1,Unnamed: 2_level_1
Cupertino,167.0,148
Fremont,155.0,120
Hayward,181.0,190
Los Angeles,177.5,110
Palo Alto,170.0,140
San Francisco,161.5,110
Santa Clara,155.0,120


### 2.2.  Filtering

A filtering operation allows you to drop data based on the group properties like `having` in SQL

In [35]:
def filter_func(x):
    return x['Height'].mean() >= 170

df.groupby('Hometown').filter(filter_func)

Unnamed: 0,Name,Height,Weight,Hometown
0,Ashley,155,140,Palo Alto
5,Zoey,181,190,Hayward
6,Jay,183,180,Los Angeles
7,Frances,172,110,Los Angeles
11,Terry,185,220,Palo Alto


### 2.3. Transformation

`Transformation` can return some transformed version of the full data to recombine. For example, first calculate average of height and weight by Hometown, then each member's record is subtracted by average values.

In [38]:
df.groupby('Hometown').transform(lambda x: x - x.mean())

Unnamed: 0,Height,Weight
0,-15.0,-40.0
1,-10.0,1.0
2,-3.0,5.5
3,0.0,0.0
4,-0.5,14.5
5,0.0,0.0
6,5.5,35.0
7,-5.5,-35.0
8,3.0,-5.5
9,0.5,-14.5


### 2.4. Apply

The `apply` method lets you apply an arbitrary function to the group results. The function should take a DataFrame, and return either a Pandas object (e.g., DataFrame, Series) or a scalar; the combine operation will be tailored to the type of output returned.

In [45]:
def BMI(x):
    # x is a DataFrame of group values
    x['Weight'].mean() / (x['Height'].mean())**2
    return x

df.groupby('Hometown').apply(BMI)

Unnamed: 0,Name,Height,Weight,Hometown
0,Ashley,155,140,Palo Alto
1,Robin,145,122,Fremont
2,Priyanka,152,131,Santa Clara
3,Youngchul,167,148,Cupertino
4,Aziz,161,139,San Francisco
5,Zoey,181,190,Hayward
6,Jay,183,180,Los Angeles
7,Frances,172,110,Los Angeles
8,Abby,158,120,Santa Clara
9,Xia,162,110,San Francisco
