# Pandas - Grouping and Aggregation

## 1. Aggregation and Grouping

An essential piece of analysis of large data is efficient summarization: computing aggregations in which a single number gives insight into the nature of a potentially large dataset. I will explore aggregations in Pandas, from simple operations and more sophisticated operations based on the concept of a groupby.

The following table summarizes some other built-in Pandas aggregations:

| Aggregation              | Description                     |
|--------------------------|---------------------------------|
| ``count()``              | Total number of items           |
| ``first()``, ``last()``  | First and last item             |
| ``mean()``, ``median()`` | Mean and median                 |
| ``min()``, ``max()``     | Minimum and maximum             |
| ``std()``, ``var()``     | Standard deviation and variance |
| ``mad()``                | Mean absolute deviation         |
| ``prod()``               | Product of all items            |
| ``sum()``                | Sum of all items                |

### 1.1. GroupBy: Split, Apply, Combine

Simple aggregations can give you a flavor of your dataset, but often we would prefer to aggregate conditionally on some label or index: this is implemented in the so-called `groupby` operation. Although the name "group by" comes from a command in the SQL database language, the `groupby` operation can be considered as split, apply, combine.

![title](../Data/Notebook_Images/Groupby.png)


This makes clear what the groupby accomplishes:

* The split step involves breaking up and grouping a DataFrame depending on the value of the specified key.
* The apply step involves computing some function, usually an aggregate, transformation, or filtering, within the individual groups.
* The combine step merges the results of these operations into an output array.

#### GroupBy object

The GroupBy object is a very flexible abstraction. The most basic split-apply-combine operation can be computed with the `groupby()` method of DataFrames, passing the name of the desired key column

In [3]:
import pandas as pd
import numpy as np

df = pd.read_csv('../Data/Height_Weight.csv')
df.head()

Unnamed: 0,Name,Height,Weight,Hometown
0,Ashley,155,140,Palo Alto
1,Robin,145,122,Fremont
2,Priyanka,152,131,Santa Clara
3,Youngchul,167,148,Cupertino
4,Aziz,161,139,San Francisco


In [2]:
size_lvl = df.groupby('group')

print('groupby object:', size_lvl, '\n')

for i,j in size_lvl:
    print('keys:', i, '\n', 'content:\n', j, '\n')

NameError: name 'df2' is not defined

In [None]:
# select specific group in the groupby object
size_lvl.get_group('Engineering')

Aggregate, filter, transform, apply
The preceding discussion focused on aggregation for the combine operation, but there are more options available. In particular, GroupBy objects have aggregate(), filter(), transform(), and apply() methods that efficiently implement a variety of useful operations before combining the grouped data.