# Chapter 8
# Data Aggregation and Group Operations

Categorizing a dataset and applying a function to each group, whether an aggregation or transformation, is often a critical component of a data analysis workflow. After loading, merging, and preparing a dataset, you may need to compute group statistics or possibly pivot tables for reporting or visualization purposes. pandas provides a flexible groupby interface, enabling you to slice, dice, and summarize datasets in a natural way.

One reason for the popularity of relational databases and SQL (which stands for “structured query language”) is the ease with which data can be joined, filtered, transformed, and aggregated. However, query languages like SQL are somewhat constrained in the kinds of group operations that can be performed. As you will see, with the expressiveness of Python and pandas, we can perform quite complex group operations by utilizing any function that accepts a pandas object or NumPy array.

## 8.1 GroupBy Mechanics

Hadley Wickham, an author of many popular packages for the R programming language, coined the term split-apply-combine for describing group operations. In the first stage of the process, data contained in a pandas object, whether a Series, Data‐
Frame, or otherwise, is split into groups based on one or more keys that you provide. The splitting is performed on a particular axis of an object. 

For example, a DataFrame can be grouped on its <b>rows *(axis=0)*</b> or its <b>columns *(axis=1)*</b>. Once this is done, a function is applied to each group, producing a new value. Finally, the results of all those function applications are combined into a result object. The form of the resulting object will usually depend on what’s being done to the data. See Figure 8-1 for a
mockup of a simple group aggregation.

<img src="Figure 8.1.png" width="600px">
<br>
<center>Figure 8.1: Illustration of a group aggregation</center>

Each grouping key can take many forms, and the keys do not have to be all of the same type:<br>
• A list or array of values that is the same length as the axis being grouped <br>
• A value indicating a column name in a DataFrame <br>
• A dict or Series giving a correspondence between the values on the axis being grouped and the group names <br>
• A function to be invoked on the axis index or the individual labels in the index <br>

Note that the latter three methods are shortcuts for producing an array of values to be used to split up the object.

In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'], 'key2' : ['one', 'two', 'one', 'two', 'one'],
                   'data1' : np.random.randn(5), 'data2' : np.random.randn(5)})
df

Suppose you wanted to compute the mean of the data1 column using the labels from *key1*. There are a number of ways to do this. One is to access data1 and call `groupby` with the column (a Series) at key1:

In [None]:
df['key1']

In [None]:
grouped = df['data1'].groupby(df['key1'])
list(grouped)

This grouped variable is now a *GroupBy* object. It has not actually computed anything yet except for some intermediate data about the group key `df['key1']`. The idea is that this object has all of the information needed to then apply some operation to
each of the groups. For example, to compute group means we can call the GroupBy’s `mean` method:

<br>
<img src="Example1.png"/>

In [None]:
ans = grouped.sum()
# df['data1'].groupby(df['key1']).mean()
ans

The important thing here is that the data (a Series) has been aggregated according to the group key, producing a new Series that is now indexed by the unique values in the key1 column.

The result index has the name `'key1'` because the DataFrame column `df['key1']` did.

If instead we had passed multiple arrays as a list, we’d get something different:
<br>
<img src="Example2.png"/>

In [None]:
means = df['data1'].groupby([df['key1'], df['key2']]).mean()
means

Here we grouped the data using two keys, and the resulting Series now has a hierarchical index consisting of the unique pairs of keys observed:

In [None]:
means_2d = means.unstack()


In this example, the group keys are all Series, though they could be any arrays of the right length:

In [None]:
states = np.array(['Ohio', 'Ohio', 'California', 'Ohio', 'Ohio'])
# years = np.array([2005, 2005, 2006, 2005, 2006])
df['data1'].groupby([states]).mean()

Frequently the grouping information is found in the same DataFrame as the data you want to work on. In that case, you can pass column names (whether those are strings, numbers, or other Python objects) as the group keys:


In [None]:
df.groupby('key1').mean()

In [None]:
df.groupby(['key1', 'key2']).mean()

You may have noticed in the first case `df.groupby('key1').mean()` that there is no `key2` column in the result. Because `df['key2']` is not numeric data, it is said to be a *nuisance column*, which is therefore excluded from the result. <b>By default, all of the numeric columns are aggregated</b>, though it is possible to filter down to a subset.

Regardless of the objective in using groupby, a generally useful GroupBy method is `size`, which returns a Series containing group sizes:

<br>
<img src='Example3.png'/>

In [None]:
df.groupby(['key1', 'key2']).size()

Take note that any missing values in a group key will be **excluded** from the result.

In [None]:
df1 = pd.DataFrame({'k1' : ['a', 'a', 'b', 'b', 'a'], 'k2' : ['one', 'two', 'one', 'two', 'one'],
                   'd1' : np.random.randn(5), 'd2' : np.random.randn(5)})
df1['d1'][0]=""
df1.groupby(['k1','k2']).mean()

In [None]:
df1

### 8.1.1 Iterating Over Groups

The GroupBy object supports iteration, generating a sequence of 2-tuples containing the group name along with the chunk of data. Consider the following:

In [None]:
grouped_1k = df.groupby('key1')
grouped_2k = df.groupby(['key1', 'key2'])
grouped_2k

In [None]:
for name, group in grouped_1k:
#     print(name)
    print(group)

In the case of multiple keys, the first element in the tuple will be a tuple of key values:

In [None]:
for (k1, k2), group in grouped_2k:
    print(k1, k2)
    print(group)

You can choose to do whatever you want with the pieces of data. A recipe you may find useful is computing a dict of the data pieces as a one-liner:

In [None]:
pieces = dict(list(grouped_1k))
pieces['a']

By default `groupby` groups on *axis=0*, but you can group on any of the other axes. For example, we could group the columns of our example `df` here by `dtype` like so:

In [None]:
df.dtypes

In [None]:
grouped_by_type = df.groupby(df.dtypes, axis=1)
list(grouped)

We can print out the groups like so:

In [None]:
for dtype, group in grouped:
    print(dtype)
    print(group)

### 8.1.2 Selecting a Column or Subset of Columns

Indexing a GroupBy object created from a DataFrame with a column name or array of column names has the effect of column subsetting for aggregation. This means that:

    df.groupby('key1')['data1']
    df.groupby('key1')[['data2']]

are syntactic sugar for:

    df['data1'].groupby(df['key1'])
    df[['data2']].groupby(df['key1'])

Especially for large datasets, it may be desirable to aggregate only a few columns. For example, in the preceding dataset, to compute means for just the data2 column and get the result as a DataFrame, we could write:

In [None]:
df.groupby(['key1', 'key2'])[['data2']].mean()

The object returned by this indexing operation is a <b>grouped DataFrame if a list or array is passed</b> or a <b>grouped Series if only a single column name is passed</b> as a scalar:

In [None]:
s_grouped = df.groupby(['key1', 'key2'])['data2']
s_grouped

In [None]:
s_grouped.mean()

### 8.1.3 Grouping with Dicts and Series

Grouping information may exist in a form other than an array. Let’s consider another example DataFrame:

In [None]:
people = pd.DataFrame(np.random.randn(5, 5), columns=['a', 'b', 'c', 'd', 'e'], index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])
people.iloc[2:3, [1, 2]] = np.nan # Add a few NA values
people

Now, suppose we have a group correspondence for the columns and want to sum together the columns by group:

In [None]:
mapping = {'a': 'red', 'b': 'red', 'c': 'blue', 'd': 'blue', 'e': 'red', 'f' : 'orange'}

Now, we could construct an array from this dict to pass to `groupby`, but instead we can just pass the `dict` (we included the key 'f' to highlight that unused grouping keys are OK):

In [None]:
by_column = people.groupby(mapping, axis=1)
# by_column.sum()
dict(list(by_column))
by_column.sum()

The same functionality holds for Series, which can be viewed as a fixed-size mapping:

In [None]:
map_series = pd.Series(mapping)
map_series

In [None]:
people.groupby(map_series, axis=1).sum()

### 8.1.4 Grouping with Functions

Using Python functions is a more generic way of defining a group mapping compared with a `dict` or `Series`. Any function passed as a group key will be <b>called once per index value, with the return values being used as the group names.</b> More concretely, consider the example `DataFrame` from the previous section, which has people’s first names as index values. 

Suppose we wanted to group by the length of the names; while we could compute an array of string lengths, it’s simpler to just pass the `len` function:

In [None]:
people

In [None]:
people.groupby(len).mean()

In [None]:
dict(list(people.groupby(len)))

Mixing functions with arrays, dicts, or Series is not a problem as everything gets converted to arrays internally:
<br>
<img src="Example4.png"/>

In [None]:
key_list = ['one', 'one', 'one', 'two', 'two']
people.groupby([len, key_list]).mean()

### 8.1.5 Grouping by Index Levels

A final convenience for hierarchically indexed datasets is the ability to aggregate using one of the levels of an axis index. Let’s look at an example:

In [None]:
columns = pd.MultiIndex.from_arrays([['US', 'US', 'US', 'JP', 'JP'], [1, 3, 5, 1, 3]], names=['cty', 'tenor'])
hier_df = pd.DataFrame(np.random.randn(4, 5), columns=columns)
hier_df

To group by level, pass the level number or name using the `level` keyword:

In [None]:
hier_df.groupby(level='cty', axis=1).count()

## 8.2 Data Aggregation

Aggregations refer to any data transformation that produces scalar values from arrays. The preceding examples have used several of them, including `mean`, `count`, `min`, and `sum`. We may wonder what is going on when we invoke `mean()` on a GroupBy object. Many common aggregations, such as those found in Table 8.1, have optimized implementations. However, we are not limited to only this set of methods.

<br><center>Table 8.1: Optimized groupby methods</center>
<img src="Table 8.1.png" width="500px">

We can use aggregations of our own devising and additionally call any method that is also defined on the grouped object. For example, we might recall that `quantile` computes sample quantiles of a Series or a DataFrame’s columns.

While `quantile` is not explicitly implemented for GroupBy, it is a Series method and thus available for use. Internally, GroupBy efficiently slices up the Series, calls `piece.quantile(0.9)` for each piece, and then assembles those results together into the result object:

In [None]:
df

In [None]:
grouped = df.groupby('key1')
# grouped.quantile()
grouped['data1'].mean()

To use your own aggregation functions, pass any function that aggregates an array to the `aggregate` or `agg` method:

In [None]:
def peak_to_peak(arr):
    return arr.max() - arr.min()

grouped.agg(peak_to_peak)

We may notice that some methods like describe also work, even though they are not aggregations, strictly speaking:

In [None]:
grouped['data1'].describe()

### 8.2.1 Column-Wise and Multiple Function Application

Let’s go to the tipping dataset. After loading it with `read_csv`, we add a tipping percentage column *tip_pct*:

In [None]:
tips = pd.read_csv('tips.csv')
tips.head()

In [None]:
# Add tip percentage of total bill
tips['tip_pct'] = tips['tip'] / tips['total_bill']

tips[:10]

As we’ve already seen, aggregating a Series or all of the columns of a DataFrame is a matter of using `aggregate` with the desired function or calling a method like `mean` or `std`. However, we may want to aggregate using a different function depending on the column, or multiple functions at once. Fortunately, this is possible to do, which we’ll illustrate through a number of examples. First, we’ll group the *tips* by day and *smoker*:

In [None]:
grouped = tips.groupby(['day', 'smoker'])
dict(list(grouped))

In [None]:
grouped_pct = grouped['tip_pct']
dict(list(grouped_pct))

In [None]:
grouped_pct.agg(['sum','max','mean'])

If you pass a list of functions or function names instead, you get back a DataFrame with column names taken from the functions:

In [None]:
grouped_pct.agg([('Average','mean'), ('Std dev','std'), ('Max-Min',peak_to_peak)])

Here we passed a list of aggregation functions to `agg` to evaluate indepedently on the data groups.

We don’t need to accept the names that GroupBy gives to the columns; notably, `lambda` functions have the name `'<lambda>'`, which makes them hard to identify. Thus, if we pass a list of (name, function) tuples, the first element of each tuple will be used as the DataFrame column names (we can think of a list of 2-tuples as an ordered mapping):

In [None]:
grouped_pct.agg([('foo', 'mean'), ('bar', np.std)])

With a DataFrame we have more options, as we can specify a list of functions to apply to all of the columns or different functions per column. To start, suppose we wanted to compute the same three statistics for the *tip_pct* and *total_bill*
columns:

In [None]:
functions = [('No. of tips','count'), ('Average Tips','mean'), ('Max tips','max')]


In [None]:
result = grouped[['tip_pct', 'total_bill']].agg(functions)
result

As we can see, the resulting DataFrame has hierarchical columns, the same as we would get aggregating each column separately and using `concat` to glue the results together using the column names as the `keys` argument:

In [None]:
result['total_bill']

As before, a list of tuples with custom names can be passed:

In [None]:
ftuples = [('Average', 'mean'), ('Variance', np.var)]
grouped[['tip_pct', 'total_bill']].agg(ftuples)

Now, suppose we wanted to apply potentially different functions to one or more of the columns. To do this, pass a dict to `agg` that contains a mapping of column names to any of the function specifications listed so far:

In [None]:
dict(list(grouped))

In [None]:
grouped.agg({'tip' : np.max, 'size' : 'sum'})

In [None]:
ans = grouped.agg({'tip_pct' : ['min', 'max', 'mean', 'std'], 'size' : 'sum'})
ans

A DataFrame will have hierarchical columns only if multiple functions are applied to at least one column.

### 8.2.2 Returning Aggregated Data Without Row Indexes

In all of the examples up until now, the aggregated data comes back with an index, potentially hierarchical, composed from the unique group key combinations. Since this isn’t always desirable, we can disable this behavior in most cases by passing
`as_index=False` to `groupby`:

In [None]:
tips.groupby(['day', 'smoker'], as_index=False).mean()

Of course, it’s always possible to obtain the result in this format by calling `reset_index` on the result. Using the `as_index=False` method avoids some unnecessary computations.

## 8.3 Apply: General split-apply-combine

The most general-purpose GroupBy method is apply, which is the subject of the rest of this section. As illustrated in Figure 8.1 previously, apply splits the object being manipulated into pieces, invokes the passed function on each piece, and then attempts to concatenate the pieces together.

<img src="Figure 8.1.png" width="600px">
<br>
<center>Figure 8.1: Illustration of a group aggregation</center>

Returning to the tipping dataset from before, suppose we wanted to select the top five *tip_pct* values by group. First, write a function that selects the rows with the largest values in a particular column:

In [None]:
def top(df, n=5, column='tip_pct'):
    return df.sort_values(by=column)[-n:]

top(tips, n=6)

Now, if we group by *smoker*, say, and call apply with this function, we get the following:

In [None]:
tips.groupby('smoker').apply(top)

What has happened here? The top function is called on each row group from the DataFrame, and then the results are glued together using `pandas.concat`, labeling the pieces with the group names. The result therefore has a hierarchical index whose
inner level contains index values from the original DataFrame.

If you pass a function to apply that takes other arguments or keywords, you can pass these after the function:

In [None]:
tips.groupby(['smoker', 'day']).apply(top, n=1, column='total_bill')

We may recall that we earlier called `describe` on a GroupBy object:

In [None]:
result = tips.groupby('smoker')['tip_pct'].describe()
result

In [None]:
result.unstack()

Inside GroupBy, when we invoke a method like `describe`, it is actually just a shortcut for:

In [None]:
f = lambda x: x.describe()
grouped.apply(f)

### 8.3.1 Suppressing the Group Keys

In the preceding examples, we see that the resulting object has a hierarchical index formed from the group keys along with the indexes of each piece of the original object. We can disable this by passing `group_keys=False` to `groupby`:

In [None]:
tips.groupby('smoker', group_keys=False).apply(top)

### 8.3.2 Quantile and Bucket Analysis

As we may recall from Chapter 6, pandas has some tools, in particular `cut` and `qcut`, for slicing data up into buckets with bins of your choosing or by sample quantiles. Combining these functions with `groupby` makes it convenient to perform bucket or
quantile analysis on a dataset. Consider a simple random dataset and an equal-length bucket categorization using `cut`:

In [None]:
frame = pd.DataFrame({'data1': np.random.randn(1000), 'data2': np.random.randn(1000)})
quartiles = pd.cut(frame.data1, 4)
quartiles[:10]

The `Categorical` object returned by cut can be passed directly to groupby. So we could compute a set of statistics for the *data2* column like so:

In [None]:
def get_stats(group):
    return {'min': group.min(), 'max': group.max(), 'count': group.count(), 'mean': group.mean()}

grouped = frame.data2.groupby(quartiles)
grouped.apply(get_stats).unstack()

These were equal-length buckets; to compute equal-size buckets based on sample quantiles, use `qcut`. We’ll pass `labels=False` to just get quantile numbers:

In [None]:
# Return quantile numbers
grouping = pd.qcut(frame.data1, 10, labels=False)
grouped = frame.data2.groupby(grouping)
grouped.apply(get_stats).unstack()

### 8.3.3 Example: Filling Missing Values with Group-Specific Values

When cleaning up missing data, in some cases we will replace data observations using `dropna`, but in others we may want to impute (fill in) the null (NA) values using a fixed value or some value derived from the data. `fillna` is the right tool to
use; for example, here we fill in NA values with the mean:

In [None]:
s = pd.Series(np.random.randn(6))
s[::2] = np.nan
s

In [None]:
s.fillna(s.mean())

Suppose we need the fill value to vary by group. One way to do this is to group the data and use `apply` with a function that calls `fillna` on each data chunk. Here is some sample data on US states divided into eastern and western regions:

In [None]:
states = ['Ohio', 'New York', 'Vermont', 'Florida', 'Oregon', 'Nevada', 'California', 'Idaho']
group_key = ['East'] * 4 + ['West'] * 4
data = pd.Series(np.random.randn(8), index=states)
data

Note that the syntax ['East'] * 4 produces a list containing four copies of the elements in ['East']. Adding lists together concatenates them. Let’s set some values in the data to be missing:

In [None]:
data[['Vermont', 'Nevada', 'Idaho']] = np.nan
data

In [None]:
data.groupby(group_key).mean()

We can fill the NA values using the group means like so:

In [None]:
fill_mean = lambda g: g.fillna(g.mean())
data.groupby(group_key).apply(fill_mean)

In another case, we might have predefined fill values in our code that vary by group. Since the groups have a name attribute set internally, we can use that:

In [None]:
fill_values = {'East': 0.5, 'West': -1}
fill_func = lambda g: g.fillna(fill_values[g.name])
data.groupby(group_key).apply(fill_func)

## 8.4 Pivot Tables and Cross-Tabulation

### 8.4.1 Pivot Tables

A *pivot table* is a data summarization tool frequently found in spreadsheet programs and other data analysis software. It aggregates a table of data by one or more keys, arranging the data in a rectangle with some of the group keys along the rows and some along the columns. Pivot tables in Python with pandas are made possible through the `groupby` facility described in this chapter combined with reshape operations utilizing hierarchical indexing. DataFrame has a `pivot_table` method, and there is also a top-level `pandas.pivot_table` function. In addition to providing a convenience interface to groupby, `pivot_table` can add partial totals, also known as *margins*.

Returning to the tipping dataset, suppose you wanted to compute a table of group means (the default `pivot_table` aggregation type) arranged by *day* and *smoker* on the rows:

In [None]:
tips.pivot_table(index=['day', 'smoker'])

This could have been produced with `groupby` directly. Now, suppose we want to aggregate only *tip_pct* and *size*, and additionally group by *time*. We’ll put *smoker* in the table columns and day in the rows:

In [None]:
tips.pivot_table(['tip_pct', 'size'], index=['time', 'day'], columns='smoker')

We could augment this table to include partial totals by passing `margins=True`. This has the effect of adding All row and column labels, with corresponding values being the group statistics for all the data within a single tier:

In [None]:
tips.pivot_table(['tip_pct', 'size'], index=['time', 'day'], columns='smoker', margins=True)

Here, the *All* values are means without taking into account smoker versus nonsmoker (the All columns) or any of the two levels of grouping on the rows (the All row).

To use a different aggregation function, pass it to `aggfunc`. For example, `count` or `len` will give you a cross-tabulation (count or frequency) of group sizes:

In [None]:
tips.pivot_table('tip_pct', index=['time', 'smoker'], columns='day', aggfunc=len, margins=True)

If some combinations are empty (or otherwise NA), we may wish to pass a `fill_value`:

In [None]:
tips.pivot_table('tip_pct', index=['time', 'size', 'smoker'], columns='day', aggfunc='mean', fill_value=0)

See Table 8.2 for a summary of pivot_table methods.

<br><center>Table 8.2: *pivot_table* options</center>
<img src="Table 8.2.png" width="800px">

### 8.4.2 Cross-Tabulations: Crosstab

A cross-tabulation (or crosstab for short) is a special case of a pivot table that computes group frequencies. Here is an example:

In [None]:
data1 = pd.DataFrame({'Sample': range(10), 'Nationality': ['USA', 'Japan', 'USA', 'Japan', 'Japan', 'Japan', 'USA',
                                                            'USA', 'Japan', 'USA'], 
                      'Handedness': ['Right-handed', 'Left-handed', 'Right-handed', 'Right-handed', 'Left-handed',
                                     'Right-handed', 'Right-handed', 'Left-handed', 'Right-handed', 'Right-handed',]})
data1

As part of some survey analysis, we might want to summarize this data by nationality and handedness. We could use `pivot_table` to do this, but the `pandas.crosstab` function can be more convenient:

In [None]:
pd.crosstab(data1.Nationality, data1.Handedness, margins=True)

The first two arguments to `crosstab` can each either be an array or Series or a list of arrays. As in the tips data:

In [None]:
pd.crosstab([tips.time, tips.day], tips.smoker, margins=True)