# Table of Contents
* [Learning Objectives](#Learning-Objectives)
* [Pandas GroupBy Overview](#Pandas-GroupBy-Overview)
	* [Set-Up](#Set-Up)
	* [What is Groupby?](#What-is-Groupby?)
	* [Groupby Input Types](#Groupby-Input-Types)
* [Quick Demo](#Quick-Demo)
	* [Read Data](#Read-Data)
	* [Create a groupby object](#Create-a-groupby-object)
	* [Split](#Split)
	* [Apply](#Apply)
	* [Visualize](#Visualize)
* [GroupBy with Real Data](#GroupBy-with-Real-Data)
	* [Choose your Groups](#Choose-your-Groups)
	* [Create GroupBy Object](#Create-GroupBy-Object)
	* [Split](#Split)
	* [Apply](#Apply)
	* [Pure Operations](#Pure-Operations)
	* [Transform](#Transform)
* [GroupBy with Multiple Columns](#GroupBy-with-Multiple-Columns)
	* [Reshaping with Stack and Unstack](#Reshaping-with-Stack-and-Unstack)
	* [GroupBy with MultiIndex](#GroupBy-with-MultiIndex)
	* [Summary](#Summary)


# Learning Objectives

After completion of this module, learners will be able to:
* Use pandas to perform groupby operations on tabular data
* Apply groupby to several different "groupers" (e.g. Series, funcs, dicts)
* Use stack and unstaack to mitigate difficulties with hierarchical indexing
* Perform aggregations of selected coluns of data

# Pandas GroupBy Overview

## Set-Up

In [None]:
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

import pandas as pd
pd.options.display.max_rows    = 6
pd.options.display.max_columns = 6
pd.options.display.width       = 80

## What is Groupby?

Groupby is a fundamental operation in data analysis and is well supported in pandas.
> http://pandas.pydata.org/pandas-docs/stable/groupby.html


A “group by” involves one or more of the following:

1. **Split** a table into groups
2. **Apply** a function to each groups
3. **Combine** the results

The split and combine are often straight-forward. The apply is more interesting

In the apply step, perform one or more of the following:

* **Aggregation:** compute a summary statistic, e.g. count, sum, mean, or std.
* **Transformation:** group-specific computations and return a like-indexed.
* **Filtration:** discard some groups, filtering based on mean


## Groupby Input Types

In pandas the first step in a groupby looks like this:

``df.groupby( grouper )``

`grouper` can be many things

- **``Series``** e.g. `df['A']` used as `df.groupby(df['A'])`
- **`string`** e.g. `"A"` is a column label, used as `df.groupby('A')`, which is equivalent to `df.groupby(df['A'])`
- **function** (to be applied on the index)
- **dict** : groups by *values*
- **`levels=[]`**, names of levels in a MultiIndex

# Quick Demo

A very brief example of the power of pandas groupby:

## Read Data

In [None]:
file = "data/exoplanets.csv"
data = pd.read_csv(file, parse_dates=True, index_col='DATE')
data.head(3)

## Create a groupby object

This is a deferred object as we haven't done much work here to create the grouping indexers.

## Split

In [None]:
grouped = data.groupby(lambda x: x.year)
grouped

We can get a specific group; this returns us a DataFrame

In [None]:
grouped.get_group(1997)

This is equivalent to selecting out that particular group

In [None]:
data[data.index.year==1997]

## Apply

Apply is all about aggregation of the grouped object.

In [None]:
# What can we aggregate? How about distance away from Earth
data.columns

We are taking the ``mean`` of ALL of the numeric columns, THEN selecting out a particular one.

In [None]:
grouped.agg('mean')['DIST(PARSEC)']

# 14 Parsecs... Pfsss!

## Visualize

In [None]:
# Early detections were planets around nearby stars
# Later detections found around stars farther away

grouped.agg('mean')['DIST(PARSEC)'].plot()

# GroupBy with Real Data

## Choose your Groups

We'll use the same dataset of beer reviews seen in previous notebooks.

In [None]:
df = pd.read_csv('data/beer2.csv.gz',
                  index_col=0,
                  parse_dates=['time'])

The unique values in the column (in this case, `beer_style`) will make up the **keys** or "group names" in the groupby.

In [None]:
# Example values in the `beer_style` column

df['beer_style'].iloc[0:5]

In [None]:
# The unique values in the `beer_style` column will be the keys of our groupby
print( len( df['beer_style'] ) )
print( len( df['beer_style'].unique() ) )

## Create GroupBy Object

In [None]:
gr = df.groupby('beer_style')

print( type( gr ) )

In [None]:
# The number of groups matches the number of unique entries in the column

gr.ngroups

A grouper object is an ``iterable``. So you *can* iterate over it.

In [None]:
for g, grp in gr:
    print("group: {g}, type of group: {t}".format(g=g, t=type(grp)))

In [None]:
# Example of those column entries

list(gr.groups)[0:5]

## Split

In [None]:
# Separating one group from many

df_dunkel = gr.get_group('Dunkelweizen')
df_dunkel.info()

# note the number of rows is much less!

In [None]:
# The ordering of the columns is not ideal for viewing results...

df_dunkel.head()

In [None]:
# For convenient viewing of groupby results
# Rearrange the columns such that the groupby keys are first

cols = ['beer_style'] + df.columns.difference(['beer_style']).tolist()
cols

In [None]:
# Another group: Separate the 'Tripel' group from the others

df_tripel = gr.get_group('Tripel')[cols]
df_tripel.head()

In [None]:
# Inspet the group

df_tripel.info()

In [None]:
# Alternatively, you could extract a "group" using .loc()

df.loc[df.beer_style=='Tripel',cols]

## Apply

In the **split-apply-combine** strategy, during the apply step, one often aggregates across a group

In [None]:
# Aggregation methods apply to each group separately.

gr.agg('mean')

This says apply the `mean` function to each column. 
Non-numeric columns (nuisance columns) are excluded.

We can also select a subset of columns to perform the aggregation on.

In [None]:
# Apply aggregation method only to a subset of columns

review_columns = ['abv','review_overall','review_appearance',
                  'review_palate','review_taste']
gr[review_columns].agg('mean')

The `.` attribute lookup works as well.

In [None]:
gr.abv.agg('mean')

## Pure Operations

As an aside, we are using this style, with the grouping parens so it lines up nicely, in order to express pipelines in a nice clean way.

Pandas objects, generally, return a NEW object for each operation, with OUT mutating the existing object (the exception are indexing operations, and ``inplace=True`` operations).

This permits a very READABLE and LOGICAL flow of operations, hence the pipeline idiom. You can quickly skim code and see that it generates a new operation, with out having to worry about something being modified without you looking!


Find the `beer_style` with the greatest variance in `abv`.

In [None]:
# ex: groupby, then select abv, then aggregrate with std, then sort

(df
   .groupby('beer_style')
   .abv
   .std()
   .sort_values(ascending=False)
 )

Multiple Aggregations on one column

In [None]:
gr['review_aroma'].agg([np.mean, np.std, 'count'])

Single Aggregation on multiple columns

In [None]:
gr[review_columns].mean()

Multiple aggregations on multiple columns

In [None]:
result = gr[review_columns].agg(['mean', 'count', 'std'])
result.columns.names=['characteristic','measure']
result

This generates a ``MultiIndex`` for the columns. This is a powerful concept and allows us to express multiple reducing/aggregating operations with a very small amount of code!

## Transform

The previous operations were ``reducing``, in that a Series was summaried into a scalar (example with ``.mean()``).

There is another class of operations that we would like to do. ``transforms``. These will return the SAME shaped operation as the input, with the resulting operation value, the SAME for each value in that GROUP.

A classic example is:

standardize the reviews by subtracting the mean.

In [None]:
# Define a function to perform the demeaning

def de_mean(reviews):
    s = reviews - reviews.mean()
    return s

In [None]:
# Demonstrate the demeaning function acting on a Series

de_mean(df.review_overall)

So we got back the same shape as our input. Now let's try the same for a ``.groupby()``

In [None]:
# Use transform() to apply the demean function to grouped data

df.groupby('profile_name').transform(de_mean)

And we have the same shape as our input. As with an ``.apply/.agg``, these are performing the action FOR EACH COLUMN that is non-numeric (unless of course something was sub-selected before hand)

Performance: Often, it is better to work with the groupby object directly

Here we have to 'manually' remove the non-numerics as we are doing a 'regular' subtraction

In [None]:
# Rather than applying the de_mean(), 
# use the built-in pandas method mean() and 
# move the arithmetic up to "df - df.groupby()"

(

    df - df.groupby('profile_name').transform('mean')

).select_dtypes(exclude=['object'])

Convince yourself that taking the original data and subtracting off a groupby transform of the mean IS the same as ON EACH GROUP, taking the elements of the group and subtracting off the groupby mean.

You can think of this operation as a ``broadcasting`` one!

In [None]:
%timeit -n 1 -r 1 df.groupby('profile_name').transform(de_mean)

In [None]:
%timeit -n 1 -r 1 (df-df.groupby('profile_name').transform('mean')).select_dtypes(exclude=['object'])

In [None]:
df.groupby('profile_name').ngroups

***

# GroupBy with Multiple Columns

You can group by **levels** of a MultiIndex.

To understand this, we need to talk about `stack()` and `unstack()`

## Reshaping with Stack and Unstack

Why stack?
* Hierarchical Indexes in the columns can be awkward to work with.
* Hierarchical Indexes in rows can be easier to work with.
* stack: “pivot” a level of the column labels to the inner-most level of row labels.
* unstack: inverse operation from stack: “pivot” a level of the row index to the inner-most level of column labels.

Documentation:
> http://pandas.pydata.org/pandas-docs/stable/reshaping.html#reshaping-by-stacking-and-unstacking

In [None]:
# Let's start with "result"
# Notice that the COLUMNS are hierarchical, not the rows

result

In [None]:
# Notice that the COLUMN selection is hierarchical using tuples
# pd.IndexSlice provides a convenient selection object
idx = pd.IndexSlice

result.loc[:,idx['abv','mean']]

Same as above, but also select several of the main columns

In [None]:
result.loc[:,[idx['abv','mean'],idx['review_taste','mean']]]

In [None]:
# Ex: move the "characteristic" level from the columns to the rows

multi1 = result.stack(level='characteristic')
multi1

# Note that the rows indexing is now hierarchical

In [None]:
# Now the ROW selection is hierarchical, with 'mean' as the column
# slice(None) is equivalent to : when used in the hierarchical tuple

multi1.loc[idx[:,'abv'],'mean']

In [None]:
# Ex: alternative stacking: 
# Move the "measure" level of the hierarchical column to the rows

multi2 = result.stack(level='measure')
multi2

In [None]:
# Now the ROW selection is hierarchical, with 'abv' as the column

multi2.loc[idx[:,'mean'], 'abv']

The stack and unstack methods are inverses:

In [None]:
result

In [None]:
# stack-unstack are inverses
(result
      .stack(level='measure')
      .unstack(level='measure')
 )

##  GroupBy with MultiIndex

Now that we know how to `stack()`, we can use it for comparison

In [None]:
(result.stack(level='characteristic')
       .groupby(level='beer_style')
       ['mean']
       .agg(['min', 'max' ])
 )

Group by **multiple** columns

In [None]:
df.groupby(['brewer_id', 'beer_style'])[review_columns].mean()

***

## Summary

We've seen a lot of permutations among number of groupers, number of columns to aggregate, and number of aggregators.
In fact, the `.agg`, which returns one row per group, is just one kind of way to combine the results. The three ways are

- `agg`: one row per results
- `transform`: identicaly shaped output as input
- `apply`: anything goes


***