# Aggregation and Grouping

An essential peice of analysis of large data is efficient summarization: computing aggregations like `sum()`, `mean()`, `median()`, `min()` and `max()`, in which a single number gives insight into the nature of a potentially large dataset. In this section, we'll explore aggreagtions in Pandas, from simple operations akin to what we've seen on NumPy arrays, to more sophisticated operations based on the concept of a `groupby`.

For convenience, we'll use the same `display`magic function that we've seen in previous sections:

In [2]:
import numpy as np
import pandas as pd 

class display(object):
    """ Display HTML represenation of multiple objects"""
    template = """ <div style="float; left; padding: 10px;">
    <p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
    </div>"""

    def __init__(self, *args):
        self.args = args
    
    def _repr_html_(self):
        return '\n'.join(self.template.format(a, eval(a)._repr_html()) for a in self.args)
    
    def __repr(self):
        return '\n\n'.join(a + '\n' + repr(eval(a)) for a in self.args)

## Planets Data

Here we wil luse the Planets dataset, available via the seaborn package. It gives information on planets that astronomers have discovered around other stars (known *as extrasolar planets* or *exoplanets* for short).

In [3]:
import seaborn as sns
planets = sns.load_dataset('planets')
planets.shape

(1035, 6)

In [4]:
planets.head()

Unnamed: 0,method,number,orbital_period,mass,distance,year
0,Radial Velocity,1,269.3,7.1,77.4,2006
1,Radial Velocity,1,874.774,2.21,56.95,2008
2,Radial Velocity,1,763.0,2.6,19.84,2011
3,Radial Velocity,1,326.03,19.4,110.62,2007
4,Radial Velocity,1,516.22,10.5,119.47,2009


This has some details on the $1000+$ extrasolar planets discovered up to 2014.

## Simple Aggregation in Pandas

Earlier, we explored some of the data aggregations available for NumPy arrays. As with a one-dimensional NumPy array, for a Pandas `Series` the aggregates return a single value:

In [5]:
rng = np.random.RandomState(42)
ser = pd.Series(rng.rand(5))
ser

0    0.374540
1    0.950714
2    0.731994
3    0.598658
4    0.156019
dtype: float64

In [6]:
ser.sum()

2.811925491708157

In [None]:
ser.mean()

For a `DataFrame,` by default the aggregates return results within each column:

In [8]:
df = pd.DataFrame({'A': rng.rand(5),
                   'B': rng.rand(5)})
df                   

Unnamed: 0,A,B
0,0.183405,0.611853
1,0.304242,0.139494
2,0.524756,0.292145
3,0.431945,0.366362
4,0.291229,0.45607


In [9]:
df.mean()

A    0.347115
B    0.373185
dtype: float64

In [10]:
df.mean(axis=1)

0    0.397629
1    0.221868
2    0.408451
3    0.399153
4    0.373650
dtype: float64

Pandas `Series` and `DataFrame`s include all of the common aggregates mentioned earlier; in addition, there is a convenience method `describe()` that computes several common aggregates for each column and returns the result. Let's use this on the Planets data, for now dropping rows with missing values:

In [11]:
planets.dropna().describe()

Unnamed: 0,number,orbital_period,mass,distance,year
count,498.0,498.0,498.0,498.0,498.0
mean,1.73494,835.778671,2.50932,52.068213,2007.37751
std,1.17572,1469.128259,3.636274,46.596041,4.167284
min,1.0,1.3283,0.0036,1.35,1989.0
25%,1.0,38.27225,0.2125,24.4975,2005.0
50%,1.0,357.0,1.245,39.94,2009.0
75%,2.0,999.6,2.8675,59.3325,2011.0
max,6.0,17337.5,25.0,354.0,2014.0


This can be a useful way to begin understanding the overall properties of a dataset. For example, we see in the `year` column that although exoplanets were discovered as far back as 1989, half fo all known exoplanets were not discovered until 2010 or after. This is largely thanks to the *Kepler* mission, which is a space-based telescope specifically designed for finding eclipsing planets around other stars.

The following table summarizes other built-in Pandas aggregations:

```
Aggregation 	Description
count() 	Total number of items
first(), last() 	First and last item
mean(), median() 	Mean and median
min(), max() 	Minimum and maximum
std(), var() 	Standard deviation and variance
mad() 	Mean absolute deviation
prod() 	Product of all items
sum() 	Sum of all items

```

These are all methods of `DataFrame` and `Series` objects.

To go deeper into the data, however, simple aggregates are often not enough. The next level of data summarization is the `groupby` operation, which allows you to quickly and efficiently compute aggregates on subsets of data.

## GroupBy: Split, Apply, Combine

Simple aggregations can give you a flavor of your dataset, but often we would prefer to aggregate conditionally on some label or index: this is implemented in the so-called `groupby` operation. The name "groub by" comes from a command in the SQL database language, but it is perhaps more illuminative to think of it in the terms first coniend by Hadley Wickham of Rstats fame: `split, apply, combine`.

### Split, apply, combine

This makes clear what `groupby` accomplished:

- The *split* step involves breaking up and grouping a `DataFrame` depending on the value of the specified key.
- The *apply* step involves computing some function, usually an aggregate, transformation, or filtering, whithin the individual groups.
- The *combine* step merges the results of these operations into an output array.

While this could certainly be done manually using some combination of the masking, aggregation, and mergin commands covered earlier, an important realization is that the *intermediate splits do not need to be explicitly intantiated*. Rather, the `GroupBy` can (often) do this in a single pass over the data, updating the sum, mean, count, min, or other aggregate for each group along the way. The power of the `GroupBy` is that is abstracts away these steps: the user need not think about *how* the computation is done under the hood, but rahter thinks about the *operation as a whole*.

As a concrete example, let's take a look at using Pandas for the computation shown in this diagram. We'll star tby creating the input `DataFrame`:

In [20]:
df = pd.DataFrame({'key': list('ABCABC'),
                   'data': range(6)}, columns=['key', 'data'])
df

Unnamed: 0,key,data
0,A,0
1,B,1
2,C,2
3,A,3
4,B,4
5,C,5


The most basic split-apply-combine operation can be computed with the `groupby()` method of `DataFrame`s, passing the name of the desired key column:

In [30]:
df.groupby('key')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000251AC84FBE0>

The `sum()` method is just one possibility here; you can apply virtually any common Pandas or NumPy aggregation function, as well as virtually any valid `DataFrame` operation, as we will see in the following discussion.

### The GroupBy object

The `GroupBy` object is a very flexible abstraction. In many ways, you can simply treat it as if it's a collection of `DataFrame`s, and it does the difficult things under the hood. Let's see some examples using the Planets data.

### Column indexing

The `GroupBy` object supports column indexing in the same way as the `DataFrame`, and returns a modified `GroupBy` object.

In [32]:
planets.groupby('method')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000251AC543BB0>

In [34]:
planets.groupby('method')['orbital_period']

<pandas.core.groupby.generic.SeriesGroupBy object at 0x00000251AC882370>

Here we've selected a particular `Series` group from the original `DataFrame` group by reference to its column name. As with the `GroupBy` object, no computation is done until we call some aggregate on the object:

In [35]:
planets.groupby('method')['orbital_period'].median()

method
Astrometry                         631.180000
Eclipse Timing Variations         4343.500000
Imaging                          27500.000000
Microlensing                      3300.000000
Orbital Brightness Modulation        0.342887
Pulsar Timing                       66.541900
Pulsation Timing Variations       1170.000000
Radial Velocity                    360.200000
Transit                              5.714932
Transit Timing Variations           57.011000
Name: orbital_period, dtype: float64

That gives an idea of the general scale of orbital periods (in days) that each method is sensitive to.

#### Iteration over groups

The `GroupBy` object supports direct iteration over the groups, returning each group as a `Series` or `DataFrame`:

In [38]:
for (method, group) in planets.groupby('method'):
    print("{0:30s} shape={1}".format(method, group.shape))

Astrometry                     shape=(2, 6)
Eclipse Timing Variations      shape=(9, 6)
Imaging                        shape=(38, 6)
Microlensing                   shape=(23, 6)
Orbital Brightness Modulation  shape=(3, 6)
Pulsar Timing                  shape=(5, 6)
Pulsation Timing Variations    shape=(1, 6)
Radial Velocity                shape=(553, 6)
Transit                        shape=(397, 6)
Transit Timing Variations      shape=(4, 6)


This can be useful for doing certain things manually, though it is often much faster to sue the built-in `apply` functionality, which we will discuss momentarily.

#### Dispatch methods

Thorugh some Python class magic, any method not explicitly implemented by the `GroupBy` object will be passed through and called on the groups, whether they are `DataFrame` or `Series` objects. For example, you can sue the `describe()` method of `DataFrame`s to perform a set of aggreagtions that describe each grop in the data:

In [43]:
planets.groupby('method')['year'].describe().unstack()

       method                       
count  Astrometry                          2.0
       Eclipse Timing Variations           9.0
       Imaging                            38.0
       Microlensing                       23.0
       Orbital Brightness Modulation       3.0
                                         ...  
max    Pulsar Timing                    2011.0
       Pulsation Timing Variations      2007.0
       Radial Velocity                  2014.0
       Transit                          2014.0
       Transit Timing Variations        2014.0
Length: 80, dtype: float64

Looking at this table help us to better understand the data: for example, the vast majority of planets have been discovered by the Radial Velocity and transit methods, though the latter only became common (due to new, more accurate telescopes) in the last decade. The newest methods seem to be Transit Timing Variation on and Orbital Brightness Modulation, which were not used to discover a new planet until 2011.

This is just one example of the utiliy of dispatch methods. Notice that they are applied to *each individual group*, and the results are the combined within `GroupBy` and returned. Again, any valid `DataFrame / Series` method can be used on the corresponding `GroupBy` object, which allows for some very flexible and powerful operations.