In [21]:
import pandas as pd
import numpy as np
import seaborn as sns

## Planets Data from `seaborn`

In [22]:
planets = sns.load_dataset('planets')
planets.shape

(1035, 6)

In [23]:
planets.head()

Unnamed: 0,method,number,orbital_period,mass,distance,year
0,Radial Velocity,1,269.3,7.1,77.4,2006
1,Radial Velocity,1,874.774,2.21,56.95,2008
2,Radial Velocity,1,763.0,2.6,19.84,2011
3,Radial Velocity,1,326.03,19.4,110.62,2007
4,Radial Velocity,1,516.22,10.5,119.47,2009


## Simple Aggregation in Pandas

Pandas `Series` and `DataFrame`s include all of the common aggregates as in Numpy.

### `Series`

For a Pandas `Series` the aggregates return a single value:

In [24]:
rng = np.random.RandomState(42)
ser = pd.Series(rng.randint(5, size=5))
ser

0    3
1    4
2    2
3    4
4    4
dtype: int64

In [25]:
ser.sum()

17

In [26]:
ser.mean()

3.4

### `DataFrame`

For a `DataFrame`, by default the aggregates return results within each **column**:

In [27]:
df = pd.DataFrame({'A': rng.randint(5, size=5),
                   'B': rng.randint(10, size=5)})
df

Unnamed: 0,A,B
0,1,3
1,2,7
2,2,7
3,2,2
4,4,5


In [28]:
df.mean()

A    2.2
B    4.8
dtype: float64

In [29]:
df.sum()

A    11
B    24
dtype: int64

Aggregate within each row by specifying the `axis` argument:

In [30]:
# Sum over each row
df.mean(axis='columns')

0    2.0
1    4.5
2    4.5
3    2.0
4    4.5
dtype: float64

In [31]:
# Max of each row
df.max(axis=1)

0    3
1    7
2    7
3    2
4    5
dtype: int64

`describe()`:  computes several common aggregates for each column and returns the result

In [32]:
planets.dropna().describe()

Unnamed: 0,number,orbital_period,mass,distance,year
count,498.0,498.0,498.0,498.0,498.0
mean,1.73494,835.778671,2.50932,52.068213,2007.37751
std,1.17572,1469.128259,3.636274,46.596041,4.167284
min,1.0,1.3283,0.0036,1.35,1989.0
25%,1.0,38.27225,0.2125,24.4975,2005.0
50%,1.0,357.0,1.245,39.94,2009.0
75%,2.0,999.6,2.8675,59.3325,2011.0
max,6.0,17337.5,25.0,354.0,2014.0


In [33]:
planets.describe()

Unnamed: 0,number,orbital_period,mass,distance,year
count,1035.0,992.0,513.0,808.0,1035.0
mean,1.785507,2002.917596,2.638161,264.069282,2009.070531
std,1.240976,26014.728304,3.818617,733.116493,3.972567
min,1.0,0.090706,0.0036,1.35,1989.0
25%,1.0,5.44254,0.229,32.56,2007.0
50%,1.0,39.9795,1.26,55.25,2010.0
75%,2.0,526.005,3.04,178.5,2012.0
max,7.0,730000.0,25.0,8500.0,2014.0


### Some built-in Pandas aggregtation

| Aggregation          | Description                     |
| :------------------- | :------------------------------ |
| `count()`            | Total number of items           |
| `first()`, `last()`  | First and last item             |
| `mean()`, `median()` | Mean and median                 |
| `min()`, `max()`     | Minimum and maximum             |
| `std()`, `var()`     | Standard deviation and variance |
| `mad()`              | Mean absolute deviation         |
| `prod()`             | Product of all items            |
| `sum()`              | Sum of all items                |

These are all methods of `DataFrame` and `Series` objects.

## GroupBy: Split, Apply, Combine

### Split, apply, combine

`groupby`:

- *Split* step: breaks up and groups a `DataFrame` depending on the value of the specified key

- *Apply* step: computes some function, usually an aggregate, transformation, or filtering, within the individual groups

- *Combine* step: merges the result of these operations into an output array.

Example: 

In [34]:
from IPython.display import display, Image

Image(url='https://jakevdp.github.io/PythonDataScienceHandbook/figures/03.08-split-apply-combine.png')

The power of the `GroupBy` is that it abstracts away these steps: the user need not think about *how* the computation is done under the hood, but rather thinks about the *operation as a whole*.

In [35]:
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'data': range(6)}, columns=['key', 'data'])
df

Unnamed: 0,key,data
0,A,0
1,B,1
2,C,2
3,A,3
4,B,4
5,C,5


In [36]:
df.groupby('key')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x1a30f11710>

Notice that what is returned is not a set of `DataFrame`s, but a `DataFrameGroupBy` object. 

This object is where the magic is: **you can think of it as a special view of the `DataFrame`, which is poised to dig into the groups but does no actual computation until the aggregation is applied.** This "lazy evaluation" approach means that common aggregates can be implemented very efficiently in a way that is almost transparent to the user.

Now apply an aggregate to this `DataFrameGroupBy` object, which will perform the appropriate apply/combine steps to produce the desired result:

In [37]:
df.groupby('key').sum()

Unnamed: 0_level_0,data
key,Unnamed: 1_level_1
A,3
B,5
C,7


### The `GroupBy` object

We can simply treat the `GroupBy` object as if it's a collection of `DataFrame`s, and it does the difficult things under the hood.

#### Column indexing

The `GroupBy` object supports column indexing in the same way as the `DataFrame`, and returns a modified `GroupBy` object.

In [42]:
planets.head()

Unnamed: 0,method,number,orbital_period,mass,distance,year
0,Radial Velocity,1,269.3,7.1,77.4,2006
1,Radial Velocity,1,874.774,2.21,56.95,2008
2,Radial Velocity,1,763.0,2.6,19.84,2011
3,Radial Velocity,1,326.03,19.4,110.62,2007
4,Radial Velocity,1,516.22,10.5,119.47,2009


In [39]:
planets.groupby('method')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x1a30f31c50>

In [40]:
planets.groupby('method')['orbital_period']

<pandas.core.groupby.generic.SeriesGroupBy object at 0x1a30f37a20>

In [41]:
planets.groupby('method')['orbital_period'].median()

method
Astrometry                         631.180000
Eclipse Timing Variations         4343.500000
Imaging                          27500.000000
Microlensing                      3300.000000
Orbital Brightness Modulation        0.342887
Pulsar Timing                       66.541900
Pulsation Timing Variations       1170.000000
Radial Velocity                    360.200000
Transit                              5.714932
Transit Timing Variations           57.011000
Name: orbital_period, dtype: float64

#### Iteration over groups

The `GroupBy` object supports direct iteration over the groups, returning each group as a `Series` or `DataFrame`:

In [45]:
for (method, group) in planets.groupby('method'):
    print(f'{method}, shape={group.shape}')

Astrometry, shape=(2, 6)
Eclipse Timing Variations, shape=(9, 6)
Imaging, shape=(38, 6)
Microlensing, shape=(23, 6)
Orbital Brightness Modulation, shape=(3, 6)
Pulsar Timing, shape=(5, 6)
Pulsation Timing Variations, shape=(1, 6)
Radial Velocity, shape=(553, 6)
Transit, shape=(397, 6)
Transit Timing Variations, shape=(4, 6)


This can be useful for doing certain things manually.

#### Dispatch methods

Any method not explicitly implemented by the GroupBy object will be passed through and called on the groups, whether they are `DataFrame` or `Series` objects. 

In [50]:
planets.groupby('method')['year'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Astrometry,2.0,2011.5,2.12132,2010.0,2010.75,2011.5,2012.25,2013.0
Eclipse Timing Variations,9.0,2010.0,1.414214,2008.0,2009.0,2010.0,2011.0,2012.0
Imaging,38.0,2009.131579,2.781901,2004.0,2008.0,2009.0,2011.0,2013.0
Microlensing,23.0,2009.782609,2.859697,2004.0,2008.0,2010.0,2012.0,2013.0
Orbital Brightness Modulation,3.0,2011.666667,1.154701,2011.0,2011.0,2011.0,2012.0,2013.0
Pulsar Timing,5.0,1998.4,8.38451,1992.0,1992.0,1994.0,2003.0,2011.0
Pulsation Timing Variations,1.0,2007.0,,2007.0,2007.0,2007.0,2007.0,2007.0
Radial Velocity,553.0,2007.518987,4.249052,1989.0,2005.0,2009.0,2011.0,2014.0
Transit,397.0,2011.236776,2.077867,2002.0,2010.0,2012.0,2013.0,2014.0
Transit Timing Variations,4.0,2012.5,1.290994,2011.0,2011.75,2012.5,2013.25,2014.0


Notice that methods are applied *to each individual group*, and the results are then combined within `GroupBy` and returned. 

### Aggregate, filter, transform, apply

`GroupBy` objects have `aggregate()`, `filter()`, `transform()`, and `apply()` methods that efficiently implement a variety of useful operations before combining the grouped data.

In [51]:
rng = np.random.RandomState(0)
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'data1': range(6),
                   'data2': rng.randint(0, 10, 6)},
                   columns = ['key', 'data1', 'data2'])
df

Unnamed: 0,key,data1,data2
0,A,0,5
1,B,1,0
2,C,2,3
3,A,3,3
4,B,4,7
5,C,5,9


#### Aggregation

`aggregate()` method can take a string, a function, or a list thereof, and compute the aggregates at once.

In [52]:
df.groupby('key').aggregate(['min', np.median, max])

Unnamed: 0_level_0,data1,data1,data1,data2,data2,data2
Unnamed: 0_level_1,min,median,max,min,median,max
key,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
A,0,1.5,3,3,4.0,5
B,1,2.5,4,0,3.5,7
C,2,3.5,5,3,6.0,9


Another useful pattern is to pass a dictionary mapping column names to operations to be applied on that column:

In [53]:
df.groupby('key').aggregate({'data1': 'min', 'data2': 'max'})

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,0,5
B,1,7
C,2,9


#### Filtering

A filtering operation allows you to drop data based on the group properties. 

E.g.: we might want to keep all groups in which the standard deviation is larger than some critical value:

In [58]:
def filter_func(x):
    return x['data2'].std() > 4

The filter function should return a Boolean value specifying whether the group passes the filtering.

In [57]:
df.groupby('key').std()

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,2.12132,1.414214
B,2.12132,4.949747
C,2.12132,4.242641


In [55]:
df.groupby('key').filter(filter_func)

Unnamed: 0,key,data1,data2
1,B,1,0
2,C,2,3
4,B,4,7
5,C,5,9


Here because group A does not have a standard deviation greater than 4, it is dropped from the result.

#### Transformation

Transformation can return some transformed version of the full data to recombine. For such a transformation, the output is the same shape as the input. 

In [60]:
df

Unnamed: 0,key,data1,data2
0,A,0,5
1,B,1,0
2,C,2,3
3,A,3,3
4,B,4,7
5,C,5,9


In [61]:
df.groupby('key').transform(lambda x: x - x.mean())

Unnamed: 0,data1,data2
0,-1.5,1.0
1,-1.5,-3.5
2,-1.5,-3.0
3,1.5,-1.0
4,1.5,3.5
5,1.5,3.0


### `apply()`

The `apply()` method lets you apply an arbitrary function to the group results. The function should take a `DataFrame`, and return either a Pandas object (e.g., `DataFrame`, `Series`) or a scalar; the combine operation will be tailored to the type of output returned.

In [62]:
def norm_by_data2(x):
    # x is a DataFrame of group values
    x['data1'] /= x['data2'].sum()
    return x

In [63]:
df.groupby('key').apply(norm_by_data2)

Unnamed: 0,key,data1,data2
0,A,0.0,5
1,B,0.142857,0
2,C,0.166667,3
3,A,0.375,3
4,B,0.571429,7
5,C,0.416667,9


`apply()` within a `GroupBy` is quite flexible: the only criterion is that the function takes a `DataFrame` and returns a Pandas object or scalar; 

### Specify the split key

#### A list, array, series, or index providing the grouping keys

The key can be any series or list with a length matching that of the `DataFrame`. 

In [66]:
df

Unnamed: 0,key,data1,data2
0,A,0,5
1,B,1,0
2,C,2,3
3,A,3,3
4,B,4,7
5,C,5,9


In [75]:
L = [0, 1, 0, 1, 2, 0]
for el in df.groupby(L):
    print(el, '\n')

(0,   key  data1  data2
0   A      0      5
2   C      2      3
5   C      5      9) 

(1,   key  data1  data2
1   B      1      0
3   A      3      3) 

(2,   key  data1  data2
4   B      4      7) 



In [76]:
df.groupby(L).sum()

Unnamed: 0,data1,data2
0,7,17
1,4,3
2,4,7


In [68]:
df.groupby(df['key']).sum() # verbose way

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,3,8
B,5,7
C,7,12


In [71]:
df.groupby('key').sum()

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,3,8
B,5,7
C,7,12


#### A dictionary or series mapping index to group

Provide a dictionary that maps index values to the group keys:

In [78]:
df2 = df.set_index('key')
df2

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,0,5
B,1,0
C,2,3
A,3,3
B,4,7
C,5,9


In [80]:
mappings = {
    'A': 'vowel',
    'B': 'consonant',
    'C': 'consonant'
}

df2.groupby(mappings).sum()

Unnamed: 0,data1,data2
consonant,12,19
vowel,3,8


#### Any Python function

Pass any Python function that will input the index value and output the group:

In [81]:
df2

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,0,5
B,1,0
C,2,3
A,3,3
B,4,7
C,5,9


In [82]:
df2.groupby(str.lower)

Unnamed: 0,data1,data2
a,1.5,4.0
b,2.5,3.5
c,3.5,6.0


## Grouping Example

In [90]:
decade = 10 * (planets['year'] // 10)
decade = decade.astype(str) + 's'
decade.name = 'decade'
planets.groupby(['method', decade])['number'].sum().unstack().fillna(0)

decade,1980s,1990s,2000s,2010s
method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Astrometry,0.0,0.0,0.0,2.0
Eclipse Timing Variations,0.0,0.0,5.0,10.0
Imaging,0.0,0.0,29.0,21.0
Microlensing,0.0,0.0,12.0,15.0
Orbital Brightness Modulation,0.0,0.0,0.0,5.0
Pulsar Timing,0.0,9.0,1.0,1.0
Pulsation Timing Variations,0.0,0.0,1.0,0.0
Radial Velocity,1.0,52.0,475.0,424.0
Transit,0.0,0.0,64.0,712.0
Transit Timing Variations,0.0,0.0,0.0,9.0
