Title: Grouping and Aggregating Data
Slug: pandas/grouping-and-aggregating-data
Category: Pandas
Tags: head, describe, groupby, mean, nunique, agg
Date: 2017-09-24
Modified: 2017-09-25

#### Import libraries

In [1]:
import pandas as pd
from bokeh.sampledata.iris import flowers

#### Inspect data

In [2]:
flowers.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [3]:
flowers.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


#### Grouping data
We use the `groupby()` method with a column name to, you guessed it, group our data. This returns a `DataFrameGroupBy` object.

In [4]:
flowers.groupby('species')

<pandas.core.groupby.DataFrameGroupBy object at 0x10a325400>

To use this object, we need to chain another method afterwards. This usually an aggregate of some kind.

In [5]:
flowers.groupby('species').mean()

Unnamed: 0_level_0,sepal_length,sepal_width,petal_length,petal_width
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
setosa,5.006,3.428,1.462,0.246
versicolor,5.936,2.77,4.26,1.326
virginica,6.588,2.974,5.552,2.026


In [6]:
# Find the 36th percentiles
flowers.groupby('species').quantile(0.36)

0.36,petal_length,petal_width,sepal_length,sepal_width
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
setosa,1.4,0.2,4.9,3.3
versicolor,4.1,1.3,5.7,2.7
virginica,5.264,1.9,6.3,2.8


In [7]:
# Use .agg to find multiple aggreagates
flowers.groupby('species').agg(['var', 'std'])

Unnamed: 0_level_0,sepal_length,sepal_length,sepal_width,sepal_width,petal_length,petal_length,petal_width,petal_width
Unnamed: 0_level_1,var,std,var,std,var,std,var,std
species,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
setosa,0.124249,0.35249,0.14369,0.379064,0.030159,0.173664,0.011106,0.105386
versicolor,0.266433,0.516171,0.098469,0.313798,0.220816,0.469911,0.039106,0.197753
virginica,0.404343,0.63588,0.104004,0.322497,0.304588,0.551895,0.075433,0.27465


In [8]:
# Find different aggreagates for different columns
flowers.groupby('species').agg({
    'sepal_length': ['min', 'max'],
    'sepal_width': ['count']
})


Unnamed: 0_level_0,sepal_length,sepal_length,sepal_width
Unnamed: 0_level_1,min,max,count
species,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
setosa,4.3,5.8,50
versicolor,4.9,7.0,50
virginica,4.9,7.9,50
