In [1]:
import numpy as np
from numpy import random
import pandas as pd

### Group By Operations

Groupby operations are a way to *split* a DataFrame (or Series) into groups.  Once these groups have been formed, a reduction function is usually *applied* to each group. Lastly, these groups are then *combined* to form a new DataFrame (or Series) to display the results.

Indeed, the term *split-apply-combine*, first coined by Hadley Wickham, describes the basic idea of groupby operations.

##### Example 1

In [2]:
pets = pd.DataFrame({'gender':np.random.choice(np.array(['M','F']), size = (10,)),
                    'type':np.random.choice(np.array(['dog','cat']), size = (10,)),
                    'weight':10*np.random.normal(size = (10,))+25,
                    'num_feet':np.random.choice(range(1,5), size = (10,), p=[0.05, 0.05, 0.3, 0.6])})
pets

Unnamed: 0,gender,type,weight,num_feet
0,M,dog,38.194264,4
1,M,dog,30.475912,4
2,M,cat,18.308914,4
3,F,dog,13.49908,3
4,M,dog,24.388946,2
5,F,cat,9.060232,4
6,F,cat,23.484639,4
7,M,cat,28.512175,3
8,M,dog,23.381202,4
9,F,dog,16.084521,4


First, we group the animal types by gender.

In [3]:
pets['type'].groupby(pets['gender'])

<pandas.core.groupby.generic.SeriesGroupBy object at 0x000001BB905B8700>

...and we get a groupby object, which is iterable.  So, converting to a list, we get some enlightenment.

In [4]:
pets

Unnamed: 0,gender,type,weight,num_feet
0,M,dog,38.194264,4
1,M,dog,30.475912,4
2,M,cat,18.308914,4
3,F,dog,13.49908,3
4,M,dog,24.388946,2
5,F,cat,9.060232,4
6,F,cat,23.484639,4
7,M,cat,28.512175,3
8,M,dog,23.381202,4
9,F,dog,16.084521,4


In [5]:
list(pets['type'].groupby(pets['gender']))

[('F',
  3    dog
  5    cat
  6    cat
  9    dog
  Name: type, dtype: object),
 ('M',
  0    dog
  1    dog
  2    cat
  4    dog
  7    cat
  8    dog
  Name: type, dtype: object)]

Now, we apply the *count* method and the results are combined into one Series.

In [6]:
pets['type'].groupby(pets['gender']).count()

gender
F    4
M    6
Name: type, dtype: int64

$\Box$

##### Example 2

As we saw above, the code below produces tuples of the form (group, DataFrame).  Since tuples define a mapping, it makes sense that we can turn a groupby object into a dictionary using the *dict* function.  

In [7]:
pets

Unnamed: 0,gender,type,weight,num_feet
0,M,dog,38.194264,4
1,M,dog,30.475912,4
2,M,cat,18.308914,4
3,F,dog,13.49908,3
4,M,dog,24.388946,2
5,F,cat,9.060232,4
6,F,cat,23.484639,4
7,M,cat,28.512175,3
8,M,dog,23.381202,4
9,F,dog,16.084521,4


In [8]:
groups = list(pets['weight'].groupby(pets['num_feet']))
groups

[(2,
  4    24.388946
  Name: weight, dtype: float64),
 (3,
  3    13.499080
  7    28.512175
  Name: weight, dtype: float64),
 (4,
  0    38.194264
  1    30.475912
  2    18.308914
  5     9.060232
  6    23.484639
  8    23.381202
  9    16.084521
  Name: weight, dtype: float64)]

In [9]:
grouped = dict(groups)

In [10]:
grouped[3]

3    13.499080
7    28.512175
Name: weight, dtype: float64

$\Box$

We don't even have to group by a column in the DataFrame.  We just have to use a list or array of values that is the same length as the axis being grouped.

##### Example 3

In [11]:
pets

Unnamed: 0,gender,type,weight,num_feet
0,M,dog,38.194264,4
1,M,dog,30.475912,4
2,M,cat,18.308914,4
3,F,dog,13.49908,3
4,M,dog,24.388946,2
5,F,cat,9.060232,4
6,F,cat,23.484639,4
7,M,cat,28.512175,3
8,M,dog,23.381202,4
9,F,dog,16.084521,4


In [12]:
colors = ['black', 'black', 'white', 'black', 'white', 'white', 'brown', 'white', 'brown', 'brown']
colors

['black',
 'black',
 'white',
 'black',
 'white',
 'white',
 'brown',
 'white',
 'brown',
 'brown']

First, we group.

In [13]:
list(pets['weight'].groupby(colors))

[('black',
  0    38.194264
  1    30.475912
  3    13.499080
  Name: weight, dtype: float64),
 ('brown',
  6    23.484639
  8    23.381202
  9    16.084521
  Name: weight, dtype: float64),
 ('white',
  2    18.308914
  4    24.388946
  5     9.060232
  7    28.512175
  Name: weight, dtype: float64)]

Now, we apply and combine using the mean method.

In [14]:
pets['weight'].groupby(colors).mean()

black    27.389752
brown    20.983454
white    20.067567
Name: weight, dtype: float64

$\Box$

We can group by multiple columns.

##### Example 4

In [15]:
pets

Unnamed: 0,gender,type,weight,num_feet
0,M,dog,38.194264,4
1,M,dog,30.475912,4
2,M,cat,18.308914,4
3,F,dog,13.49908,3
4,M,dog,24.388946,2
5,F,cat,9.060232,4
6,F,cat,23.484639,4
7,M,cat,28.512175,3
8,M,dog,23.381202,4
9,F,dog,16.084521,4


First, we group by gender and type.

In [16]:
list(pets['weight'].groupby([pets['gender'], pets['type']]))

[(('F', 'cat'),
  5     9.060232
  6    23.484639
  Name: weight, dtype: float64),
 (('F', 'dog'),
  3    13.499080
  9    16.084521
  Name: weight, dtype: float64),
 (('M', 'cat'),
  2    18.308914
  7    28.512175
  Name: weight, dtype: float64),
 (('M', 'dog'),
  0    38.194264
  1    30.475912
  4    24.388946
  8    23.381202
  Name: weight, dtype: float64)]

Now, we apply the *max* method and combine the results.

In [17]:
pets['weight'].groupby([pets['gender'], pets['type']]).max()

gender  type
F       cat     23.484639
        dog     16.084521
M       cat     28.512175
        dog     38.194264
Name: weight, dtype: float64

$\Box$

What happens when we group two columns by one column?

##### Example 6

In [18]:
pets

Unnamed: 0,gender,type,weight,num_feet
0,M,dog,38.194264,4
1,M,dog,30.475912,4
2,M,cat,18.308914,4
3,F,dog,13.49908,3
4,M,dog,24.388946,2
5,F,cat,9.060232,4
6,F,cat,23.484639,4
7,M,cat,28.512175,3
8,M,dog,23.381202,4
9,F,dog,16.084521,4


In [19]:
list(pets[['gender', 'weight']].groupby(pets['type']))

[('cat',
    gender     weight
  2      M  18.308914
  5      F   9.060232
  6      F  23.484639
  7      M  28.512175),
 ('dog',
    gender     weight
  0      M  38.194264
  1      M  30.475912
  3      F  13.499080
  4      M  24.388946
  8      M  23.381202
  9      F  16.084521)]

In [20]:
pets[['gender', 'weight']].groupby(pets['type']).min()

Unnamed: 0_level_0,gender,weight
type,Unnamed: 1_level_1,Unnamed: 2_level_1
cat,F,9.060232
dog,F,13.49908


$\Box$

Example 4 should not be too surprising.  After all, all that really happened is that the new DataFrame defined by pets[['gender', 'weight']] had its columns grouped by type.

Why not group the entire DataFrame by one or more columns?

##### Example 7

In [23]:
pets

Unnamed: 0,gender,type,weight,num_feet
0,M,cat,22.196734,3
1,M,cat,27.846224,3
2,F,cat,8.565548,4
3,M,cat,22.420509,2
4,M,cat,36.827468,4
5,F,dog,15.219761,3
6,M,cat,7.347463,3
7,M,cat,18.391027,4
8,F,dog,25.621848,4
9,F,dog,24.884675,2


In [24]:
list(pets[pets.columns].groupby(pets['num_feet']))

[(2,
    gender type     weight  num_feet
  3      M  cat  22.420509         2
  9      F  dog  24.884675         2),
 (3,
    gender type     weight  num_feet
  0      M  cat  22.196734         3
  1      M  cat  27.846224         3
  5      F  dog  15.219761         3
  6      M  cat   7.347463         3),
 (4,
    gender type     weight  num_feet
  2      F  cat   8.565548         4
  4      M  cat  36.827468         4
  7      M  cat  18.391027         4
  8      F  dog  25.621848         4)]

Pandas gives us some better syntax for this situation.  Instead of writing pets[pets.columns], we can just write pets.  This makes sense, since pets[pets.columns] == pets.

In [25]:
pets[pets.columns] == pets

Unnamed: 0,gender,type,weight,num_feet
0,True,True,True,True
1,True,True,True,True
2,True,True,True,True
3,True,True,True,True
4,True,True,True,True
5,True,True,True,True
6,True,True,True,True
7,True,True,True,True
8,True,True,True,True
9,True,True,True,True


In [26]:
pets.groupby(pets['num_feet']).count()

Unnamed: 0_level_0,gender,type,weight
num_feet,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2,2,2,2
3,4,4,4
4,4,4,4


$\Box$

Since we often groupby just one column, Pandas has a simpler syntax for that too.

##### Example 8

In [29]:
pets

Unnamed: 0,gender,type,weight,num_feet
0,M,cat,35.74478,3
1,M,dog,23.163488,4
2,F,dog,30.966483,4
3,F,cat,34.807987,4
4,F,dog,21.805438,3
5,M,dog,19.038178,2
6,M,dog,32.250435,3
7,F,dog,13.247173,3
8,M,dog,23.337015,4
9,M,dog,18.180506,4


In [27]:
pets.groupby('type').sum()

Unnamed: 0_level_0,weight,num_feet
type,Unnamed: 1_level_1,Unnamed: 2_level_1
cat,143.594974,23
dog,65.726285,9


Where did gender go?

Since the sum doesn't really make sense for strings, these columns are reffered to as *nuisance* columns and excluded from the result.

So far, we have been grouping by a column or list (array).  We can also group by a dictionary.

##### Example 9

In [30]:
pets

Unnamed: 0,gender,type,weight,num_feet
0,F,cat,20.226648,4
1,F,dog,22.828038,4
2,F,dog,35.356432,3
3,M,dog,13.143096,4
4,M,cat,22.251801,4
5,F,dog,11.906721,3
6,F,cat,26.13404,4
7,F,cat,18.395892,4
8,M,cat,18.884273,3
9,M,cat,42.136635,4


In [47]:
mapping = {'gender':'string', 'type':'string', 'weight':'numeric', 'num_feet':'numeric'}

The default axis to group over is axis = 0.  However, we can group by any defined axis.

In [48]:
pets.groupby(mapping, axis = 1)

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000297A27E7A90>

In [49]:
list(pets.groupby(mapping, axis = 1))

[('numeric',
        weight  num_feet
  0  20.226648         4
  1  22.828038         4
  2  35.356432         3
  3  13.143096         4
  4  22.251801         4
  5  11.906721         3
  6  26.134040         4
  7  18.395892         4
  8  18.884273         3
  9  42.136635         4),
 ('string',
    gender type
  0      F  cat
  1      F  dog
  2      F  dog
  3      M  dog
  4      M  cat
  5      F  dog
  6      F  cat
  7      F  cat
  8      M  cat
  9      M  cat)]

In [50]:
pets.groupby(mapping, axis = 1).sum()

Unnamed: 0,numeric,string
0,24.226648,Fcat
1,26.828038,Fdog
2,38.356432,Fdog
3,17.143096,Mdog
4,26.251801,Mcat
5,14.906721,Fdog
6,30.13404,Fcat
7,22.395892,Fcat
8,21.884273,Mcat
9,46.136635,Mcat


$\Box$

### Apply

We have used the apply method on a DataFrame to create new columns.  We can use the apply method with groupby as well. 

##### Example 10

In [51]:
pets

Unnamed: 0,gender,type,weight,num_feet
0,F,cat,20.226648,4
1,F,dog,22.828038,4
2,F,dog,35.356432,3
3,M,dog,13.143096,4
4,M,cat,22.251801,4
5,F,dog,11.906721,3
6,F,cat,26.13404,4
7,F,cat,18.395892,4
8,M,cat,18.884273,3
9,M,cat,42.136635,4


In [28]:
def mean_plus_2(S):
    return S.mean()+2

In [29]:
list(pets['weight'].groupby(pets['gender']))

[('F',
  2     8.565548
  5    15.219761
  8    25.621848
  9    24.884675
  Name: weight, dtype: float64),
 ('M',
  0    22.196734
  1    27.846224
  3    22.420509
  4    36.827468
  6     7.347463
  7    18.391027
  Name: weight, dtype: float64)]

In [30]:
pets['weight'].groupby(pets['gender']).apply(mean_plus_2)

gender
F    20.572958
M    24.504904
Name: weight, dtype: float64

In [31]:
pets

Unnamed: 0,gender,type,weight,num_feet
0,M,cat,22.196734,3
1,M,cat,27.846224,3
2,F,cat,8.565548,4
3,M,cat,22.420509,2
4,M,cat,36.827468,4
5,F,dog,15.219761,3
6,M,cat,7.347463,3
7,M,cat,18.391027,4
8,F,dog,25.621848,4
9,F,dog,24.884675,2


$\Box$

##### Exercise 1

For this exercise, use the pets DataFrame.

Use a single groupby operation to display a single DataFrame that shows the average weight for the four different groups in the pets DataFrame:
+ Male Cats
+ Male Dogs
+ Female Cats
+ Female Dogs

##### Exercise 2

Using the crime dataset defined below, display a single Series that has as index the 'primay_type' and the percentage of total crimes that the number of crimes of this primary_type comprises.

In [23]:
crime = pd.read_csv('ChicagoCrimeData.csv')