Following from my last blog on Transforming data with Pandas, today in my series on Pandas Python I am going to discuss one of the more powerful features of pandas, the groupby object.

It works best with categorical data as you can effectively subset the data in to groups.

Groupby is a class and has its own methods many use the same names as the series and dataframe methods making it quicker to learn.

One way to think about a groupby object is that it is a dataframe of dataframes.

Once you have made a groupby object you can then using aggregation to combine the results, this allow you to get statistics for the groups.

![](2022-01-03-14-35-39.png)

### Import data - Cars.CSV

In [2]:
import pandas as pd
import numpy as np
cars = pd.read_csv('cars.csv')
cars.head()

Unnamed: 0.1,Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


Lets split up the make and model so we have two categorise.

In [3]:

cars[['Unnamed: 0', 'model']] = cars['Unnamed: 0'].str.split(' ', expand=True, n=1)
cars = cars.rename(columns={'Unnamed: 0': 'make'})

cars.head(3)

Unnamed: 0,make,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb,model
0,Mazda,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4,RX4
1,Mazda,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4,RX4 Wag
2,Datsun,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1,710


In [4]:
print(cars.shape)
cars.dtypes

(32, 13)


make      object
mpg      float64
cyl        int64
disp     float64
hp         int64
drat     float64
wt       float64
qsec     float64
vs         int64
am         int64
gear       int64
carb       int64
model     object
dtype: object

Lets produce a groupby object using the make as the category and check it. It will not show data but rather than pandas object memory space.

In [5]:

make_group = cars.groupby('make')
make_group


<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000016CEA058F70>

Lets use the size method to get a count of the values. Note it does not sort the data so a sort_values() can be used.

Note that size() is much the same thing as value_counts() which we use on dataframes

In [6]:

make_group.size().sort_values(ascending=False).head(5)

make
Merc      7
Hornet    2
Toyota    2
Mazda     2
Fiat      2
dtype: int64

We can get the first row from each group but using head(1).

Note head(2) will show the first two rows from each group.

I am using a second head() to just return the top five.

In [7]:
make_group.head(1).head()

Unnamed: 0,make,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb,model
0,Mazda,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4,RX4
2,Datsun,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1,710
3,Hornet,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1,4 Drive
5,Valiant,18.1,6,225.0,105,2.76,3.46,20.22,1,0,3,1,
6,Duster,14.3,8,360.0,245,3.21,3.57,15.84,0,0,3,4,360


Lets have a look at the first row from each of the categories in the groupby object using first().

This returns a dataframe with the 'make' as the index. It also sorted the values by this index.

In [8]:

make_group.first(1).head()

Unnamed: 0_level_0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
make,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
AMC,15.2,8,304.0,150,3.15,3.435,17.3,0,0,3,2
Cadillac,10.4,8,472.0,205,2.93,5.25,17.98,0,0,3,4
Camaro,13.3,8,350.0,245,3.73,3.84,15.41,0,0,3,4
Chrysler,14.7,8,440.0,230,3.23,5.345,17.42,0,0,3,4
Datsun,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1


Another way to get the first from each group is the nth() method.

The good thing about nth() is it return all columns not just numerical columns like first()

In [9]:
make_group.nth(0).head()

Unnamed: 0_level_0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb,model
make,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
AMC,15.2,8,304.0,150,3.15,3.435,17.3,0,0,3,2,Javelin
Cadillac,10.4,8,472.0,205,2.93,5.25,17.98,0,0,3,4,Fleetwood
Camaro,13.3,8,350.0,245,3.73,3.84,15.41,0,0,3,4,Z28
Chrysler,14.7,8,440.0,230,3.23,5.345,17.42,0,0,3,4,Imperial
Datsun,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1,710


We can use the groups attrabute to see the groups. It returns the index values for all rows in the group. Some makes only have one model in the dataset.

In [10]:

make_group.groups

{'AMC': [22], 'Cadillac': [14], 'Camaro': [23], 'Chrysler': [16], 'Datsun': [2], 'Dodge': [21], 'Duster': [6], 'Ferrari': [29], 'Fiat': [17, 25], 'Ford': [28], 'Honda': [18], 'Hornet': [3, 4], 'Lincoln': [15], 'Lotus': [27], 'Maserati': [30], 'Mazda': [0, 1], 'Merc': [7, 8, 9, 10, 11, 12, 13], 'Pontiac': [24], 'Porsche': [26], 'Toyota': [19, 20], 'Valiant': [5], 'Volvo': [31]}

So in a past post I discussed filtering. If want to filter for a single 'make' it could be done like this:-

In [11]:
filt = (cars['make'] == 'Merc')
cars[filt]

Unnamed: 0,make,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb,model
7,Merc,24.4,4,146.7,62,3.69,3.19,20.0,1,0,4,2,240D
8,Merc,22.8,4,140.8,95,3.92,3.15,22.9,1,0,4,2,230
9,Merc,19.2,6,167.6,123,3.92,3.44,18.3,1,0,4,4,280
10,Merc,17.8,6,167.6,123,3.92,3.44,18.9,1,0,4,4,280C
11,Merc,16.4,8,275.8,180,3.07,4.07,17.4,0,0,3,3,450SE
12,Merc,17.3,8,275.8,180,3.07,3.73,17.6,0,0,3,3,450SL
13,Merc,15.2,8,275.8,180,3.07,3.78,18.0,0,0,3,3,450SLC


But if you want to do this for more than one model it would be slow.

So this is where groupby can come in handy.

Get all rows for the 'Merc' group by using the get_group() method.

Remember we made the groupby object by running:  **make_group = cars.groupby('make')**

In [12]:

make_group.get_group('Merc').head(3)

Unnamed: 0,make,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb,model
7,Merc,24.4,4,146.7,62,3.69,3.19,20.0,1,0,4,2,240D
8,Merc,22.8,4,140.8,95,3.92,3.15,22.9,1,0,4,2,230
9,Merc,19.2,6,167.6,123,3.92,3.44,18.3,1,0,4,4,280


But once you have the groupby object you can quickly get access to any/all the groups.

In [13]:
make_group.get_group('Fiat')

Unnamed: 0,make,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb,model
17,Fiat,32.4,4,78.7,66,4.08,2.2,19.47,1,1,4,1,128
25,Fiat,27.3,4,79.0,66,4.08,1.935,18.9,1,1,4,1,X1-9


Get_group can be sliced

In [27]:
make_group[['make', 'mpg']].get_group('Fiat')

Unnamed: 0,make,mpg
17,Fiat,32.4
25,Fiat,27.3


Show a list of the names of all the groups. Like doing a nunique on a series.

In [15]:

make_group.groups.keys()

dict_keys(['AMC', 'Cadillac', 'Camaro', 'Chrysler', 'Datsun', 'Dodge', 'Duster', 'Ferrari', 'Fiat', 'Ford', 'Honda', 'Hornet', 'Lincoln', 'Lotus', 'Maserati', 'Mazda', 'Merc', 'Pontiac', 'Porsche', 'Toyota', 'Valiant', 'Volvo'])

## Aggregation

So in the Pandas documentation it says "A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups."

This is referring to aggregation.

Lets use max on the 'mpg' column. This will give a pandas series for make_group, groupby object, and the max returns the largest values from 'mpg'

In [16]:

make_group['mpg'].max().head()

make
AMC         15.2
Cadillac    10.4
Camaro      13.3
Chrysler    14.7
Datsun      22.8
Name: mpg, dtype: float64

Now lets make a groupby object with two columns

In [17]:
power = cars.groupby(['cyl', 'hp'])


If we use the max() function now it will bring back the numerical columns as max for a multi index dataframe for grouping of 'cyl' and 'hp'. This will use a multiIndex.

So this is giving us the max power by make and cylinders.

In [18]:

power.max().head(15)

Unnamed: 0_level_0,Unnamed: 1_level_0,make,mpg,disp,drat,wt,qsec,vs,am,gear,carb,model
cyl,hp,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
4,52,Honda,30.4,75.7,4.93,1.615,18.52,1,1,4,2,Civic
4,62,Merc,24.4,146.7,3.69,3.19,20.0,1,0,4,2,240D
4,65,Toyota,33.9,71.1,4.22,1.835,19.9,1,1,4,1,Corolla
4,66,Fiat,32.4,79.0,4.08,2.2,19.47,1,1,4,1,X1-9
4,91,Porsche,26.0,120.3,4.43,2.14,16.7,0,1,5,2,914-2
4,93,Datsun,22.8,108.0,3.85,2.32,18.61,1,1,4,1,710
4,95,Merc,22.8,140.8,3.92,3.15,22.9,1,0,4,2,230
4,97,Toyota,21.5,120.1,3.7,2.465,20.01,1,0,3,1,Corona
4,109,Volvo,21.4,121.0,4.11,2.78,18.6,1,1,4,2,142E
4,113,Lotus,30.4,95.1,3.77,1.513,16.9,1,1,5,2,Europa


## Agg Method

The agg method can be used to give more control and allows different aggregation(s) on each column. You can even pass a list.

In [19]:

power.agg({'mpg': 'max',
            'hp': ['max', 'mean']}).head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,mpg,hp,hp
Unnamed: 0_level_1,Unnamed: 1_level_1,max,max,mean
cyl,hp,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
4,52,30.4,52,52.0
4,62,24.4,62,62.0
4,65,33.9,65,65.0


In [20]:
power.agg(['sum', 'max']).head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,make,make,mpg,mpg,disp,disp,drat,drat,wt,wt,...,vs,vs,am,am,gear,gear,carb,carb,model,model
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,max,sum,max,sum,max,sum,max,sum,max,...,sum,max,sum,max,sum,max,sum,max,sum,max
cyl,hp,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2
4,52,Honda,Honda,30.4,30.4,75.7,75.7,4.93,4.93,1.615,1.615,...,1,1,1,1,4,4,2,2,Civic,Civic
4,62,Merc,Merc,24.4,24.4,146.7,146.7,3.69,3.69,3.19,3.19,...,1,1,0,0,4,4,2,2,240D,240D
4,65,Toyota,Toyota,33.9,33.9,71.1,71.1,4.22,4.22,1.835,1.835,...,1,1,1,1,4,4,1,1,Corolla,Corolla


## Iterating over groups

Sometimes you find you have no choice but to iterate over groups. Lets see how to do this.

Lets make a new datframe and give it the same column names as the original cleaned dataframe

In [21]:
len(cars)

32

In [22]:
cars2 = pd.DataFrame(columns=cars.columns)
cars2

Unnamed: 0,make,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb,model


Allows you to perform any action row by row. Note this will be slow on large dataframes.

I will use a Python for loop.

Using the cars dataframe again to get the highest 'mpg' form each group 'make'

In [23]:
# Make_group is the groupby object
# make takes one make group at a time
# data will then iterate the make group

for make, data in make_group:
    highest_mpg_in_each_group = data.nlargest(1, 'mpg')
    cars2 = cars2.append(highest_mpg_in_each_group)
cars2

# lets sort by mpg
cars2.sort_values(by='mpg', ascending=False)

Unnamed: 0,make,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb,model
19,Toyota,33.9,4,71.1,65,4.22,1.835,19.9,1,1,4,1,Corolla
17,Fiat,32.4,4,78.7,66,4.08,2.2,19.47,1,1,4,1,128
27,Lotus,30.4,4,95.1,113,3.77,1.513,16.9,1,1,5,2,Europa
18,Honda,30.4,4,75.7,52,4.93,1.615,18.52,1,1,4,2,Civic
26,Porsche,26.0,4,120.3,91,4.43,2.14,16.7,0,1,5,2,914-2
7,Merc,24.4,4,146.7,62,3.69,3.19,20.0,1,0,4,2,240D
2,Datsun,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1,710
3,Hornet,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1,4 Drive
31,Volvo,21.4,4,121.0,109,4.11,2.78,18.6,1,1,4,2,142E
0,Mazda,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4,RX4


Right that is all for today. Hopefully this was interesting for you.