In [3]:
import polars as pl

csv_file = 'Titanic.csv'

df = pl.read_csv(csv_file)
df.head(2)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
1,0,3,"""Braund, Mr. Ow…","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
2,1,1,"""Cumings, Mrs. …","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""


### The groupby object

Calling the groupby method with a column creates a GroupBy object.

In [4]:
df.groupby('Pclass')

<polars.dataframe.groupby.GroupBy at 0x7fac7c325a90>

We can access the DataFrame for each group by looping over a GroupBy object. In the example below we print the mean for each group.

In [5]:
for group in df.groupby('Pclass'):
  print(group[1].mean())

shape: (1, 12)
┌─────────────┬──────────┬────────┬──────┬───┬────────┬───────────┬───────┬──────────┐
│ PassengerId ┆ Survived ┆ Pclass ┆ Name ┆ … ┆ Ticket ┆ Fare      ┆ Cabin ┆ Embarked │
│ ---         ┆ ---      ┆ ---    ┆ ---  ┆   ┆ ---    ┆ ---       ┆ ---   ┆ ---      │
│ f64         ┆ f64      ┆ f64    ┆ str  ┆   ┆ str    ┆ f64       ┆ str   ┆ str      │
╞═════════════╪══════════╪════════╪══════╪═══╪════════╪═══════════╪═══════╪══════════╡
│ 461.597222  ┆ 0.62963  ┆ 1.0    ┆ null ┆ … ┆ null   ┆ 84.154688 ┆ null  ┆ null     │
└─────────────┴──────────┴────────┴──────┴───┴────────┴───────────┴───────┴──────────┘
shape: (1, 12)
┌─────────────┬──────────┬────────┬──────┬───┬────────┬───────────┬───────┬──────────┐
│ PassengerId ┆ Survived ┆ Pclass ┆ Name ┆ … ┆ Ticket ┆ Fare      ┆ Cabin ┆ Embarked │
│ ---         ┆ ---      ┆ ---    ┆ ---  ┆   ┆ ---    ┆ ---       ┆ ---   ┆ ---      │
│ f64         ┆ f64      ┆ f64    ┆ str  ┆   ┆ str    ┆ f64       ┆ str   ┆ str      │
╞════════════

### Group indices and values

We get the group keys and the row indices for each group with _groups. Notice however that due to parallelization the order of the groups changes each time. If we want to maintain the same order every time we need to use the maintain_order argument of the groupby method.

The head method, when applied on the group data returns the first n rows from each group. In the example below we return 2 rows from each group.

In [6]:
df.groupby('Pclass').head(2)

Pclass,PassengerId,Survived,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,i64,str,str,f64,i64,i64,str,f64,str,str
2,10,1,"""Nasser, Mrs. N…","""female""",14.0,1,0,"""237736""",30.0708,,"""C"""
2,16,1,"""Hewlett, Mrs. …","""female""",55.0,0,0,"""248706""",16.0,,"""S"""
1,2,1,"""Cumings, Mrs. …","""female""",38.0,1,0,"""PC 17599""",71.2833,"""C85""","""C"""
1,4,1,"""Futrelle, Mrs.…","""female""",35.0,1,0,"""113803""",53.1,"""C123""","""S"""
3,1,0,"""Braund, Mr. Ow…","""male""",22.0,1,0,"""A/5 21171""",7.25,,"""S"""
3,3,1,"""Heikkinen, Mis…","""female""",26.0,0,0,"""STON/O2. 31012…",7.925,,"""S"""


### Aggregations

In eager mode we can call aggregations directly on the groupby object. The aggregation can be applied to both rows and columns.

In [7]:
# Rows

df.groupby('Pclass').count()

Pclass,count
i64,u32
3,491
1,216
2,184


In [8]:
# Columns

df.groupby('Pclass').mean()

Pclass,PassengerId,Survived,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,f64,f64,str,str,f64,f64,f64,str,f64,str,str
2,445.956522,0.472826,,,29.87763,0.402174,0.380435,,20.662183,,
1,461.597222,0.62963,,,38.233441,0.416667,0.356481,,84.154687,,
3,439.154786,0.242363,,,25.14062,0.615071,0.393075,,13.67555,,


Grouping on multiple columns is also possible of course.

In [9]:
df.groupby(['Pclass', 'Survived']).mean()

Pclass,Survived,PassengerId,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
i64,i64,f64,str,str,f64,f64,f64,str,f64,str,str
1,1,491.772059,,,35.368197,0.492647,0.389706,,95.608029,,
2,1,439.08046,,,25.901566,0.494253,0.643678,,22.0557,,
2,0,452.123711,,,33.544444,0.319588,0.14433,,19.412328,,
3,1,394.058824,,,20.646118,0.436975,0.420168,,13.694887,,
1,0,410.3,,,43.695312,0.2875,0.3,,64.684008,,
3,0,453.580645,,,26.555556,0.672043,0.384409,,13.669364,,


### Agg method

The agg method allowds for more flexibility. Some of the most common methods that can be called on groups include:

* first - get the first element of each group
* last - get the last element of each group
* n_unique - get the number of unique elements in each group
* count - get the number of elements in each group
* sum - sum the elements in each group
* min - get the smallest element in each group
* max - get the largest element in each group
* mean - get the average of elements in each group
* median - get the median in each group
* quantile - calculate quantiles in each group

We aggregate by caling agg after groupby. Notice that we must use expressions inside agg.

In [10]:
# Agg on a single column

df.groupby('Pclass').agg(pl.col('Age').mean())

Pclass,Age
i64,f64
2,29.87763
1,38.233441
3,25.14062


In [11]:
# Agg with sorting

df.groupby('Pclass').agg(pl.col('Age').mean()).sort('Pclass')

Pclass,Age
i64,f64
1,38.233441
2,29.87763
3,25.14062


### Groupby on an expression

One of the advantages of grouping by expression instead of a column is that we can transform the column before grouping. For example, if we want to group by the Age column in decates instead of individual years. To do this we must:
* Convert the Age column from years to decates
* Cast the output to integer
* group by the decades

In [12]:
df.groupby((pl.col('Age') / 10).round(0).cast(pl.Int64).alias('Decade')) \
                                             .agg(pl.col('Fare').mean()) \
                                             .sort('Decade', reverse = True)

TypeError: ignored

### Applying a filter on an aggregation

We may want to filter the results after doing the aggregation so that only some of the aggregates appear in the output. In SQL, for example, this is done by using the HAVING clause. In polars this is done by applying an additional filter after calling agg.

In the example below we get the average fare by passenger class but only if the average fare is greater than 20.

In [13]:
df.groupby('Pclass') \
.agg(pl.col('Fare').mean()) \
.filter(pl.col('Fare') > 20) \
.sort('Fare')

Pclass,Fare
i64,f64
2,20.662183
1,84.154687


### Aggregations in a list

We can pass a list to .agg to set out different aggregations. When there are multiple aggregations Polars calculates them in parallel.

In [14]:
df.groupby('Pclass').agg([pl.col('Age').mean(), pl.col('Fare').max()])

Pclass,Age,Fare
i64,f64,f64
1,38.233441,512.3292
2,29.87763,73.5
3,25.14062,69.55


### Multiple aggregations on a column

Calling multiple aggregations on the same column produces columns of the same name. We use alias to ensure the column names are unique.

In [16]:
df.groupby('Pclass').agg([pl.col('Age').min().alias('Age_min'), pl.col('Age').mean().alias('Age_mean'), pl.col('Age').max().alias('Age_max')])

Pclass,Age_min,Age_mean,Age_max
i64,f64,f64,f64
3,0.42,25.14062,74.0
1,0.92,38.233441,80.0
2,0.67,29.87763,70.0


### Same aggregation on multiple columns

There are more efficient ways to write code to do multiple columns and/or aggregations in agg. One of them, to do the same aggregation on multiple columns we can loop over the columns in a list comprehension.

In [18]:
df.groupby('Pclass').agg([pl.col(colName).mean() for colName in ['Age', 'Fare']])

Pclass,Age,Fare
i64,f64,f64
3,25.14062,13.67555
2,29.87763,20.662183
1,38.233441,84.154687


Alternatively, we can also use the methods for selecting multiple columns we met previously including:
* Using pl.col
* Passing dtype to pl.col
* Passing a regex to pl.col

### Multiple aggregations on multiple columns

Using alias is tedious for multiple aggregations on multiple columns. Instead, we add a prefix or suffix to the column name. Here's an example with a suffix.

In [19]:
df.groupby('Pclass').agg([pl.col(pl.Float64).mean().suffix('_mean'), pl.col(pl.Float64).min().suffix('_min')])

Pclass,Age_mean,Fare_mean,Age_min,Fare_min
i64,f64,f64,f64,f64
3,25.14062,13.67555,0.42,0.0
2,29.87763,20.662183,0.67,0.0
1,38.233441,84.154687,0.92,0.0


### Creating a LazyGroupBy object

We create a LazyGroupBy object by calling groupby on a LazyFrame.

In [21]:
pl.scan_csv(csv_file).groupby('Pclass')

<polars.lazyframe.groupby.LazyGroupBy at 0x7fac5a2646a0>

The only way to do aggregations on lazy groupby object is by using the agg method. Calling agg converts a LazyGroupBy to a LazyFrame.

In [23]:
pl.scan_csv(csv_file).groupby('Pclass').agg(pl.col('Age').mean())

### Optimized plan example

In [25]:
print(pl.scan_csv(csv_file).groupby('Pclass').agg(pl.col('Age').mean()).describe_optimized_plan().replace('FROM ', 'FROM\n'))

AGGREGATE
	[col("Age").mean()] BY [col("Pclass")] FROM
	
  CSV SCAN Titanic.csv
  PROJECT 2/12 COLUMNS


  print(pl.scan_csv(csv_file).groupby('Pclass').agg(pl.col('Age').mean()).describe_optimized_plan().replace('FROM ', 'FROM\n'))
