# Polars Notebook — Unsolved

CheatSheet: https://franzdiebold.github.io/polars-cheat-sheet/Polars_cheat_sheet.pdf

## 0. Setup — Working at Scale

We are now working with **larger datasets**.

Goal:
- Understand lazy execution
- Optimize query plans
- Work with joins, windows and multi-step pipelines

We generate:
- `products`: small dimension table
- `sales`: large fact table

Data is persisted to Parquet to enable **scan_parquet**.


In [1]:
%pip install polars


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
import polars as pl
import polars.selectors as cs
from datetime import datetime

In [3]:
# ----------------------------
# Dataset sizes
# ----------------------------
N_SALES = 300_000
N_PRODUCTS = 2_000

cities = ["Valencia", "Madrid", "Sevilla", "Bilbao", "Barcelona"]
categories = ["Electronics", "Fashion", "Home", "Sports", "Toys"]


In [4]:
# ----------------------------
# Products dimension
# ----------------------------
products = pl.DataFrame({
    "product_id": pl.int_range(1, N_PRODUCTS + 1, eager=True),
    "category": [categories[i % len(categories)] for i in range(N_PRODUCTS)],
    "is_discontinued": (pl.int_range(1, N_PRODUCTS + 1, eager=True) % 40 == 0),
})

products.write_parquet("products.parquet")

In [5]:
print("Products — sample")
products.head(10)

Products — sample


product_id,category,is_discontinued
i64,str,bool
1,"""Electronics""",False
2,"""Fashion""",False
3,"""Home""",False
4,"""Sports""",False
5,"""Toys""",False
6,"""Electronics""",False
7,"""Fashion""",False
8,"""Home""",False
9,"""Sports""",False
10,"""Toys""",False


In [6]:
# ----------------------------
# Sales fact table
# ----------------------------
sales = pl.DataFrame({
    "sale_id": pl.int_range(1, N_SALES + 1, eager=True),
    "product_id": (pl.int_range(1, N_SALES + 1, eager=True) * 37 % (N_PRODUCTS + 200)) + 1,
    "user_id": (pl.int_range(1, N_SALES + 1, eager=True) * 13) % 50_000,
    "units": (pl.int_range(1, N_SALES + 1, eager=True) * 7 % 5) + 1,
    "city": [cities[i % len(cities)] for i in range(N_SALES)],
    "has_discount": (pl.int_range(1, N_SALES + 1, eager=True) % 10 == 0),
    "ts": pl.datetime_range(
        datetime(2025, 1, 1),
        datetime(2025, 3, 31),
        interval="1s",
        eager=True,
    )[:N_SALES],
})

sales = sales.with_columns(
    (pl.col("units") * ((pl.col("sale_id") % 200) + 5)).alias("gross_value")
)

sales.write_parquet("sales.parquet")


In [7]:
print("Sales — sample")
sales.head(10)

Sales — sample


sale_id,product_id,user_id,units,city,has_discount,ts,gross_value
i64,i64,i64,i64,str,bool,datetime[μs],i64
1,38,13,3,"""Valencia""",False,2025-01-01 00:00:00,18
2,75,26,5,"""Madrid""",False,2025-01-01 00:00:01,35
3,112,39,2,"""Sevilla""",False,2025-01-01 00:00:02,16
4,149,52,4,"""Bilbao""",False,2025-01-01 00:00:03,36
5,186,65,1,"""Barcelona""",False,2025-01-01 00:00:04,10
6,223,78,3,"""Valencia""",False,2025-01-01 00:00:05,33
7,260,91,5,"""Madrid""",False,2025-01-01 00:00:06,60
8,297,104,2,"""Sevilla""",False,2025-01-01 00:00:07,26
9,334,117,4,"""Bilbao""",False,2025-01-01 00:00:08,56
10,371,130,1,"""Barcelona""",True,2025-01-01 00:00:09,15


## 1. Lazy Pipeline Optimization

Build a lazy pipeline that:
- filters data
- creates derived columns
- aggregates results
- sorts output

Then inspect the execution plan with `explain()`.


In [8]:
# 1) scan sales.parquet lazily
# 2) filter rows with high gross_value (> 1000)
# 3) create net_value applying a 10% discount when needed
# 4) group by city
# 5) aggregate:
#    - number of orders
#    - total revenue
# 6) sort by revenue descending
# 7) inspect the execution plan

lazy_sales = pl.scan_parquet("sales.parquet")

pipeline = (
    lazy_sales
    .filter(pl.col("gross_value") > 1000)
    .with_columns([
        pl.when(pl.col("has_discount"))
        .then(pl.col("gross_value") * 0.9)
        .otherwise(pl.col("gross_value"))
        .alias("net_value")
    ])
    .group_by("city")
    .agg([
        pl.count().alias("num_orders"),
        pl.col("net_value").sum().alias("total_revenue")
    ])
)

print(pipeline.explain())
print(pipeline.collect())

AGGREGATE[maintain_order: false]
  [len().alias("num_orders"), col("net_value").sum().alias("total_revenue")] BY [col("city")]
  FROM
   WITH_COLUMNS:
   [when(col("has_discount")).then([(col("gross_value").cast(Unknown(Float))) * (dyn float: 0.9)]).otherwise(col("gross_value").strict_cast(Unknown(Float))).alias("net_value")] 
    Parquet SCAN [sales.parquet]
    PROJECT 3/8 COLUMNS
    SELECTION: [(col("gross_value")) > (1000)]
    ESTIMATED ROWS: 300000
shape: (1, 3)
┌────────┬────────────┬───────────────┐
│ city   ┆ num_orders ┆ total_revenue │
│ ---    ┆ ---        ┆ ---           │
│ str    ┆ u32        ┆ f64           │
╞════════╪════════════╪═══════════════╡
│ Madrid ┆ 1500       ┆ 1.515e6       │
└────────┴────────────┴───────────────┘


(Deprecated in version 0.20.5)
  pl.count().alias("num_orders"),


## 2. Advanced Joins

Join sales with products.

Goal:
- enrich data
- detect invalid product references


In [None]:
sales_l = pl.scan_parquet("sales.parquet")
products_l = pl.scan_parquet("products.parquet")

enriched_sales = sales_l.join(products_l, on="product_id", how="inner")

invalid_sales = sales_l.join(products_l, on="product_id", how="anti")

print("Inner join row count:", enriched_sales.collect().height)
print("Invalid product_id row count:", invalid_sales.collect().height)

Inner join row count: 272730
Invalid product_id row count: 27270


In [17]:
sales_l = pl.scan_parquet("sales.parquet")
products_l = pl.scan_parquet("products.parquet")

enriched_sales = sales_l.join(products_l, on="product_id", how="inner")

invalid_sales = sales_l.join(products_l, on="product_id", how="anti")

print("Inner join row count:", enriched_sales.collect().height)
print("Invalid product_id row count:", invalid_sales.collect().height)

Inner join row count: 272730
Invalid product_id row count: 27270


## 3. Window Functions

Compute metrics that depend on group context:
- running totals
- lag values


In [19]:
# Create net_value, sort by timestamp, compute running total and previous value per user
sales_l = pl.scan_parquet("sales.parquet")

cleaned = (
    sales_l
    .with_columns([
        pl.when(pl.col("has_discount"))
        .then(pl.col("gross_value") * 0.9)
        .otherwise(pl.col("gross_value"))
        .alias("net_value")
    ])
    .sort("ts")
)

windows = cleaned.with_columns([
    pl.col("net_value").cum_sum().over("user_id").alias("running_total"),
    pl.col("net_value").shift(1).over("user_id").alias("prev_value")
])

windows.select(["user_id", "net_value", "running_total", "prev_value"]).head().collect()

user_id,net_value,running_total,prev_value
i64,f64,f64,f64
13,18.0,18.0,
26,35.0,35.0,
39,16.0,16.0,
52,36.0,36.0,
65,10.0,10.0,


In [None]:
cleaned = (
    sales_l
    .with_columns([
        pl.when(pl.col("has_discount"))
        .then(pl.col("gross_value") * 0.9)
        .otherwise(pl.col("gross_value"))
        .alias("net_value")
    ])
    .sort("ts")
)

windows = cleaned.with_columns([
    pl.col("net_value").cum_sum().over("user_id").alias("running_total"),
    pl.col("net_value").shift(1).over("user_id").alias("prev_value")
])

windows.select(["user_id", "net_value", "running_total", "prev_value"]).head().collect()

user_id,net_value,running_total,prev_value
i64,f64,f64,f64
13,18.0,18.0,
26,35.0,35.0,
39,16.0,16.0,
52,36.0,36.0,
65,10.0,10.0,


## 4. String & Date Processing

Clean and extract features from text and timestamps.


In [None]:
sales_l = pl.scan_parquet("sales.parquet")

result = sales_l.with_columns([
    pl.col("city").str.to_lowercase().alias("city_norm"),
    pl.col("ts").dt.year().alias("year"),
    pl.col("ts").dt.month().alias("month")
]).select(["city", "city_norm", "year", "month"]).head().collect()
result

In [None]:
sales_l = pl.scan_parquet("sales.parquet")

result = sales_l.with_columns([
    pl.col("city").str.to_lowercase().alias("city_norm"),
    pl.col("ts").dt.year().alias("year"),
    pl.col("ts").dt.month().alias("month")
]).select(["city", "city_norm", "year", "month"]).head().collect()
result

city,city_norm,year,month
str,str,i32,i8
"""Valencia""","""valencia""",2025,1
"""Madrid""","""madrid""",2025,1
"""Sevilla""","""sevilla""",2025,1
"""Bilbao""","""bilbao""",2025,1
"""Barcelona""","""barcelona""",2025,1


## 5. End-to-End Pipeline

Build a single optimized pipeline that:
- cleans data
- joins products
- computes revenue
- aggregates KPIs


In [22]:
products_l = pl.scan_parquet("products.parquet")
sales_l = pl.scan_parquet("sales.parquet")

final = (
    sales_l
    .join(products_l, on="product_id", how="inner")
    .filter(~pl.col("is_discontinued"))
    .with_columns([
        pl.when(pl.col("has_discount"))
        .then(pl.col("gross_value") * 0.9)
        .otherwise(pl.col("gross_value"))
        .alias("net_value")
    ])
    .group_by("category")
    .agg([
        pl.count().alias("num_orders"),
        pl.col("net_value").sum().alias("total_revenue")
    ])
    .sort("total_revenue", descending=True)
)

print(final.explain())
print(final.collect())

SORT BY [descending: [true]] [col("total_revenue")]
  AGGREGATE[maintain_order: false]
    [len().alias("num_orders"), col("net_value").sum().alias("total_revenue")] BY [col("category")]
    FROM
     WITH_COLUMNS:
     [when(col("has_discount")).then([(col("gross_value").cast(Unknown(Float))) * (dyn float: 0.9)]).otherwise(col("gross_value").strict_cast(Unknown(Float))).alias("net_value")] 
      simple π 4/4 ["category", "gross_value", ... 2 other columns]
        INNER JOIN:
        LEFT PLAN ON: [col("product_id")]
          Parquet SCAN [sales.parquet]
          PROJECT 3/8 COLUMNS
          ESTIMATED ROWS: 300000
        RIGHT PLAN ON: [col("product_id")]
          Parquet SCAN [products.parquet]
          PROJECT 3/3 COLUMNS
          SELECTION: col("is_discontinued").not()
          ESTIMATED ROWS: 2000
        END INNER JOIN
shape: (5, 3)
┌─────────────┬────────────┬───────────────┐
│ category    ┆ num_orders ┆ total_revenue │
│ ---         ┆ ---        ┆ ---           │
│ str

(Deprecated in version 0.20.5)
  pl.count().alias("num_orders"),


## Exercise 6 — Predicate & Projection Pushdown

Build a lazy pipeline that:

- reads sales data
- selects only the necessary columns
- filters rows as early as possible

Inspect the execution plan and identify:
- predicate pushdown
- projection pruning


In [None]:
pipeline = (
    pl.scan_parquet("sales.parquet")
    .select(["city", "gross_value", "has_discount"])
    .filter(pl.col("gross_value") > 5000)
)

print(pipeline.explain())
pipeline.collect()

Parquet SCAN [sales.parquet]
PROJECT 3/8 COLUMNS
SELECTION: [(col("gross_value")) > (5000)]
ESTIMATED ROWS: 300000


city,gross_value,has_discount
str,i64,bool


## Exercise 7 — Join + Filter Pushdown

Join sales with products and filter out discontinued products.

Inspect the execution plan and verify:
- join reordering
- filter pushdown on the dimension table


In [None]:
products_l = pl.scan_parquet("products.parquet")
pipeline = (
    sales_l
    .join(products_l, on="product_id", how="inner")
    .filter(~pl.col("is_discontinued"))
)

print(pipeline.explain())
pipeline.collect()

INNER JOIN:
LEFT PLAN ON: [col("product_id")]
  Parquet SCAN [sales.parquet]
  PROJECT 8/8 COLUMNS
  ESTIMATED ROWS: 300000
RIGHT PLAN ON: [col("product_id")]
  Parquet SCAN [products.parquet]
  PROJECT 3/3 COLUMNS
  SELECTION: col("is_discontinued").not()
  ESTIMATED ROWS: 2000
END INNER JOIN


sale_id,product_id,user_id,units,city,has_discount,ts,gross_value,category,is_discontinued
i64,i64,i64,i64,str,bool,datetime[μs],i64,str,bool
1,38,13,3,"""Valencia""",false,2025-01-01 00:00:00,18,"""Home""",false
2,75,26,5,"""Madrid""",false,2025-01-01 00:00:01,35,"""Toys""",false
3,112,39,2,"""Sevilla""",false,2025-01-01 00:00:02,16,"""Fashion""",false
4,149,52,4,"""Bilbao""",false,2025-01-01 00:00:03,36,"""Sports""",false
5,186,65,1,"""Barcelona""",false,2025-01-01 00:00:04,10,"""Electronics""",false
…,…,…,…,…,…,…,…,…,…
299996,853,49948,3,"""Valencia""",false,2025-01-04 11:19:55,603,"""Home""",false
299997,890,49961,5,"""Madrid""",false,2025-01-04 11:19:56,1010,"""Toys""",false
299998,927,49974,2,"""Sevilla""",false,2025-01-04 11:19:57,406,"""Fashion""",false
299999,964,49987,4,"""Bilbao""",false,2025-01-04 11:19:58,816,"""Sports""",false


## Exercise 8 — Window Aggregation vs GroupBy

Compute total revenue per city using:

- a group_by aggregation
- a window function

Compare the results conceptually.


In [30]:
# Compute net_value, revenue per city using group_by and window functions
sales_l = pl.scan_parquet("sales.parquet")

# group_by version
grouped = (
    sales_l
    .with_columns([
        pl.when(pl.col("has_discount"))
        .then(pl.col("gross_value") * 0.9)
        .otherwise(pl.col("gross_value"))
        .alias("net_value")
    ])
    .group_by("city")
    .agg([
        pl.col("net_value").sum().alias("revenue")
    ])
)

# window version
# Note: We split with_columns because "city_revenue" depends on "net_value" created in the same step
windowed = (
    sales_l
    .with_columns([
        pl.when(pl.col("has_discount"))
        .then(pl.col("gross_value") * 0.9)
        .otherwise(pl.col("gross_value"))
        .alias("net_value")
    ])
    .with_columns([
        pl.col("net_value").sum().over("city").alias("city_revenue")
    ])
)

print(grouped.collect())
print(windowed.select(["city", "city_revenue"]).head().collect())

shape: (5, 2)
┌───────────┬─────────┐
│ city      ┆ revenue │
│ ---       ┆ ---     │
│ str       ┆ f64     │
╞═══════════╪═════════╡
│ Madrid    ┆ 3.135e7 │
│ Sevilla   ┆ 1.266e7 │
│ Bilbao    ┆ 2.556e7 │
│ Barcelona ┆ 5.85e6  │
│ Valencia  ┆ 1.863e7 │
└───────────┴─────────┘
shape: (5, 2)
┌───────────┬──────────────┐
│ city      ┆ city_revenue │
│ ---       ┆ ---          │
│ str       ┆ f64          │
╞═══════════╪══════════════╡
│ Valencia  ┆ 1.863e7      │
│ Madrid    ┆ 3.135e7      │
│ Sevilla   ┆ 1.266e7      │
│ Bilbao    ┆ 2.556e7      │
│ Barcelona ┆ 5.85e6       │
└───────────┴──────────────┘


## Exercise 9 — Top-N per Group

For each city, return the top 3 users by total revenue.

Use window functions to rank users within each city.


In [35]:
# For each city, return the top 3 users by total revenue.
# Use window functions to rank users within each city.
sales_l = pl.scan_parquet("sales.parquet")

pipeline = (
    sales_l
    .with_columns([
        pl.when(pl.col("has_discount"))
        .then(pl.col("gross_value") * 0.9)
        .otherwise(pl.col("gross_value"))
        .alias("net_value")
    ])
    .group_by(["city", "user_id"])
    .agg(pl.col("net_value").sum().alias("total_user_revenue"))
    .with_columns([
        pl.col("total_user_revenue")
        .rank(method="dense", descending=True)
        .over("city")
        .alias("rank")
    ])
    .filter(pl.col("rank") <= 3)
    .sort(["city", "rank"])
)

print(pipeline.collect())

shape: (3_750, 4)
┌───────────┬─────────┬────────────────────┬──────┐
│ city      ┆ user_id ┆ total_user_revenue ┆ rank │
│ ---       ┆ ---     ┆ ---                ┆ ---  │
│ str       ┆ i64     ┆ f64                ┆ u32  │
╞═══════════╪═════════╪════════════════════╪══════╡
│ Barcelona ┆ 1135    ┆ 1200.0             ┆ 1    │
│ Barcelona ┆ 15735   ┆ 1200.0             ┆ 1    │
│ Barcelona ┆ 26935   ┆ 1200.0             ┆ 1    │
│ Barcelona ┆ 27135   ┆ 1200.0             ┆ 1    │
│ Barcelona ┆ 19935   ┆ 1200.0             ┆ 1    │
│ …         ┆ …       ┆ …                  ┆ …    │
│ Valencia  ┆ 6618    ┆ 3438.0             ┆ 3    │
│ Valencia  ┆ 28418   ┆ 3438.0             ┆ 3    │
│ Valencia  ┆ 49818   ┆ 3438.0             ┆ 3    │
│ Valencia  ┆ 46818   ┆ 3438.0             ┆ 3    │
│ Valencia  ┆ 39818   ┆ 3438.0             ┆ 3    │
└───────────┴─────────┴────────────────────┴──────┘


## Exercise 10 — Multi-step Optimization Challenge

Build a fully optimized lazy pipeline that:

- joins sales and products
- filters discontinued products
- computes net_value
- aggregates revenue per category
- sorts results

Inspect the execution plan and explain:
- which optimizations are applied


In [None]:
# Fully optimized lazy pipeline for sales analytics
products_l = pl.scan_parquet("products.parquet")
sales_l = pl.scan_parquet("sales.parquet")

pipeline = (
    sales_l
    .join(products_l, on="product_id", how="inner")
    .filter(~pl.col("is_discontinued"))
    .with_columns([
        pl.when(pl.col("has_discount"))
        .then(pl.col("gross_value") * 0.9)
        .otherwise(pl.col("gross_value"))
        .alias("net_value")
    ])
    .group_by("category")
    .agg([
        pl.count().alias("num_orders"),
        pl.col("net_value").sum().alias("total_revenue")
    ])
    .sort("total_revenue", descending=True)
)

print(pipeline.explain())
print(pipeline.collect())


SORT BY [descending: [true]] [col("total_revenue")]
  AGGREGATE[maintain_order: false]
    [len().alias("num_orders"), col("net_value").sum().alias("total_revenue")] BY [col("category")]
    FROM
     WITH_COLUMNS:
     [when(col("has_discount")).then([(col("gross_value").cast(Unknown(Float))) * (dyn float: 0.9)]).otherwise(col("gross_value").strict_cast(Unknown(Float))).alias("net_value")] 
      simple π 4/4 ["category", "gross_value", ... 2 other columns]
        INNER JOIN:
        LEFT PLAN ON: [col("product_id")]
          Parquet SCAN [sales.parquet]
          PROJECT 3/8 COLUMNS
          ESTIMATED ROWS: 300000
        RIGHT PLAN ON: [col("product_id")]
          Parquet SCAN [products.parquet]
          PROJECT 3/3 COLUMNS
          SELECTION: col("is_discontinued").not()
          ESTIMATED ROWS: 2000
        END INNER JOIN
shape: (5, 3)
┌─────────────┬────────────┬───────────────┐
│ category    ┆ num_orders ┆ total_revenue │
│ ---         ┆ ---        ┆ ---           │
│ str

(Deprecated in version 0.20.5)
  pl.count().alias("num_orders"),
