In [48]:
import polars as pl
from warnings import filterwarnings

filterwarnings('ignore')

df = pl.read_parquet(r"C:\Users\Rudra\Desktop\yelp\parquet-data\business.parquet")

- Data With Baraa Youtube channel clearly explain this topic check out before this notebook : https://youtu.be/GzRyOsQsugk?si=13-YEkyBC03ISlHI
- If data is soo huge then first do rolling operation in the sql query

In [49]:
business_lf = pl.scan_parquet(r"C:\Users\Rudra\Desktop\yelp\parquet-data\business.parquet")
checkin_lf = pl.scan_parquet(r"C:\Users\Rudra\Desktop\yelp\parquet-data\checkin.parquet")
review_lf = pl.scan_parquet(r"C:\Users\Rudra\Desktop\yelp\parquet-data\review.parquet")
tip_lf = pl.scan_parquet(r"C:\Users\Rudra\Desktop\yelp\parquet-data\tip.parquet")
user_lf = pl.scan_parquet(r"C:\Users\Rudra\Desktop\yelp\parquet-data\yelp_user.parquet")

In [50]:
user_lf = user_lf.with_columns(
    pl.col("yelping_since").str.to_datetime()
)

# <strong style="color:#5e17eb">  1. Introduction to Rolling Functions</strong>


## <strong style="color:#5e17eb"> Rolling, Cumulative & Window Functions </strong>

| Term / Operation              | What It Means                                                                               | How It Works                                                                                                 | Example Use Case                                                    |
| ----------------------------- | ------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------- |
| **Rolling Window**            | A calculation applied over a **fixed-size moving window** of data (rows or time-based).     | Moves forward one step at a time, dropping the oldest entry and adding the newest, then recalculates.        | 7-day moving average of temperature                                 |
| **Cumulative (Running)**      | A calculation applied **from the start of the data up to the current point**.               | No window size — keeps accumulating all past values until the current point.                                 | Running total sales till a given day                                |
| **Window Function** (Grouped) | An operation applied **within each group separately** while still keeping row-level detail. | Operates over partitions/groups (like SQL window functions). Can be rolling or cumulative but **per group**. | Rank sales within each region; rolling average per customer segment |


## <strong style="color:#5e17eb">Use Cases  </strong>

 **🔍 Concept of Rolling Windows**

* Imagine you have 10 days of stock prices, and you want the **average price of the last 3 days** —
  - Day 3 average = Avg(Day 1, Day 2, Day 3)
  - Day 4 average = Avg(Day 2, Day 3, Day 4)
  …and so on.
* The **window** moves forward (sliding) over the dataset.

 **🔍Use Cases**

| Use Case                   | Example                                    |
| -------------------------- | ------------------------------------------ |
| **Smoothing noisy data**   | 7-day moving average of COVID cases        |
| **Detecting trends**       | Rolling sales growth over last 4 quarters  |
| **Volatility calculation** | Rolling standard deviation of stock prices |
| **Moving max/min**         | Highest temperature in last 30 days        |
| **Rolling sum/count**      | Total rainfall over last 7 days            |



## <strong style="color:#5e17eb"> Difference </strong>

**🔍 Rolling vs Cumulative vs Window**

| Feature         | Rolling Window               | Cumulative                     | Window Function               |
| --------------- | ---------------------------- | ------------------------------ | ----------------------------- |
| **Window size** | Fixed (e.g., 7 days, 5 rows) | Expands from start to current  | Varies by partition/group     |
| **Scope**       | Local subset of rows         | All rows from start to current | Per group/partition           |
| **Reset point** | Moves forward step-by-step   | Never resets until end         | Resets at each group boundary |
| **Example**     | Rolling 3-day average temp   | Running total sales            | Rank within region            |


💡 **Key Takeaway**:

* **Rolling** is like looking at data through a moving magnifying glass.
* **Cumulative** is like filling a bucket over time.
* **Window functions** are like running these calculations but **within groups**, without collapsing the dataset.

**Example**
1. **Rolling** → `.rolling_mean()` and `.groupby_rolling()`
2. **Cumulative** → `.cum_sum()`, `.cum_min()`, `.cum_max()`
3. **Window functions** → `.over("group_column")`


## <strong style="color:#5e17eb"> Summary  </strong>

**Goal: Understand what rolling operations are and when to use them.**

| Sub-Topic                                                            | Description                                                                          |
| -------------------------------------------------------------------- | ------------------------------------------------------------------------------------ |
| **1.1 Concept of Rolling Windows**                                   | Calculations over a moving window of rows or time.                                   |
| **1.2 Use Cases**                                                    | Moving averages, rolling sums, volatility, moving max/min, smoothing noisy data.     |
| **1.3 Difference Between Rolling, Cumulative, and Window Functions** | Rolling = fixed-sized moving window; Cumulative = running total; Window = per-group. |


# <strong style="color:#5e17eb"> 2. Rolling by Row Count (Index-based Rolling) </strong>


In [51]:
df = pl.DataFrame({
    "user_id": ["u1", "u2", "u3", "u4", "u5", "u6"],
    "name": ["A", "B", "C", "D", "E", "F"],
    "review_count": [10, 20, 30, 40, 50, 60],
    "yelping_since": pl.date_range(
        start=pl.datetime(2015, 1, 1),
        end=pl.datetime(2020, 1, 1),
        interval="1y",
        eager=True
    ),
    "fans": [1, 3, 5, 7, 9, 11]
})

df

user_id,name,review_count,yelping_since,fans
str,str,i64,date,i64
"""u1""","""A""",10,2015-01-01,1
"""u2""","""B""",20,2016-01-01,3
"""u3""","""C""",30,2017-01-01,5
"""u4""","""D""",40,2018-01-01,7
"""u5""","""E""",50,2019-01-01,9
"""u6""","""F""",60,2020-01-01,11


## <strong style="color:#5e17eb"> Rolling Mean & Median 
</strong>

In [52]:
df.with_columns([
    pl.col("review_count").rolling_mean(window_size=3).alias("roll_mean"),
    pl.col("review_count").rolling_median(window_size=3).alias("roll_median"),
])


user_id,name,review_count,yelping_since,fans,roll_mean,roll_median
str,str,i64,date,i64,f64,f64
"""u1""","""A""",10,2015-01-01,1,,
"""u2""","""B""",20,2016-01-01,3,,
"""u3""","""C""",30,2017-01-01,5,20.0,20.0
"""u4""","""D""",40,2018-01-01,7,30.0,30.0
"""u5""","""E""",50,2019-01-01,9,40.0,40.0
"""u6""","""F""",60,2020-01-01,11,50.0,50.0


## <strong style="color:#5e17eb"> Rolling Min / Max & sum </strong>

In [53]:
df.with_columns([
    pl.col("review_count").rolling_sum(window_size=3).alias("roll_sum"),
    pl.col("review_count").rolling_min(window_size=3).alias("roll_min"),
    pl.col("review_count").rolling_max(window_size=3).alias("roll_max"),
])


user_id,name,review_count,yelping_since,fans,roll_sum,roll_min,roll_max
str,str,i64,date,i64,i64,i64,i64
"""u1""","""A""",10,2015-01-01,1,,,
"""u2""","""B""",20,2016-01-01,3,,,
"""u3""","""C""",30,2017-01-01,5,60.0,10.0,30.0
"""u4""","""D""",40,2018-01-01,7,90.0,20.0,40.0
"""u5""","""E""",50,2019-01-01,9,120.0,30.0,50.0
"""u6""","""F""",60,2020-01-01,11,150.0,40.0,60.0


## <strong style="color:#5e17eb">Rolling std / var & Quantile </strong>

In [54]:
df.with_columns([
    pl.col("review_count").rolling_std(window_size=3).alias("roll_std"),
    pl.col("review_count").rolling_var(window_size=3).alias("roll_var"),
    pl.col("review_count").rolling_quantile(quantile=0.9).alias("roll_quantile")
])

user_id,name,review_count,yelping_since,fans,roll_std,roll_var,roll_quantile
str,str,i64,date,i64,f64,f64,f64
"""u1""","""A""",10,2015-01-01,1,,,
"""u2""","""B""",20,2016-01-01,3,,,20.0
"""u3""","""C""",30,2017-01-01,5,10.0,100.0,30.0
"""u4""","""D""",40,2018-01-01,7,10.0,100.0,40.0
"""u5""","""E""",50,2019-01-01,9,10.0,100.0,50.0
"""u6""","""F""",60,2020-01-01,11,10.0,100.0,60.0


## <strong style="color:#5e17eb"> Summary  </strong>

**Goal: Perform rolling calculations over N previous rows.**


| Sub-Topic                | Function                                      | Example                                 |
| ------------------------ | --------------------------------------------- | --------------------------------------- |
| **2.1 Rolling Mean**     | `.rolling_mean(window_size, min_periods=...)` | `pl.col("stars").rolling_mean(3)`       |
| **2.2 Rolling Sum**      | `.rolling_sum()`                              | `pl.col("review_count").rolling_sum(5)` |
| **2.3 Rolling Min/Max**  | `.rolling_min()`, `.rolling_max()`            | Find recent peak/min value              |
| **2.4 Rolling Std/Var**  | `.rolling_std()`, `.rolling_var()`            | Rolling standard deviation              |
| **2.5 Rolling Median**   | `.rolling_median()`                           | Smoothing with median filter            |
| **2.6 Rolling Quantile** | `.rolling_quantile(q=0.9)`                    | Rolling 90th percentile                 |


-

> 💡 **Tip**: Use min_periods when your first few windows have fewer rows than window_size.

# <strong style="color:#5e17eb"> 3. Rolling by Time Period (Time-based Rolling) </strong>


## <strong style="color:#5e17eb"> Concept </strong>

- **Normal rolling** → window size is a *number of rows* (e.g., window_size=3 means last 3 rows).

- **Time-based rolling** → window size is a *time period* (e.g., "2y", "3mo", "7d") and depends on a datetime column.

*Needs:*

- A datetime column.

- Sorting by datetime before rolling.

In [55]:

df = pl.DataFrame({
    "user_id": ["u1", "u2", "u3", "u4", "u5"],
    "yelping_since": pl.date_range(
        start=pl.datetime(2018, 1, 1),
        end=pl.datetime(2022, 1, 1),
        interval="1y",
        eager=True
    ),
    "review_count": [10, 20, 30, 40, 50]
})

print(df)


shape: (5, 3)
┌─────────┬───────────────┬──────────────┐
│ user_id ┆ yelping_since ┆ review_count │
│ ---     ┆ ---           ┆ ---          │
│ str     ┆ date          ┆ i64          │
╞═════════╪═══════════════╪══════════════╡
│ u1      ┆ 2018-01-01    ┆ 10           │
│ u2      ┆ 2019-01-01    ┆ 20           │
│ u3      ┆ 2020-01-01    ┆ 30           │
│ u4      ┆ 2021-01-01    ┆ 40           │
│ u5      ┆ 2022-01-01    ┆ 50           │
└─────────┴───────────────┴──────────────┘


## <strong style="color:#5e17eb"> Rolling window with Time </strong>

In [56]:
df = df.sort("yelping_since")  # Must be sorted

df_rolling_mean = df.with_columns([
    "user_id",
    "yelping_since",
    pl.col("review_count")
      .rolling_mean(
          window_size=2,                
      )
      .alias("rolling_mean")
])

print(df_rolling_mean)

shape: (5, 4)
┌─────────┬───────────────┬──────────────┬──────────────┐
│ user_id ┆ yelping_since ┆ review_count ┆ rolling_mean │
│ ---     ┆ ---           ┆ ---          ┆ ---          │
│ str     ┆ date          ┆ i64          ┆ f64          │
╞═════════╪═══════════════╪══════════════╪══════════════╡
│ u1      ┆ 2018-01-01    ┆ 10           ┆ null         │
│ u2      ┆ 2019-01-01    ┆ 20           ┆ 15.0         │
│ u3      ┆ 2020-01-01    ┆ 30           ┆ 25.0         │
│ u4      ┆ 2021-01-01    ┆ 40           ┆ 35.0         │
│ u5      ┆ 2022-01-01    ┆ 50           ┆ 45.0         │
└─────────┴───────────────┴──────────────┴──────────────┘


## <strong style="color:#5e17eb"> Summary  </strong>

**Key Difference**
| Function           | Works On      | Window Type     | Null Handling          |
| ------------------ | ------------- | --------------- | ---------------------- |
| `rolling_*`        | Numeric       | Fixed row count | Null until enough rows |
| `cum*`             | Numeric       | From start      | Always fills           |
| `group_by_dynamic` | Time/datetime | Time-based      | Groups per time frame  |


**Goal: Perform rolling calculations over a time duration, not row count.**
| Sub-Topic                                               | Function / Method                               | Example                                                |
| ------------------------------------------------------- | ----------------------------------------------- | ------------------------------------------------------ |
| **3.1 Setting Time Column**                             | Ensure `.cast(pl.Datetime)` for time operations | `pl.col("date").cast(pl.Datetime)`                     |
| **3.2 `.rolling_mean()` with `by` & `window`**          | Rolling mean per time period                    | `pl.col("sales").rolling_mean(window="7d", by="date")` |
| **3.3 `.rolling_sum()` Time Window**                    | Sum of values over last 30 days                 | `"30d"`                                                                             |
- 

> 💡 **Tip**: Requires sorted time column for correct results.

# <strong style="color:#5e17eb">  4. Parameters to Control Rolling</strong>


In [61]:
import polars as pl
from datetime import datetime

df_time = pl.DataFrame({
    "dt": [datetime(2020, 1, 1), datetime(2020, 1, 2), datetime(2020, 1, 3), datetime(2020, 1, 4)],
    "value": [10, 15, 12, 18]
}).with_columns(
    pl.col("dt").set_sorted()
)

df_time_rolling = df_time.rolling(
    index_column="dt",
    period="2d"
).agg(
    pl.col("value").mean().alias("rolling_mean_2d")
)
print(df_time_rolling)

shape: (4, 2)
┌─────────────────────┬─────────────────┐
│ dt                  ┆ rolling_mean_2d │
│ ---                 ┆ ---             │
│ datetime[μs]        ┆ f64             │
╞═════════════════════╪═════════════════╡
│ 2020-01-01 00:00:00 ┆ 10.0            │
│ 2020-01-02 00:00:00 ┆ 12.5            │
│ 2020-01-03 00:00:00 ┆ 13.5            │
│ 2020-01-04 00:00:00 ┆ 15.0            │
└─────────────────────┴─────────────────┘


In [62]:
import polars as pl

s = pl.Series([10, 20, 30, 40, 50])

print(s.rolling_mean(window_size=3))
print(s.rolling_mean(window_size=3, weights=[0.2, 0.3, 0.5]))
print(s.rolling_sum(window_size=3, center=True))
print(s.rolling_max(window_size=3, min_samples=2))


shape: (5,)
Series: '' [f64]
[
	null
	null
	20.0
	30.0
	40.0
]
shape: (5,)
Series: '' [f64]
[
	null
	null
	23.0
	33.0
	43.0
]
shape: (5,)
Series: '' [i64]
[
	null
	60
	90
	120
	null
]
shape: (5,)
Series: '' [i64]
[
	null
	20
	30
	40
	50
]


## <strong style="color:#5e17eb"> Window Size <strong>

In [64]:
df.collect_schema()

Schema([('user_id', String), ('yelping_since', Date), ('review_count', Int64)])

## <strong style="color:#5e17eb">min_samples  </strong>

## <strong style="color:#5e17eb">  center & weight </strong>

## <strong style="color:#5e17eb"> Summary  </strong>

**Goal: Fine-tune rolling behavior.**

| Parameter                     | Description                                                         |
| ----------------------------- | ------------------------------------------------------------------- |
| **`window_size` or `window`** | Size of rolling window (`int` rows or `"xd"` days)                  |
| **`min_periods`**             | Minimum number of observations required to produce result           |
| **`center`**                  | If True, centers the window on the current row                      |
| **`by`**                      | Column name to group rolling by (e.g., by `city` in your Yelp data) |


# <strong style="color:#5e17eb"> 5. Rolling with Groups </strong>


In [65]:
import polars as pl
import datetime as dt

df = pl.DataFrame({
    "category": ["A", "A", "A", "B", "B", "B"],
    "customer": [1, 1, 2, 1, 1, 2],
    "date": pl.date_range(
        start=dt.date(2025, 1, 1),
        end=dt.date(2025, 1, 6),
        interval="1d",
        eager=True
    ),
    "sales": [10, 20, 15, 5, 7, 9]
})

print(df)


shape: (6, 4)
┌──────────┬──────────┬────────────┬───────┐
│ category ┆ customer ┆ date       ┆ sales │
│ ---      ┆ ---      ┆ ---        ┆ ---   │
│ str      ┆ i64      ┆ date       ┆ i64   │
╞══════════╪══════════╪════════════╪═══════╡
│ A        ┆ 1        ┆ 2025-01-01 ┆ 10    │
│ A        ┆ 1        ┆ 2025-01-02 ┆ 20    │
│ A        ┆ 2        ┆ 2025-01-03 ┆ 15    │
│ B        ┆ 1        ┆ 2025-01-04 ┆ 5     │
│ B        ┆ 1        ┆ 2025-01-05 ┆ 7     │
│ B        ┆ 2        ┆ 2025-01-06 ┆ 9     │
└──────────┴──────────┴────────────┴───────┘


## <strong style="color:#5e17eb">  Grouped rolling mean</strong>

In [66]:
df_grouped_mean = (
    df.sort("date")
      .group_by("category", maintain_order=True)
      .agg([
          pl.col("sales")
          .rolling_mean(window_size=2)
          .alias("rolling_mean_sales")
      ])
)

print(df_grouped_mean)


shape: (2, 2)
┌──────────┬────────────────────┐
│ category ┆ rolling_mean_sales │
│ ---      ┆ ---                │
│ str      ┆ list[f64]          │
╞══════════╪════════════════════╡
│ A        ┆ [null, 15.0, 17.5] │
│ B        ┆ [null, 6.0, 8.0]   │
└──────────┴────────────────────┘


## <strong style="color:#5e17eb"> using over </strong>

In [67]:
df_over = df.sort("date").with_columns(
    pl.col("sales")
      .rolling_mean(window_size=2)
      .over("category")  # rolling mean separately for each category
      .alias("rolling_mean_sales")
)
print(df_over)

shape: (6, 5)
┌──────────┬──────────┬────────────┬───────┬────────────────────┐
│ category ┆ customer ┆ date       ┆ sales ┆ rolling_mean_sales │
│ ---      ┆ ---      ┆ ---        ┆ ---   ┆ ---                │
│ str      ┆ i64      ┆ date       ┆ i64   ┆ f64                │
╞══════════╪══════════╪════════════╪═══════╪════════════════════╡
│ A        ┆ 1        ┆ 2025-01-01 ┆ 10    ┆ null               │
│ A        ┆ 1        ┆ 2025-01-02 ┆ 20    ┆ 15.0               │
│ A        ┆ 2        ┆ 2025-01-03 ┆ 15    ┆ 17.5               │
│ B        ┆ 1        ┆ 2025-01-04 ┆ 5     ┆ null               │
│ B        ┆ 1        ┆ 2025-01-05 ┆ 7     ┆ 6.0                │
│ B        ┆ 2        ┆ 2025-01-06 ┆ 9     ┆ 8.0                │
└──────────┴──────────┴────────────┴───────┴────────────────────┘


## <strong style="color:#5e17eb">  Multiple Group by</strong>

In [68]:
df_multi_group = df.sort("date").with_columns(
    pl.col("sales")
      .rolling_mean(window_size=2)
      .over(["category", "customer"])
      .alias("rolling_mean_sales")
)

print(df_multi_group)


shape: (6, 5)
┌──────────┬──────────┬────────────┬───────┬────────────────────┐
│ category ┆ customer ┆ date       ┆ sales ┆ rolling_mean_sales │
│ ---      ┆ ---      ┆ ---        ┆ ---   ┆ ---                │
│ str      ┆ i64      ┆ date       ┆ i64   ┆ f64                │
╞══════════╪══════════╪════════════╪═══════╪════════════════════╡
│ A        ┆ 1        ┆ 2025-01-01 ┆ 10    ┆ null               │
│ A        ┆ 1        ┆ 2025-01-02 ┆ 20    ┆ 15.0               │
│ A        ┆ 2        ┆ 2025-01-03 ┆ 15    ┆ null               │
│ B        ┆ 1        ┆ 2025-01-04 ┆ 5     ┆ null               │
│ B        ┆ 1        ┆ 2025-01-05 ┆ 7     ┆ 6.0                │
│ B        ┆ 2        ┆ 2025-01-06 ┆ 9     ┆ null               │
└──────────┴──────────┴────────────┴───────┴────────────────────┘


## <strong style="color:#5e17eb"> Summary  </strong>

**Goal: Apply rolling metrics within groups (per category, per customer).**

| Sub-Topic                    | Example                                                    |
| ---------------------------- | ---------------------------------------------------------- |
| **5.1 Grouped Rolling Mean** | `df.group_by("state").agg(pl.col("stars").rolling_mean(3))` |
| **5.2 Using `.over()`**      | `pl.col("stars").rolling_mean(3).over("city")`             |
| **5.3 Multiple Group Keys**  | `over(["state", "city"])`                                  |


# <strong style="color:#5e17eb"> 6. Cumulative Functions (Running Totals) </strong>


In [69]:
# Sample Data
df = pl.DataFrame({
    "customer": ["A", "A", "A", "B", "B", "B"],
    "day": [1, 2, 3, 1, 2, 3],
    "sales": [10, 20, 15, 5, 7, 9]
})
df

customer,day,sales
str,i64,i64
"""A""",1,10
"""A""",2,20
"""A""",3,15
"""B""",1,5
"""B""",2,7
"""B""",3,9


## <strong style="color:#5e17eb"> cum_min / cum_max / cum_sum </strong>

In [70]:
result = df.with_columns([
    pl.col("sales").cum_sum().over("customer").alias("cum_sum"),
    pl.col("sales").cum_min().over("customer").alias("cum_min"),
    pl.col("sales").cum_max().over("customer").alias("cum_max"),
])

print(result)

shape: (6, 6)
┌──────────┬─────┬───────┬─────────┬─────────┬─────────┐
│ customer ┆ day ┆ sales ┆ cum_sum ┆ cum_min ┆ cum_max │
│ ---      ┆ --- ┆ ---   ┆ ---     ┆ ---     ┆ ---     │
│ str      ┆ i64 ┆ i64   ┆ i64     ┆ i64     ┆ i64     │
╞══════════╪═════╪═══════╪═════════╪═════════╪═════════╡
│ A        ┆ 1   ┆ 10    ┆ 10      ┆ 10      ┆ 10      │
│ A        ┆ 2   ┆ 20    ┆ 30      ┆ 10      ┆ 20      │
│ A        ┆ 3   ┆ 15    ┆ 45      ┆ 10      ┆ 20      │
│ B        ┆ 1   ┆ 5     ┆ 5       ┆ 5       ┆ 5       │
│ B        ┆ 2   ┆ 7     ┆ 12      ┆ 5       ┆ 7       │
│ B        ┆ 3   ┆ 9     ┆ 21      ┆ 5       ┆ 9       │
└──────────┴─────┴───────┴─────────┴─────────┴─────────┘


## <strong style="color:#5e17eb"> cum_prod / cum_count  </strong>

In [71]:
result = df.with_columns([
    pl.col("sales").cum_prod().over("customer").alias("cum_prod"),
    pl.col("sales").cum_count().over("customer").alias("cum_count"),
])

print(result)

shape: (6, 5)
┌──────────┬─────┬───────┬──────────┬───────────┐
│ customer ┆ day ┆ sales ┆ cum_prod ┆ cum_count │
│ ---      ┆ --- ┆ ---   ┆ ---      ┆ ---       │
│ str      ┆ i64 ┆ i64   ┆ i64      ┆ u32       │
╞══════════╪═════╪═══════╪══════════╪═══════════╡
│ A        ┆ 1   ┆ 10    ┆ 10       ┆ 1         │
│ A        ┆ 2   ┆ 20    ┆ 200      ┆ 2         │
│ A        ┆ 3   ┆ 15    ┆ 3000     ┆ 3         │
│ B        ┆ 1   ┆ 5     ┆ 5        ┆ 1         │
│ B        ┆ 2   ┆ 7     ┆ 35       ┆ 2         │
│ B        ┆ 3   ┆ 9     ┆ 315      ┆ 3         │
└──────────┴─────┴───────┴──────────┴───────────┘


## <strong style="color:#5e17eb"> Summary  </strong>

**Goal: Running calculations from start to current row.**

| Function                   | Example                     |
| -------------------------- | --------------------------- |
| `.cum_sum()`               | `pl.col("sales").cum_sum()` |
| `.cum_min()`, `.cum_max()` | Running min/max             |
| `.cum_prod()`              | Running product             |
| `.cum_count()`             | Running row count           |
| `.cum_mean()`              | Running average             |

- 

> 💡 **Tip**: Cumulative functions are much faster than rolling when you just need running totals.

# <strong style="color:#5e17eb"> 7. Advanced Rolling Patterns </strong>


## <strong style="color:#5e17eb"> Custom Aggregations </strong>

In [72]:
import polars as pl
import datetime as dt
import numpy as np

# Sample data
df = pl.DataFrame({
    "day": pl.date_range(
        start=dt.date(2025, 1, 1),   
        end=dt.date(2025, 1, 10),    
        interval="1d",
        eager=True
    ),
    "sales": [5, 8, 10, 3, 7, 6, 4, 9, 12, 15]
})

df


day,sales
date,i64
2025-01-01,5
2025-01-02,8
2025-01-03,10
2025-01-04,3
2025-01-05,7
2025-01-06,6
2025-01-07,4
2025-01-08,9
2025-01-09,12
2025-01-10,15


## <strong style="color:#5e17eb"> Multiple Rolling Metrics in One place </strong>

In [73]:
df_multi = df.with_columns([
    pl.col("sales").rolling_mean(3).alias("roll_mean_3"),
    pl.col("sales").rolling_max(3).alias("roll_max_3"),
    pl.col("sales").rolling_min(3).alias("roll_min_3")
])

print(df_multi)


shape: (10, 5)
┌────────────┬───────┬─────────────┬────────────┬────────────┐
│ day        ┆ sales ┆ roll_mean_3 ┆ roll_max_3 ┆ roll_min_3 │
│ ---        ┆ ---   ┆ ---         ┆ ---        ┆ ---        │
│ date       ┆ i64   ┆ f64         ┆ i64        ┆ i64        │
╞════════════╪═══════╪═════════════╪════════════╪════════════╡
│ 2025-01-01 ┆ 5     ┆ null        ┆ null       ┆ null       │
│ 2025-01-02 ┆ 8     ┆ null        ┆ null       ┆ null       │
│ 2025-01-03 ┆ 10    ┆ 7.666667    ┆ 10         ┆ 5          │
│ 2025-01-04 ┆ 3     ┆ 7.0         ┆ 10         ┆ 3          │
│ 2025-01-05 ┆ 7     ┆ 6.666667    ┆ 10         ┆ 3          │
│ 2025-01-06 ┆ 6     ┆ 5.333333    ┆ 7          ┆ 3          │
│ 2025-01-07 ┆ 4     ┆ 5.666667    ┆ 7          ┆ 4          │
│ 2025-01-08 ┆ 9     ┆ 6.333333    ┆ 9          ┆ 4          │
│ 2025-01-09 ┆ 12    ┆ 8.333333    ┆ 12         ┆ 4          │
│ 2025-01-10 ┆ 15    ┆ 12.0        ┆ 15         ┆ 9          │
└────────────┴───────┴─────────────┴────

## <strong style="color:#5e17eb"> Rolling over Expressions </strong>

In [74]:
df_expr = df.with_columns(
    (pl.col("sales") / (pl.arange(1, df.height + 1)))  # visits = day number
      .rolling_mean(3)
      .alias("sales_per_visit_roll_mean")
)

print(df_expr)

shape: (10, 3)
┌────────────┬───────┬───────────────────────────┐
│ day        ┆ sales ┆ sales_per_visit_roll_mean │
│ ---        ┆ ---   ┆ ---                       │
│ date       ┆ i64   ┆ f64                       │
╞════════════╪═══════╪═══════════════════════════╡
│ 2025-01-01 ┆ 5     ┆ null                      │
│ 2025-01-02 ┆ 8     ┆ null                      │
│ 2025-01-03 ┆ 10    ┆ 4.111111                  │
│ 2025-01-04 ┆ 3     ┆ 2.694444                  │
│ 2025-01-05 ┆ 7     ┆ 1.827778                  │
│ 2025-01-06 ┆ 6     ┆ 1.05                      │
│ 2025-01-07 ┆ 4     ┆ 0.990476                  │
│ 2025-01-08 ┆ 9     ┆ 0.89881                   │
│ 2025-01-09 ┆ 12    ┆ 1.009921                  │
│ 2025-01-10 ┆ 15    ┆ 1.319444                  │
└────────────┴───────┴───────────────────────────┘


## <strong style="color:#5e17eb"> Summary  </strong>

**Goal: Build custom rolling logic.**

| Sub-Topic                                    | Description                                                       |
| -------------------------------------------- | ----------------------------------------------------------------- |
| **7.1 Custom Aggregations**                  | `.rolling_apply(func, window_size)` to run your own function      |
| **7.2 Multiple Rolling Metrics in One Pass** | `.agg([pl.col("x").rolling_mean(3), pl.col("x").rolling_max(3)])` |
| **7.3 Rolling over Expressions**             | `(pl.col("sales") / pl.col("visits")).rolling_mean(7)`            |


# <strong style="color:#5e17eb"> 8. Performance Tips for Rolling </strong>

- **Sort before rolling:** Always sort by index or date for predictable results.

- **Use proper dtypes:** For time-based rolling, ensure datetime column is in pl.Datetime format.

- **Lazy API:** For huge datasets, chain rolling in lazy mode and .collect() only once.

- **Avoid rolling_apply for large data:** It’s slower—prefer built-in .rolling_* functions.

- Use `over()` for grouped rolling instead of manual loops.

<div style="text-align: center;">
  <h4 style="
    display: inline-block;
    color: #5e17eb;
    font-family: 'Segoe UI';
    border-left: 5px solid #5e17eb;
    background-color: #F8F9F9;
    padding: 10px 20px;
    border-radius: 5px;
    text-align: left;
  ">
  <b>
    Thank You 💜
    </b>
  </h4>
</div>