# Cumulative Operations and Rolling Windows in Pandas

### What Are Cumulative Operations and Rolling Windows?

In data analysis and time series, **cumulative operations** and **rolling windows** help us observe running totals, averages, and other statistics over a sequence of values. These techniques are essential for detecting **trends**, **seasonality**, **anomalies**, and **local patterns** in our dataset.

In the Titanic dataset, we can simulate rolling windows on numerical columns like `Age`, `Fare`, or cumulative statistics by passenger ID or boarding date (if present or simulated).

## Cumulative Functions in Pandas

Cumulative functions return the running total (or other operation) across rows.

In [1]:
import pandas as pd

df = pd.read_csv("data/train.csv")

df['CumulativeFare'] = df['Fare'].cumsum()
df['CumulativeSurvival'] = df['Survived'].cumsum()
df['CumulativeMaxAge'] = df['Age'].cummax()
print(df[['PassengerId', 'Fare', 'CumulativeFare', 'CumulativeSurvival', 'CumulativeMaxAge']].head())

   PassengerId     Fare  CumulativeFare  CumulativeSurvival  CumulativeMaxAge
0            1   7.2500          7.2500                   0              22.0
1            2  71.2833         78.5333                   1              38.0
2            3   7.9250         86.4583                   2              38.0
3            4  53.1000        139.5583                   3              38.0
4            5   8.0500        147.6083                   3              38.0


Other cumulative methods:

- `.cumsum()` → Cumulative sum
- `.cumprod()` → Cumulative product
- `.cummax()` → Cumulative max
- `.cummin()` → Cumulative min

## Rolling Window Functions

Rolling functions apply a moving window over the data and calculate a statistic within that window. It's commonly used in time series for smoothing.

In [2]:
# Rolling mean on Fare
df['RollingFare'] = df['Fare'].rolling(window=5).mean()

# Rolling survival rate (average over last 5 passengers)
df['RollingSurvival'] = df['Survived'].rolling(window=5).mean()
print(df[['Fare', 'RollingFare', 'Survived', 'RollingSurvival']].head(10))

      Fare  RollingFare  Survived  RollingSurvival
0   7.2500          NaN         0              NaN
1  71.2833          NaN         1              NaN
2   7.9250          NaN         1              NaN
3  53.1000          NaN         1              NaN
4   8.0500     29.52166         0              0.6
5   8.4583     29.76332         0              0.6
6  51.8625     25.87916         0              0.4
7  21.0750     28.50916         0              0.2
8  11.1333     20.11582         1              0.2
9  30.0708     24.51998         1              0.4


Common functions with `.rolling()`:

- `.mean()`, `.sum()`, `.std()`, `.min()`, `.max()`

### Handling Missing Values:

- Use `.fillna()` to handle NaN values introduced at the beginning of rolling windows.

## AI/ML Use Case: Time-Based Feature Engineering

Cumulative and rolling operations are heavily used in machine learning to create **derived features** that track user behavior, trends, or group statistics over time.

Examples:

- Cumulative sum of purchases
- Rolling average temperature
- Rolling survival rate or delay
- Customer engagement trends over time

These can be **predictive** features that capture long-term patterns or short-term spikes, improving model accuracy.

## Exercises

**Q1.** Compute cumulative sum of `Fare`.

In [3]:
df['CumFare'] = df['Fare'].cumsum()
print(df[['PassengerId', 'Fare', 'CumFare']].head())

   PassengerId     Fare   CumFare
0            1   7.2500    7.2500
1            2  71.2833   78.5333
2            3   7.9250   86.4583
3            4  53.1000  139.5583
4            5   8.0500  147.6083


**Q2.** Compute rolling mean of `Age` with window of 3.

In [4]:
df['RollingAge'] = df['Age'].rolling(window=3).mean()
print(df[['PassengerId', 'Age', 'RollingAge']].head(10))

   PassengerId   Age  RollingAge
0            1  22.0         NaN
1            2  38.0         NaN
2            3  26.0   28.666667
3            4  35.0   33.000000
4            5  35.0   32.000000
5            6   NaN         NaN
6            7  54.0         NaN
7            8   2.0         NaN
8            9  27.0   27.666667
9           10  14.0   14.333333


**Q3.** Add column for cumulative count of passengers.

In [5]:
df['CumulativeCount'] = range(1, len(df) + 1)

**Q4.** Create rolling standard deviation of `Fare` with window 4.

In [6]:
df['RollingFareStd'] = df['Fare'].rolling(window=4).std()
print(df[['Fare', 'RollingFareStd']].head(10))

      Fare  RollingFareStd
0   7.2500             NaN
1  71.2833             NaN
2   7.9250             NaN
3  53.1000       32.389078
4   8.0500       32.163198
5   8.4583       22.478937
6  51.8625       25.540069
7  21.0750       20.575731
8  11.1333       19.907780
9  30.0708       17.368570


**Q5.** Fill missing values in rolling results using forward fill.

In [None]:
df['RollingAge'] = df['RollingAge'].fillna(method='ffill')

### Summary

Cumulative and rolling operations are key tools in Pandas for analyzing trends and smoothing out short-term fluctuations. Cumulative functions like `.cumsum()`, `.cummax()` help track growth over time, while rolling windows allow for local averages and volatility measurement.

These operations are especially useful in time-series analysis, stock price forecasting, customer behavior modeling, and more. Rolling statistics can be combined with `.groupby()` and `.shift()` for even more powerful feature engineering.

When used properly, these techniques lead to **better model inputs**, **smoother data visualization**, and **insightful analytics**. Always visualize and validate these derived columns before using them in production ML pipelines.