In [None]:
import pandas as pd
from pathlib import Path

In [None]:
path = Path("../../csv/")

In [None]:
data = pd.read_csv(path /"groupby_demo_data.csv")
data

In [None]:
data.groupby('country')['lifetime_value'].sum()

In [None]:
data.groupby('country')['lifetime_value'].agg(['sum', 'mean', 'max'])

In [None]:
data.groupby('country').agg(
    total_ltv=('lifetime_value', 'sum'),
    avg_ltv=('lifetime_value', 'mean'),
    max_ltv=('lifetime_value', 'max')
)


# Understanding pandas `groupby()`: The Split-Apply-Combine Strategy

## Split
### What happens
- The dataset is logically divided into smaller groups based on one or more columns.
- No rows are changed yet.
- Think of this as creating buckets.

### Example
- All rows with `region = "East"` go into one bucket
- All rows with `region = "West"` go into another

### Key point
**Split does not compute anything — it only organizes rows.**

## Apply
### What happens
- A function is run **independently** on each bucket.
- This function can:
  - Reduce rows (aggregation)
  - Preserve rows (transformation)
  - Filter rows

### Examples
- Sum sales inside each region
- Compute average per customer
- Normalize values within each group

### Key point
**Apply defines what calculation or logic is executed per group.**

## Combine
### What happens
- The outputs from each group are assembled back together.
- pandas decides the final shape:
  - Fewer rows → aggregation
  - Same rows → transformation
- Group keys become index or columns.

### Key point
**Combine determines the structure of the result, not the math.**

## One-Line Mental Model
> “Group rows → run logic per group → stitch results into a table.”

## Why This Matters
If you can answer these **before coding**:

- What defines the groups?
- Does my operation reduce rows or preserve them?
- Do I want group keys as index or columns?

**You will almost never be surprised by `groupby()` output.**

checking if as_index=False and .reset_index() are the same 

In [None]:
data.groupby('country')['lifetime_value'].agg(['sum']).reset_index()

In [None]:
data.groupby('country', as_index=False)['lifetime_value'].agg(['sum'])

## Production code

- Prefer as_index=False
- Prefer named aggregation
- Avoid unnecessary index mutation

scaler and series output handling in agg and transform

In [None]:
df = pd.DataFrame({
    'country': ['US', 'US', 'IN', 'IN', 'IN'],
    'revenue': [100, 200, 50, 60, 90]
})

df

In [None]:
# here agg we need 1 number per country
df.groupby('country',as_index=False).agg(
    total_revenue=('revenue', 'sum')
)


In [None]:
df['revenue_vs_avg'] = (
    df.groupby('country')['revenue']
      .transform(lambda x: x / x.mean())
)
df

In [None]:
df['above_avg'] = (
    df.groupby('country')['revenue']
      .transform(lambda x: x > x.mean())
)
df

In [None]:
df['revenue_mean'] = (
    df.groupby('country')['revenue']
      .transform('mean')
)

df