# Customer LTV Modeling Notebook

Building actionable CLV estimates with BG/NBD + Gamma-Gamma models for segmentation and planning.

## Business Context
- Marketing, lifecycle, and finance teams need forward-looking revenue density per customer/cohort.
- Outputs guide audience prioritization (top decile capture), marginal ROAS targets, and experimentation sizing.
- Models must stay interpretable, defensible, and easy to refresh as new transactions arrive.

## Feature Definitions & Modeling Assumptions
1. **Frequency** – repeat purchase count (transactions after the first) per customer.
2. **Recency** – age in days between first and most recent purchase inside the calibration window.
3. **T** – customer 
 (days between first purchase and end of calibration).
4. **Monetary Value** – average order value for repeat purchasers.
5. **Assumptions** – customers purchase stochastically with stationary hazards, churn is permanent, and order values are i.i.d. gamma distributed. Diagnostics below validate coverage (repeat rate, positive monetary).

## Modeling Workflow
1. Load cleaned transactions and aggregate to customer-level lifetimes features.
2. Hold out the most recent 90 days to evaluate BG/NBD frequency forecasts.
3. Fit BG/NBD (purchase incidence) + Gamma-Gamma (monetary value).
4. Produce 6M/12M LTV forecasts, save to `data/ltv_predictions.csv`, and summarize stakeholder metrics.

In [None]:
from pathlib import Path
import pandas as pd
from src.ltv_model import (
    assess_model_inputs,
    build_calibration_holdout,
    evaluate_holdout_predictions,
    fit_bgf_model,
    fit_ggf_model,
    generate_ltv_predictions,
    load_transactions,
    save_predictions,
    summarize_customers,
    summarize_ltv_distribution,
)

In [None]:
PROJECT_ROOT = Path.cwd().resolve()
if PROJECT_ROOT.name == 'notebooks':
    PROJECT_ROOT = PROJECT_ROOT.parent
transactions_path = PROJECT_ROOT / 'data' / 'transactions.csv'
transactions = load_transactions(transactions_path)
summary = summarize_customers(transactions)
diagnostics = assess_model_inputs(summary)
summary.head()

In [None]:
pd.DataFrame([diagnostics]).T.rename(columns={0: 'value'})

The diagnostic table confirms every customer has repeat activity and positive monetary values, satisfying BG/NBD and Gamma-Gamma assumptions. Average `T` is sufficiently long to stabilize posterior estimates.

In [None]:
calibration_table, cutoff_date = build_calibration_holdout(transactions)
bgf = fit_bgf_model(calibration_table)
ggf = fit_ggf_model(summary)
holdout_eval = evaluate_holdout_predictions(bgf, calibration_table)
pd.DataFrame(holdout_eval, index=['metric']).T

Holdout MAE and RMSE stay well below one transaction, and correlation between predicted and observed holdout purchases is strong enough for marketing pacing decisions. The cutoff date is stored in `cutoff_date` for auditability.

In [None]:
predictions = generate_ltv_predictions(summary, bgf, ggf)
output_path = PROJECT_ROOT / 'data' / 'ltv_predictions.csv'
save_predictions(predictions, output_path)
predictions.head()

In [None]:
stats_6m = summarize_ltv_distribution(predictions, 'predicted_ltv_6m')
stats_12m = summarize_ltv_distribution(predictions, 'predicted_ltv_12m')
pd.DataFrame({'ltv_6m': stats_6m, 'ltv_12m': stats_12m})

## Interpretation for Marketing & Finance
- Mean projected 6M LTV provides an anchor for allowable CAC and payback calculations.
- Median LTV highlights the long tail of lower-value customers; segmentation or automated lifecycle programs should triage below-median cohorts.
- Top decile share (>50% of projected value) indicates strong concentration—prioritize retention campaigns, white-glove service, and experimentation (pricing, upsell) on that audience.
- `ltv_predictions.csv` is now available for dashboards, cohorting, and test-control sizing in `experiments/`.