# Generalized Riesz Regression (GRR) — End-to-end examples

This notebook demonstrates how to use the `genriesz` package to estimate several common functionals:

- **ATE** (average treatment effect)
- **AME** (average marginal effect / average derivative)
- **Average policy effect**

## Installation

If you are running this notebook from the repository, an editable install is convenient:

```bash
pip install -e .[sklearn,torch]
```

(You can omit `sklearn` or `torch` if you are not running those sections.)


It also shows how to plug in different **bases**:

- Polynomial basis
- RKHS-style RBF features (random Fourier features)
- Random-forest leaf indicator features (scikit-learn)
- Neural-network embedding features (PyTorch)

Finally, it shows how to report **DM / IPW / AIPW** simultaneously with **cross-fitting**, **confidence intervals**, and **p-values**.

> Notes  
> - The **link function is automatic**: it is induced by the chosen Bregman generator (e.g. `UKLGenerator`).  
> - To keep **exact ACB structure** for the chosen basis, the GRR solver stays **linear in parameters**.  
>   Random forests / neural nets are used as **fixed feature maps** (bases), not trained end-to-end inside the GRR objective.


In [1]:
import sys
from pathlib import Path

# --- Optional: enable local "src/" imports when running from the repository ---
root = Path.cwd().resolve()
for _ in range(6):
    if (root / "src" / "genriesz").exists():
        sys.path.insert(0, str(root / "src"))
        break
    root = root.parent

import numpy as np

from genriesz import (
    grr_functional,
    grr_ate,
    grr_ame,
    grr_policy_effect,
    ATEFunctional,
    AverageDerivativeFunctional,
    PolicyEffectFunctional,
    PolynomialBasis,
    RBFRandomFourierBasis,
    TreatmentInteractionBasis,
    SquaredGenerator,
    UKLGenerator,
)

np.set_printoptions(precision=4, suppress=True)

  \alpha_\beta(x) = (\partial g_x)^{-1}(\phi(x)^\top\beta).
  - **DM** (direct method / plug-in): ``E[m(X, \hat\gamma)]``
  phi_j(z) = 1[ c_j \in NN_k(z) ].


## Synthetic data generators

We will use small synthetic datasets so the notebook runs quickly.


In [2]:
def make_ate_data(n=800, d_z=5, seed=0):
    """Binary-treatment DGP with constant ATE tau."""
    rng = np.random.default_rng(seed)
    Z = rng.normal(size=(n, d_z))

    logits = 0.6 * Z[:, 0] - 0.4 * Z[:, 1] + 0.2 * Z[:, 2]
    e = 1.0 / (1.0 + np.exp(-logits))
    D = rng.binomial(1, e, size=n).astype(float)

    tau = 1.0
    mu0 = np.sin(Z[:, 0]) + 0.25 * Z[:, 1] ** 2
    Y = mu0 + tau * D + rng.normal(scale=1.0, size=n)

    X = np.column_stack([D, Z])  # X = [D, Z]
    return X, Y, tau, Z, D


def make_ame_data(n=1200, d_z=3, seed=0):
    """Continuous-treatment DGP with known AME."""
    rng = np.random.default_rng(seed)
    Z = rng.normal(size=(n, d_z))

    D = 0.7 * Z[:, 0] - 0.3 * Z[:, 1] + rng.normal(scale=1.0, size=n)
    mu = np.sin(Z[:, 0]) + 0.25 * Z[:, 1] ** 2 + (1.0 + 0.5 * Z[:, 1]) * D
    Y = mu + rng.normal(scale=1.0, size=n)

    # True AME = E[1 + 0.5 Z1] = 1 (because E[Z1]=0)
    true_ame = 1.0

    X = np.column_stack([D, Z])  # X = [D, Z]
    return X, Y, true_ame


def make_policy_data(n=1200, d_z=3, seed=0):
    """Binary-treatment DGP with heterogeneous effects for a policy-effect demo."""
    rng = np.random.default_rng(seed)
    Z = rng.normal(size=(n, d_z))

    logits = 0.6 * Z[:, 0] - 0.4 * Z[:, 1]
    e = 1.0 / (1.0 + np.exp(-logits))
    D = rng.binomial(1, e, size=n).astype(float)

    tau = 1.0 + 0.5 * Z[:, 0]  # heterogeneous
    mu0 = Z[:, 0] + 0.25 * Z[:, 1] ** 2
    Y = mu0 + tau * D + rng.normal(scale=1.0, size=n)

    X = np.column_stack([D, Z])
    return X, Y, Z

## Example 1 — ATE with a polynomial basis (and DM/IPW/AIPW)

Here we use:

- `m = ATEFunctional(...)`
- a polynomial base basis on `Z`
- `TreatmentInteractionBasis` to build `φ(D,Z) = [1, D, ψ(Z), D·ψ(Z)]`
- `UKLGenerator` (automatic ACB link)
- `grr_functional` to compute **DM / IPW / AIPW** with cross-fitting

We also demonstrate `outcome_models="both"` so that the output includes:

- **shared** outcome model: the default penalized linear model using the same basis as GRR
- **separate** outcome model: a user-supplied model (here: a random forest)


In [3]:
from sklearn.ensemble import RandomForestRegressor

X, Y, tau, Z, D = make_ate_data(n=800, d_z=5, seed=0)

m = ATEFunctional(treatment_index=0)

psi = PolynomialBasis(degree=2, include_bias=False)
phi = TreatmentInteractionBasis(base_basis=psi)

gen = UKLGenerator(C=1.0, branch_fn=lambda x: int(x[0] == 1.0)).as_generator()

rf_outcome = RandomForestRegressor(
    n_estimators=200,
    max_depth=6,
    min_samples_leaf=10,
    random_state=0,
)

res_poly = grr_functional(
    X=X,
    Y=Y,
    m=m,
    basis=phi,
    generator=gen,
    cross_fit=True,
    folds=3,
    estimators=("dm", "ipw", "aipw"),
    outcome_models="both",          # fit shared + separate outcome models
    outcome_model=rf_outcome,         # used for the separate DM/AIPW
    riesz_penalty="l2",
    riesz_lam=1e-3,
    max_iter=300,
    tol=1e-9,
)

print("True ATE (by construction):", tau)
print(res_poly.summary_text())

True ATE (by construction): 1.0
n=800, cross_fit=True, folds=3, alpha=0.05, null=0.0
AIPW (separate):  0.889124  (se=0.148253)  CI[95%]=0.598554,1.179694  p=2.006e-09
 AIPW (shared):  0.972232  (se=0.138349)  CI[95%]=0.701072,1.243391  p=2.105e-12
 DM (separate):  0.919782  (se=0.013116)  CI[95%]=0.894075,0.945489  p=0
   DM (shared):  0.913845  (se=0.014611)  CI[95%]=0.885209,0.942482  p=0
           IPW:  0.925385  (se=0.148894)  CI[95%]=0.633559,1.217212  p=5.13e-10


## Example 2 — ATE with an RKHS-style RBF basis (random Fourier features)

This approximates an RBF kernel feature map using random Fourier features.


In [4]:
X, Y, tau, Z, D = make_ate_data(n=800, d_z=5, seed=1)

psi_rff = RBFRandomFourierBasis(
    n_features=150,
    sigma=1.0,
    standardize=True,
    random_state=0,
)
phi_rff = TreatmentInteractionBasis(base_basis=psi_rff)

gen = UKLGenerator(C=1.0, branch_fn=lambda x: int(x[0] == 1.0)).as_generator()

res_rff = grr_ate(
    X=X,
    Y=Y,
    basis=phi_rff,
    generator=gen,
    cross_fit=True,
    folds=3,
    estimators=("dm", "ipw", "aipw"),
    outcome_models="shared",
    riesz_penalty="l2",
    riesz_lam=1e-3,
    max_iter=250,
    tol=1e-9,
)

print("True ATE (by construction):", tau)
print(res_rff.summary_text())

True ATE (by construction): 1.0
n=800, cross_fit=True, folds=3, alpha=0.05, null=0.0
 AIPW (shared): -7.701880  (se=15.568555)  CI[95%]=-38.215686,22.811927  p=0.6208
   DM (shared):  1.024989  (se=0.020527)  CI[95%]=0.984757,1.065222  p=0
           IPW:  32.225247  (se=16.051944)  CI[95%]=0.764014,63.686480  p=0.04469


## Example 3 — Random forest leaf basis for the Riesz model (scikit-learn)

You can use a tree ensemble as a **feature map** by encoding **leaf indices** as one-hot features.

A simple pattern is:

1. Fit a random forest **propensity model**: predict `D` from `Z`.
2. Convert leaf indices into one-hot features `ψ(Z)`.
3. Build `φ(D,Z)` with treatment interactions.
4. Run GRR with the same (automatic) generator.

This keeps GRR convex in `β` while giving a flexible, data-adaptive basis.


In [20]:
from sklearn.ensemble import RandomForestClassifier
from genriesz.sklearn_basis import RandomForestLeafBasis

X, Y, tau, Z, D = make_ate_data(n=800, d_z=5, seed=2)

# Fit a modest forest to avoid an excessively large one-hot basis.
rf = RandomForestClassifier(
    n_estimators=5,
    max_depth=3,
    min_samples_leaf=10,
    random_state=0,
)

leaf_basis = RandomForestLeafBasis(rf, include_bias=False)
leaf_basis.fit(Z, D)

phi_leaf = TreatmentInteractionBasis(base_basis=leaf_basis)

gen = UKLGenerator(C=1.0, branch_fn=lambda x: int(x[0] == 1.0)).as_generator()

res_leaf = grr_ate(
    X=X,
    Y=Y,
    basis=phi_leaf,
    generator=gen,
    cross_fit=True,
    folds=3,
    estimators=("dm", "ipw", "aipw"),
    outcome_models="shared",
    riesz_penalty="l2",
    riesz_lam=1e-3,
    max_iter=250,
    tol=1e-9,
)

print("True ATE (by construction):", tau)
print(res_leaf.summary_text())

True ATE (by construction): 1.0
n=800, cross_fit=True, folds=3, alpha=0.05, null=0.0
 AIPW (shared):  0.924534  (se=0.151375)  CI[95%]=0.627846,1.221223  p=1.011e-09
   DM (shared):  0.901676  (se=0.020098)  CI[95%]=0.862284,0.941068  p=0
           IPW:  0.827351  (se=0.152154)  CI[95%]=0.529135,1.125566  p=5.4e-08


## Example 4 — Neural network embedding basis (PyTorch)

A recommended way to use neural networks **without breaking the GLM-style GRR structure** is:

1. Train an embedding network `ψ(Z)` on an auxiliary task (e.g., predict `D` from `Z`).
2. Freeze the embedding network.
3. Use its output as a fixed basis inside GRR.

Below we train a small MLP embedding on the treatment prediction task.


In [6]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

from genriesz.torch_basis import MLPEmbeddingNet, TorchEmbeddingBasis

X, Y, tau, Z, D = make_ate_data(n=800, d_z=5, seed=3)

torch.manual_seed(0)

# --- 1) Train an embedding network to predict D from Z ---
embed = MLPEmbeddingNet(input_dim=Z.shape[1], hidden_dim=64, output_dim=16)
head = nn.Linear(16, 1)
model = nn.Sequential(embed, head)

xt = torch.tensor(Z, dtype=torch.float32)
yt = torch.tensor(D.reshape(-1, 1), dtype=torch.float32)

loader = DataLoader(TensorDataset(xt, yt), batch_size=256, shuffle=True)
criterion = nn.BCEWithLogitsLoss()
opt = optim.Adam(model.parameters(), lr=1e-3)

model.train()
for _epoch in range(4):
    for xb, yb in loader:
        opt.zero_grad()
        logits = model(xb)
        loss = criterion(logits, yb)
        loss.backward()
        opt.step()

# Freeze the embedding
embed.eval()
for p in embed.parameters():
    p.requires_grad_(False)

# --- 2) Wrap as a NumPy-returning basis ψ(Z) ---
psi_nn = TorchEmbeddingBasis(embed, device="cpu", include_bias=False)

# --- 3) Build φ(D,Z) and run GRR ---
phi_nn = TreatmentInteractionBasis(base_basis=psi_nn)
gen = UKLGenerator(C=1.0, branch_fn=lambda x: int(x[0] == 1.0)).as_generator()

res_nn = grr_ate(
    X=X,
    Y=Y,
    basis=phi_nn,
    generator=gen,
    cross_fit=True,
    folds=3,
    estimators=("dm", "ipw", "aipw"),
    outcome_models="shared",
    riesz_penalty="l2",
    riesz_lam=1e-3,
    max_iter=250,
    tol=1e-9,
)

print("True ATE (by construction):", tau)
print(res_nn.summary_text())

True ATE (by construction): 1.0
n=800, cross_fit=True, folds=3, alpha=0.05, null=0.0
 AIPW (shared):  1.065794  (se=0.099129)  CI[95%]=0.871505,1.260084  p=0
   DM (shared):  1.207156  (se=0.006935)  CI[95%]=1.193564,1.220748  p=0
           IPW:  1.161960  (se=0.129423)  CI[95%]=0.908295,1.415625  p=0


## Example 5 — AME (average marginal effect / average derivative)

We estimate the average derivative of the regression function with respect to the treatment coordinate.

Here we use `SquaredGenerator`, which induces a linear link.


In [7]:
X, Y, true_ame = make_ame_data(n=1200, d_z=3, seed=0)

phi = PolynomialBasis(degree=2, include_bias=True)
gen = SquaredGenerator(C=0.0).as_generator()

res_ame = grr_ame(
    X=X,
    Y=Y,
    coordinate=0,
    eps=1e-4,
    basis=phi,
    generator=gen,
    cross_fit=True,
    folds=3,
    estimators=("dm", "ipw", "aipw"),
    outcome_models="shared",
    riesz_penalty="l2",
    riesz_lam=1e-3,
    max_iter=250,
    tol=1e-9,
)

print("True AME (by construction):", true_ame)
print(res_ame.summary_text())

True AME (by construction): 1.0
n=1200, cross_fit=True, folds=3, alpha=0.05, null=0.0
 AIPW (shared):  0.996180  (se=0.034650)  CI[95%]=0.928266,1.064093  p=0
   DM (shared):  1.003363  (se=0.015097)  CI[95%]=0.973773,1.032952  p=0
           IPW:  1.026086  (se=0.073244)  CI[95%]=0.882531,1.169642  p=0


## Example 6 — Average policy effect

We compare two policies:

- `π1(z)`: treat if `z0 > 0`
- `π0(z)`: never treat

This reduces to the ATE when `π1(z)=1` and `π0(z)=0`.


In [8]:
X, Y, Z = make_policy_data(n=1200, d_z=3, seed=1)

def pi1(z: np.ndarray) -> float:
    return float(z[0] > 0.0)

def pi0(_z: np.ndarray) -> float:
    return 0.0

# ATE-style basis on Z + treatment interactions.
psi = PolynomialBasis(degree=2, include_bias=False)
phi = TreatmentInteractionBasis(base_basis=psi)

gen = UKLGenerator(C=1.0, branch_fn=lambda x: int(x[0] == 1.0)).as_generator()

res_policy = grr_policy_effect(
    X=X,
    Y=Y,
    policy_1=pi1,
    policy_0=pi0,
    treatment_index=0,
    basis=phi,
    generator=gen,
    cross_fit=True,
    folds=3,
    estimators=("dm", "ipw", "aipw"),
    outcome_models="shared",
    riesz_penalty="l2",
    riesz_lam=1e-3,
    max_iter=250,
    tol=1e-9,
)

print(res_policy.summary_text())

n=1200, cross_fit=True, folds=3, alpha=0.05, null=0.0
 AIPW (shared): -8914.019643  (se=8301.731779)  CI[95%]=-25185.114940,7357.075654  p=0.2829
   DM (shared):  0.760382  (se=0.023229)  CI[95%]=0.714854,0.805910  p=0
           IPW:  117278.585540  (se=105582.285974)  CI[95%]=-89658.892374,324216.063454  p=0.2667


## Example 7 — Nearest-neighbor matching as a Riesz/LSIF special case

The paper shows that classical NN matching weights can be interpreted through a
squared-loss Riesz / LSIF construction using a *catchment-area* indicator basis.

Here we compute the matching-style IPW estimate

$$
\hat\tau = \frac{1}{n}\sum_i (2D_i-1)\,\hat w_i\,Y_i,
\qquad \hat w_i = 1 + \frac{K_M(i)}{M},
$$

where $K_M(i)$ is the matched-times count (how often unit $i$ is selected among the
$M$ nearest neighbors from the opposite treatment arm).

We obtain $K_M$ using :class:`genriesz.KNNCatchmentBasis`.


In [9]:
from genriesz import KNNCatchmentBasis

X, Y, tau, Z, D = make_ate_data(n=2000, d_z=5, seed=0)
D_int = D.astype(int)

M = 1
idx_t = np.flatnonzero(D_int == 1)
idx_c = np.flatnonzero(D_int == 0)

# Treated weights: how many controls match to each treated unit?
basis_t = KNNCatchmentBasis(n_neighbors=M).fit(Z[idx_t])
Phi_ct = basis_t(Z[idx_c])
K_t = Phi_ct.sum(axis=0)
w_t = 1.0 + K_t / float(M)

# Control weights: how many treated match to each control unit?
basis_c = KNNCatchmentBasis(n_neighbors=M).fit(Z[idx_c])
Phi_tc = basis_c(Z[idx_t])
K_c = Phi_tc.sum(axis=0)
w_c = 1.0 + K_c / float(M)

w = np.empty(len(Z), dtype=float)
w[idx_t] = w_t
w[idx_c] = w_c

scores = (2.0 * D_int - 1.0) * w * Y
ate_hat = float(np.mean(scores))
se = float(np.std(scores, ddof=1) / np.sqrt(len(scores)))

print("True ATE:", tau)
print(f"NN matching (M={M}) ATE:", ate_hat)
print("Naive SE:", se)
print("Naive 95% CI:", (ate_hat - 1.96 * se, ate_hat + 1.96 * se))


True ATE: 1.0
NN matching (M=1) ATE: 1.115378298501986
Naive SE: 0.07261994454906012
Naive 95% CI: (0.9730432071858282, 1.2577133898181438)


## Notes on using non-linear learners

- **Random forest / neural network as a Riesz model**:  
  In this package the core GRR solver is **linear-in-parameters**, so the Riesz model is specified by the **basis**.  
  To use forests / neural nets, plug them in as **fixed feature maps** (Examples 3 and 4).

- **Random forest / neural network as an outcome model**:  
  Pass any `sklearn`-style regressor as `outcome_model=...`.  
  Use `outcome_models="both"` to report AIPW/DM with both the shared linear outcome model and your custom outcome model.
