# Simulation Example: Bias Correction in Regression with Generated Binary Labels

This simulation example demonstrates the use of the `ValidMLInference` package for correcting bias and performing valid inference in regression models with generated binary labels.

The example is based on the simulation design in [Battaglia, Christensen, Hansen & Sacher (2024)](https://arxiv.org/abs/2402.15585). Data are generated according to the model `Y = β0 + β1 * X + (σ1 X + σ0 (1 - X)) * u`, where `u` is a standard normal random variable. Parameter values are set to match the empirical example in the paper.

In the main sample, the true variable `X` is latent. A predicted label `Xhat` is generated with a false positive rate `fpr`. 

We also generate a smaller validation sample in which both `X` and `Xhat` are observed. This sample is used to estimate `fpr`.

We generate `nsim` data sets, each with `n` observations in the main sample and `m` observations from which to estimate `fpr`. 

In [None]:
from ValidMLInference import ols, ols_bca, ols_bcm, one_step
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from math import sqrt

Set parameter values and pre-allocate storage for simulation results:

In [10]:
nsim    = 1000
n       = 16000      # training size
m       = 1000       # test size for estimating false positive rate
p       = 0.05       # P(X=1)
kappa   = 1.0        # relative strength of measurement error
fpr     = kappa / sqrt(n)

β0, β1       = 10.0, 1.0
σ0, σ1       = 0.3, 0.5

# pre­allocate storage: (sim × 4 methods × 2 coefficients)
B = np.zeros((nsim, 4, 2))
S = np.zeros((nsim, 4, 2))

Function to generate data:

In [11]:
def generate_data(n, m, p, fpr, β0, β1, σ0, σ1):
    """
    Generates simulated data.

    Parameters:
      n, m: Python integers (number of training and test samples)
      p, p1: floats
      beta0, beta1: floats

    Returns:
      A tuple: ((train_Y, train_X), (test_Xhat, test_X))
      where train_X and test_Xhat include a constant term as the second column.
    """
    N = n + m
    X    = np.zeros(N)
    Xhat = np.zeros(N)
    u    = np.random.rand(N)

    for j in range(N):
        if   u[j] <= fpr:
            X[j] = 1.0
        elif u[j] <= 2*fpr:
            Xhat[j] = 1.0
        elif u[j] <= p + fpr:
            X[j] = 1.0
            Xhat[j] = 1.0

    eps = np.random.randn(N)
    Y   = β0 + β1*X + (σ1*X + σ0*(1.0 - X))*eps

    # split into train vs test
    train_Y   = Y[:n]

    train_X   = Xhat[:n].reshape(-1, 1)
    test_Xhat = Xhat[n:].reshape(-1, 1)
    test_X    = X[n:].reshape(-1, 1)

    return (train_Y, train_X), (test_Xhat, test_X)

Generate data, implement methods, and store results:

In [12]:
def update_results(B, S, b, V, i, method_idx):
    """
    Store coefficient estimates and their SEs into B and S.
    B,S have shape (nsim, nmethods, max_n_coefs).
    b is length d <= max_n_coefs.  V is d×d.
    """
    d = b.shape[0]
    for j in range(d):
        B[i, method_idx, j] = b[j]
        S[i, method_idx, j] = np.sqrt(max(V[j, j], 0.0))

for i in range(nsim):
    (tY, tX), (eXhat, eX) = generate_data(
        n, m, p, fpr, β0, β1, σ0, σ1
    )

    # Method 1: run OLS on generated labels in the main sample (biased)
    res = ols(Y = tY, X = tX, intercept = True)
    update_results(B, S, res.coef, res.vcov, i, 0)

    # Method 2: Additive bias correction
    fpr_hat = np.mean(eXhat[:,0] * (1.0 - eX[:,0]))
    res = ols_bca(Y = tY, Xhat =  tX, fpr = fpr_hat, m = m)
    update_results(B, S, res.coef, res.vcov, i, 1)
    
    # Method 2: Multiplicative bias correction
    res = ols_bcm(Y = tY, Xhat = tX, fpr = fpr_hat, m = m)
    update_results(B, S, res.coef, res.vcov, i, 2)

    # Method 4: One-step estimator
    res = one_step(Y = tY, Xhat = tX)
    update_results(B, S, res.coef, res.vcov, i, 3)

    if (i+1) % 100 == 0:
        print(f"Done {i+1}/{nsim} sims")


Done 100/1000 sims
Done 200/1000 sims
Done 300/1000 sims
Done 400/1000 sims
Done 500/1000 sims
Done 600/1000 sims
Done 700/1000 sims
Done 800/1000 sims
Done 900/1000 sims
Done 1000/1000 sims


Compute coverage probabilities of 95% confidence intervals for the slope coefficient across methods:

In [15]:
methods = {
    "OLS     ": 0,
    "ols_bca ": 1,
    "ols_bcm ": 2,
    "one_step": 3
}

cov_dict = {}
for name, col in methods.items():
    slopes = B[:, col, 1]
    ses   = S[:, col, 1]
    # fraction of sims whose 95% CI covers β1
    cov_dict[name] = np.mean(np.abs(slopes - β1) <= 1.96 * ses)

cov_series = pd.Series(cov_dict, name=f"Coverage @ β1={β1}")
cov_series

OLS         0.000
ols_bca     0.885
ols_bcm     0.879
one_step    0.929
Name: Coverage @ β1=1.0, dtype: float64

Evidently, standard OLS confidence intervals for the slope coefficient have coverage of zero. Both `ols_bca` and `ols_bcm` yield confidence intervals with coverage probabilities a bit below the nominal level of 95%, but their coverage approaches 95% in larger sample sizes. Moreover, `one_step` produces confidence intervals with coverage close to 95%.

Finally, we tabulate results, presenting:

* the average estimate and average standard error across simulations for each method;
* intervals containing the 2.5% and 97.5% quantiles of the estimates across simultaions for each method.

In [19]:
nsim, nmethods, ncoeff = B.shape

method_names = [
    "OLS     ",
    "ols_bca ",
    "ols_bcm ",
    "one_step"
]

results = []

for i in range(nmethods):
    row = {"Method": method_names[i]}
    
    for j, coef in enumerate(["β1", "β0"]):
        estimates = B[:, i, 1-j]
        ses = S[:, i, 1-j]
        mean_est = np.nanmean(estimates)
        mean_se = np.nanmean(ses)
        lower = np.percentile(estimates, 2.5)
        upper = np.percentile(estimates, 97.5)
        
        row[f"Avg_{coef}"] = f"{mean_est:.3f}"
        row[f"Avg_SE_{coef}"] = f"{mean_se:.3f}"
        row[f"Quantiles_{coef}"] = f"[{lower:.3f}, {upper:.3f}]"
    
    results.append(row)

df_results = pd.DataFrame(results).set_index("Method")
print(df_results)

         Avg_β1 Avg_SE_β1    Quantiles_β1  Avg_β0 Avg_SE_β0      Quantiles_β0
Method                                                                       
OLS       0.833     0.021  [0.794, 0.875]  10.008     0.003  [10.003, 10.014]
ols_bca   0.971     0.062  [0.874, 1.091]  10.002     0.004   [9.995, 10.008]
ols_bcm   1.003     0.064  [0.880, 1.183]  10.000     0.004   [9.991, 10.007]
one_step  0.999     0.031  [0.926, 1.058]  10.000     0.002   [9.995, 10.005]


We see that OLS estimator of the slope coefficient is biased (it under-estimates the true effect size by about 17% on average), while `ols_bca`, `ols_bcm`, and `one_step` yield estimates close to the true value of the slope coefficient. 