This notebook provides a step-by-step calculation to demonstrate the core proposition
of Prediction-Powered Inference (PPI) for mean estimation. It shows how PPI combines
a small labeled dataset with a large unlabeled dataset to produce a more precise
estimate.

In [14]:
%load_ext autoreload
%autoreload 2
import os, sys

sys.path.append(os.path.abspath(os.path.join(os.getcwd(), os.pardir)))
import numpy as np
from ppi_py.datasets import load_dataset

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [15]:
dataset_folder = "./data/"
data = load_dataset(dataset_folder, "forest")
Y_total = data["Y"]
Yhat_total = data["Yhat"]


print(Y_total[31:40])
print(Yhat_total[31:40])

[1. 0. 0. 0. 1. 0. 0. 0. 0.]
[0.14660134 0.01642194 0.08404195 0.03080211 0.55973437 0.19156072
 0.21175678 0.04103236 0.05024965]


Define the Dataset Split
We define a very small labeled dataset (labeled_idx) of 9 samples.
The rest of the data becomes the unlabeled dataset (unlabeled_idx).
This setup of N >> n is crucial for PPI's effectiveness.

In [16]:
labeled_idx = np.arange(31, 40)   # 9 samples
unlabeled_idx = np.setdiff1d(np.arange(len(Y_total)), labeled_idx)

We calculate the two main estimates to be compared:
- theta_classical: The classical estimate, which is simply the mean of the
small, labeled gold-standard dataset. It ignores all predicted data.
- theta_ppi: The Prediction-Powered Inference estimate. The formula used here,
Y_total[labeled_idx].mean() - Yhat_total[unlabeled_idx].mean(), is a simplified
estimator for the PPI mean, demonstrating the core concept of rectifying a
prediction-based estimate using the labeled data.

In [17]:
n = len(labeled_idx)
N = len(Y_total)

theta_classical = Y_total[labeled_idx].mean()
theta_ppi = (Y_total[labeled_idx].mean()) - (Yhat_total[unlabeled_idx].mean())

This cell provides an alternative formulation of the PPI estimator.
It is a weighted average of the labeled and unlabeled data, a form that
highlights how the estimate combines information from both sources.

In [18]:
theta_ppi = (n/N) * Y_total[labeled_idx].mean() + ((N-n)/N) * Yhat_total[unlabeled_idx].mean()

In [19]:
print(theta_classical)
print(theta_ppi)

0.2222222222222222
0.15452251659883148


These lines calculate the key variance terms needed to prove the effectiveness
of PPI.
- var_errors: The variance of the "rectifier" term, which is the difference
between the true labels and the model's predictions on the labeled data.
- var_yhat_unlabeled: The variance of the predictions on the large, unlabeled
dataset.
The PPI confidence interval's width depends on these two variances.

In [20]:
errors = Y_total[labeled_idx] - Yhat_total[labeled_idx]
var_errors = errors.var(ddof=1)   # sample variance

In [21]:
var_yhat_unlabeled = Yhat_total[unlabeled_idx].var(ddof=1)

In [22]:
print(var_errors,'\t',var_yhat_unlabeled)

0.12082489240228936 	 0.046341041450386665


In [None]:
# True Mean
theta_star = Y_total.mean()
print(theta_star)

0.15162907268170425


In [24]:
lower_ppi_ci = theta_ppi - (1.96 * np.sqrt((var_errors/n)+(var_yhat_unlabeled/N)))
upper_ppi_ci = theta_ppi + (1.96 * np.sqrt((var_errors/n)+(var_yhat_unlabeled/N)))
PPi_ci = [lower_ppi_ci, upper_ppi_ci]
print(PPi_ci)

[np.float64(-0.07282078717826854), np.float64(0.38186582037593153)]


In [25]:
if theta_star in PPi_ci:
    print(f"{theta_star} lies in Predictive powered Confidence interval.")
else:
    print(f"{theta_star} lies in Predictive powered Confidence interval.")

0.15162907268170425 lies in Predictive powered Confidence interval.
