A key limitation of Prediction-Powered Inference (PPI) is that its effectiveness
hinges on two main factors: the quality of the predictive model and the
availability of a sufficiently large unlabeled dataset. This notebook
demonstrates the second point by violating the assumption that the labeled
dataset size (n) is much smaller than the unlabeled dataset size (N).

In [43]:
%load_ext autoreload
%autoreload 2
import os, sys

sys.path.append(os.path.abspath(os.path.join(os.getcwd(), os.pardir)))
import numpy as np
from ppi_py.datasets import load_dataset

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Load the forest dataset
We load the forest dataset, which contains:
- Y_total: The complete set of gold-standard (true) deforestation labels.
- Yhat_total: The complete set of model-predicted deforestation labels.
The goal is to see how PPI performs when we have a large labeled sample.

In [44]:
dataset_folder = "./data/"
data = load_dataset(dataset_folder, "forest")
Y_total = data["Y"]
Yhat_total = data["Yhat"]

Define the dataset split
In this experiment, we intentionally create a large labeled dataset.
We are selecting a large range of indices to ensure n is large,
which violates the core PPI assumption of N >> n.

In [None]:
labeled_idx = np.arange(1, 1580)   # where n >> N
unlabeled_idx = np.setdiff1d(np.arange(len(Y_total)), labeled_idx)

Calculate the total size and the PPI and Classical estimates
We get the size of our labeled and total datasets.

The classical estimate is simply the average of the labeled gold-standard data.
It does not use any of the predicted data.
theta_classical = Y_total[labeled_idx].mean()

The PPI estimate is a combined estimator. The formula below is a common
alternative representation of the PPI estimator for the mean.
It is a weighted average of the labeled sample mean and the unlabeled sample mean
of the predictions, adjusted by the difference between the predictions and labels.
In this specific implementation, the formula is simplified.
The theta_ppi estimate here is not the standard PPI estimator
and will give a slightly different value. A more accurate PPI estimator for the
mean is:
theta_ppi = Yhat_total.mean() - (Y_total[labeled_idx] - Yhat_total[labeled_idx]).mean()
theta_ppi = (Y_total[labeled_idx].mean()) - (Yhat_total[unlabeled_idx].mean())

This line uses a different formulation for the PPI estimate, which might
be less stable, but serves the purpose of this example. It shows how the
estimate becomes more reliant on the labeled data as n increases.
theta_ppi = (n/N) * Y_total[labeled_idx].mean() + ((N-n)/N) * Yhat_total[unlabeled_idx].mean()

In [46]:
n = len(labeled_idx)
N = len(Y_total)

theta_classical = Y_total[labeled_idx].mean()
theta_ppi = (Y_total[labeled_idx].mean()) - (Yhat_total[unlabeled_idx].mean())

In [47]:
theta_ppi = (n/N) * Y_total[labeled_idx].mean() + ((N-n)/N) * Yhat_total[unlabeled_idx].mean()

In [48]:
print(theta_classical)
print(theta_ppi)

0.1513616212792907
0.1525698519252756


Calculate variance terms
These lines calculate the variance of the prediction errors (the difference
between the true label and the model's prediction) and the variance of the
predictions on the unlabeled data. These are the components used to calculate
the final confidence interval width.

In [49]:
errors = Y_total[labeled_idx] - Yhat_total[labeled_idx]
var_errors = errors.var(ddof=1)   # sample variance

In [50]:
var_yhat_unlabeled = Yhat_total[unlabeled_idx].var(ddof=1)

In [51]:
print(var_errors,'\t',var_yhat_unlabeled)

0.09149123081823825 	 0.0792621405879498


In [52]:
lower_ppi_ci = theta_ppi - (0.475 * np.sqrt((var_errors/n)+(var_yhat_unlabeled/N)))
upper_ppi_ci = theta_ppi + (0.475 * np.sqrt((var_errors/n)+(var_yhat_unlabeled/N)))
PPi_ci = [lower_ppi_ci, upper_ppi_ci]
print(PPi_ci)

[np.float64(0.14764253082536408), np.float64(0.15749717302518715)]


Print the true mean
This line prints the true mean of the entire Y_total dataset.
The goal of both the classical and PPI estimators is to get as close to this
true value as possible.

In [53]:
theta_star = Y_total.mean()
print(theta_star)

0.15162907268170425


In [54]:
if theta_star in PPi_ci:
    print(f"{theta_star} lies in Predictive powered Confidence interval.")
else:
    print(f"{theta_star} lies in Predictive powered Confidence interval.")

0.15162907268170425 lies in Predictive powered Confidence interval.
