Instead of training the model with the original data, full dataset, I train several models with random subsets. What are these samples then?

When you train several models on random subsets of your full dataset, you're performing a form of **ensemble learning** often referred to as **bagging** (bootstrap aggregating).  Let's analyze the distribution of the predictions generated by this ensemble.

Let's define some notation:

* $D$: The full dataset.
* $D_b$: The $b$-th random subset of $D$, where $b = 1, \dots, B$ and $B$ is the number of subsets/models.
* $h$: The hyperparameters (assumed fixed for now for simplicity).
* $w_b$: The weights learned by the model trained on subset $D_b$.  So, $w_b = S(h, D_b)$.
* $f(x; w_b)$: The prediction of the $b$-th model for input $x$.
* $y$: The output we want to predict.

When you sample a subset $D_b$, you're effectively sampling from a distribution over datasets. Let's denote this distribution as $p(D_b | D)$. The exact form of $p(D_b|D)$ depends on how you create the subsets (e.g., bootstrapping, subsampling).

Now, consider the distribution of predictions $y$ from a single model trained on a random subset $D_b$.  This can be written as:

$$ p(y | x, D_b, h) = \delta(y - f(x; S(h, D_b))) $$

This is similar to before, but now conditioned on the subset $D_b$.

To get the distribution of predictions across all models (the ensemble), we need to marginalize over the distribution of subsets:

$$ p(y | x, D, h) = \int p(y | x, D_b, h) p(D_b | D) dD_b $$

Substituting the delta function:

$$ p(y | x, D, h) = \int \delta(y - f(x; S(h, D_b))) p(D_b | D) dD_b $$

This equation describes the distribution of predictions $y$ given the input $x$, the full dataset $D$, and the hyperparameters $h$. This is *not* the same as the posterior predictive distribution we discussed before. It represents the predictive distribution of an ensemble of models trained on different subsets of the data.


**Key Differences from Previous Method:**

* **No explicit prior over h:** This method doesn't involve a prior over hyperparameters directly (although the choice of a fixed $h$ can be considered an implicit prior).  Instead, it samples different subsets of the data, implicitly exploring different parts of the weight space.
* **Ensemble Prediction:** The resulting distribution is not the prediction of a single model but the aggregated prediction of an ensemble of models.
* **Focus on Data Variation:**  The focus here is on handling the variation in the data itself, rather than uncertainty in hyperparameters.

**Practical Implications:**

Sampling from $p(y | x, D, h)$ is done by training $B$ models on different subsets $D_b$ and then, for a given $x$, computing $y_b = f(x; S(h, D_b))$ for each model. The collection $\{y_b\}_{b=1}^B$ represents samples from $p(y | x, D, h)$.

This bagging approach can improve predictive performance by reducing variance and increasing robustness to noisy data.


By considering the distribution $p(D_b|D)$ in our equation, we get a mathematically precise understanding of bagging and the distribution of the resulting predictions.


To get $D_b$ I remove every datapoint in D with a prob of $\alpha$.

Okay, if you're creating subsets $D_b$ by removing each data point from $D$ with probability $\alpha$, you're performing a form of subsampling.  This is different from bootstrapping, where data points can be repeated in the subsets.

Let's formalize $p(D_b|D)$ for this subsampling procedure.

Let $N$ be the total number of data points in the full dataset $D = \{z_1, z_2, \dots, z_N\}$, where $z_i = (x_i, y_i)$ represents a single data point.  Let $N_b$ be the number of data points in the subset $D_b$.

For a given subset $D_b$, we can define an indicator variable $I_i$ for each data point $z_i$ in $D$:

$$ I_i = \begin{cases} 1, & \text{if } z_i \in D_b \\ 0, & \text{if } z_i \notin D_b \end{cases} $$

Since each data point is removed with probability $\alpha$, the probability of a data point being *included* in $D_b$ is $(1-\alpha)$.  Therefore, the probability of a specific subset $D_b$ being generated is:

$$ p(D_b | D) = \prod_{i=1}^N (1-\alpha)^{I_i} \alpha^{(1-I_i)} = (1-\alpha)^{N_b} \alpha^{N - N_b} $$

This expression tells us the probability of obtaining a *specific* subset $D_b$ with $N_b$ data points. Notice that subsets with the same size $N_b$ have the same probability.

Now, substituting this into our previous equation for the predictive distribution:

$$ p(y | x, D, h) = \sum_{D_b}  \delta(y - f(x; S(h, D_b)))  (1-\alpha)^{N_b} \alpha^{N - N_b} $$

Here, the integral over $D_b$ becomes a sum because the space of possible subsets is discrete.

In practice, you would approximate this by sampling several subsets $\{D_b\}_{b=1}^B$ according to the subsampling procedure and computing the corresponding predictions $y_b = f(x; S(h, D_b))$.


This clarifies the distribution $p(D_b|D)$ for your specific subsampling strategy. The resulting $p(y | x, D, h)$ is the predictive distribution for an ensemble of models trained on these randomly subsampled datasets.
