In [3]:
import numpy as np

# Integrating out the model
Much of machine learning concerns generating preditions based on past data. We want to know $P(X_\text{new}|X_\text{old})$. <br>
But in order to generate good predictions it is necessary to have a model with parameters $\theta$. The above can be rewritten
$$P(X_\text{new}|X_\text{old})=\int P(X_\text{new},\theta|X_\text{old}) d\theta$$
Which can in turn be expanded with the product rule: <br>
$$P(X_\text{new}|X_\text{old})=\int P(X_\text{new}|\theta,X_\text{old})P(\theta|X_\text{old}) d\theta$$
Making the assumption that the model encodes the previous data this becomes:
$$P(X_\text{new}|X_\text{old})=\int P(X_\text{new}|\theta)P(\theta|X_\text{old}) d\theta$$
This formula represents the purest way to predict data with a parameter based model. The parameters of the model are integrated out over all possible values, where the probability of each prediction is weighed by the model probability. For instance, consider that there are two coins ($C_1$ and $C_2$), one of which is selected and used in a single flip. $C_1$ has heads on both sides and from previous data has a $10\%$ chance of being selected. $C_2$ is fair and has a $90%$ chance of being selected. The correct way to predict new data is to apply the above formula. Say you want to know the probability of heads ($H$). That is given by the above formula as: $p(H)=1\times0.1 + 0.5\times0.9=0.55$. This is easy to verify with sampling:

In [9]:
total_heads = 0
total = 0
for sample in range(10000):
    coin = np.random.choice(["C1","C2"],p=[0.1,0.9])
    if(coin=="C1"):
        total_heads+=1
        total+=1
    if(coin=="C2"):
        if(np.random.rand()>=0.5):
            total_heads+=1
        total+=1
prob_heads_estimates = total_heads/total
print("estimate",prob_heads_estimates)

estimate 0.5524


The issue with this approach is it is not tractable in many cases. However, for a binomial probability the model is analytical - defined as the beta-binomial model. And 