# Missing Information and Selection Effects

Goals:
* Incorporate models for data selection into our toolkit
* Understand when selection effects are ignorable, and when they must be accounted for

## References

* Gelman chapters 7 and 21

## What does "missing information" mean?

In physics, we're used to the idea that we never have complete information about a system.

Trivial example: non-zero measurement errors mean that we're missing some information, namely the true value of whatever we've measured. We deal with this by incorporating that fact into our model, via the sampling distribution.

### What does "missing information" mean?

We've seen a more complex example already, when introducing [hierarchical models](hierarchical.ipynb),

*Future time-domain photometric surveys will discover too many supernovae to follow up and type spectroscopically. This means that if we want to do our SNIa experiment from the "intrinsic scatter" example, our data set will be contaminated by other supernova types.*

We added "group membership" to the model, a latent variable that **would not be measured** for most SN.

## Today's key message

1. No data set is perfectly complete (especially in astronomy!)
2. It's our job to know whether that incompleteness can be ignored for the purpose of our inference
3. If not, we need to model it appropriately and marginalize over our ignorance

## More missingness mechanisms

Two more ways that data can be missing are extremely common in astrophysics, and especially in surveys. In statistics, these are called **censoring** and **truncation**.

**Censoring**: a given data point (astronomical source) is known to exist, but a relevant measurement for it is not available.

This refers both to completely absent measurements and upper limits/non-detections, although in principle the latter case still provides us with a sampling distribution.

### More missingness mechanisms

**Truncation**: not only are measurements missing, but the total number of sources that *should* be in the data set is unknown.

In other words, the lack of a measurement means that we don't even know about a particular source's existence.

### Malmquist Bias

These features are related astronomical terms that you may come across.

**Malmquist bias** refers to the fact that flux-limited surveys have an effective *luminosity* limit for detection that rises with distance (redshift). Thus, the sample of measured luminosities is not representative of the whole population.

### Malmquist Bias

<table>
    <tr>
        <td><a href="https://commons.wikimedia.org/wiki/File:Bias2.png"><img src="../graphics/missing_malquist.png" width=100%></a></td>
    </tr>
</table>

Image credit: Wikimedia Commons user Galaxy1F10 (public domain)

### Eddington Bias

**Eddington bias** refers to the effect of noise or scatter on a luminosity function, $N(<L)$, the number of sources in some population less luminous than $L$.

Because the true $N(<L)$ is usually steeply decreasing in practice, and extends below the survey flux limit, scatter in measurements of $L$ can have a big impact on the measured luminosity function.

### Eddington Bias

This is a histogram rather than $N(<L)$, but you get the idea.

<table>
    <tr>
        <td><img src="../graphics/missing_eddington.png" width=100%></td>
    </tr>
</table>

### General Selection Effects

The terms Malmquist and Eddington bias were coined in relatively specific contexts. Usually, it's more accurate to say that a given data set is impacted by the selection procedure.

Consider the (real) case of a flux-limited galaxy cluster survey. Cluster luminosities scale with mass, and the mass function (hence also the luminosity function) is steeply decreasing. The number as a function of mass and redshift, and the luminosity-mass relation, are both of interest.

### Selection Effects
<table>
    <tr>
        <td><img src="../graphics/missing_RASS_zL.png" width=50%></td>
    </tr>
</table>

Complilation of ROSAT All-Sky Survey cluster detections

### Selection Effects
<table>
    <tr>
        <td><img src="../graphics/missing_expn_full.png" width=90%></td>
        <td></td>
        <td><img src="../graphics/missing_expn_trun.png" width=90%></td>
    </tr>
</table>

Fictional luminosity-mass data, applying a threshold for detection

Above 2 slides' image credits: A. Mantz ([MNRAS, 406:1773, 2010](http://adsabs.harvard.edu/abs/2010MNRAS.406.1773M))

## How do we deal with missing information?

There are some ad hoc approaches ... that we won't cover, because missing information fits straightforwardly within the inference framework used throughout this course.

In short, we simply need to include the selection process for which data are observed and which are not in our generative model. This may involve expanding the model to include things like undetected sources.

### Modelling Missing Information

Let's adopt the notation from Gelman (2004):
* $y_\mathrm{obs}$ and $y_\mathrm{mis}$ are the observed and unobserved data, and $y=y_\mathrm{obs}\cup y_\mathrm{mis}$
* $I$ are indicator variables (0 or 1) telling us whether a given y is observed or not
* $\theta$ is the set of parameters needed to model a completely observed data set
* $\phi$ are any additional parameters needed to model the selection process

We'll assume that $\theta$ and $\phi$ can always be separated.

### Modelling Missing Information

The likelihood associated with a complete data set would be just

$P(y|\theta)$

For our partially missing data set, this needs to also account for the inclusion parameters, $I$

$P(y,I|\theta,\phi) = P(y|\theta)\,P(I|\phi,y)$

### Modelling Missing Information

Expanding out the $y$s,

$P(y_\mathrm{obs},y_\mathrm{mis},I|\theta,\phi) = P(y_\mathrm{obs},y_\mathrm{mis}|\theta)\,P(I|\phi,y_\mathrm{obs},y_\mathrm{mis})$

This isn't yet a likelihood for the *observed* data, however. For that we need to marginalize over the  $y_\mathrm{mis}$.

$P(y_\mathrm{obs},I|\theta,\phi) = \int dy_\mathrm{mis} \, P(y_\mathrm{obs},y_\mathrm{mis}|\theta)\,P(I|\phi,y_\mathrm{obs},y_\mathrm{mis})$

### When can we ignore selection?

Consider the likelihood in this form

$P(y_\mathrm{obs},I|\theta,\phi) = \int dy_\mathrm{mis} \, P(y_\mathrm{obs},y_\mathrm{mis}|\theta)\,P(I|\phi,y_\mathrm{obs},y_\mathrm{mis})$

We can get away with ignoring the selection process if the posterior for the parameters of interest $P(\theta|y_\mathrm{obs},I)$ is equivalent to simply $P(\theta|y_\mathrm{obs})$.

### When can we ignore selection?

This requires two things to be true:

1. Selection doesn't depend on unobserved values
2. Priors for the interesting ($\theta$) and selection-related ($\phi$) parameters are independent

## Example: black hole $M$-$\sigma$ relation

Imagine we're fitting the relation between the central black hole mass and bulge stellar velocity dispersion for galaxies.

<table>
    <tr>
        <td><img src="../graphics/missing_Msigma.jpg" width=60%></td>
    </tr>
</table>

Image credit: Msigma at the English Language Wikipedia (Creative Commons Attribution-Share Alike 3.0 Unported)

### Example: black hole $M$-$\sigma$ relation

To start with, we'll assume a complete data set. Then the generative model needs
* true values of $\sigma$ for the $N$ galaxies
* true values of $M$ for each galaxy, determined by a mean relation and scatter, parametrized by $\theta$
* sampling distributions for $M$ and $\sigma$, which we'll assume are independent
* prior distributions for $\sigma$ and $\theta$

Go ahead and sketch the PGM yourselves.

### Example: black hole $M$-$\sigma$ relation

The likelihood for this model is just

$P(\sigma_\mathrm{obs},M_\mathrm{obs}|\sigma,M,\theta) = \prod_{i=1}^N P(M_i|\sigma_i,\theta)\,P(M_{\mathrm{obs},i}|M_i)\,P(\sigma_{\mathrm{obs},i}|\sigma_i)$

### Example: black hole $M$-$\sigma$ relation

Now imagine (realistically) that we don't have measurements for all the black hole masses.

* The data need to be augmented by the inclusion vector, $I$, which implicitly includes the number of observed $M$s, $N_\mathrm{obs}$
* Instead of having an $M_\mathrm{obs}$ for each galaxy, we have either an $M_\mathrm{obs}$ or an $M_\mathrm{mis}$

Update the PGM.

### Example: black hole $M$-$\sigma$ relation

Following the notes above, the likelihood needs to become

${N \choose N_\mathrm{obs}} \prod_{i}^{N_\mathrm{obs}} P(M_i|\sigma_i,\theta)\,P(M_{\mathrm{obs},i}|M_i)\,P(\sigma_{\mathrm{obs},i}|\sigma_i)\,P(I_i|\bullet,\phi)$

$\times\prod_{i}^{N_\mathrm{mis}} \int dM_{\mathrm{mis},i} P(M_i|\sigma_i,\theta)\,P(M_{\mathrm{mis},i}|M_i)\,P(\sigma_{\mathrm{obs},i}|\sigma_i)\,P(I_i|\bullet,\phi)$

where $\phi$ are additional parameters related to selection, and $\bullet$ can in principle include any of $M_i$, $\sigma_i$, $M_{\mathrm{obs/mis},i}$ and $\sigma_{\mathrm{obs/mis},i}$

### Example: black hole $M$-$\sigma$ relation

**Note well** that a binomial term, ${N \choose N_\mathrm{obs}}$ has sneakily appeared.

The reason for this is subtle, and has to do with the statistical concept of *exchangeability* (a priori equivalence of data points).

As we've set things up, the fully observed data are exchangeable with one another, as are the partially observed data, but the the full data set is not, by virtue of containing these two classes.

### Example: black hole $M$-$\sigma$ relation

It helps to think in terms of the generative model here. Namely, because the order of data points holds no meaning for us, the binomial term is there to reflect the number of ways we might generate completely equivalent (except for the ordering) data sets.

In other words, $P(I|\ldots)$ shouldn't actually give us the likelihood of a specificly ordered inclusion vector, but instead the likelihood that $I$ has the observed number of fully observed data points in it ($N_\mathrm{obs}$), along with any dependence on $(\bullet,\phi)$.

## Exercise: $M$-$\sigma$ data missing at random

Using the above example, consider this simple inclusion model: we tried to obtain measurements of $M$ for every galaxy, but that these attempts were only successful with known probability $\phi$ (thanks to, e.g., clouds).

1. Write down the corresponding expression for $P(I_i|M_i,\sigma_i,M_{\mathrm{obs/mis},i},\sigma_{\mathrm{obs/mis},i},\phi)$

2. Simplify the likelihood given in the example as much as possible. Take particular note of what happens with the terms involving selection (i.e. $N_\mathrm{obs}$ and $\phi$). Do you recognize the function that these terms form?

3. Are selection effects ignorable in this problem?

## Exercise: more general missing $M$-$\sigma$ data

Now let's suppose that, somehow, the inclusion probability depends on $\sigma_i$ (only).

1. Again, simplify the general likelihood for this example as much as possible.

2. Are selection effects ignorable in this problem? Are there specific forms of $P(I_i|\ldots)$ for which they are ignorable, and some for which they are not?

## Bonus exercise: math

Earlier, we claimed that selection can be ignored provided

1. Selection doesn't depend on unobserved values
2. Priors for the interesting ($\theta$) and selection-related ($\phi$) are independent

Show that this is sufficient.

## Bonus exercise: galaxy cluster scaling relations

This is a more general version of a regression where selection plays a role.

Our generative model will include
* true values of mass ($M$) for the $N$ clusters in some search volume
* true values of luminosity ($L$) and temperature ($T$) for each cluster, determined by mean relations and joint intrinsic scatter, all parametrized by $\theta$ (note that in this example, $M$ is the independent variable of the scaling relation, not the dependent one)
* sampling distributions for $L$, $T$ and $M$, which may not be independent
* prior distributions for $M$ and $\theta$

If it helps to visualize things, you can assume the sampling distributions and the intrinsic scatter are multivariate normal.

In this exercise, assume that only clusters with luminosity over some threshold, $L\geq L_\mathrm{th}$, are detected in a survey. We have complete information for $N_\mathrm{obs}$ detected clusters, but no information for the remaining clusters, nor do we know their total number, $N$.

1. Sketch a cartoon of the model and a PGM, and write down the likelihood.
2. In general, are selection effects ignorable for inferences about the $T$-$M$ relation (the mean scaling and the marginal intrinsic scatter in $T$)? If not, are there conditions on the sampling distributions and/or intrinsic scatter that would make selection effects ignorable?
2. In general, are selection effects ignorable for inferences about the $L$-$M$ relation? If not, are there conditions on the sampling distributions and/or intrinsic scatter that would make selection effects ignorable?