# Clustering

See:
* [Alexie's 214 notes](https://github.com/alexieleauthaud/ASTR214_2017/blob/master/class9and10_clustering.ipynb)
* [Corrfunc, a fast implementation of these algos](https://github.com/manodeep/Corrfunc)
* [Appendix A in Zehavi+ 2011](https://iopscience.iop.org/article/10.1088/0004-637X/736/1/59/meta#apj391584app1) talks about the information content of auto vs cross correlation.
* [Appendix in Zu+ 2008](https://arxiv.org/abs/0712.3570)

Most broadly, what is the probability of finding an object in a volume $dV$ some distance $r$ away from another object? 

# Two Point Correlation Function (2pcf)

What is the excess probability (compared to a random distribution) of finding two galaxies separated by a distance $r$. **Excess** is important, else you are also measuring the mean number density $n$. Mathematically,

$$
dP(r) = n_1 n_2(1 + \xi(r))dV_1 dV_2
$$

where $\xi(r)$ is the correlation function which is unitless and must be $> -1$ (else probability goes negative and we can't have that!). $r$ is the vector pointing from $V_1$ to $V_2$. However, as we assume things are isotropic, only the length of that vector matters.

We know that on average the probability of finding a galaxy in a volumne element is $dP = n dV$ and so the probability of finding galaxies in both volume elements is $dP = n_1 dV_1 n_2 dV_2$. Therefore $\xi(r) > 0$ implies clustering - galaxies are more likely to be separated by that distance than random. $\xi(r) < 0$ implies anti-clustering.

To compute this in practice we setup bins, e.g. $10 < r < 11$Mpc. We then count the number of pairs that are separated by a length that fits in that bin. If we are working in a simulation we can then use the known $n$ and volume to compare this count to the expected one. However, in observations with complex selection effects (redshift dependences, masks, collisions, etc) a carefully constructed random catalog should be used to find the expected number of pairs with this separation.

## Auto vs Cross correlation

Which objects are we measuring the clustering of? There are two options:
* Auto correlation: Select a sample of objects (e.g. galaxies with stellar mass between $10^{11}$ and $10^{11.1}$) and find the excess probability of finding members of this sample at a distance $r$ away from other members of it.
* Cross correlation: Select two samples of objects (e.g. galaxies with stellar mass between $10^{11}$ and $10^{11.1}$ (sample 1) and ones between $10^{11.1}$ and $10^{11.2}$ (sample 2)) and find the excess probability of finding members of sample 2 at a distance $r$ from sample 1. Note that the order of these doesn't matter.

## 2d vs 3d correlation

Imagine you don't have redshifts, you only have angular (RA, Dec) positions on the sky. You can still define a 2d angular correlation function,

$$
dP = n_1 n_2 (1 + \omega(\theta))dA_1 dA_2
$$

where $n$ is now the projected mean number density and $\omega(\theta)$ is the angular correlation function.

There are many other correlation functions we can build. If we define:
* $r$: Separation in real space
* $s$: Separation in redshift space
* $r_p$: Redshift space separation perpendicular to the LOS 
* $r_{\pi}$: Redshift space separation along the LOS 
* $\theta$: Angular separation

We can have:
* $\xi(r)$: Real space 3d correlation function
* $\xi(s)$: Redshift space 3d correlation function. Sometimes called $\xi(r_p, r_{\pi})$
* $\omega_p(r_p)$: Correlation as a function of perpendicular separation (projected)
* $\omega(\theta)$: The angular version of $\omega_p(r_p)$, not using redshift to get distances

## Uncertainties

Let's say we have a simulation. We can compute the correlation function (let's say a $\xi(r)$ autocorrelation) in various $r$ bins and with various samples (let's say $n$ mass bins). What are the uncertainties on these estimates? What are the covariances between them?

The best way to compute this is by using a jackknife. Divide the simulation into say 100 roughly equal sized regions and compute the correlation function 100 times, each time leaving out one of these regions. Do this for all bins of $r$ and $m$ using the same regions. You can now build a covariance matrix!

# Estimators

So far, we've talked about the correlation function being an excess in probability. But how do we actually compute this?

We can count the number of galaxies separated by a certain distance bin ($r_1 < r < r_2$). Let's say we are doing a cross correlation between two samples, $s_1$ and $s_2$. We call the number of data pairs in our bin DD.
If we are in a simulation, which has a very simple geometry, we can easily compute $\xi(r)$ from this. The expected number of pairs is just a function of $V_{bin} = 4/3 \pi (r_2^3 - r_1^3)$, $V_{sim}$, and the number of items in the two samples $ls_1$, $ls_2$.

$$
n_{exp} = \frac{V_{bin}  ls_1  ls_2}{V_{sim}}
$$

However, in observations we typically cannot easily calculate the volume. There are masks (e.g. bright star) and the depth might not be uniform. To get around this, we use a random catalog. Assuming we have two of these ($r_1$, $R_2$) with lengths ($lr_1$, $lr_2$, these should usually be much larger than $ls_{1,2}$ to minimize errors) we can compute,

$$
n_{exp} = RR \frac{ls_1 ls_2}{lr_1 lr_2}
$$

Note that $n_{exp}$ is just $n_1 n_2 dV_1 dV_2$ and so once we have this it is easy to get to $\xi$:

$$
1 + \xi(r) = \frac{DD}{n_{exp}}
$$

Some examples,
* $DD = 0 \rightarrow \xi(r) = -1$. Totally negatively unclustered
* $DD = RR \rightarrow \xi(r) = 0$. Same clustering as the randoms
* $DD = 2RR \rightarrow \xi(r) = 1$. A bit more clustered

It turns out, for reasons I won't go into, that this is not the best possible estimator. We can improve on this slightly by instead using (properly normalized):

$$
\xi(r) = \frac{DD - 2DR + RR}{RR}
$$

Note how there is no $1 + \xi(r)$. I don't think this is clearly spelled out in the LS paper, but see equation 47 and note the `+ 2` at the end of the final estimator. We've added 1 to both sides to get the $1 + \xi(r)$ in this estimator.

We can persuade ourselves that the 1 doesn't need to be there by considering unclustered data. In that case $DD - 2DR + RR = n - 2n + n = 0$ which is what we expect for $\xi(r)$.

In full, with the normalization, where we define $f_1 = \frac{lr_1}{ls_1}$ and similarly for $s_2$,

$$
\xi(r) = \frac{f_1 f_2 DD - f_1 DR - f_2 RD + RR}{RR}
$$

This is the [Landay-Szalay estimator](http://adsabs.harvard.edu/doi/10.1086/172900).

We need to be a bit careful with the normalization. When doing a cross-correlation, it is exactly as I have shown above. However, when doing an autocorrelation, things are slightly different. Because self matches are not possible in an autocorrelation there needs to be some $n - 1$ in places.


$$
\xi(r) = \frac{\frac{lr (lr - 1)}{ls (ls - 1)} DD - 2 DR \frac{lr}{ls - 1} + RR}{RR}
$$

See Equation 3 in the LS paper.

# Total information content 

Let's say we have a population of galaxies A(ll) that we divide into two populations, B(lue) and R(ed). We can compute 5 correlation functions from these populations: AA, BB, RR, BR, RB. Do all of these contain independent information?

What is the probability that we find *any* pair at a given separation? Somewhat trivially, it must just be the sum of the probabilities of finding each of the 4 types of pairs that it could be.

$$
dP_{AA} = dP_{BB} + dP_{RR} + dP_{BR} + dP_{RB}
$$

Of course, $dP_{BR} = dP_{RB}$ and so,

$$
dP_A = dP_B + dP_{RR} + 2 dP_{BR}
$$

Therefore in terms of $\xi$,

$$
n_A^2(1 + \xi_{AA}) = n_B^2 (1 + \xi_{BB}) + n_R^2 (1 + \xi_{RR}) + 2 n_B n_R (1 + \xi_{BR})
$$

but, $n_A^2 = (n_B + n_R)^2 = n_B^2 + n_R^2 + 2 n_B n_R$ and so subtracting this from both sides,

$$
n_A^2 \xi_{AA} = n_B^2 \xi_{BB} + n_R^2 \xi_{RR} + 2 n_B n_R \xi_{BR}
$$

And therefore given just 3 of these we can determine the fourth.
