# Error propagation and confidence intervals for correlated data points 

You have the data file "data_cov.dat" which contains 10 pairs of measurements $(a_i, b_i)$. You consider reporting the mean and standard deviation on $a$ and $b$. Is this information sufficient to calculate a 99% confidence interval on $\bar{c} = \bar{a} + \bar{b}$? Explain why this is not sufficient and show it through a numerical experiment.   

Bonus: Calculate a 99\% CI on $\bar{c}$. 

Tip: Instead of calculating the sample covariance manually, you can use `np.cov(a, b, ddof=1)`. 

Guidline for solving this exercise: 

- 1) Visualise the pairs of data points $(a_i, b_i)$
- 2) Evaluate the covariance matrix of your sample and pearson correlation coefficient.
- 3) Derive a p-value for your observed correlation coefficient. Is the correlation between a and b meaningful?
- 4) Compare the variance on $c = a+b$ to the variance you would derive from the error propagation formula. Compare the results if you include of drop the covariance between a and b in the error propagation formula. 

## About correlation coefficients

Given two length-$N$ samples of data $\{x_i\}$ and $\{y_i\}$, Pearson's correlation coefficient is defined as

$$ r = \frac{\sum_{i=1}^N (x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i=1}^N (x_i-\bar{x})^2}\sqrt{\sum_{i=1}^N (y_i-\bar{y})^2}} $$

where $-1\leq r\leq 1$, and $r=0$ for uncorrelated variables.

If the pairs $(x_i,y_i)$ are drawn from uncorrelated univariate Gaussian distributions, then the distribution of $r$ follows a Student's $t$ distribution with $k=N-2$ degrees of freedom and $t = r\sqrt{(N-2)/(1-r^2)}$ (so this is the normalised RV $r$ built based on t that follows a student ; this also means that the $stde(r) = \sqrt{\frac{1-r^2}{N-2}}$).

Because of this, a measured $r$ can be interpreted in terms of the significance with which we can reject the variables being correlated. 
E.g., for $N=10$ you can estiamte probability that a value of $r$ arises just by chance noise fluctuations using a Student $t$ distribution. 

### About p-value 

The p-value is the probability for getting a value at least as large as the one observed. So a small p-value (conventionally smaller than 0.05) is used as an indication that the obtained value is not a statistical fluke. One may however not abuse of it / puts too much trust in it (but it is sometimes the only objective way we have to quantify a "visual" statement). 

HERE, we are not testing if there is a strong correlation, but testing that the coefficient we find can happen due to statistical fluctuations. A large positive $r$ means strong positive correlation and large negative means anti-correlation. We care about extreme values in both directions so: 
$p = 2 \times P(T \geq |t|))$ (where $T$ is the t-distribution with $n-2$ dof. 

HERE, a low p-value does not mean the correlation is weak, it means:  If there were truly no correlation, the chance of observing such an extreme r (positive or negative) is very small.
So, a low p-value gives us confidence that the observed correlation (positive or negative) is real, not random noise.

### More about correlation coefficients

The Pearson's coefficient can be calculated in python using `numpy.corrcoef()`.  

Note that there are two problems with Pearson's coefficient:
- It does not incorporate measurement uncertainties on the data.
- It is highly susceptible to outliers.

Alternative coefficients have been introduced, such as the Spearman's-r coefficient (less sensible to outliers but biased) and the Kendall's-$\tau$ coeffcient. 

See **Chapter 3** of of the book <a class="anchor" id="book"></a> *Statistics, data mining and Machine learning in astronomy* by Z. Ivezic et al. in Princeton Series in Modern Astronomy for a more exhaustive discussion of correlation coefficients (and alternatives to Pearson's $r$ correlation coefficient). 

Why $n-1$ as Bessel's correction for calculating the covariance $\sigma_{ab}$ (i.e. `np.cov(a, b, ddof=1)`) ? See https://stats.stackexchange.com/questions/142456/why-shouldnt-the-denominator-of-the-covariance-estimator-be-n-2-rather-than-n-1/142472#142472 . 

In [None]:
# Visualising the data
data_sample =    # load the data 
a, b =         # save a, b columns into variables for legibility 

# Create an axis object to visualise the data points 
# ADD HERE A row to visualise the data 


ax.set_aspect('equal')
ax.set_xlabel('a')
ax.set_ylabel('b')

In [None]:
# calculate the covariance matrix to quantify the visual evidence for a correlation

In [None]:
# Calculate a correlation coefficient and its uncertainty (as a complement to assess the covariance)

In [None]:
# Calculate the p value to see if our value of r could have been as large just from statistical fluctuations

In [None]:
# Compare the uncertainty on c using the covariance and ignoring it 