# Errors, Correlations and Hypothesis Testing

Please ensure you have watched the Chapter 4 video(s).

## You will learn the following things in this Chapter

- Weighted errors.
- 
- 

- How to use Python programming to do the above.
- After completing this notebook you will be able to attempt CA 1 questions 2 and 3.

***

## Introduction to Errors

If the measurement of a particular quantity is subject to many, independent and random errors, then the *central limit theorem* allows us to use the normal distribution to model the quantity’s errors.


Note that there are two types of errors:
- **statistical errors** - from random nature of measurement process, can be reduced by increasing the number of measurements and averaging over them.
- **systematic errors** - these arise from flawed measurements (eg a rogue voltmeter adding +2V to every measurement because its not properly calibrated). This is easy to spot as it remains even after repeating measurements multiple times.

## Combining measurements with different errors

Suppose we have two students, let's call them $A$ and $B$, who make a measurement of the length of our snake, $x$. Student $A$ finds the length to be $x = x_A \pm \sigma_A$, while student $B$ finds that $x = x_A \pm \sigma_A$. Given that both sets of data are valid estimates of the snake's length, we'd like to combine the results from the two experiments, to get a new, and hopefully improved result $x_{AB}$, with an associated uncertainty $\sigma_{AB}$.

How to proceed? It is tempting to simply average the two results, e.g. $x_{AB} = \dfrac{x_A + x_B}{2}$, but this feels a bit fishy if the two uncertainties $\sigma_A$ and $\sigma_B$ are not equal. Why should they have equal weighting, if one is less accurate (higher uncertainty) than the other? The answer is to weight the values according to their uncertainties, to produce a **weighted average**.

$\hat{x_0} = \dfrac{w_A x_A + w_B x_B} {w_A + w_B}$

where $\hat{x_0}$ denotes the weighted average, and $\hat{\sigma}_{x_0}$ is the standard deviation:

$\hat{\sigma}_{x_0} = \dfrac{1}{\sqrt{\sum w_i}}.$

where $w_i$ denotes the individual weights of each component in the average and $w_A = 1/\sigma_A^2$ and $w_B = 1/\sigma_B^2$.  

This type of weighting -- also called optimal weighting -- is extremely important in data analysis.  Optimal weighting allows you to take account of all data points, with each point contributing to the final result in a way than depends on how well you trust the data (i.e. the variance of the point). The problem is, that you need to know something about the error in each point (not always the case).

#### Derivation

To work this out, we are going to assume once again that the errors in the snake length are normally distributed, and the two experiments performed by students $A$ and $B$ were completely independent (e.g. the snake was not stretched by A during her attempt at measurement).   The probability that the students would obtain their resulting lengths for the snake is given by, 

$P_{x_0}(x_A) \propto \dfrac{1}{\sigma_A} e^{-(x_A -x_0)^2 / 2\sigma_A}$
for student $A$ 

and

$P_{x_0}(x_B) \propto \dfrac{1}{\sigma_B} e^{-(x_B -x_0)^2 / 2\sigma_B}$ for student $B$. 

Note that the probabilities depend on the unknown, but true value of the snake's length $x_0$.
So the probability that *both* students found the lengths $x_A$ and $x_B$ is then simply:

$P_{x_0}(x_A \cap x_B) = P_{x_0}(x_A , x_B)$

$P_{x_0}(x_A) \times P_{x_0}(x_B) \propto \dfrac{1}{\sigma_A \sigma_B} e^{-\chi^2/2}$

where we have introduced the notation $\chi^2$ (chi-squared) as a shorthand for,

$\chi^2 = \left( \dfrac{x_A - x_0}{\sigma_A} \right)^2 + \left( \dfrac{x_B- x_0}{\sigma_B} \right)^2.$

Using the principle of *maximum likelihood*, we can see that $P_{x_0}(x_A , x_B)$ has a maximum when $\chi^2$ has a minimum. So we want to know the value of $x_0$ that would maximise the chances of $A$ finding $x_A$ *and* $B$ finding $x_B$. To do this, we need to differentiate $\chi^2$ and set the derivative equal to zero,

$2 \dfrac{x_A - x_0}{\sigma_A} + 2 \dfrac{x_B- x_0}{\sigma_B} = 0$

The solution for $x_0$ is then simply,

$ {\rm best~ estimate~for~} x_0 = \left( \dfrac{x_A}{\sigma_A^2} + \dfrac{x_B}{\sigma_B^2}  \right) \Big/ \left( \dfrac{1}{\sigma_A^2} + \dfrac{1}{\sigma_B^2}  \right)$

If we define weights to have the form $w_A = \dfrac{1}{\sigma_A}^2$ and $w_B = \dfrac{1}{\sigma_B}^2$, then we can tidy this up to obtain,

$\hat{x_0} = \dfrac{w_A x_A + w_B x_B} {w_A + w_B}$

where $\hat{x_0}$ denotes the weighted average.

Using the standard error propagation formula that we covered above, we can then derive the uncertainty in $\hat{x_0}$, as,

$\hat{\sigma}_{x_0} = \dfrac{1}{\sqrt{\sum w_i}}$

**Errors**

Let's first assume that we have a function $f$ that is dependent on some measured quantity $x$, and yields a value $y$ that we are interested in knowing, such that $y = f(x)$. Now the measurements of $x$ are associated with some random error, $\sigma_x$, and so the final value of $y$ will also have an error $\sigma_y$. How do we calculate $\sigma_y$?

Assuming the errors in $x$ are small, and are close to the true value $\hat{x}$, we can expand $f(x)$ around the point $\hat x$,

$f(x) = f(\hat x) + (x - \hat x) \left( \dfrac{df} {dx} \right)_{\hat x}  + \dotsb$

If we now identify $\hat y = f(\hat x)$, then we can see that,

$y - \hat y = f(x) -  f(\hat x) \approx  (x - \hat x) \left( \dfrac{df} {dx} \right)_{\hat x}.$

which gives us an expression for how the value of $y$ derived from our measured value of $x$, relates to the true values of both $y$ and $x$, which are given by $\hat y$ and $\hat x$. If we then take many measurements of $x$, we can use the expression above to write the standard deviation about the mean, as

$\dfrac {1}{N}\sum_i^N (y_i - \hat y)^2 = \left( \dfrac{df} {dx} \right)^2_{\hat x} \dfrac {1}{N}\sum_i^N (x_i - \hat x)^2$

or simply,

$\sigma_y^2 = \left( \dfrac{df} {dx} \right)^2_{\hat x} \, \sigma_x^2$

which is the result you are probably familiar with from your first year labs!  

For two variables, we get the following: 

$\sigma_z^2 = \left( \dfrac{\partial f} {\partial x} \right)^2 \sigma_x^2 + \left( \dfrac{\partial f} {\partial y} \right)^2 \sigma_y^2 + 2 \dfrac{\partial f} {\partial x} \dfrac{\partial f} {\partial y} \sigma_{xy}.$

Ignoring the last term on the RHS for a moment, we see that the expression is the normal error propagation formula that you learnt during your lab work (for independent errors). If  $\sigma_x$ and $\sigma_y$ are not independent, then we need the last term! This is called the *covariance*.

$\sigma_{xy} = \dfrac{1}{N}\sum (x - \hat x) (y - \hat y).$

The variance of a variable describes how much the values are spread. The covariance is a measure that tells the amount of dependency between two variables. A positive covariance means that the values of the first variable are large when the values of the second variables are also large. A negative covariance means the opposite: large values from one variable are associated with small values of the other. 

*Problems with covariance:*

The problem with covariance is that it keeps the scale of the variables $X$ and $Y$, and therefore can take on any value. This makes interpretation difficult and comparing covariances to each other impossible. For example, $\sigma_{XY}  = 5.2$ and $\sigma_{ZQ}= 3.1$ tell us that these pairs are positively associated, but it is difficult to tell whether the relationship between $X$ and $Y$ is stronger than $Z$ and $Q$ without looking at the means and distributions of these variables.  We can normalise the covariance to give us both direction and strength of the correlation between these parameters. 

### Correlation

Two variables may have a positive association, so that as the values for one variable increase, so do the values of the other variable. Alternatively, the association could be negative or neutral. Correlation quantifies this association, often as a measure between the values -1 to 1 for perfectly negatively correlated and perfectly positively correlated. The calculated correlation is referred to as the “correlation coefficient.” This correlation coefficient can then be interpreted to describe the measures.

For a linear function, the extent to which data points $(x_1, y_1)... (x_N, y_N)$ support a linear correlation is given by the *linear correlation coefficient* sometimes called the Pearson correlation coefficient,

$r =  \dfrac{\sigma_{xy}} {\sigma_x\,\sigma_y}$
$ r = \dfrac{\sum(x - \hat x)(y - \hat y)} { \sqrt{\sum (x - \hat x)^2 \sum ( y- \hat y)^2} }.
$

If $r$ is close to $\pm 1$, then we would say that the points are correlated.  Completetly uncorrelated points would have $r=0$. 

But, if we look at standard probability tables (see table below), the probability of getting $r \ge 0.7$ is 51% for $N=3$ *even if 2 variables are uncorrelated*.  Therefore we should combine our $r$ correlation value with some measure of the probability of getting that value given the dataset randomly.  See the next workbook.

![alt text](ro.png "Title")

**Covariance Matrix**

The covariance matrix is a matrix that summarises the variances and covariances of a set of parameters.  Typical python data fitting routines will return this matrix. The diagonal of the matrix corresponds to the variance between the parameters. The sample variance is given by:

$\sigma^2 = \dfrac{1}{n-1} \sum_{i=1}^n (x-\hat{x})^2$

with $n$ the number of data points, and $\hat{x}$ the mean. The covariance is given by

${\rm cov}(x,y) \dfrac{1}{n-1} \sum_{i=1}^n  (x-\hat{x})(y-\hat{y})$.

The covariance matrix for a set of data denoted by matrix **X** can also be created using the following method: 

${\bf x} = {\bf X} - {\bf 11^T}{\bf X} ( 1 / n )$

where $n$ is the number of rows in the data matrix **X**, **1** is an $n \times 1$ column vector of ones. $1^T$ is the transpose of matrix *1*. 

Then we compute ${\bf x^Tx}$, the $k \times k$ deviation sums of squares and cross products matrix for **x**. Then we divide each term in the deviation sums of squares and cross product matrix by $n$ to create the variance-covariance matrix:

${\bf V} = {\bf x^Tx} ( 1 / n ).$

## Null hypothesis testing

The general framework we have looked at in lectures is referred to as *Null Hypothesis Significance Testing*, which we will abbreviate as *NHST*.  Hypothesis testing is the bread and butter of inferential statistics and a critical skill in the repertoire of a data scientist. 

Given an unknown parameter $\theta$, and a dataset $X={x_1,x_2,..}$ with probability of getting the data given by $p(X,\theta)$, does $X$ support the idea that $\theta$ is within a set of possible values $\Theta$? Classical hypothesis testing is based around two concepts:

\begin{align}
H_0 &:&~\theta \in \Theta_0 &~~ \text{the null hypothesis} \\
H_1 &:&~\theta \in \Theta_1 &~~ \text{the alternative hypothesis}
\end{align}

The null hypothesis assumes that nothing interesting happens/happened. The alternative hypothesis is, where the action is i.e. some observation/ phenomenon is real (i.e. not a fluke) and statistical analysis will give us more insights on that.

Statisticians take a pessimistic sort of view and start with the Null hypothesis ie the *null* hypothesis is what we are going to assume is true, thought ww are normally trying to show that it is not!  We then compute a statistic and then ask "What is the chance of observing the test-statistic for this sample (considering its size and the probability governing the system), purely randomly (ie if the Null hypothesis were true)?"

This chance — probability value of observing the test-statistic — is the so-called $p-$value. 

Remember, you cannot prove that something is correct in classical hypothesis testing, only prove that it is wrong. This is why the errors focus on $H_0$ -- at best you can (correctly) accept that $H_0$ is correct, and thus our hypothesis that  $\theta \in \Theta_1$ is wrong.

***
## Worked example of Correlation:


In [None]:
import numpy as np
import pylab as plt
from numpy.random import randn

# the line below makes the plot appear in the jupyter notebook
%matplotlib inline  

def cov(x,y,n):
    x_hat = np.mean(x)
    y_hat = np.mean(y)
    return np.sum((x-x_hat)*(y-y_hat))/(n-1)

# let's generate some random data
data1 = 20 * randn(1000) + 100
data2 = data1 + (10 *randn(1000) + 50)
n = len(data1)

# let's work out mean of the data
xhat = np.mean(data1)
yhat = np.mean(data2)

print('the mean of x is {:.2f}'.format(xhat))
print('the mean of y is {:.2f}'.format(yhat))

# covariance between the datasets
covar = cov(data1,data2,n)
print('the covariance between x and y is {:.2f}'.format(covar))

# plot
plt.scatter(data1, data2)
plt.xlabel('data1')
plt.ylabel('data2')
plt.show()

As we know (as we created the fake data), the data looks to be highly correlated. Now it's not too much more work to calculate the linear correlation coefficient $r$. Here we will see how to do this using the inbuilt python function from the `scip.stats` package. Many of the things we'll do in the course have inbuilt routines in python but part of the coursework will see you doing it from scratch to check understanding.

In [None]:
from scipy.stats import pearsonr

corr, _ = pearsonr(data1, data2)
print('Pearsons correlation is: %.3f' % corr)

We know from our notes that this value of $r$ indicates that the data is strongly correlated. 

***

Now you are ready to tackle the **Chapter 4 quiz** on Learning Central and the [Chapter 4 yourturn notebook](https://github.com/haleygomez/Data-Analysis-2021/blob/master/blended_exercises/Chapter%204/Chapter4_yourturn.ipynb).