In [None]:
# In Python it is standard practice to import the modules we need at the very top of our scripts
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# Calculating Means, Standard Deviations and Standard Errors

N.B. If you don't remember the definitions of 'sample', 'population', 'mean', 'standard deviation' and 'standard error', you can remind yourself by looking at your Mathematical Techniques notes.

An important aspect of data handling is to evaluate statistical functions, including the mean, standard deviation and standard error. 
Use the NumPy documentation to find functions to calculate the mean and standard deviation.
Remember that for a *sample* distribution the definition of the standard deviation is 
\begin{equation}
 \sigma_{N-1}=\frac{\sqrt{\sum_i (x_i-\bar{x})^2}}{\sqrt{N-1}}.
\end{equation}
whilst the standard deviation for the hypothetical *population* distribution is
\begin{equation}
  \sigma_N = \frac{\sqrt{\sum_i (x_i-\bar{x})^2}}{\sqrt{N}}.
\end{equation}
Fitting $N$ data points to the hypothetical population distribution with a known value for the mean gives us $N$ degrees of freedom.
Evaluating the mean from the sample reduces the degrees of freedom to $N-1$.

Which of these two does NumPy use by default?
Can you pass an optional parameter to the function to switch between these two cases?

The formula for the standard error is
\begin{equation}
 \alpha=\frac{\sigma_{N-1}}{\sqrt{N}},
\end{equation}
where $N$ is the number of data points.
Use the SciPy documentation to find a function to calculate the standard error.
Do you need to bear in mind the difference between sample distributions and population distributions?

In the Markdown cell below, add some comments to yourself on how to use these three functions and how to ensure you get the sample and not population statistics.

**Write your notes here**



# Exercise 9: Calculating Statistical Metrics (5 Marks)

In this exercise you will practice importing tables of experimental data into Python, and analyzing them.
There are two datasets available in your Jupyter Hub folder: `poisson-data.csv` and `gaussian-data.csv`.
Your first task is to import them into your notebook.

These files are both 'comma separated values' files - this means that they contain an array of data where rows are discriminated by new lines and columns by, usually, commas.

There are several approaches and functions you could use to read and write `.csv` files, for this course we're going to use `np.loadtxt()` and `np.savetxt()`.
Read the documentation for `np.loadtxt()`.
Which parameters will you need to set to read a file where values are separated by commas?
Load `poisson-data.csv` into your notebook and assign it to a variable; repeat for `gaussian-data.csv`.

In [None]:
# Load data here



## Poisson Distribution

The file contains two dataset, each following a Poisson distribution.

1) Plot the histogram using the `plt.hist()` function for each of the two Poisson distributions; plot both histograms as a separate subplot in one figure. Consult the Matplotlib documentation and examples on `plt.hist()`. Vary the number of bins in the histograms and choose a number that neither oversamples the data (leading to scattered histograms) nor undersamples it (losing detail).

2) For both data sets determine the mean, standard deviation and standard error, calculate also the square root of the mean.  Display your results using the `print` statement; be sure to format your results. What do you notice about the value of the square root of the mean (comment in the Markdown cell)?

In [None]:
# Put your answer to question 1 here



In [None]:
# Put your answer to question 2 here



**What do you notice about sqrt(mean)?**



## Gaussian Distribution

The file contains five datasets, each assumed to be a (small) subsets of normal (or Gaussian) distributed data.

3) Use a `for` loop to calculate and print the mean, standard deviation and standard error for all 5 sets of data. For loops are a simple way to do repetitive calculations without writing it out for every iteration; you should read the Python documentation [here](https://docs.python.org/3/tutorial/controlflow.html#for-statements) to familiarise yourself with these loops. Are the values obtained consistent between the data sets (comment in the Markdown cell)? Remember, consistent data is usually assumed to be within 3 standard deviations.

An example for loop, which just prints each dataset, could be:

```python
for dataset in np.arange(5):
    print(gaussian[:,dataset])```

In [None]:
# Put your answer to question 3 here



**Are the datasets consistent with each other?**

