# Statistics
This notebook is part of a collection of supplementary material designed to bring student up to speed on the mathematics required for COMP47750 Mathematics with Python.   
This notebook introduces covariance matrices and data normalisation.  
This material is covered in the lecture **M1 Statistics**.   
  
You may need to install `seaborn` to run this notebook.  
You could use the command `pip install seaborn` from the command line.

In [1]:
import matplotlib.pyplot as plt
%matplotlib inline

import numpy as np
import pandas as pd
import seaborn as sns
# Load the Penguins dataset
penguins_all = pd.read_csv('penguins_af.csv')

Reduce the dataset to just 4 descriptive features and the class label `species`

In [None]:
label_4f = ['bill_length_mm', 'bill_depth_mm','flipper_length_mm', 'body_mass_g', 'species']
penguins_4f = penguins_all[label_4f]

In [None]:
penguins_4f['species'].value_counts()

### Seaborn Pairplot
This provides a good overview of the relationships between the features.  
The plots on the diagonal show the feature distributions.  
The off-diagonal plots provide insight on feature correlations. 

In [None]:
sns.pairplot(penguins_4f, hue="species")

Extract a dataframe with just the `Gentoo` samples. 

In [None]:
Gentoo_df = penguins_4f[penguins_4f['species']=='Gentoo']

In [None]:
Gentoo_df.head()

In [None]:
sns.pairplot(Gentoo_df, diag_kind = 'kde')

The covariance matrix for the `Gentoo` class. 

In [None]:
Gentoo_df.cov().round(decimals = 2)

Normalise the data to get a clearer picture.  
The four features are normalised to mean = 0 and standard deviation = 1.

In [None]:
Gentoo_df.pop('species')
Gentoo_dfN = (Gentoo_df-Gentoo_df.mean())/Gentoo_df.std()

In [None]:
Gentoo_dfN.cov().round(decimals = 2)

### Law of Large Numbers 
Estimating Means

In [None]:
n_trials = 100000
sample_means = []
for trial in range(n_trials):
    sample_means.append(Gentoo_dfN.sample(5)['body_mass_g'].values.mean())

In [None]:
sns.displot(sample_means, kind = 'kde')