<img src="images/Project_logos.png" width="500" height="300" align="center">

## Descriptive Statistics

Descriptive statistics describe and summarise datasets and can be:

- quantitative (statistical analysis of the data)
- qualitative (visualisations of the data)

Prior knowledge of Python, NumPy, Pandas, Iris, and Matplolib are assumed for this course.

## Aims

This course will teach you about commonly used statistical terminology and Python libraries for the following measures of descriptive statistics:

  - central tendency, the 'average' values for the data
  - variability, the spread of the data
  - data visualisation

## Table of Contents

* [Glossay of Statistical Terminology](#glossary)
* [Central Tendency](#central_tendency)
* [Exercise 1](#exercise_1)
* [Variability](#variability)
* [Exercise 2](#exercise_2)
* [Visualisation](#visualisation)

## Glossary of Statistical Terminology<a class="anchor" id="glossary"></a>

- **Population:** the set of all elements in the dataset
- **Sample:** a subset of a population
- **Outlier:** a data point that is significantly different from the rest of the sample or population
- **Mean:** the arithmetic average of all the elements in the dataset (the sum of all the elements divided by the number of elements)
- **Median** the middle element in a sorted (ascending or descending) dataset
- **Mode** the most frequently occuring value in the dataset
- **Weighted Mean:** the arithmetic average of all the elements in the dataset multiplied by a weighting factor for each element (the sum of all the elements multiplied by their respective weights, divided by the sum of all the weights)
- **Variance:** a measure showing how far the dataset elements are from the mean
- **Standard Deviation:** a measure showing how far the dataset elements are from the mean, in the same units as the dataset elements
- **Skewness:** a measure of the asymmetry in the dataset
- **Kurtosis:** a measure of tailedness; how much of the data lies in the tails of the distribution, how often outliers occur.
- **Percentile:** the value in the dataset below which a certain percentage of the data falls
- **Interquartile Range:** the difference between the 25th and 75th percentile

## Central Tendency<a class="anchor" id="central_tendency"></a>

Measures of central tendency provide estimates of the most typical value in a dataset and there are different definitions. Here we cover the most common.

The sample **mean** is the arithmetic average of all the elements in the dataset (the sum of all the elements divided by the number of elements).

The sample **median** is the middle element in a sorted (ascending or descending) dataset. The mean is heavily affected by outliers, but the median only depends on outliers either slightly or not at all.

The sample **mode** is the most frequently occuring value in the dataset. Where more than one value occurs in the dataset with the same frequency, the result will be multimodal and all modes will be returned. 

The sample **weighted mean** is the arithmetic average of all the elements in the dataset multiplied by a weighting factor for each element (the sum of all the elements multiplied by their respective weights, divided by the sum of all the weights). The weights could represent the frequency of the elements, the area covered by each data element, or a user-defined weighting according to importance of different factors in the data (e.g. the average uncertainty for temperature measurements could be based on the number of observations at a given location, the length of time of the total record, and the age of the observing instrumentation and each could be assigned a different weight).

The measures of central tendency can all be calculated using many Python libraries, including the built-in library `statistics`, and the libraries `NumPy`, `Pandas` and `Iris`, depending on the type of data being analysed. Note that for calculating the median or the mode for some datatypes, a dedicated statistical library will need to be used, such as `SciPy`.

Note that you will need to be careful which method is choosen if there are NaNs in the dataset because, for some libraries, NaN will be returned if the sample contains NaNs.

### Examples using the statistics library ###

In [None]:
import statistics

'''Mean'''
sample = [2, 5, 4, 23, 4, 5, 15, 1]
mean = statistics.mean(sample)
print(f'Mean = {mean}')

'''Median'''
sample = [2, 5, 4, 23, 4, 5, 15, 1]
median = statistics.median(sample)
print(f'Median = {median}')

'''Mode'''
sample = [2, 5, 4, 23, 4, 5, 15, 1]
multimode = statistics.multimode(sample)
print(f'Mode = {multimode}')
mode = statistics.mode(sample)
print(f'Mode = {mode}. Note that here the multimodes have been rounded up')

### Examples using the NumPy library ###

In [None]:
import numpy as np

'''Mean'''
sample = [2, 5, 4, 23, 4, 5, 15, 1]
mean = np.mean(sample)
print(f'Mean = {mean}')

# if the sample is a NumPy array you can also use:
sample = np.array([2, 5, 4, 23, 4, 5, 15, 1])
mean = sample.mean()
print(f'Mean using arrays = {mean}')

# To ignore NaNs use np.nanmean()

'''Median'''
sample = [2, 5, 4, 23, 4, 5, 15, 1]
median = np.median(sample)
print(f'Median = {median}')

'''Mode'''
import scipy.stats
sample = np.array([2, 5, 4, 23, 4, 5, 15, 1])
mode = scipy.stats.mode(sample)
print(f'Mode = {mode}. Note that here only the smallest value for the mode is returned')


'''Weighted Mean'''
sample = np.array([2, 5, 4, 23, 4, 5, 15, 1])
weights = np.array([10, 20, 50, 1, 4, 5, 8, 2])
mean = np.average(sample, weights=weights)
print(f'Weighted Mean = {mean}')


### Examples using the Pandas library ###

In [None]:
import pandas as pd

'''Mean'''
sample = [2, 5, 4, 23, 4, 5, 15, 1]
sample_df = pd.DataFrame(sample)
mean = sample_df.mean().values
print(f'Mean = {mean}')


'''Median'''
sample = [2, 5, 4, 23, 4, 5, 15, 1]
sample_df = pd.DataFrame(sample)
median = sample_df.median().values
print(f'Median = {median}')


'''Mode'''
sample = [2, 5, 4, 23, 4, 5, 15, 1]
sample_df = pd.DataFrame(sample)
mode = sample_df.mode().values
print(f'Mode = {mode[0]} and {mode[1]}. Note that two values are returned because 4 and 5 occur with the same frequency')


'''Weighted Mean'''
sample_df = pd.DataFrame({'Value':[2, 5, 4, 23, 4, 5, 15, 1],
                          'Weight':[10, 20, 50, 1, 4, 5, 8, 2]})
mean = (sample_df.Value * sample_df.Weight).sum() / sample_df.Weight.sum()
print(f'Weighted Mean = {mean}')

### Examples using the Iris library ###

In [None]:
import iris.cube

'''Mean'''
sample = [2, 5, 4, 23, 4, 5, 15, 1]
cube = iris.cube.Cube(np.zeros((8), np.int8))
cube.data = sample
mean = cube.data.mean()
print(f'Mean = {mean}')


'''Median'''
latitude = iris.coords.DimCoord(np.arange(-90, 90,90), standard_name='latitude', units='degrees')
longitude = iris.coords.DimCoord(np.arange(0, 360,90), standard_name='longitude', units='degrees')
cube = iris.cube.Cube(np.zeros((2, 4), np.float32), dim_coords_and_dims=[(latitude, 0), (longitude, 1)])
cube.coord('latitude').guess_bounds()
cube.coord('longitude').guess_bounds()
sample = [2, 5, 4, 23], [4, 5, 15, 1]
cube.data = sample
median_cube = cube.collapsed(['longitude', 'latitude'], iris.analysis.MEDIAN)
median = median_cube.data
print(f'Median = {median}')


'''Weighted Mean'''
# This example is using the area of the grid cell as the weight
import iris.analysis.cartography

latitude = iris.coords.DimCoord(np.arange(-90, 90,90), standard_name='latitude', units='degrees')
longitude = iris.coords.DimCoord(np.arange(0, 360,90), standard_name='longitude', units='degrees')
cube = iris.cube.Cube(np.zeros((2, 4), np.float32), dim_coords_and_dims=[(latitude, 0), (longitude, 1)])
cube.coord('latitude').guess_bounds()
cube.coord('longitude').guess_bounds()
sample = [2, 5, 4, 23], [4, 5, 15, 1]
cube.data = sample
grid_areas = iris.analysis.cartography.area_weights(cube)
mean_cube = cube.collapsed(['longitude', 'latitude'], iris.analysis.MEAN, weights=grid_areas)
mean = mean_cube.data
print(f'Weighted Mean = {mean}')

## Exercise 1<a class="anchor" id="exercise_1"></a>

Using any Python method, calculate the mean, median and mode for the following dataset containing a timeseries of the number of raindays per summer:

1, 1, 0, 1, 2, 2, 0, 0, 0, 3, 3, 0, 3, 3, 0, 2, 2, 2, 1, 1, 4, 1, 1, 0, 3, 0, 0, 0, 1, 1, 2, 2, 2, 2, 1, 1, 1, 
1, 4, 4, 4, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 3, 3, 0, 3, 3, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 3, 
3, 3, 2, 3, 3, 1, 1, 1, 2, 2, 2, 4, 5, 5, 4, 4, 1, 1, 1, 4, 1, 1, 1, 3, 3, 5, 3, 3, 3, 2, 3, 3, 0, 0, 0, 0, 3, 
3, 3, 3, 3, 3, 0, 2, 2, 2, 2, 1, 1, 1, 3, 1, 0, 0, 0, 1, 1, 3, 1, 1, 1, 2, 2, 2, 4, 2, 2, 2, 1, 1, 1, 1, 0, 0, 
2, 2, 3, 3, 2, 2, 3, 2, 0, 0, 1, 1, 3, 3, 3, 1, 1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 0, 1, 1, 1, 3, 1, 1, 1, 2, 
2, 2, 1, 1, 1, 2, 1, 1, 1, 3, 3, 5, 3, 3, 1, 1, 1, 3, 3, 3, 3, 1, 1, 1, 4, 1, 1, 4, 4, 4, 4, 4, 4, 1, 1, 1, 2,
2, 5, 5, 2, 3, 3, 4, 4, 3, 2, 2, 2, 1, 5, 1, 2, 2, 1, 1, 1, 2, 2, 2, 2, 2, 1, 1, 0, 1, 1, 1, 3, 3, 3, 3, 3

In [None]:
import 

<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
**Solution**

<font color='red'>**NOTE**</font>: Your methods can include any Python library

Mean = 1.84,
Median = 2,
Mode = 1,

## Variability<a class="anchor" id="variability"></a>
Measures of variability quantify the amount of spread in a dataset.

The sample **variance:** is a measure showing how far the dataset elements are from the mean. Note that two datasets with a small and large variance, respectively, can have the same mean and median even though they can look quite different.

The sample **standard deviation** is another measure of showing how far the dataset elements are from the mean, but unlike the variance, the standard deviation has the same units as the dataset elements, making it more intuitive.

The sample **skewness** is a measure of the asymmetry in the dataset. Negative skewness values indicate that the dataset is skewed (has a longer tail) to the left of the mean and positive skewness values indicate that the dataset is skewed to the right of the mean. If the skewness value is close to 0, the dataset is considered to be symmetrical.

The sample **kurtosis** is a measure of tailedness; how much of the data lies in the tails of the distribution, which indicates how often extreme values occur.

The 10th **percentile** is the value in the dataset below which 10% of the data falls, the 20th percentile is the value in the dataset below which 20% of the data falls. The median is actually the 50th percentile because it is in the middle and so 50% of the data is below it. 

The difference between the 25th and 75th percentile is called the **interquartile range**, which is a measure of the dispersion in the data and is often used to identify outliers (elements in the data that fall below the 25th percentile minus 1.5 times the interquartile range or above the 75th percentile plus 1.5 times the interquartile range).

The measures of variability can all be calculated using many Python libraries, including the built-in library `statistics`, and the libraries `NumPy`, `Pandas` and `Iris`, depending on the type of data being analysed. Note that for calculating the skewness for some datatypes, a dedicated statistical library will need to be used, such as `SciPy`.

Note that you will need to be careful which method is choosen if there are NaNs in the dataset because, for some libraries, NaN will be returned if the sample contains NaNs.

### Examples using the statistics library ###

In [None]:
import statistics
import scipy.stats

sample = [2, 5, 4, 23, 4, 5, 15, 1]

'''Variance'''
variance = statistics.variance(sample)
print(f'Variance = {variance}')

'''Standard Deviation'''
stdev = statistics.stdev(sample)
print(f'Standard Deviation = {stdev}')

'''Skewness'''
skewness = scipy.stats.skew(sample, bias=False)
print(f'Skewness = {skewness}')

'''Kurtosis'''
kurtosis = scipy.stats.kurtosis(sample, bias=False)
print(f'Kurtosis = {kurtosis}')

'''Percentiles'''
percentile50th = statistics.quantiles(sample, n=2) # Note this is the same as the median
print(f'The 50th percentile = {percentile50th}')
percentile25th = statistics.quantiles(sample, n=4, method='inclusive')[0]
print(f'The 25th percentile = {percentile25th}')
percentile75th = statistics.quantiles(sample, n=4, method='inclusive')[-1]
print(f'The 75th percentile = {percentile75th}')

quartiles = statistics.quantiles(sample, n=4, method='inclusive')
print(f'The quartiles = {quartiles}')

'''Interquartile range'''
iqr = percentile75th - percentile25th
print(f'The interquartile range = {iqr}')

### Examples using the NumPy library ###

In [None]:
import numpy as np
import scipy.stats

# NOTE it is very important to set the number of degrees of freedom to 1 because the calculation 
# uses the mean as one of the parameters in an intermediate step.

sample = [2, 5, 4, 23, 4, 5, 15, 1]

'''Variance'''
variance = np.var(sample, ddof=1)
print(f'Variance = {variance}')

# if the sample is a NumPy array you can also use:
sample = np.array(sample)
variance = sample.var(ddof=1)
print(f'Variance using arrays = {variance}')

# To ignore NaNs use np.nanvar()


'''Standard Deviation'''
stdev = sample.std(ddof=1)
print(f'Standard Deviation = {stdev}')

'''Skewness'''
skewness = scipy.stats.skew(sample, bias=False)
print(f'Skewness = {skewness}')

'''Kurtosis'''
kurtosis = scipy.stats.kurtosis(sample, bias=False)
print(f'Kurtosis = {kurtosis}')

'''Percentiles'''
percentile50th = np.percentile(sample, 50) # Note this is the same as the median
print(f'The 50th percentile = {percentile50th}')
percentile25th = np.percentile(sample, 25)
print(f'The 25th percentile = {percentile25th}')
percentile75th = np.percentile(sample, 75)
print(f'The 75th percentile = {percentile75th}')

# or

percentile25th = np.quantile(sample, 0.25)
print(f'The 25th percentile = {percentile25th}')
percentile75th = np.quantile(sample, 0.75)
print(f'The 75th percentile = {percentile75th}')

# To ignore NaNs use np.nanpercentile()

quartiles = np.quantile(sample, [0.25, 0.5, 0.75])
print(f'The quartiles = {quartiles}')

'''Interquartile range'''
iqr = percentile75th - percentile25th
print(f'The interquartile range = {iqr}')

### Examples using the Pandas library ###

In [None]:
import pandas as pd

sample = [2, 5, 4, 23, 4, 5, 15, 1]
sample_df = pd.DataFrame(sample)

'''Variance'''
variance = sample_df.var().values
print(f'Variance = {variance}')

'''Standard Deviation'''
stdev = sample_df.std().values
print(f'Standard Deviation = {stdev}')

'''Skewness'''
skewness = sample_df.skew().values
print(f'Skewness = {skewness}')

'''Kurtosis'''
kurtosis = sample_df.kurtosis().values
print(f'Kurtosis = {kurtosis}')

'''Percentiles'''
percentile50th = sample_df.quantile(0.5).values # Note this is the same as the median
print(f'The 50th percentile = {percentile50th}')
percentile25th = sample_df.quantile(0.25).values
print(f'The 25th percentile = {percentile25th}')
percentile75th = sample_df.quantile(0.75).values
print(f'The 75th percentile = {percentile75th}')
quartiles = sample_df.quantile([0.25, 0.5, 0.75]) # in this case a new series is returned
print(f'The quartiles = {quartiles}')

'''Interquartile range'''
iqr = percentile75th - percentile25th
print(f'The interquartile range = {iqr}')

### Examples using the Iris library ###

In [None]:
import iris
import iris.coords
import iris.cube
import scipy.stats

latitude = iris.coords.DimCoord(np.arange(-90, 90,90), standard_name='latitude', units='degrees')
longitude = iris.coords.DimCoord(np.arange(0, 360,90), standard_name='longitude', units='degrees')
cube = iris.cube.Cube(np.zeros((2, 4), np.float32), dim_coords_and_dims=[(latitude, 0), (longitude, 1)])
cube.coord('latitude').guess_bounds()
cube.coord('longitude').guess_bounds()
sample = [2, 5, 4, 23], [4, 5, 15, 1]
cube.data = sample

'''Variance'''
variance = cube.data.var(ddof=1)
print(f'Variance = {variance}')

'''Standard Deviation'''
# Using this method, we must set the number of degrees of freedom to 1.
stdev = cube.data.std(ddof=1)
print(f'Standard Deviation = {stdev}')

# Using this method, the number of degrees of freedom is 1 by default.
stdev_cube = cube.collapsed(['longitude', 'latitude'], iris.analysis.STD_DEV)
stdev = stdev_cube.data
print(f'Standard Deviation = {stdev}')

'''Skewness'''
sample = cube.data.ravel() # We must first flatten the data in the cube
skewness = scipy.stats.skew(sample, bias=False)
print(f'Skewness = {skewness}')

'''Kurtosis'''
sample = cube.data.ravel() # We must first flatten the data in the cube
kurtosis = scipy.stats.kurtosis(sample, bias=False)
print(f'Kurtosis = {kurtosis}')

'''Percentiles'''
percentile50th = np.quantile(cube.data, 0.25) # Note this is the same as the median
print(f'The 50th percentile = {percentile50th}')
percentile25th = np.percentile(cube.data, 25)
print(f'The 25th percentile = {percentile25th}')
percentile75th = np.percentile(cube.data, 75)
print(f'The 75th percentile = {percentile75th}')

quartiles = np.quantile(cube.data, [0.25, 0.5, 0.75])
print(f'The quartiles = {quartiles}')

'''Interquartile range'''
iqr = percentile75th - percentile25th
print(f'The interquartile range = {iqr}')

**SciPy and Pandas can also be used to get a quick descrition of the statistics in a dataset:**

In [None]:
import scipy.stats
import pandas as pd

print('Using SciPy:')
sample = [2, 5, 4, 23, 4, 5, 15, 1]
description = scipy.stats.describe(sample, bias=False)
print(f'{description}\n')

print('Using Pandas:')
sample_df = pd.DataFrame(sample)
description = sample_df.describe()
print(description)

## Exercise 2<a class="anchor" id="exercise_2"></a>

Using any Python method, identify the outliers for the following dataset containing a timeseries of the number of raindays in July per year:

1, 9, 0, 1, 2, 2, 8, 0, 0, 3, 3, 0, 3, 3, 0, 2, 2, 2, 8, 15, 4, 8, 1, 0, 3, 0, 18, 0, 12, 9, 6, 4, 2, 6, 9, 1, 1, 
9, 4, 4, 4, 1, 9, 1, 7, 2, 8, 2, 8, 2, 2, 12, 2, 1, 12, 5, 8, 1, 3, 3, 0, 3, 3, 1, 17, 6, 1, 0, 0, 1, 1, 8, 1, 15, 
3, 13, 2, 3, 3, 11, 4, 1, 9, 2, 2, 4, 5, 5, 4, 4, 1, 8, 1, 4, 1, 9, 1, 3, 3, 5, 3, 3, 3, 2, 3, 3, 0, 10, 0, 0, 13, 
3, 8, 15, 3, 3, 0, 2, 2, 2, 2, 9, 12, 1, 3, 8, 0, 0, 0, 1, 1, 3, 13, 1, 7, 2, 2, 2, 4, 2, 2, 2, 1, 8, 1, 1, 0, 10, 
8, 2, 9, 3, 2, 8, 3, 2, 0, 0, 1, 8, 8, 3, 3, 1, 9, 1, 5, 1, 2, 9, 2, 2, 1, 1, 1, 8, 0, 1, 9, 1, 3, 1, 1, 8, 2, 
2, 12, 9, 11, 9, 2, 1, 9, 1, 3, 3, 5, 3, 3, 14, 5, 1, 3, 3, 3, 3, 1, 1, 1, 4, 1, 1, 4, 4, 4, 4, 4, 4, 1, 8, 1, 2,
2, 5, 5, 2, 3, 3, 4, 4, 3, 2, 12, 2, 1, 5, 1, 20, 2, 1, 1, 1, 2, 2, 2, 2, 2, 8, 1, 0, 1, 1, 1, 3, 3, 3, 3, 3

In [None]:
import


<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
**Solution**

<font color='red'>**NOTE**</font>: Your methods can include any Python library

Outliers = [12, 13, 14, 15, 17, 18, 20]

## Visualisation<a class="anchor" id="visualisation"></a>

We can use matplotlib to visualise the data in a statistical way by creating a histogram or a boxplot to show the distribution of the elements:

In [None]:
import matplotlib.pyplot as plt

data = [1, 9, 0, 1, 2, 2, 8, 0, 0, 3, 3, 0, 3, 3, 0, 2, 2, 2, 8, 15, 4, 8, 1, 0, 3, 0, 18, 0, 12, 9, 6, 4, 2, 6, 9, 1, 1, 
9, 4, 4, 4, 1, 9, 1, 7, 2, 8, 2, 8, 2, 2, 12, 2, 1, 12, 5, 8, 1, 3, 3, 0, 3, 3, 1, 17, 6, 1, 0, 0, 1, 1, 8, 1, 15, 
3, 13, 2, 3, 3, 11, 4, 1, 9, 2, 2, 4, 5, 5, 4, 4, 1, 8, 1, 4, 1, 9, 1, 3, 3, 5, 3, 3, 3, 2, 3, 3, 0, 10, 0, 0, 13, 
3, 8, 15, 3, 3, 0, 2, 2, 2, 2, 9, 12, 1, 3, 8, 0, 0, 0, 1, 1, 3, 13, 1, 7, 2, 2, 2, 4, 2, 2, 2, 1, 8, 1, 1, 0, 10, 
8, 2, 9, 3, 2, 8, 3, 2, 0, 0, 1, 8, 8, 3, 3, 1, 9, 1, 5, 1, 2, 9, 2, 2, 1, 1, 1, 8, 0, 1, 9, 1, 3, 1, 1, 8, 2, 
2, 12, 9, 11, 9, 2, 1, 9, 1, 3, 3, 5, 3, 3, 14, 5, 1, 3, 3, 3, 3, 1, 1, 1, 4, 1, 1, 4, 4, 4, 4, 4, 4, 1, 8, 1, 2,
2, 5, 5, 2, 3, 3, 4, 4, 3, 2, 12, 2, 1, 5, 1, 20, 2, 1, 1, 1, 2, 2, 2, 2, 2, 8, 1, 0, 1, 1, 1, 3, 3, 3, 3, 3]

plt.hist(data)

In [None]:
fig = plt.figure()
plt.boxplot(data)