# Lesson 4: Descriptive Statistics and Parameters

**Python learning objectives**

1. Learn how to combine two logic statements within a `.loc[]` function to coordinate data
2. Practice mathematical functions on *DataFrame* columns.

**What you will be able to do with these skills**

1. Calculate the percentage change, variance and standard deviation from scratch.
2. Learn how to describe a dataset with descriptive statistics and parameters
3. Learn how to calculate the interquartile range and use it to find outliers within a dataset


In this lesson you will learn how to calculate a range of different data descriptors, and even more importantly, you will learn how to interpret these measurements. 

Once again, in this lesson we need to import the `pandas` library.

In [None]:
import pandas as pd


**Averages**

The dataset used in this section is the number of days between British mine accidents with at least 10 fatalities between 06/12/1875 to 29/05/1951. [1]

Below is a histogram plot of this dataset which shows its distribution.

In [None]:
MineAccidentDF = pd.read_csv("https://raw.githubusercontent.com/ThomasJewson/datasets/master/TimeBetweenBritishMineAccident.csv")
MineAccidentDF.plot.hist(bins=10)

*1. Mean*

The mean of a collection of numbers is the sum of all the elements of the collection, divided by the number of elements in the collection. 

The mean can be calculated with the `.mean()` [function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mean.html) which we use on our selected *DataFrame* column. In this case we want to find the mean of the `Time interval (days)` column in the `MineAccidentDF` *DataFrame*. Therefore, we use the following code.

In [None]:
MineAccidentDF["Time interval (days)"].mean()

Properties of the Mean:

- It does not need to be an element of the collection
- It need not be an integer even if all the element of the collection are integers
- It is between the largest and smallest elements within the collection
- It is not necessarily halfway between the two extremes in the data. 

Note that the mean is not the "halfway point" of the data. If you look back at the histogram distribution you can prove this to yourself. 

*2. Median*

The median is the number that is in the middle of a sorted list of numbers. 

It can be calculated with `.median()` [function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.median.html) used on a *DataFrame*.


In [None]:
MineAccidentDF["Time interval (days)"].median()

If the mean equals the median we have a symmetric distribution. If the mean does not equal the median then we have a skewed distribution. 

For example, below we produce and plot a symmettrical distribution.

In [None]:
Symmetrical = pd.DataFrame(
[1,2,2,3,3,3,4,4,4,4,5,5,5,5,5,6,6,6,6,7,7,7,8,8,9]
)
Symmetrical.plot.hist(bins=9)

In [None]:
MyDataFrame.describe()

In [None]:
Symmetrical[0].mean()

In [None]:
Symmetrical[0].median()

In [None]:
Symmetrical[0].median() == Symmetrical[0].mean()

As the median and the mean are equal, this distribution is symmetrical. 

**Excercise 1:** *Produce a logic expression and get either a `True` or `False` result by equating - with the `==` operator - the mean and median of `MineAccidentDF["Time interval (days)"]`, just like the above cell. What does the result of this expression tell us about the symmetry of the distribution?*

In [None]:
#Answer
MineAccidentDF["Time interval (days)"].median() == MineAccidentDF["Time interval (days)"].mean()

#False - Therefore unsymmetrical distribution

*3. Mode*

The mode, or the modal value, is the most common datum in a set. 

It can be calculated with `.mode()` [function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mode.html).

In [None]:
Symmetrical[0].mode()

**Measures of spread**

*1. Variability and Variance*

A simple method to measure the variability is to measure the deviation from the mean. This can be achieved by subtracting the mean from the data.

$ Mean \space Deviation = Data - Mean $

The code below does the following:

1. Calculates the mean of the `Symmetrical` *DataFrame* with the `.mean()` function and saves the result in a variable called `SymmetircalMean`.
```Python
SymmetricalMean = Symmetrical.mean()
```
2. Calculates the mean deviation by subtracting the mean from the data and saves it as a new column in `Symmetrical` called `"Deviation from Mean"`. 
```Python
Symmetrical["Deviation from Mean"] = Symmetrical - SymmetricalMean
```
3. Then the `Symmetrical` *DataFrame* is outputted by calling its name.
```Python
Symmetrical
```

In [None]:
SymmetricalMean = Symmetrical.mean() # Finds the mean of our dataset
Symmetrical["Deviation from Mean"] = Symmetrical - SymmetricalMean # Finds the mean deviation of each piece of data
Symmetrical # outputs the DataFrame

The deviations are negative when the data is less than the mean and positive when the data is greater than the mean. 

These deviations from the mean can give us an idea about the spread or variability of the data. However, it doesnt give us a single figure to use to describe the data as a whole. 

To find a single figure to describe the data you might want to find the average of these deviations - however, we do not get a useful answer...

In [None]:
Symmetrical["Deviation from Mean"].mean()

The mean value of the deviations from the mean is always equal to zero. This is because the negative and positive values cancel each other.

The mean of the deviations is, therefore, not a useful measure of the variability of the sample. We want to find the variability regardless of whether the deviation away from the mean is positive or negative. So we need to eliminate the signs in the deviations. 

This can be done either with squaring or via finding the absolute. We are going to do it via the squaring method.

The code below squares the `Deviations from Mean` column and saves the data in a new column in the `Symmetrical` *DataFrame* called `Squared Deviation`.

In [None]:
Symmetrical["Squared Deviation"] = Symmetrical["Deviation from Mean"] ** 2
Symmetrical

Now if we take the mean of `Symmetical["Squared Deviation"]` we will have the *mean squared deviation* - otherwise known as the *population variance*.

In [None]:
Symmetrical["Squared Deviation"].mean()

The overall process of this calculation can be represented with the following equation.


${\large \text{Population Variance}=\sigma^2=\frac{1}{N}\sum_{i=1}^{N}(x_{i}-\overline{x})^2}$



$\overline{x} = \text{Mean}$

$N = \text{Population}$

$x_{i} = \text{Each piece of data}$



$(x_{i}-\overline{x})^2 = \text{Squared deviation from the mean}$



We can compare this *population variance* that we have calculated to the *sample variance* which can be calculated with the `.var()` [function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.var.html). 

The *sample variance* is always greater than the *population variance*. This is because in the *sample variance* we divide by the smaller $n-1$, whereas, in the *population variance* we divide by greater $N$. 

${\large \text{Sample Variance}=s^2=\frac{1}{n-1}\sum_{i=1}^{n}(x_{i}-\overline{x})^2}$

$n = \text{Sample population}$

In [None]:
Symmetrical[0].var() # Sample Variance

**Excercise 2:** *Calculate the variance of `MineAccidentDF["Time interval (days)"]`*

In [None]:
#answer
MineAccidentDF["Time interval (days)"].var()

*2. Standard Deviation*

The problem with variance is that it is not on the same scale as the original variable because we have squared it. However, if we take the positive square root of the variance it leads to a parameter that is on the same scale as our initial dataset and as a consequence it is more comparable. This is the *standard deviation*.

The *standard deviation* is the root mean squared deviation from the mean value. 

We can calculate the square root by powering our value by a half (`** 0.5`)

In [None]:
variance = Symmetrical["Squared Deviation"].mean()
standarddeviation = variance ** 0.5 

standarddeviation # Population Standard Deviation

This results is the *population standard deviation*. 

$\large\text{Population Standard Deviation}=\sigma^2=\sqrt{\frac{1}{N}\sum_{i=1}^{N}(x_{i}-\overline{x})^2}$

Again, we can calculate the slightly greater *sample population* with the `.std()` [function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.std.html). 

$\large\text{Sample Standard Deviation}=s^2=\sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_{i}-\overline{x})^2}$

In [None]:
Symmetrical[0].std() # Sample Standard Deviation

**Excercise 3:** *Calculate the standard deviation of `MineAccidentDF["Time interval (days)"]`*

In [None]:
#answer
MineAccidentDF["Time interval (days)"].std()

*3. Interquartile Range*

The standard deviation suffers from being sensitive to outliers. If we look at the distribution of the British mine accidents again (below) you will notice the vast majority of the data is to the left hand side of the histogram. Therefore, our standard deviation might be skewed the few outliers that lie to the far right hand side of the distribution.

In [None]:
MineAccidentDF.plot.hist(bins=10)

If we use a measurement of spread that is less sensitive to outliers, we might end up with a more accurate description of our data. The *interquartile range* (*IQR*) is exactly that due to it being robust to outliers. It is robust as the outliers are not directly used within the calculation of the parameter. Whereas, the *standard deviation*, *variance* and *mean* use *all* the data in the calculation - including the outliers - therefore they are sensitive to outliers. 

If we were to organise the data from least to greatest and then we split the data at the median we would have an upper and lower half of the data. Then if we found the medians of both these upper and lower halfs of data and found the difference between them, this would be the *IQR*. 

The word 'quartile' comes from the idea that we are seperating the data into four segments. Each of these segments gives us our quartile statistic. 

The first quartile ($Q1$) is the median of the lower half of the data - or exactly 25% through the list of the organised data. 

The median of the total population is the second quartile ($Q2$) - 50% through the list of organised data.

The third quartile ($Q3$) is the median of the upper half of the data - 75% through the list of organised data.

Finally, the fourth quartile ($Q4$) - which isn't of much use - is the very last piece of data in the organised list. 

We calculate the *IQR* with the following:

$\large IQR = Q3 - Q1$

To find the quartiles we use the `.quantile()` [function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.quantile.html) on our *DataFrame* (Note the function is **quant**ile rather than **quart**ile).

As $Q2$ is 25% through the list of organised data we need to have `0.25` as a parameter in the function. Likewise, $Q3$ needs to have `0.75` as a parameter as it is 75% through the data. As this argument is a number we do not need to surround it in `""`.

In [None]:
Q1 = MineAccidentDF["Time interval (days)"].quantile(0.25)
Q1

In [None]:
Q3 = MineAccidentDF["Time interval (days)"].quantile(0.75)
Q3

With these statistics we can work out the IQR with the formula above.

In [None]:
IQR = Q3 - Q1
IQR

**Excercise 4:** *Calculate Q1 with `.quantile(0.25)` and Q3 with `.quantile(0.75)` of the `Symmetrical[0]` DataFrame printing the results.*

In [None]:
#Answer
print(Symmetrical[0].quantile(0.25))
print(Symmetrical[0].quantile(0.75))

**Box plots and Outliers**

Several statistical parameters are particularly sensitive to outliers, in particular the *mean*, *variance* and the *standard deviation*. To minimise these parameters being skewed it is common to remove outliers in datasets. 

There are several ways to calculate what an outlier is, we are going to be using the $1.5 \cdot IQR$ method. 

With this method we create bounds where the vast majority of the data resides. We expand $Q1$ and $Q3$ by $1.5 \cdot IQR$ on both sides. The bounds are calculated with the formulas below.

$LowerBound = Q1 - 1.5*IQR$ 

$UpperBound = Q3 + 1.5*IQR$

The data which is greater than the *lower bound* and smaller than the *upper bound* is the data we keep. Therefore, if a piece of data is outside of these bounds we class it as an outlier and we drop it from our dataset. 

In [None]:
LowerBound = Q1 - 1.5*IQR
LowerBound

In [None]:
HigherBound = Q3 + 1.5*IQR
HigherBound

To remove outliers from a *DataFrame* we firstly need to form a logic statement that describes the data we want to keep. This can be done in the following way:

1. We want all the data from `MineAccidentDF["Time interval (days)"]` less than the `HigherBound`. This can be described with the following.
    ```Python
    MineAccidentDF["Time interval (days)"] < HigherBound
    ```
2. And we want all the data from `MineAccidentDF["Time interval (days)"]` greater than the `LowerBound`. This can be described with the following.
    ```Python
    MineAccidentDF["Time interval (days)"] > LowerBound
    ```
3. We can combine both of these logic statements by using the `&` symbol. Note, we need to use brackets around each of the conjoining statements.
    ```Python
    (MineAccidentDF["Time interval (days)"] < HigherBound) & (MineAccidentDF["Time interval (days)"] > LowerBound)
    ```
    Unfortunately we cannot use the following statement ` LowerBound < MineAccidentDF["Time interval (days)"] < HigherBound` in Python, and we need to use the `&` symbol instead.
    
    
Below is an example of the conjoined statement. All the `True` results is the data we keep, and `False` are the outliers which will be excluded later on. 

In [None]:
(MineAccidentDF["Time interval (days)"] < HigherBound) & (MineAccidentDF["Time interval (days)"] > LowerBound)

To exclude all the outliers, we simply need to place the conjoined logic statement into a `.loc[]` function which operates on our `MineAccidentDF`. 

The code below does just that, and saves the resultant *DataFrame* as `MineAccident_ExOutliers`.

In [None]:
MineAccident_ExOutliers = MineAccidentDF.loc[(MineAccidentDF["Time interval (days)"] < HigherBound) & (MineAccidentDF["Time interval (days)"] > LowerBound)]
MineAccident_ExOutliers

Therefore, we now how a new DataFrame, `MineAccident_ExOutliers`, which has removed all the outliers as defined by our bounds.

Plotting `MineAccident_ExOutliers` in a histogram will demonstrate how the distribution has changed as a consequence. 

In [None]:
MineAccident_ExOutliers.plot.hist()

If we compare the mean averages of our *DataFrame* before and after removing outliers it will also demonstrate how sensitive the mean is to outliers. 

In [None]:
MineAccident_ExOutliers["Time interval (days)"].mean()  # Excluding Outliers

In [None]:
MineAccidentDF["Time interval (days)"].mean()  # Including Outliers

Box plots - or box-and-whisker plots - are another way to plot the distribution of data. In particular, this plot uses the $1.5*IQR$ outlier method to find outliers. This plot is a very quick way to assess a dataset to find any particular outliers.

To plot a box plot we need to use the `.plot.box()` [function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.box.html). We do not need to pass any arguments in this function.


In [None]:
MineAccidentDF.plot.box()

The central orange line is the median of the *DataFrame*.

The blue box surrounding the median (orange line) is $Q1$ and $Q3$.

The whiskers on this blue box depict $1.5 * IQR$ away from $Q1$ and $Q3$. 

Individual pieces of data outside of the whiskers are outliers and are depicted with circles.

**Excercise 5:** *Draw a box plot of `Symmetrical[0]`.*

In [None]:
#answer
Symmetrical[0].plot.box()

**Percentage Change and Growth Rates**

Below is data from Covid-19 outbreak in the EU between 01/01/2020 to 30/06/2020. [2]

In [None]:
covid_df = pd.read_csv("https://raw.githubusercontent.com/ThomasJewson/datasets/master/EUCOVID19CasesDeaths.csv")
covid_df

In Jan there were 17 confirmed cases in the EU of Covid-19, by Feb there was 1126 cases. To compute the percentage change between these two we need to use the following:

$\large \text{Percentage Change}=100*\frac{Final - Initial}{Initial}$

Alternatively we can also use this equation: 

$\large \text{Percentage Change}=100*\frac{Final}{Initial}-1$

We calculate the percentage change between Jan and Feb in the code below.

In [None]:
initial = 17
final = 1126
pctchange = 100*(final - initial) / initial
pctchange

In [None]:
pctchange = 100*((final / initial) - 1)
pctchange

As you can see there was a very large percentage increase in COVID-19 cases in the EU between Jan and Feb. 

Generally, percentage changes are not whole numbers so it is useful for us to round the result. This can be done with the `round()` [function](https://docs.python.org/3/library/functions.html#round).

The round function requires two positional arguments. The first argument is the data we want to round, in this case it is the `pctchange` variable. The second argument is the number of decimal places we want to round to, which we express as an integer. 

In [None]:
round(
    pctchange, #This is our data
    1 #This is the number of decimal places to round to
)

To calculate the percentage change for a whole column of data in a DataFrame we use the `.pct_change()` [function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pct_change.html). However, we still need to multiply by 100 to turn it into a percentage. 

In [None]:
pctdata = covid_df["Confirmed Cases"].pct_change()*100
pctdata

We can also round all the numbers in a *DataFrame* with the `round()` [function](https://docs.python.org/3/library/functions.html#round) by placing our `pctdata` *DataFrame* as the first argument of the function. 

In [None]:
round(pctdata,2)

From this we can tell that the coronavirus was increasing through the EU at the fastest rate in March 2020 as it had a percentage increase in cases of 38,043.43%.

**Conclusions:**

*You should now be able to do the following:*

1. Calculate the following averages:
    1. Mean with `.mean()`
    2. Median with `.median()`
    3. Mode with `.mode()`
2. Calculate the following measures of spread:
    1. Variance from scratch and with `.var()`
    2. Standard deviation from scratch and with `.std()`
3. Calculate the interquartile range (IQR) with the use of the `.quantile` function
4. Use the ${1.5*IQR}$ method to remove outliers by using `&` between two logic statements
5. Produce a box plot with `.plot.box()`.
6. Find the percentage change manually and also with the `.pct_change()` function

**Sources:**

[1] B.A. Maguire, E.S. Pearson, A.H.A. Wynn (1952).
"The Time Intervals Between Industrial Accidents", Biometrika Vol. 39, #1/2 pp. 168-180

[2] European Centre for Disease Prevention and Control