# The normal distribution

```{attention}
Enrolled students using DILNET may use the CS JupyterHub.<br/>
<a href="http://jhub.science.upd.edu.ph/hub/user-redirect/git-pull?repo=https://github.com/GeoPython-UPD/notebooks&urlpath=lab/tree/notebooks/L8/normal-distribution.ipynb+&branch=main"><img src="https://img.shields.io/badge/Launch-CS_Hub-blue" alt="Launch - CS Hub"></a>

Follow the lesson and fill in your notebooks using Binder.<br/>
<a href="https://mybinder.org/v2/gh/GeoPython-UPD/notebooks/main?labpath=L8/normal-distribution.ipynb"><img alt="Binder badge" src="https://img.shields.io/badge/launch-binder-red.svg" style="vertical-align:text-bottom"></a>
```

The {term}`normal distribution`, also known as the *Gaussian*, refers to the mathematical function used to describe the bell-shaped curve of expected values for any measurement subject to small random errors. Probably most of you have already seen this distribution before, and here we briefly outline its definition and use.

![The normal distribution](img/Gaussian.png)

_**Figure 1.5**. The normal distribution of cinder code slope angles. Source: Figure 7.1 from [McKillup and Dyar, 2010](http://www.cambridge.org/fi/academic/subjects/earth-and-environmental-science/earth-science-general-interest/geostatistics-explained-introductory-guide-earth-scientists?format=HB&isbn=9780521763226)._


Many geological values can be described using the *normal distribution*, where the most common measured values are proximal to the *mean*, but small numbers of measurements may have values quite far from the mean. The example in Figure 1.5 is for the slope angles of cinder cones, which may be expected to be similar in many cases. This might relate to the angle of repose of the cinders that form the cone.

![The normal distribution and standard deviations](img/Gaussian-sd.png)

_**Figure 1.6**. The relationships between the normal distribution and standard deviation. Source: Figure 7.3 from [McKillup and Dyar, 2010](http://www.cambridge.org/fi/academic/subjects/earth-and-environmental-science/earth-science-general-interest/geostatistics-explained-introductory-guide-earth-scientists?format=HB&isbn=9780521763226)._


Assuming we have a representative sample, the *standard deviation* can be related to the percentage of measurements that would be expected to fall within some distance from the *mean*. If the sample is representative, the sample mean $\bar{x}$ and population mean $\mu$ should be the same. In this case, ~68% of measurements should be within plus/minus one standard deviation of the mean, and ~95% should fall within plus/minus two standard deviations.

You will learn more about the *normal distribution* in the exercise for this week.


## Checking the distribution of data

Let's load some data to see the temperature distribution in Mactan airport.

In [None]:
import pandas as pd
import numpy as np

fp = r"data/ph_tempdata.csv"

data = pd.read_csv(
    fp,
    na_values=["***"],
    usecols=["DATE", "STATION", "TEMP", "MAX", "MIN"],
    parse_dates=["DATE"],
    index_col="DATE",
)

# convert temperatures from Farenheit to Celsius
data["TEMP"] = (data["TEMP"] - 32)/1.8
data["MAX"] = (data["MAX"] - 32)/1.8
data["MIN"] = (data["MIN"] - 32)/1.8

# get only the data from the Mactan airport using the station code
data = data.loc[data["STATION"] == "RPM00098646"]

How much data are we dealing with? Let's check!

In [None]:
len(data)

That's a lot of data to look one by one. Good thing, we can get an idea of the range of values with `data.describe()`, right?

In [None]:
data.describe()

But a more interesting way of understanding the distribution of temperatures probably comes in the form of a graph. Good thing we can plot [histograms](https://en.wikipedia.org/wiki/Histogram) with [matplotlib.pyplot.hist](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html).

In [None]:
import matplotlib.pyplot as plt

# plot a histogram of the temperatures in MActan airport
plt.hist(data["TEMP"])

## A taste of Seaborn

[Seaborn](https://seaborn.pydata.org/) is a popular statistics visualization package that allows quick plotting of nice figures.  Here we use it to quickly plot a histogram.

In [None]:
import seaborn as sns

sns.histplot(data["TEMP"])

Let's annotate the mean and mode to see if they are the same or different.

In [None]:
data_Tmean = data["TEMP"].mean(skipna=True)
data_Tmode = data["TEMP"].mode()

sns.histplot(data["TEMP"])
plt.axvline(data_Tmean, color="k", linestyle="-")
plt.axvline(data_Tmode[0], color="r", linestyle="--")

To help visualize what the likely temperature range is 95% of the time, why not also annotate the 2-$\sigma$?

In [None]:
data_Tmean = data["TEMP"].mean(skipna=True)
data_Tmode = data["TEMP"].mode()

sns.histplot(data["TEMP"])
plt.axvline(data_Tmean, color="k", linestyle="--")
plt.axvline(data_Tmode[0], color="r", linestyle="-")

# plotting the standard deviation and 2 times the standard deviation from the mean
data_Tstd = data["TEMP"].std()

plt.axvline(data_Tmean - data_Tstd, color="k", linestyle="--")
plt.axvline(data_Tmean + data_Tstd, color="k", linestyle="--")
plt.axvline(data_Tmean - (2*data_Tstd), color="k", linestyle=":")
plt.axvline(data_Tmean + (2*data_Tstd), color="k", linestyle=":")