<img src="images/Project_logos.png" width="500" height="300" align="center">

## Aims

This course will teach you some methods for quantifying the probability of events happening from the statistics in your data. By knowing the probability of events happening, there is the ability to make predictions of how frequently the might occur or how big they might be.


Prior knowledge of Python, NumPy, Pandas, Iris, and Matplolib are assumed for this course.

## Table of Contents

* [Normal Distribution](#normal_distribution)
* [Binomial Distribution](#bionomial_distribution)
* [Poisson Distribution](#poisson_distribution)
* [Exercise 1](#exercise1)

## Normal Distribution<a class="anchor" id="normal_distribution"></a>

The normal distribution (also known as the Gaussian Distribution) is the most commonly used distribution to be fit to data. It is symmetrical about the mean with a bell-shaped curve. 

**Central Limit Theorem** dictates that the more observations in the data, the more the distribution will look like a normal distribution and the more the probabilility of the event happening will approach the true mean.

The **Three Sigma Rule** dictates that a certain number of the observations in the data will fall within a certain distance of the mean. For a normal distribution, the Three Sigma rule states that:
- 68% of the data will fall within one standard deviation of the mean
- 95% of the data will fall within two standard deviations of the mean
- 99.7% of the data will fall within three standard deviations of the mean

Any observation that is more than three standard deviations from the mean should be treated with caution.

The **z-score** is a measure of how many standard deviations away from the mean a given data point is. A z-table of the cumulative probability of a standard normal distribution is used to place the z-score in context.

Using an example of weather station data, we may wish to substitute the data from one station to another in the case of missing data at the one you want to use. In order to do this though, it is necessary to confirm that the weather at the replacement station is representative of the weather that the station with missing data. For this example, we want to know whether we can use maximum temperature data from Heathrow to replace missing data from Oxford in July 2012.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats
import numpy as np

# Load in the data
df_heathrow = pd.read_csv('data/heathrow_weather_station_data.csv')
df_oxford = pd.read_csv('data/oxford_weather_station_data.csv')

# select the maximum temperature data for July
df_july_heathrow = df_heathrow[df_heathrow['Month']==7]
df_july_oxford = df_oxford[df_oxford['Month']==7]
df_tmax_jul_heathrow = df_july_heathrow['Max_temperature']
df_tmax_jul_oxford = df_july_oxford['Max_temperature']

# combine the two datasets
df_tmax_jul = pd.DataFrame(zip(df_tmax_jul_heathrow.values, df_tmax_jul_oxford.values),
                           columns=['Heathrow_tmax_jul', 'Oxford_tmax_jul'])

# print the distributions of the two datasets
sns.kdeplot(data=df_tmax_jul)
plt.show()

# test the data for normality
test_normal_heathrow = scipy.stats.shapiro(df_tmax_jul_heathrow)
print(f'Heathrow data looks normally distributed: p-value = {test_normal_heathrow[1]}, p-value > 0.05')
test_normal_oxford = scipy.stats.shapiro(df_tmax_jul_oxford)
print(f'Oxford data looks normally distributed: p-value = {test_normal_oxford[1]}, p-value > 0.05')

The data are normally distributed, therefore we can use the **z-distribution** (the standard normal distribution), a normal distriubtion with a mean of 0 and a standard deviation of 1 to examine the data further. Any normal distribution can be standardised by converting its values into z-scores. Z-scores indicate how many standard deviations away from the mean each value is. Z-distributions allow us to calculate the probability of values occuring and to compare different datasets. Here we are using it to compare two different datasets.

In [None]:
# calculate the z-score
oxford_mean = df_tmax_jul['Oxford_tmax_jul'].mean()
heathrow_mean = df_tmax_jul['Heathrow_tmax_jul'].mean()
oxford_std = df_tmax_jul['Oxford_tmax_jul'].std()

z_score = (heathrow_mean - oxford_mean)/oxford_std
print(f'z-score = {z_score}')

The z-score indicates that the mean of the Heathrow maximum July temperatures is within one standard deviation of the mean of the Oxford maximum July temperatures, but we can see from the plot that the mean of the July temperatures at Heathrow is larger than the mean of the July temperatures at Oxford and so we need to test if they are significantly different distributions.

In [None]:
#find p-value for two-tailed test
print(f'p-value = {scipy.stats.norm.sf(abs(z_score))*2}')

This p-value > 0.05, so the result is NOT significant at the 95% confidence interval. Therefore, the data from the Heathrow weather station cannot be used to replace the missing data from the Oxford weather station.

In this instance, since the Oxford data are normally distributed, using the mean to infill the missing data values is a better choice.

To obtain a normal distribution to sample from, use the numpy function `random`:

In [None]:
from numpy import random

# For sampling a single value from a normal distribution:
x = random.normal()
print(f'Random sample for a normal distribution with the default setting of \nmean=0, \nstandard deviation=1\n{x}\n\n')
      
# For sampling a single value from a normal distribution with a mean (loc) of 2 and a 
# standard deviation (scale) of 1:
x = random.normal(loc=2, scale=1)
print(f'Random sample for a normal distribution with\nmean=2\nstandard deviation=1\n{x}\n\n')

# To return more than one value, use the 'size' keyword to specify the number and shape of the sample:
x = random.normal(loc=2, scale=1, size=3)
print(f'3 Random samples for a normal distribution with\nmean=2\nstandard deviation=1\n{x}\n\n')
x = random.normal(loc=2, scale=1, size=(2,3))
print(f'6 Random samples for a normal distribution with\nmean=2\nstandard deviation=1\n{x}\n\n')


## Binomial Distribution<a class="anchor" id="binomial_distribution"></a>
The binomial distribution is the discrete probability distribution for a variable that can only take one of two independent values, e.g. raining or not raining. The probability of each outcome remains fixed.

Using the weather station data from Heathrow, we can calculate the probability of a wet day as the number of wet days divided by the total number of days.

In [None]:
import pandas as pd

# Load in the data
df_heathrow = pd.read_csv('data/heathrow_weather_station_data_daily.csv')

# Remove the days with missing data
df_heathrow.dropna(inplace=True)
print(df_heathrow)

# Calculate the wet days (assume a threshold of 0.01 mm/day constitutes a wet day)
df_heathrow['wetday'] = 'Wet'
df_heathrow.loc[df_heathrow["PRCP"] <= 0.01, "wetday"] = 'Dry'

# Calculate the probabilities
rainfall_probs = df_heathrow.groupby('wetday').size().div(len(df_heathrow))
print(f'The probability of a wet day at Heathrow is {rainfall_probs[1]:.2f}')

We could also calculate the conditional probability, e.g. the probability of a wet day given a maximum temperature above 20'C

In [None]:
import pandas as pd

# Load in the data
df_heathrow = pd.read_csv('data/heathrow_weather_station_data_daily.csv')

# Remove the days with missing data
df_heathrow.dropna(inplace=True)

# Convert the temperature units to metric
df_heathrow["TMAX"] = (df_heathrow["TMAX"]-32) * 5/9

# Calculate the wet days (assume a threshold of 0.01 mm/day constitutes a wet day)
df_heathrow['wetday'] = 'Wet'
df_heathrow.loc[df_heathrow["PRCP"] <= 0.01, "wetday"] = 'Dry'

# Calculate the days above 15'C
df_heathrow['tmax_gt_15'] = 'TMAX < 15C'
df_heathrow.loc[df_heathrow["TMAX"] >= 15, "tmax_gt_15"] = 'TMAX >= 15C'

# Calculate the probabilities
rainfall_probs = df_heathrow.groupby('wetday').size().div(len(df_heathrow))
print(f'The probability of a wet day at Heathrow is {rainfall_probs[1]:.2f}')
temperature_probs = df_heathrow.groupby('tmax_gt_15').size().div(len(df_heathrow))
print(f'The probability of a day with tmax > 15C at Heathrow is {temperature_probs[1]:.2f}\n\n')

conditional_probs = df_heathrow.groupby('tmax_gt_15')['wetday'].value_counts() / df_heathrow.groupby('tmax_gt_15')['wetday'].count()
print(f'The conditional probabilities are: \n{conditional_probs}\n')

conditional_prob = conditional_probs.loc['TMAX >= 15C']['Wet']
print(f'Therefore, the probability of a wet day at Heathrow given a maximum temperature >= 15C is {conditional_prob*100:.0f}%')

In [None]:
from numpy import random
     
# For sampling a single value from a binomial distribution with 10 trials and a probability of success of 0.5:
x = random.binomial(10, 0.5)
print(f'Random sample for a binomial distribution with 10 trials and a probability of success of 0.5:\n{x}\n\n')

# To return more than one value, use the 'size' keyword to specify the number and shape of the sample:
x = random.binomial(10, 0.5, size=3)
print(f'Three Random samples for a binomial distribution with 10 trials and a probability of success of 0.5:\n{x}\n\n')
x = random.binomial(10, 0.5, size=(2,3))
print(f'Six Random samples for a binomial distribution with 10 trials and a probability of success of 0.5:\n{x}\n\n')


Using our wet day example again, we know the number of days with observations and the probability of a wet day occuring. We could therefore work out the probability of 5% more wet days occuring than was observed

In [None]:
import pandas as pd
import numpy as np

# Load in the data
df_heathrow = pd.read_csv('data/heathrow_weather_station_data_daily.csv')

# Remove the days with missing data
df_heathrow.dropna(inplace=True)

# Calculate the wet days (assume a threshold of 0.01 mm/day constitutes a wet day)
df_heathrow['wetday'] = 'Wet'
df_heathrow.loc[df_heathrow["PRCP"] <= 0.01, "wetday"] = 'Dry'

# Calculate the probabilities
rainfall_probs = df_heathrow.groupby('wetday').size().div(len(df_heathrow))
print(f'The probability of a wet day at Heathrow is {rainfall_probs[1]:.2f}')

# Calculate the number of observations
num_obs = len(df_heathrow)
print(f'There are {num_obs} observations')
num_wetdays = len(df_heathrow[df_heathrow['wetday']=='Wet'])
print(f'There are {num_wetdays} wet days in the record')

# Calculate the threshold
threshold = num_wetdays*1.05
print(f'5% more wet days = {threshold}')

# Draw a random 10,000 samples from the binomial distribution for the dataset
random_sample = np.random.binomial(num_obs, rainfall_probs[1], 10000)
# print(f'Random sample of 1000 drawn from the binomial distribution = {random_sample}')
prob = sum(random_sample >= threshold)/10000
print(f'The probability of 5% more wet days than was observed at Heathrow is {prob*100:.2f}%')

## Poisson Distribution<a class="anchor" id="poisson"></a>
The poisson distribution is the probability of an event occuring a given number of times within a given time frame, provided that the mean rate of occurrence is known and each event is independent. The poisson distribution is therefore useful for determining how often an event should be expected.

Using the weather station data from Heathrow, we can calculate the 99th percentile rainfall amount and call it a storm

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import poisson

# Load in the data from Heathrow which covers the 30 year period 1991 to 2010
df_heathrow = pd.read_csv('data/heathrow_weather_station_data_daily.csv')


# Remove the days with missing data
df_heathrow.dropna(inplace=True)

# Calculate the 99th percentile rainfall amount
rain99 = df_heathrow["PRCP"].quantile(0.99)
print(f'The 99th percentile rainfall amount is {rain99} mm/day')
      
# Calculate the number of storm days
df_heathrow['storm'] = 0
df_heathrow.loc[df_heathrow["PRCP"] >= rain99, "storm"] = 1
df_heathrow['DATE'] = pd.to_datetime(df_heathrow['DATE'], format='%d/%m/%Y')
df_heathrow['Year'] = pd.to_datetime(df_heathrow['DATE']).dt.strftime('%Y')

# Calculate the mean number of storm events per year in the timeseries
mean_storms = df_heathrow.groupby(['Year'])['storm'].sum().mean()
print(f'The mean number of storm events per year is {mean_storms:.2f}')

# Plot the probability of different numbers of storms per year occurring
num_occurrences = np.arange(0, step=1, stop=df_heathrow.groupby(['Year'])['storm'].sum().max()+1)
pmf = poisson.pmf(k=num_occurrences, mu=mean_storms)
fig, ax = plt.subplots(1, 1, figsize=(12, 9))
plt.bar(num_occurrences, pmf * 100)
plt.xlabel('Number of storms')
plt.ylabel('Probability')
plt.show()

prob = poisson.pmf(k=5, mu=mean_storms)*100
print(f'The probability of Heathrow having 5 storms a year is {prob:.2f}%')
print(f'This is equivalent to a 1-in-{100/prob:.0f} year event')

## Exercise 1<a class="anchor" id="exercise_1"></a>

Using any Python methods, assume an increase in rainfall per day at Heathrow due to climate change of 20% but assume that the definition of a storm remains the same as 99th percentile rainfall amount for the current climate.  Calculate the number of storm days, the mean number of storm events per year, and the probability of Heathrow having 5 storms a year.


<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
**Solution**

<font color='red'>**NOTE**</font>: Your methods can include any Python library



In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import poisson

# Load in the data from Heathrow which covers the 30 year period 1991 to 2010
df_heathrow = pd.read_csv('data/heathrow_weather_station_data_daily.csv')


# Remove the days with missing data
df_heathrow.dropna(inplace=True)

# Calculate the 99th percentile rainfall amount
rain99 = df_heathrow["PRCP"].quantile(0.99)
print(f'The 99th percentile rainfall amount is {rain99} mm/day')

df_heathrow["PRCP"] = df_heathrow["PRCP"]*1.2
      
# Calculate the number of storm days
df_heathrow['storm'] = 0
df_heathrow.loc[df_heathrow["PRCP"] >= rain99, "storm"] = 1
df_heathrow['DATE'] = pd.to_datetime(df_heathrow['DATE'], format='%d/%m/%Y')
df_heathrow['Year'] = pd.to_datetime(df_heathrow['DATE']).dt.strftime('%Y')

# Calculate the mean number of storm events per year in the timeseries
mean_storms = df_heathrow.groupby(['Year'])['storm'].sum().mean()
print(f'The mean number of storm events per year is now {mean_storms:.2f}')

# Plot the probability of different numbers of storms per year occurring
num_occurrences = np.arange(0, step=1, stop=df_heathrow.groupby(['Year'])['storm'].sum().max()+1)
pmf = poisson.pmf(k=num_occurrences, mu=mean_storms)
fig, ax = plt.subplots(1, 1, figsize=(12, 9))
plt.bar(num_occurrences, pmf * 100)
plt.xlabel('Number of storms')
plt.ylabel('Probability')
plt.show()

prob = poisson.pmf(k=5, mu=mean_storms)*100
print(f'The probability of Heathrow having 5 storms a year is now {prob:.2f}%')
print(f'This is equivalent to a 1-in-{100/prob:.0f} year event')