# W2D1 Tutorial 2: What is an extreme event? Empirical return levels

#### __Week 2, Day 4, Extremes & Vulnerability__
##### __Content creators:__ Matthias Aengenheyster, Joeri Reinders
##### __Content reviewers:__ TBD
##### __Content editors:__ TBD
##### __Production editors:__ TBD
##### __Our 2023 Sponsors:__ TBD

## Tutorial Objectives:

In this second tutorial we will try to compute the precipitation levels associated with th 50, 100 and 500-year event The 100-year event is an precipitation level that we expect to see only once every 100 years, or in other words... a storm event with a 1% chance of happening every year. Subsequently the 2-year event has a 50% change of happening every year.  The return-periods as we call them are often used by policymakers to design policy and infrastructure around. For example, a bridge should be able to withhold a 100-year flood event; and evacuation plan is designed for an 50-year earthquake, and a nuclear powe plant should not collapse during a 10,000-year storm.  There are two ways in which we can compute the return levels associated with a specific return period: 1) empirically and 2) through the pdf of a distribution. Here we will first compute them empirically. 

By the end of the tutorial you will be able to:
- Compute empirical return levels
- visualize a data record in a return-level plot

## Setup

In [None]:
# Installs

In [None]:
import numpy as np
import matplotlib.pyplot as plt
# import seaborn as sns
import pandas as pd
from scipy import stats

In [None]:
import os, pooch

fname = 'precipitationGermany_1920-2022.csv'
if not os.path.exists(fname):
    url = "https://osf.io/xs7h6/download"
    fname = pooch.retrieve(url, known_hash=None)

data = pd.read_csv(fname, index_col=0).set_index('years')

data.columns=['precipitation']
precipitation = data.precipitation

# Data investigation

First open the precipitation record and plot it over time: 

In [None]:
precipitation.plot.line(style='.-')
plt.ylabel('annual maximum daily precipitation (mm/day)')

In this tutorial we will compute the return period for each event in the dataset (each maximum precipitation value for that year). To do this we first need to rank the precipitation levels from
high to low with the sort function. Save the sorted data in the first column of a matrix called 
"precip" with four columns and as many rows as there are data entries.

In the second column you store the ranks of each value (highest = 1; lowest = 103) 


In [None]:
precip_df = pd.DataFrame(index=np.arange(data.precipitation.size))

In [None]:
precip_df['sorted'] = np.sort(data.precipitation)[::-1]

In [None]:
precip_df['ranks'] = np.arange(data.precipitation.size)

As you might have noticed there are some precipitation values that appear twice. These should have
the same rank so we want to fix that first... we can instead use the scipy function rankdata to find the 
rank of each value - this function will give similar values the same rank. Of course the ranks also
have to be sorted before putting them in our matrix.


In [None]:
precip_df['ranks_sp'] = np.sort(stats.rankdata(-data.precipitation))

We can compute the empirical probability of exceedance by dividing the rank (r) by the total
amount of values (n) plus 1.


In [None]:
n = data.precipitation.size
P = precip_df['ranks_sp']/(n+1)
precip_df['exceedance'] = P

The return period and the chance of exceedence are related through through T = 1/P

In [None]:
precip_df['period'] = 1 / precip_df['exceedance']

In [None]:
precip_df

Now that we know the return periods of each annual maximum precipitation level we can create
a return level plot - a plot of return levels against return periods:

In [None]:
plt.plot(precip_df['period'],precip_df['sorted'],'o')
plt.xlabel('Return period (years)')
plt.ylabel('Return level')
plt.gca().set_xscale('linear')

## Excercise: 
1. Often we talk about return levels in a logarithmic sense, that is we talk about 1-year, 10-year and 100-year events. Modify the plot above to make the x-axis use a logarithmic rather than linear scale. How does the perception of the plot change?
2. This is only one data record. Therefore, the estimate of 'extreme values' relies on few data points. How confident do you think we can be about return periods / levels of extreme values based on this plot? What could be done to make the estimate more robust?
3. Optional: Feel free to try one of your ideas from (2)

### solution part 1:

In [None]:
plt.plot(precip_df['period'],precip_df['sorted'],'o')
plt.xlabel('Return period (years)')
plt.ylabel('Return level')
plt.gca().set_xscale('log')

### solution part 3 (optional):

In [None]:
def empirical_period(data):
    df = pd.DataFrame(index=np.arange(data.size))
    df['sorted'] = np.sort(data)[::-1]
    df['ranks'] = np.arange(data.size)
    df['ranks_sp'] = np.sort(stats.rankdata(-data))
    n = data.size
    P = df['ranks_sp']/(n+1)
    df['exceedance'] = P
    df['period'] = 1 / df['exceedance']

    return df[['period','sorted']].set_index('period')['sorted']

In [None]:
for i in range(1000):
    empirical_period(
        # data.precipitation.values
        np.random.choice(data.precipitation.values,size=data.precipitation.size,replace=True)
        ).plot(style='C0-',alpha=0.1)
plt.plot(precip_df['period'],precip_df['sorted'],'ko')
plt.xlabel('Return period (years)')
plt.ylabel('Return level')
plt.gca().set_xscale('log')