<img src='./img/LogoWekeo_Copernicus_RGB_0.png' align='right' width='20%'></img>

# Tutorial: analysis of climate extremes
In this tutorial we will use the WEkEO Jupyterhub to access and analyse data from the Wekeo HDA API Client of the Copernicus Climate Change Service (C3S). We will analyse climate extremes, and focus on the area surrounding the city of Lille in Northern France.

The tutorial comprises the following steps:

1. [Search and download data](#search_download) data using the CDS API: We will focus on ERA5 reanalysis data of 2 metre (near-surface) temperature.
2. [Read data](#read_data): Once downloaded, we will read and understand the data, including its variables and coordinates.
3. [View and plot](#view_plot) maximum temperatures in September 2020.
4. [Calculate averages](#calculate_averages) of the maximum daily temperatures in September over the period from 1979 to 2019, and compare these with our findings for 2020.

<img src='./img/climate_extremes.png' align='center' width='100%'></img>

## <a id='search_download'></a>1. Search and download data

Before we begin we must prepare our environment. This includes installing the Application Programming Interface (API) of the CDS, and importing the various python libraries that we will need.

#### Install HDA API

To install the HDA API, run the following command. We use an exclamation mark to pass the command to the shell (not to the Python interpreter).

In [None]:
pip install -U hda

Please verify the following requirements are installed before skipping to the next step:
   - Python 3
   - requests
   - tqdm

#### Import libraries

We will be working with data in NetCDF format. To best handle this data we need a number of libraries for working with multidimensional arrays, in particular Xarray. We will also need libraries for plotting and viewing data, in particular Matplotlib and Cartopy.

In [None]:
# Libraries for working with multidimensional arrays
import numpy as np
import xarray as xr

# Libraries for plotting and visualising data
import matplotlib.path as mpath
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
from cartopy.mpl.gridliner import LONGITUDE_FORMATTER, LATITUDE_FORMATTER
import cartopy.feature as cfeature

The hda client provides a fully compliant Python 3 client that can be used to search and download products using the Harmonized Data Access WEkEO API.
HDA is RESTful interface allowing users to search and download WEkEO datasets.
Documentation about its usage can be found at https://www.wekeo.eu/.

In [None]:
from hda import Client

#### Search for data

Under [WEkEO DATA](https://wekeo.eu/data?view=catalogue). Clicking the + to add a layer, opens a catalogue search. Here you can use free text, or you can use the filter options on the left to refine your search and look by satellite plaform, sensor, Copernicus service, area (region of interest), general time period (past or future), as well as through a variety of flags.

You can click on the dataset you are interested in and you will be guided to a range of details including the dataset temporal and spatial extent, collection ID, and metadata.

Now search for the product `ERA5 hourly data on single levels from 1979 to present`. You can find it more easily by selecting 'ERA5 hourly data on single' in the 'COPERNICUS SERVICE' filter group. 

Once you have found it, select 'Details' to read the dataset description.

<br>

<div style='text-align:center;'>
<figure><img src='./img/WEKEO_ERA5_hourly_data.png' width='70%' />
    <figcaption><i>WEkEO interface to search for datasets</i></figcaption>
</figure>
</div>

The dataset description provides the following information:
- **Abstract**, containing a general description of the dataset,
- **Classification**, including the Dataset ID 
- **Resources**, such as a link to the Product Data Format Specification guide, and JSON metadata
- **Contacts**, where you can find further information about the data source from its provider.  

You need the `Dataset ID` to request data from the Harmonised Data Access API. 

<br>

<div style='text-align:center;'>
<figure><img src='./img/ERA5_hourly_info.png' width='50%' />
    <figcaption><i>Dataset information on WEkEO</i></figcaption>
</figure>
</div>
<br>

Let's store the Dataset ID as a variable called `dataset_id` to be used later.

In [None]:
dataset_id = "EO:ECMWF:DAT:REANALYSIS_ERA5_SINGLE_LEVELS"

Now select `Add to map` in the data description to add the selected dataset to the list of layers in your map view. Once the dataset appears as a layer, select the `subset and download` icon. This will enable you to specify the variables, temporal and in some cases geographic extent of the data you would like to download. Select the dataset information and then select `NetCDF` as format.

Now select `Show API request`. This will show the details of your selection in `JSON` format. If you now select `Copy`, you can copy these details to the clipboard then paste it either into a text file to create a `JSON` file (see example [here](./SeaLevel_data_descriptor.json)), or paste it directly into the cell below.

The Harmonised Data Access API can read this information, which is in the form of a dictionary.

<br>

<div style='text-align:center;'>
<figure><img src='./img/ERA5_hourly_params_json.png' width='60%' />
    <figcaption><i>Displaying a JSON query from a request made to the Harmonised Data Access API through the data portal</i></figcaption>
</figure>
</div>
<br>

#### Configure the WEkEO API Authentication

In order to interact with WEkEO's Harmonised Data Access API, each user first makes sure the file "$HOME/.hdarc" exists with the URL to the API end point and your user and password.

For example, to search for the file .hdarc in the $HOME diretory, the user would open a terminale and run the following command:

Then he could copy the code below in the file "$HOME/.hdarc" (in your Unix/Linux environment) and adapt the following template with the credentials of your WEkEO account:

If he doesn't have a WEkEO account, please self register at the WEkEO registration page https://my.wekeo.eu/web/guest/user-registration.

#### Load data descriptor file and request data

The Harmonised Data Access API can read your data request from a dictionary. In this dictionary, you can describe the dataset you are interested in downloading.

In [None]:
data ={
  "datasetId": "EO:ECMWF:DAT:REANALYSIS_ERA5_SINGLE_LEVELS",
  "boundingBoxValues": [
    {
      "name": "area",
      "bbox": [
        51,
        3,
        50,
        4
      ]
    }
  ],
  "stringChoiceValues": [
    {
      "name": "format",
      "value": "netcdf"
    }
  ],
  "multiStringSelectValues": [
    {
      "name": "product_type",
      "value": [
        "reanalysis"
      ]
    },
    {
      "name": "year",
      "value": [
        "1979",
        "1980",
        "1981",
        "1982",
        "1983",
        "1984",
        "1985",
        "1986",
        "1987",
        "1988",
        "1989",
        "1990",
        "1991",
        "1992",
        "1993",
        "1994",
        "1995",
        "1996",
        "1997",
        "1998",
        "1999",
        "2000",
        "2001",
        "2002",
        "2003",
        "2004",
        "2005",
        "2006",
        "2007",
        "2008",
        "2009",
        "2010",
        "2011",
        "2012",
        "2013",
        "2014",
        "2015",
        "2016",
        "2017",
        "2018",
        "2019",
        "2020"
      ]
    },
    {
      "name": "variable",
      "value": [
        "2m_temperature"
      ]
    },
    {
      "name": "month",
      "value": [
        "09"
      ]
    },
    {
      "name": "day",
      "value": [
        "10",
        "11",
        "12",
        "13",
        "14",
        "15",
        "16",
        "17",
        "18",
        "19",
        "20",
        "21",
        "22",
        "23",
        "24",
        "25",
        "26",
        "27",
        "28",
        "29",
        "30",
        "01",
        "02",
        "03",
        "04",
        "05",
        "06",
        "07",
        "08",
        "09"
      ]
    },
    {
      "name": "time",
      "value": [
        "00:00",
        "04:00",
        "08:00",
        "12:00",
        "16:00",
        "20:00",
        "01:00",
        "05:00",
        "09:00",
        "13:00",
        "17:00",
        "21:00",
        "02:00",
        "06:00",
        "10:00",
        "14:00",
        "18:00",
        "22:00",
        "03:00",
        "07:00",
        "11:00",
        "15:00",
        "19:00",
        "23:00"
      ]
    }
  ]
}
data

#### Download requested data

As a final step, you can use directly the client to download data as in following example.

In [None]:
c = Client(debug=True)

matches = c.search(data)
print(matches)
matches.download()

## <a id='read_data'></a>2. Read Data

Now that we have downloaded the data, we can start to play ...

We have requested the data in NetCDF format. This is a commonly used format for array-oriented scientific data. 

To read and process this data we will make use of the Xarray library. Xarray is an open source project and Python package that makes working with labelled multi-dimensional arrays simple, efficient, and fun! We will read the data from our NetCDF file into an Xarray **"dataset"**

In [None]:
filename = r'./adaptor.mars.internal-1639129639.6692817-20965-2-bc1c1491-3311-42aa-b3de-5ba948fac15d.nc'
# Create Xarray Dataset
ds = xr.open_dataset(filename)

Now we can query our newly created Xarray dataset ...

In [None]:
ds

We see that the dataset has one variable called **"t2m"**, which stands for "2 metre temperature", and four coordinates of **longitude**, **latitude**, **expver** and **time**. Expver stands for 'experiment version'. Data up until the end of 2019 has expver value of 1. This is referred to as "operational data", while more recent data from 2020 has expver value of 5, which is near-real time data. After a period of time, near-real time data passes to the operational dataset.

Select the icons to the right of the table above to expand the attributes of the coordinates and data variables. What are the units of the temperature data?

While an Xarray **dataset** may contain multiple variables, an Xarray **data array** holds a single variable (which may still be multi-dimensional) and its coordinates. To make the processing of the **t2m** data easier, we convert in into an Xarray data array:

In [None]:
da = ds['t2m']

Let's convert the units of the 2m temperature data from Kelvin to degrees Celsius. The formula for this is simple: degrees Celsius = Kelvin - 273.15

In [None]:
t2m_C = da - 273.15

## <a id='view_plot'></a>3. View daily maximum 2m temperature for September 2020
We will plot the maximum values of 2m temperature over the subset area of Northern France.

First we average over the subset area:

In [None]:
Lille_t2m = t2m_C.mean(["longitude", "latitude"])

Now we select only the data for 2020, and only experiment version 5 (near-real time version of ERA5):

In [None]:
Lille_2020 = Lille_t2m.sel(expver=5)
Lille_2020 = Lille_2020.sel(time='2020')

We can now calculate the max daily 2m temperature for each day in September 2020:

In [None]:
Lille_2020_max = Lille_2020.groupby('time.day').max('time')

Let's plot the results in a chart:

In [None]:
x = Lille_2020_max.day
y = (np.around(Lille_2020_max.values, 0)).astype(int)

fig = plt.figure(figsize=(10,5))
ax = plt.subplot()
ax.set_ylabel('t2m (Celsius)')
ax.set_xlabel('day')
ax.plot(x, y)
ax.grid(linestyle='--')
for i,j in zip(x,y):
    ax.annotate(str(j),xy=(i,j))
ax.set_title('Max daily t2m for Sep 2020')

In [None]:
print('The maximum temperature in September 2020 in this area was', 
      np.around(Lille_2020_max.max().values, 1), 'degrees Celsius.')

Which day in September had the highest maximum temperature?

Is this typical for Northern France? How does this compare with the long term average? We will seek to answer these questions in the next section.

## <a id='calculate_averages'></a>4. Calculate long term average of 2m temperature for September over Northern France
We will now seek to discover just how high the temperature for Lille in mid September 2020 was when compared to typical values exptected in this region at this time of year. To do that we will calculate the mean and standard deviation of maximum daily 2m temperature for each day in September for the period of 1979 to 2019, and compare these with our values for 2020.

First we select all data prior to 2020. This data has experiment version 1 (consolidated version of ERA5).

In [None]:
Lille_past = Lille_t2m.sel(expver=1)

Now we calculate the climatology for this data, i.e. the mean and standard deviation of maximum daily values for each of the days in September for a period of several decades (from 1979 to 2019).

To do this, we first have to extract the maximum daily value for each day in the time series:

In [None]:
Lille_max = Lille_past.resample(time='D').max()

Then we can calculate the mean and standard deviation of this for the 40 year time series for each day in September:

In [None]:
Lille_m = Lille_max.groupby('time.day').mean('time')
Lille_sd = Lille_max.groupby('time.day').std('time')

Let's plot this data. We will plot the mean plus and minus one standard deviation to have an idea of the expected range of maximum daily temperatures in this part of France in September:

In [None]:
y1 = Lille_m
y2 = Lille_m + Lille_sd
y2 = np.squeeze(y2.values)
y3 = Lille_m - Lille_sd
y3 = np.squeeze(y3.values)

fig = plt.figure(figsize=(10,5))
ax = plt.subplot()
ax.set_ylabel('t2m (Celsius)')
ax.set_xlabel('day')
ax.plot(Lille_m.day, y1, color='green', label='t2m mean, shading: +/- SD')
ax.plot(Lille_m.day, y2, color='white')
ax.plot(Lille_m.day, y3, color='white')
ax.fill_between(Lille_m.day, y2, y3, alpha=0.1)
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles, labels)
ax.set_title('t2m climatology for Sep from 1979 to 2019')

What is the typical range of maximum 2m temperature values for September 15?

We will now look more closely at the probability distribution of maximum temperatures for 15 September in this time period. To do this, we will first select only the max daily temperature for 15 September, for each year in the time series:

In [None]:
Lille_max = Lille_max.dropna('time', how='all')
Lille_15 = Lille_max[14::30]

We will then plot the histogram of this:

In [None]:
Lille_15.plot.hist()

Look at the range of maximum temperatures for 15 September in the period from 1979 to 2019. Has the temperature in this period ever exceeded that of 15 September 2020?

The histogram shows the distribution of maximum temperature of one day in each year of the time series, which corresponds to 41 samples. In order to increase the number of samples, let's plot the histogram of maximum temperatures on 15 September, plus or minus three days. This would increase our number of samples by a factor of seven.

To do this, we first need to produce an index that takes 15 Sep, plus or minus three days, from every year in the time series:

In [None]:
years = np.arange(41)
days_in_sep = np.arange(11,18)
index = np.zeros(287)
for i in years:
    index[i*7:(i*7)+7] = days_in_sep + (i*30)
index = index.astype(int)

Then we apply this index to filter the array of max daily temperature from 1979 to 2019: 

In [None]:
Lille_7days = Lille_max.values[index]

Now we can plot the histogram of maximum daily temperatures in days 12-18 September from 1979-2019:

In [None]:
plt.figure(figsize=(10,5))
plt.hist(Lille_7days, bins = np.arange(10,32,1)) 
plt.title("histogram of max temperature in days 12-18 Sep from 1979-2019")
plt.xticks(np.arange(10,32,1))
plt.show()

Even in this increased temporal range, the maximum daily temperature still never reached that of 15 September 2020!

<hr>

<p><img src='./img/all_partners_wekeo.png' align='left' alt='Logo EU Copernicus' width='100%'></img></p>