In [1]:
import os
import pandas as pd
import subprocess

# Download and process System 5 seasonal data

This Python notebook calls various R scripts to download and process the System 5 data.

## 1. Download S5 data from Unican

In [2]:
# Define geographical domain. Note that you need to include least 4 cells 
# in the S5 1 deg data, otherwise bilinear interpolation will fail
lats = '59.25,59.75' 
lons = '10.25,11.25' 
lake = '10.895,59.542'

# Define members
members = '1:25'

# Variables of interest
variables = ['psl',
             'tcc',    
             'uas',
             'vas',
             'tas',
             'tdps',
             'rsds',
             'rlds',
             'tp']

# Define daily aggregation function for each variable
agg_func = ['mean',
            'mean',
            'mean',
            'mean',
            'mean',
            'mean',
            'mean',
            'mean',
            'sum']

# Path to ERA5 tidied netCDF (from 05_download_era5.ipynb)
era5_nc = '/home/jovyan/shared/WATExR/ERA5/morsa_era5_merged.nc'

#### Added 10.03.2020

The original S5 data download for the Bayesian network (rows 10 to 17 in `07_s5_download_combos.xlsx`) used seasons defined explicitly for water quality, rather than weather forecasting. However, since weather variables are no longer part of the network, it makes sense to modify the S5 download to focus on the weather component.

The data downloaded originally are available here:

    shared/WATExR/orig_s5_bayes_net_datasets
    
S5 data for the Bayesian network stored in the WateXr repository now corresponds to rows 18 to 25 in `07_s5_download_combos.xlsx`.

In [3]:
# Read list of S5 dataset combinations for Norwegian case study
df = pd.read_excel('07_s5_download_combos.xlsx')
df = df.query('comment == "Added 10.03.2020"')

df

Unnamed: 0,period,model_type,c4r_url,season,years,months,lead_month,comment
16,hindcast,bayes_net,http://meteo.unican.es/tds5/dodsC/Copernicus/S...,early_summer,1993:2016,567,1,Added 10.03.2020
17,hindcast,bayes_net,http://meteo.unican.es/tds5/dodsC/Copernicus/S...,late_summer,1993:2016,8910,1,Added 10.03.2020
18,hindcast,bayes_net,http://meteo.unican.es/tds5/dodsC/Copernicus/S...,winter,1993:2016,11121,1,Added 10.03.2020
19,hindcast,bayes_net,http://meteo.unican.es/tds5/dodsC/Copernicus/S...,spring,1993:2016,234,1,Added 10.03.2020
20,forecast,bayes_net,http://meteo.unican.es/tds5/dodsC/Copernicus/S...,early_summer,2017:2019,567,1,Added 10.03.2020
21,forecast,bayes_net,http://meteo.unican.es/tds5/dodsC/Copernicus/S...,late_summer,2017:2019,8910,1,Added 10.03.2020
22,forecast,bayes_net,http://meteo.unican.es/tds5/dodsC/Copernicus/S...,winter,2017:2019,11121,1,Added 10.03.2020
23,forecast,bayes_net,http://meteo.unican.es/tds5/dodsC/Copernicus/S...,spring,2017:2019,234,1,Added 10.03.2020


For each row in the dataframe above, the script below calls `07_s5_download.R`, which performs the following operations:

 * Downloads the specified variables and members
 
 * Aggregates results to daily frequency using the functions provided
 
 * Converts units. **Note:** The unit conversions are handled by settings in `SYSTEM5_ecmwf_Seasonal_25Members_SFC.dic` which are not immediately obvious. See Sixto's response to the question [here](https://82.223.43.150/watexr/pl/7c3scxic5fyqjjpzkewi3xdwac) for full details
 
 * Performs bilinear interpolation of the daily data to the co-ordinates of the lake
 
 * Saves the raw R data object to `WATExR/Norway_Morsa/Data/Meteorological/RData`
 
 * Saves the raw data in CSV format to `WATExR/Norway_Morsa/Data/Meteorological/07_s5_seasonal`
 
Note that the code below will take **several hours** to run.

In [4]:
# Loop over rows in df
for idx, row in df.iterrows():
    # Build command
    cmd = ['Rscript', 
           '--vanilla', 
           '07_s5_download.R',
           row['c4r_url'],
           ','.join(variables),
           members,
           lons,
           lats,
           row['months'],
           row['years'],
           str(row['lead_month']),
           ','.join(agg_func),
           lake,
           row['period'],
           row['model_type'],
           row['season'],
          ]
    
    subprocess.check_call(cmd)

## 2. Merge "hindcast" and "forecast" datasets

The "hindcast" (1993 - 2016) and "forecast" (2017 - 2019) datasets have different URLs on the Unican server and therefore must be downloaded separately. However, for the "common papers" protocol, the 2016-17 boundary has no special significance, as we're planning to calibrate and evaluate the models over multiple periods. The code below calls `07_merge_hindcast_forecast.R`, which merges the data for each `model_type` and `season` in the dataframe above and saves the result (both as an R object and as a CSV).

In [5]:
# Loop over data combos
for idx, row in df[['model_type', 'season']].drop_duplicates().iterrows():
    # Build command
    cmd = ['Rscript', 
           '--vanilla', 
           '07_merge_hindcast_forecast.R',
           ','.join(variables),
           row['model_type'],
           row['season'],
           members,
          ]
    
    subprocess.check_call(cmd)    

## 3. Processing of ERA5 data

The code below takes the ERA5 netCDF generated by `05_download_era5.ipynb` and converts it to `.rda` format for convenient use with the other climate4R datasets. The script `07_process_era5.R` performs the following operations:

 * Variables are aggregated to daily frequency
 
 * Converts units for temperature, dewpoint temperature, radiation and precipitation. **Note:** A different set of aggregation functions are used here as the `.dic` file used in Section 1 is not relevant for the ERA5 data (which is stored differently). The `.dic` is not used here, so aggregations are performed directly

 * Performs bilinear interpolation of the daily data to the co-ordinates of the lake
 
 * Saves the raw R data object to `WATExR/Norway_Morsa/Data/Meteorological/RData`
 
 * Saves the raw data in CSV format to `WATExR/Norway_Morsa/Data/Meteorological/06_era5`

In [6]:
# Define daily aggregation function for each variable
era5_agg_func = ['mean',
                 'mean',
                 'mean',
                 'mean',
                 'mean',
                 'mean',
                 'sum',
                 'sum',
                 'sum']

# Build command
cmd = ['Rscript', 
       '--vanilla', 
       '07_process_era5.R',
       era5_nc,       
       ','.join(variables),
       ','.join(era5_agg_func),
       lake,
      ]

res = subprocess.check_call(cmd)  

## 4. Bias correct System 5

`07_bias_correct_s5.R` takes the merged S5 data for each `season` and `model_type` (i.e. all relevant months in 1993 to 2019), aligns it with the processed ERA5 data for the same period (from step 3, above) and applies bias correction **with "leave-one-out" cross validation**. Results are saved to the `RData` and `07_s5_seasonal` folders with the suffix `_bc`.

In [7]:
# Loop over data combos
for idx, row in df[['model_type', 'season']].drop_duplicates().iterrows():
    # Build command
    cmd = ['Rscript', 
           '--vanilla', 
           '07_bias_correct_s5.R',
           ','.join(variables),
           row['model_type'],
           row['season'],
           members,
          ]
    
    subprocess.check_call(cmd)  

## 5. Units reference

All datasets saved by the workflow above should have the following units:

    psl     Surface pressure (Pa)
    tcc     Total cloud cover (fraction between 0 and 1; dimensionless)
    uas     10 metre U wind component (m.s-1)
    vas     10 metre V wind component (m.s-1)
    tas     2 metre temperature (C)
    tdps    2 metre dewpoint temperature (C)
    rsds    Surface solar radiation downwards (W.m-2)
    rlds    Surface thermal radiation downwards (W.m-2)
    tp      Total precipitation (mm)