# Working with AgERA5 dataset

In this notebook we will:
- Download several parameters from AgERA5 dataset for period 2010 - 2014 
- Select one point representing city of Graz
- Save the result in monthly netCDF file

Besides imported libraries, to work with this notebook, please make sure that you have **dask** and **netcdf4** python libraries installed. They are used internally by **xarray** to work with netCDF files.

Before starting with the notebook, make sure you have an account on [Climate Data Store](https://cds.climate.copernicus.eu/cdsapp#!/home) and follow instructions on [How to install CDSAPI key](https://cds.climate.copernicus.eu/api-how-to).

In [1]:
# pip install --upgrade metpy
# pip install cdsapi
# conda install -c conda-forge xarray dask netCDF4 bottleneck

In [49]:
import xarray as xr
import cdsapi
import zipfile
from pathlib import Path
import os

c = cdsapi.Client()

First function will download one month of data in netcdf daily files which will come zipped.  
It takes as arguments: path to directory where you want to save the data, month, year, variable and statistic that is computed (minimum, maximum, mean etc) over the 24 hr period.  

You can find which parameters have which statistics computed in this the web download application for this dataset:  
https://cds.climate.copernicus.eu/terms#!/dataset/sis-agrometeorological-indicators?tab=form  
On this page you can select parameters and see the example api request.

In [48]:
def retrieve_a_month(path, year, month, variable, statistic):
    filename = path + year + '_' + month + '.zip'
    c.retrieve(
    'sis-agrometeorological-indicators',
    {
        'format': 'zip',
        'variable': variable,
        'statistic': statistic,
        'year': year,
        'month': month,
        'day': [
            '01', '02', '03',
            '04', '05', '06',
            '07', '08', '09',
            '10', '11', '12',
            '13', '14', '15',
            '16', '17', '18',
            '19', '20', '21',
            '22', '23', '24',
            '25', 
            '26', '27',
            '28', '29', '30',
            '31',
        ],
    },
    filename)

This function will unzip the file in the directory that is named by the year.  This was for my convinience, because I had other netcdf files in the parent directory.

In [50]:
def unzip_file(path,year,month):
    filename = path + year + '_' + month + '.zip'
    new_dir = path + year
    with zipfile.ZipFile(filename, 'r') as zip_ref:
        zip_ref.extractall(new_dir)

This function deletes all the files in the 'year' directory as well as zip file (as the next downloaded file will have the same name, it does not contain variable name).  
This was just for my convenience again.

In [22]:
def delete_month_files(path,year,month):
    path_to_year = path + year
    for f in Path(path_to_year).glob('*.nc'):
        try:
            f.unlink()
        except OSError as e:
            print("Error: %s : %s" % (f, e.strerror))
    zip_to_delete = path + year + '_' + month + '.zip'
    os.remove(zip_to_delete)
    


Here we define the years and months we want to download as well as path to where we want to save the data.

In [53]:
years = [str(i) for i in range(2010, 2022)]
months = [
            '01', 
    '02', 
    '03',
    '04', '05', '06',
    '07', '08', '09',
    '10', 
    '11', 
    '12',
    ]

path = '../../data/'


# parameter = [('cloud_cover','24_hour_mean'),
#              ('2m_temperature','day_time_maximum'),
#              ('2m_temperature','night_time_minimum'),
#              ('10m_wind_speed','24_hour_mean'),
             
#              ('liquid_precipitation_duration_fraction',''),
#              ('solar_radiation_flux',''),             
#              ('precipitation_flux',''),
#              ('solid_precipitation_duration_fraction', ''),
#              ('2m_dewpoint_temperature','24_hour_mean'),
#              ('vapour_pressure', '24_hour_mean'),

# ]

parameter = [
             ('2m_dewpoint_temperature','24_hour_mean'),
             ('vapour_pressure', '24_hour_mean'),
#              (2m_relative_humidity)
]

We make an output directory for final data to be saved

In [54]:
graz_dir = path + 'graz_data/'
Path(graz_dir).mkdir(parents=True, exist_ok=True)

We define parameters and computed statistics we want.  
**Note** that not all combinations of parameters and statistics are allowed. You can see which parameter has what calculated in the CDS web download applicaiton.

And finally we run the whole workflow:

In [None]:
for year in years:
    new_dir = path + year
    Path(new_dir).mkdir(parents=True, exist_ok=True)    
    for variable, statistic in parameter:
        for month in months:
            retrieve_a_month(path,year,month,variable, statistic)
            unzip_file(path,year,month)
            nc_directory = path + year
            #open the dataset
            data = xr.open_mfdataset(nc_directory + '/*.nc')
            #filter the point representing Graz
            graz_data = data.sel(lat = 47.0, lon = 15.4, method = 'nearest')
            #make a filename
            nc_name = graz_dir + 'graz_' + variable + '_' + statistic + '_' + year + '_' + month + '.nc'
            #save as netcdf
            graz_data.to_netcdf(nc_name)
            #delete all the data because it is kind of a lot if we keep all
            delete_month_files(path,year,month)    

Now we can open the new small netcdf files using xarray again. From there we can convert to pandas dataframe as well.

In [None]:
import os
import xarray as xr
import pandas as pd

# graz_dir = '../../output_intermediate/95_graz_data'
graz_dir = '../../data/graz_data'

df = pd.DataFrame()
for variable, statistic in parameter:
    tmp2 = pd.DataFrame()
#     for year in range(2010, 2022):
    for year in years:
        tmp1 = pd.DataFrame()
        for month in months:
            path = os.path.join(graz_dir, 
                                'graz_' + variable + '_' + statistic + f'_{year}_' + month + '.nc')    
            try:
                tmp = xr.open_mfdataset(path).to_dataframe().drop(['lon', 'lat'], axis=1).drop_duplicates()
                tmp1 = pd.concat([tmp1, tmp], axis=0)
            except:
                print(f'failed: {path}')

        tmp2 = pd.concat([tmp2, tmp1], axis=0)
        print(f'Done year: {year}')

    print(f'Done: {variable}')
    df = df.join(tmp2, how='outer')

df = df.drop_duplicates()
df

df.to_csv('AgERA5_4params_graz.csv')