# Data Wrangling

This notebook was created to capture the data wrangling process for all the datasets. This will include the commands used in NCO and other libraries, but it will only run code written in Python.

---

 - Author:          
                    Luis F Patino Velasquez - MA
 - Date:            
                    Jun 2020
 - Version:         
                    1.0
 - Notes:            
                    files used in this notebook are in format netCDF
 - Jupyter version: 
                    jupyter core     : 4.7.1
                    jupyter-notebook : 6.4.0
                    qtconsole        : 5.1.1
                    ipython          : 7.25.0
                    ipykernel        : 6.0.3
                    jupyter client   : 6.1.12
                    jupyter lab      : 3.0.16
                    nbconvert        : 6.1.0
                    ipywidgets       : 7.6.3
                    nbformat         : 5.1.3
                    traitlets        : 5.0.5
 - Python version:  
                    3.8.5 

---

## Main considerations

* The main time period considered for all datasets is 2001 - 2019, this is based on GPM-IMERG data

## Setting Python Modules

In [None]:
from glob import glob
import xarray as xr
import matplotlib.pyplot as plt
from pathlib import Path
import datetime
import numpy as np

# Global vars
sep = ('''------------\n------------''')
print(sep)

# ERA5

ERA5 data was downloaded as 24 hourly steps (0, 1, 2, 3, 4,..,23) for each calendar day starting from Jan 1 to Dec 31 of each considered year.

ECMWF state here https://confluence.ecmwf.int/display/CKB/ERA5%3A+How+to+calculate+daily+total+precipitation that daily total precipitation must be calculated by accumulating precipitation for e.g. Jan 1, 1979 by summing the steps 1, 2,...,23 of Jan 1 AND step 0 of Jan 2. It means that the step 0 of Jan 1, 1979 is not included in calculation of the total precipitation for that day. For calculation of total precipitation for Jan 2, 1979 we use also the steps 1, 2, 3,...,23 of that day plus step 0 of Jan 3 and so on.

* The following NCO code was used to create daily total precipitation `cdo -b F64 daysum -shifttime,-1hour era5_fileTwo.nc era5_prcp_daysum_nco_2000-2010.nc`
* The following NCO code was used to change from m to mm `cdo mulc,1000 era5_prcp_daysum_nco_2000-2010.nc era5_prcp_daysum_mm_nco_2000-2010.nc`
* The following NCO code was used to change the units in the tp variable `ncatted -O -a units,tp,m,c,"mm d-1" in.nc`

*After creating the daily precipitation files the following line was used to add the standard_name to the tp variable needed for xclim*

* `ncatted -O -a standard_name,tp,c,c,"total_precipitation" in_DAY.nc`

## Creating the daily total precipitation files

In [None]:
# Set directory to read and for outputs
fldr_src = Path('/mnt/d/MRes_dataset/search_data/era_copernicus_uk/')

# Create list of daily total precipitation files 
annual_fls = fldr_src.glob('**/era5_copernicus_prcp_daysum_mm*')
# annual_fls = fldr_src.glob('**/era5_prcp_daysum_nco*')

# Create final file
for era_fl in annual_fls:
    if era_fl.name.split('_')[5][:4] != '2000':
        yr_fst = era_fl.name.split('_')[5][:4] + '-01-01'
        yr_lst = era_fl.name.split('_')[5][5:9] + '-12-31'
        filename = 'era5_copernicus_DAY_prcp_' + era_fl.name.split('_')[5]
    else:
        yr_fst = '2001-01-01'
        yr_lst = era_fl.name.split('_')[5][5:9] + '-12-31'
        filename = 'era5_copernicus_DAY_prcp_' + '2001' + '-' + era_fl.name.split('_')[5][5:9] + '.nc'
    
    # Open data and drop the first timestep of the input data as per the ECMWF statement
    dataset = xr.open_dataset(era_fl, decode_timedelta=False)
    dataset_annual = dataset.sel(time=slice(yr_fst, yr_lst))

    # Saving file with annual precipitations
    annual_prcp = Path(fldr_src / filename)
    print ('saving to ', annual_prcp)
    dataset_annual.to_netcdf(path=annual_prcp)
    print ('finished saving')
    dataset.close()
    
print(sep)
print('All done!! check files in {}'.format(fldr_src))

# HadUK-Grid

HadUK-Grid data was downloaded as monthly files for each calendar day starting from Jan 1 to Dec 31 of each considered year.

* The following NCO code was used to change the units in the rainfall variable `ncatted -O -a units,rainfall,m,c,"mm d-1" in.nc`

The line of code was run as part of a the follwing bash code:

```
#! /bin/bash

FILES='folder path with files/*'
for f in $FILES
do
	echo "Processing $f"
    ncatted -O -a units,rainfall,m,c,"mm d-1" $f
done

```

***HadUK-Grid files did not require any data wrangling using python***

## Checking File downloaded correctly


In [None]:
# Set directory to read and for outputs
fldr_src = Path('/mnt/d/MRes_dataset/search_data/haduk_cedac_uk/')

# Create list with files
fls_lst = fldr_src.glob('**/*.nc')

# Check to see if the file structure is correct - ie it opens using xarray
arr_err = []
for fl in fls_lst:
    try:
        dataset = xr.open_dataset(fl, decode_timedelta=False)
        dataset.close()
    except:
        arr_err.append(fl)
        print('Doing file: {}'.format(fl))

if len(arr_err) == 0:
    print('The files were downloaded correctly')
else:
    print ('The following files could not be opened: {}'.format(arr_err))

## Checking output after units change

In [None]:
# Set directory to read and for outputs
fldr_src = Path('/mnt/d/MRes_dataset/search_data/haduk_cedac_uk/')
# !ls {fldr_src}

dataset = xr.open_dataset(Path(fldr_src / 'rainfall_hadukgrid_uk_5km_day_20020601-20020630.nc'), decode_timedelta=False)
dataset

# GPM-IMERG

ERA5 data was downloaded as daily files for each calendar day starting from Jan 1 to Dec 31 of each considered year.

* The following NCO code was used to change the units in the rainfall variable `ncatted -O -a units,rainfall,m,c,"mm d-1" in.nc`

The change of units NCO code was run as part of a the follwing bash code:

```
#! /bin/bash

FILES='folder path with files/*'
for f in $FILES
do
	echo "Processing $f"
    ncatted -O -a units,rainfall,m,c,"mm d-1" $f
done

```

## Checking File downloaded correctly

In [None]:
# Set directory to read and for outputs
fldr_src = Path('/mnt/d/MRes_dataset/search_data/gpm_imerg_nasa_uk/')

# Create list with files
fls_lst = fldr_src.glob('**/*')

# Check to see if the file structure is correct - ie it opens using xarray
arr_err = []
for fl in fls_lst:
    try:
        dataset = xr.open_dataset(fl, decode_timedelta=False)
        dataset.close()
    except:
        arr_err.append(fl)
        print('Doing file: {}'.format(fl))

if len(arr_err) == 0:
    print('The files were downloaded correctly')
else:
    print ('The following files could not be opened: {}'.format(arr_err))

## Checking output after units change

In [None]:
# Set directory to read and for outputs
fldr_src = Path('/mnt/d/MRes_dataset/search_data/gpm_imerg_nasa_uk/')
# !ls {fldr_src}

dataset = xr.open_dataset(Path(fldr_src / '3B-DAY.MS.MRG.3IMERG.20001002-S000000-E235959.V06.nc4.nc4'), decode_timedelta=False)

# Modify precipitation values for all files

* All precipitation values under 1mm will be modified to Zero
* This step is carried out using a batch file and over a copy of all the data modified in previous steps.
* The following NCO command was in a copy (to avoid modifying the main datasets) all the files for HadUK-grid, IMERG and ERA5 `ncap2 -s 'where(precipitationCal <= 1.0) precipitationCal=0.0;' -O in.nc out.nc`

```
#! /bin/bash
FILES='folder path with files/*'
for f in $FILES
do
	echo "Processing $f"
    # change variable name to match the one in the files
    ncap2 -s 'where(precipitationCal < 1.0) precipitationCal=0.0;' -O $f $f
done
#! /bin/bash


```

# Modified HadUK-Grid from planar to geodesic coordenates
<div class="alert alert-block alert-warning">
 <p style="color:black"> <b>Caution: This should only be run BEFORE creating the indices datasets and AFTER making sure there are not values under 1mm in the rainfall variable</b></p>
</div>

This process is as follows:

* All HadUK-Grid files are merge into one nc file - rainfall_hadukgrid_uk_5km_day_2001-2019.nc
* The following variables need to be removed using NCO:
    - `ncks -C -O -x -v projection_y_coordinate_bnds,projection_x_coordinate_bnds rainfall_hadukgrid_uk_5km_day_2001-2019.nc rainfall_hadukgrid_uk_5km_day_2001-2019_step1.nc`
    - `ncks -C -x -v latitude,longitude,time_bnds rainfall_hadukgrid_uk_5km_day_2001-2019_step1.nc rainfall_hadukgrid_uk_5km_day_OSGB36_2001-2019.nc`
    
## Reproject data using rioxarray

In [None]:
import rioxarray
from pyproj import CRS

# Set directory to read and for outputs
fldr_src = Path('/mnt/d/MRes_dataset/search_data/haduk_cedac_uk_xclim/')
fldr_out = Path('/mnt/d/MRes_dataset/active_data/106_precp')

# Create list with files
fls_lst = fldr_src.glob('**/*.nc')

# Read HadUK-Grid data and merge all into one file
dataset_HADUK = xr.open_mfdataset(paths=fls_lst, combine='by_coords', parallel=True)
dataset_HADUK.to_netcdf(Path(fldr_out / 'rainfall_hadukgrid_uk_5km_day_2001-2019.nc'))

print('---------')
print('Use NCO to modify the rainfall_hadukgrid_uk_5km_day_2001-2019.nc file before processiding to next step')
print('---------')
print('File saved')

## Setting transformed dataset for Xclim

<div class="alert alert-block alert-warning">
 <p style="color:black"> <b>Caution: Before running xclim the transformed data attribute need changing, so coordinates show as latitude and longitude</b></p>
</div>

In [None]:
# Read transformed file
dsin = Dataset(Path(fldr_out/ 'rainfall_hadukgrid_uk_5km_day_WGS84_2001-2019.nc'))

# Output file
dsout = Dataset(Path(fldr_out / 'rainfall_hadukgrid_uk_5km_day_WGS84LatLon_2001-2019.nc'), mode='w',format='NETCDF3_CLASSIC')

# Create dimensions
for dimname,dim in f.dimensions.items():
    dsout.createDimension(dimname,len(dim))

# Copy variables
# transform_mercator is dropped as it is needed.
for v_name, varin in dsin.variables.items():
    if v_name == 'x':
        outVar = dsout.createVariable('longitude', varin.datatype, ('longitude',))
        # Copy variable attributes
        outVar.setncatts({k: varin.getncattr(k) for k in varin.ncattrs()})
        outVar[:] = varin[:]
    if v_name == 'y':
        outVar = dsout.createVariable('latitude', varin.datatype, ('latitude',))
        # Copy variable attributes
        outVar.setncatts({k: varin.getncattr(k) for k in varin.ncattrs()})
        outVar[:] = varin[:]
    if v_name == 'rainfall':
        outVar = dsout.createVariable(v_name, varin.datatype, ('time', 'latitude', 'longitude'))
        # Copy variable attributes
        outVar.setncatts({k: varin.getncattr(k) for k in varin.ncattrs()})
        outVar[:] = varin[:]
    if v_name == 'time':
        outVar = dsout.createVariable(v_name, np.float64, varin.dimensions)
        # Copy variable attributes
        outVar.setncatts({k: varin.getncattr(k) for k in varin.ncattrs()})
        outVar[:] = varin[:]

# close the output file to
dsout.close()

print('---------')
print('The transformed data is already to go')
print('---------')