# Data Wrangling

This notebook was created to capture the data wrangling process for all the datasets. This will include the commands used in NCO and other libraries, but it will only run code written in Python.

---

 - Author:          
                    Luis F Patino Velasquez - MA
 - Date:            
                    Jun 2020
 - Version:         
                    1.0
 - Notes:            
                    files used in this notebook are in format netCDF
 - Jupyter version: 
                    jupyter core     : 4.7.1
                    jupyter-notebook : 6.4.0
                    qtconsole        : 5.1.1
                    ipython          : 7.25.0
                    ipykernel        : 6.0.3
                    jupyter client   : 6.1.12
                    jupyter lab      : 3.0.16
                    nbconvert        : 6.1.0
                    ipywidgets       : 7.6.3
                    nbformat         : 5.1.3
                    traitlets        : 5.0.5
 - Python version:  
                    3.8.5 

---

## Main considerations

* The main time period considered for all datasets is 2001 - 2019, this is based on GPM-IMERG data

## Setting Python Modules

In [None]:
from glob import glob
import xarray as xr
import matplotlib.pyplot as plt
from pathlib import Path
import datetime
import numpy as np

# Global vars
sep = ('''------------\n------------''')
print(sep)

# ERA5

ERA5 data was downloaded as 24 hourly steps (0, 1, 2, 3, 4,..,23) for each calendar day starting from Jan 1 to Dec 31 of each considered year.

ECMWF state here https://confluence.ecmwf.int/display/CKB/ERA5%3A+How+to+calculate+daily+total+precipitation that daily total precipitation must be calculated by accumulating precipitation for e.g. Jan 1, 1979 by summing the steps 1, 2,...,23 of Jan 1 AND step 0 of Jan 2. It means that the step 0 of Jan 1, 1979 is not included in calculation of the total precipitation for that day. For calculation of total precipitation for Jan 2, 1979 we use also the steps 1, 2, 3,...,23 of that day plus step 0 of Jan 3 and so on.

* The following NCO code was used to create daily total precipitation `cdo -b F64 daysum -shifttime,-1hour era5_fileTwo.nc era5_prcp_daysum_nco_2000-2010.nc`
* The following NCO code was used to change from m to mm `cdo mulc,1000 era5_prcp_daysum_nco_2000-2010.nc era5_prcp_daysum_mm_nco_2000-2010.nc`
* The following NCO code was used to change the units in the tp variable `ncatted -O -a units,tp,m,c,"mm d-1" in.nc`

*After creating the daily precipitation files the following line was used to add the standard_name to the tp variable needed for xclim*

* `ncatted -O -a standard_name,tp,c,c,"total_precipitation" in_DAY.nc`

## Creating the daily total precipitation files

In [None]:
# Set directory to read and for outputs
fldr_src = Path('/mnt/d/MRes_dataset/search_data/era_copernicus_uk/')

# Create list of daily total precipitation files 
annual_fls = fldr_src.glob('**/era5_copernicus_prcp_daysum_mm*')
# annual_fls = fldr_src.glob('**/era5_prcp_daysum_nco*')

# Create final file
for era_fl in annual_fls:
    if era_fl.name.split('_')[5][:4] != '2000':
        yr_fst = era_fl.name.split('_')[5][:4] + '-01-01'
        yr_lst = era_fl.name.split('_')[5][5:9] + '-12-31'
        filename = 'era5_copernicus_DAY_prcp_' + era_fl.name.split('_')[5]
    else:
        yr_fst = '2001-01-01'
        yr_lst = era_fl.name.split('_')[5][5:9] + '-12-31'
        filename = 'era5_copernicus_DAY_prcp_' + '2001' + '-' + era_fl.name.split('_')[5][5:9] + '.nc'
    
    # Open data and drop the first timestep of the input data as per the ECMWF statement
    dataset = xr.open_dataset(era_fl, decode_timedelta=False)
    dataset_annual = dataset.sel(time=slice(yr_fst, yr_lst))

    # Saving file with annual precipitations
    annual_prcp = Path(fldr_src / filename)
    print ('saving to ', annual_prcp)
    dataset_annual.to_netcdf(path=annual_prcp)
    print ('finished saving')
    dataset.close()
    
print(sep)
print('All done!! check files in {}'.format(fldr_src))

# HadUK-Grid

HadUK-Grid data was downloaded as monthly files for each calendar day starting from Jan 1 to Dec 31 of each considered year.

* The following NCO code was used to change the units in the rainfall variable `ncatted -O -a units,rainfall,m,c,"mm d-1" in.nc`

The line of code was run as part of a the follwing bash code:

```
#! /bin/bash

FILES='folder path with files/*'
for f in $FILES
do
	echo "Processing $f"
    ncatted -O -a units,rainfall,m,c,"mm d-1" $f
done

```

***HadUK-Grid files did not require any data wrangling using python***

## Checking File downloaded correctly


In [None]:
# Set directory to read and for outputs
fldr_src = Path('/mnt/d/MRes_dataset/search_data/haduk_cedac_uk/')

# Create list with files
fls_lst = fldr_src.glob('**/*.nc')

# Check to see if the file structure is correct - ie it opens using xarray
arr_err = []
for fl in fls_lst:
    try:
        dataset = xr.open_dataset(fl, decode_timedelta=False)
        dataset.close()
    except:
        arr_err.append(fl)
        print('Doing file: {}'.format(fl))

if len(arr_err) == 0:
    print('The files were downloaded correctly')
else:
    print ('The following files could not be opened: {}'.format(arr_err))

## Checking output after units change

In [None]:
# Set directory to read and for outputs
fldr_src = Path('/mnt/d/MRes_dataset/search_data/haduk_cedac_uk/')
# !ls {fldr_src}

dataset = xr.open_dataset(Path(fldr_src / 'rainfall_hadukgrid_uk_5km_day_20020601-20020630.nc'), decode_timedelta=False)
dataset

# GPM-IMERG

ERA5 data was downloaded as daily files for each calendar day starting from Jan 1 to Dec 31 of each considered year.

* The following NCO code was used to change the units in the rainfall variable `ncatted -O -a units,rainfall,m,c,"mm d-1" in.nc`

The change of units NCO code was run as part of a the follwing bash code:

```
#! /bin/bash

FILES='folder path with files/*'
for f in $FILES
do
	echo "Processing $f"
    ncatted -O -a units,rainfall,m,c,"mm d-1" $f
done

```

## Checking File downloaded correctly

In [None]:
# Set directory to read and for outputs
fldr_src = Path('/mnt/d/MRes_dataset/search_data/gpm_imerg_nasa_uk/')

# Create list with files
fls_lst = fldr_src.glob('**/*')

# Check to see if the file structure is correct - ie it opens using xarray
arr_err = []
for fl in fls_lst:
    try:
        dataset = xr.open_dataset(fl, decode_timedelta=False)
        dataset.close()
    except:
        arr_err.append(fl)
        print('Doing file: {}'.format(fl))

if len(arr_err) == 0:
    print('The files were downloaded correctly')
else:
    print ('The following files could not be opened: {}'.format(arr_err))

## Checking output after units change

In [None]:
# Set directory to read and for outputs
fldr_src = Path('/mnt/d/MRes_dataset/search_data/gpm_imerg_nasa_uk/')
# !ls {fldr_src}

dataset = xr.open_dataset(Path(fldr_src / '3B-DAY.MS.MRG.3IMERG.20001002-S000000-E235959.V06.nc4.nc4'), decode_timedelta=False)

# Modified HadUK-Grid from planar to geodesic coordenates
<div class="alert alert-block alert-warning">
 <p style="color:black"> <b>Caution: This should only be run after creating the indices datasets</b></p>
</div>

 * This process creates a new HadUK-Grid file using lat, lon, time as the main dimensions

In [None]:
import netCDF4
from netCDF4 import Dataset

# Set directory
folder = Path('/mnt/c/Users/C0060017/Documents/Taught_Material/MRes_Dissertation/Dissertation/MRes_dataset/active_data/102_prcp/')
# uncomment below to check if it is the right path
!ls {folder}

In [None]:
# Original dataset
dataset = xr.open_dataset(Path(folder / 'haduk_metoffice_xclimSeason_QSDEC_prcp_2001-2019.nc'), decode_timedelta=False)
# dataset['prcptot'].values

## Creating the modified HadUK-Grid nc file

* This process will create a nc file with the transformed coordinates. However, the output file does not get overwritten. The user will need to:
    - Modified the variable *dsout* with a different name for the output file or,
    - Restart the Jupyter server and re-run the process

In [None]:
# This produces information with regards to the CRS
from pyproj import CRS
crs_27700 = CRS.from_epsg(27700)
crs_27700

In [None]:
''' This code is a modification of the code found here: https://gist.github.com/guziy/8543562'''
import pyproj
from pyproj import CRS
from pyproj import Transformer, transform

# Reading the data
dsin = Dataset(Path(folder / 'haduk_metoffice_xclimSeason_QSDEC_prcp_2001-2019.nc'))

#output file
dsout = Dataset(Path(folder / 'hadukWGS84_metoffice_xclimSeason_QSDEC_prcp_2001-2019.nc'), mode='w',format='NETCDF3_CLASSIC')
# dsout = Dataset(Path(folder / 'haduk_metoffice_xclimSeason_QSDEC_prcp_2001-2019_WGS84.nc'), mode='w',format='NETCDF3_CLASSIC')
# dsout = Dataset(Path(folder / 'test2.nc'), mode='w',format='NETCDF4_CLASSIC')
    
#Create dimensions            
for v_name, varin in dsin.variables.items():
    if v_name in ['latitude', 'longitude', 'time', 'percentiles']:
        if v_name == 'latitude':
            dsout.createDimension(v_name, varin.shape[0])
        elif v_name == 'longitude':
            dsout.createDimension(v_name, varin.shape[1])
        else:
            dsout.createDimension(v_name, len(varin))

##########################
# This is for add data but r99ptot
##########################

# Copy variables
for v_name, varin in dsin.variables.items():
    if v_name in ['r10mm', 'r20mm', 'cdd', 'cwd', 'sdii', 'rx1day', 'rx5day', 'prcptot', 'r95ptot', 'r99ptot', 'time']:      
        if v_name in ['r95ptot', 'r99ptot']:
            outVar = dsout.createVariable(v_name, varin.datatype, ('time', 'latitude', 'longitude', 'percentiles'))
            # Copy variable attributes
            outVar.setncatts({k: varin.getncattr(k) for k in varin.ncattrs()})
            outVar[:] = varin[:] 
        elif v_name in ['time']:
            outVar = dsout.createVariable(v_name, np.float64, varin.dimensions)
            # Copy variable attributes - This currently does not work for the time variable
            outVar.setncatts({k: varin.getncattr(k) for k in varin.ncattrs()})
            outVar[:] = varin[:] 
        else:
            outVar = dsout.createVariable(v_name, varin.datatype, ('time', 'latitude', 'longitude'))
            # Copy variable attributes
            outVar.setncatts({k: varin.getncattr(k) for k in varin.ncattrs()})
            outVar[:] = varin[:]

# # # ##########################
# # # # This is for r99ptot data
# # # ##########################
# # # # Copy variables
# # # for v_name, varin in dsin.variables.items():
# # #     if v_name in ['r99ptot', 'time']:      
# # #         if v_name in ['r99ptot']:
# # #             outVar = dsout.createVariable(v_name, varin.datatype, ('time', 'latitude', 'longitude', 'percentiles'))
# # #             # Copy variable attributes
# # #             outVar.setncatts({k: varin.getncattr(k) for k in varin.ncattrs()})
# # #             outVar[:] = varin[:] 
# # #         else:
# # #             outVar = dsout.createVariable(v_name, np.float64, varin.dimensions)
# # #             # Copy variable attributes - This currently does not work for the time variable
# # #             outVar.setncatts({k: varin.getncattr(k) for k in varin.ncattrs()})
# # #             outVar[:] = varin[:] 


# Create coordinates variables 
y_coord = dsout.createVariable('latitude',np.float64,('latitude'))
x_coord = dsout.createVariable('longitude',np.float64,('longitude'))


# Read the x, y values ready to transform to WGS84
values_lat = dsin['projection_y_coordinate'][:]
values_lon = dsin['projection_x_coordinate'][:]


# Information for the transformation of the coordinates from OSGB36 to WGS84
fake_lat = np.zeros(len(values_lat)) # this will be the x when transforming y to latitudes
fake_lon = np.zeros(len(values_lon)) # this will be the y when transforming x to longitude

# # Transform coordinate to WGS84
# Method 1
transformer = Transformer.from_crs("epsg:27700", "epsg:4326", always_xy=True)
lonx_fake, laty = transformer.transform(fake_lat, np.ma.getdata(values_lat))
lonx, laty_fake = transformer.transform(np.ma.getdata(values_lon), fake_lon)

# Add transformed coordinates (WGS84) to nc file
y_coord[:] = laty
x_coord[:] = lonx

# close the output file to
dsout.close()

# Checking output
dataset = xr.open_dataset(Path(folder / 'hadukWGS84_metoffice_xclimSeason_QSDEC_prcp_2001-2019.nc'), decode_timedelta=False)
dataset

## Saving final file with attributes in coordinate variables

In [None]:
'''
Encoding Example compliant with netcdf files

encoding = {'lat': {'zlib': False},
            'lon': {'zlib': False},
            'any_variable': {'_FillValue': -999.0,
                  'chunksizes': (1, 8, 10),
                  'complevel': 1,
                  'zlib': True}
            }
'''
dataset = xr.open_dataset(Path(folder / 'hadukWGS84_metoffice_xclimSeason_QSDEC_prcp_2001-2019.nc'), decode_timedelta=False)

dataset.latitude.attrs['units'] = 'degrees_north'
dataset.latitude.attrs['long_name'] = 'latitude'
dataset.latitude.attrs['origname'] = 'latitude'

dataset.longitude.attrs['units'] = 'degrees_east'
dataset.longitude.attrs['long_name'] = 'longitude'
dataset.longitude.attrs['origname'] = 'longitude'

WGS84_file = Path(folder / 'hadukWGS84Attr_metoffice_xclimSeason_QSDEC_prcp_2001-2019.nc')
print ('saving to ', WGS84_file)
dataset.to_netcdf(WGS84_file)
print ('finished saving')