# 05_Extracting_external_data.ipynb

## Extracting meteorological data for a selected watershed
Using a GeoJSON file extracted from the HydroSHEDS database or given by the user, meteorological datasets can be extracted inside the watershed's boundaries using the PAVICS-Hydro ERA5 database.

In [None]:
import datetime as dt
import xarray as xr
import s3fs
import fsspec
import intake
from clisops.core import subset


In [None]:
# This will be our input section, where we control what we want to extract. We know which watershed interests us, it is the input.geojson
# file that we previously generated!

basin_contour = 'input.geojson' # Can be generated using notebook "04_Delineating watersheds"

# Also, we can specify which timeframe we want to extract. Here let's focus on a 10-year period
reference_start_day = dt.datetime(1980, 12, 31)
reference_stop_day = dt.datetime(1991, 1, 1) # Notice we are using one day before and one day after the desired period of 1981-01-01 to 1990-12-31. This is to account for any UTC shifts that might require getting data in a previous or later time.


In [None]:
# Get the ERA5 data from the Wasabi/Amazon S3 server. Will eventually be replaced by the more efficient direct call with auto-updating timesteps.
# Future code:
'''
catalog_name = 'https://raw.githubusercontent.com/hydrocloudservices/catalogs/main/catalogs/atmosphere.yaml'
cat=intake.open_catalog(catalog_name)
ds=cat.era5_hourly_reanalysis_single_levels_ts.to_dask()
'''

# For now, let's use this method:
''' 
Configuration keys. Boilerplate, should not be changed.
'''
CLIENT_KWARGS = {'endpoint_url': 'https://s3.wasabisys.com','region_name': 'us-east-1'}
CONFIG_KWARGS = {'max_pool_connections': 100}
STORAGE_OPTIONS = {'anon': True,'client_kwargs': CLIENT_KWARGS,'config_kwargs': CONFIG_KWARGS}

'''
Prepare the filesystem and mapper that points to the data itself on the AmazonS3 directory
'''
fsERA5 = fsspec.filesystem('s3', **STORAGE_OPTIONS)
mapper = fsERA5.get_mapper('s3://era5/world/reanalysis/single-levels/zarr-temporal/2021-06-30')

'''
Get the ERA5 data. We will rechunk it to a single chunck to make it compatible with other codes on the platform, especially bias-correction.
We are also taking the daily min and max temperatures as well as the daily total precipitation.
'''
ERA5_reference=subset.subset_shape(xr.open_zarr(mapper, consolidated=True).sel(time=slice(reference_start_day,reference_stop_day)), basin_contour)
ERA5_tmin=ERA5_reference['t2m'].resample(time='1D').min().chunk(-1,-1,-1)
ERA5_tmax=ERA5_reference['t2m'].resample(time='1D').max().chunk(-1,-1,-1)
ERA5_pr=ERA5_reference['tp'].resample(time='1D').sum().chunk(-1,-1,-1)


### We can now convert these files to netcdf to use at a later time (in a future notebook!)

In [None]:
ERA5_tmin.to_netcdf('ERA5_tmin.nc')
ERA5_tmax.to_netcdf('ERA5_tmax.nc')
ERA5_pr.to_netcdf('ERA5_precip.nc')
