# Interfacing to Your Own Data

Assumptions:

1. You should already have downloaded a copy or partial copy of ERA5 5.625 degree resolution
2. You have checked out the EDIT monorepo and have a functional EDIT environment into which you can install new packages

This notebook will work through creating a new EDIT package which can interface to the ERA5 dataset, referred to hereafter as "ERA5lowres" for convenience and naming.

This notebook will present two things:

1. The quick install-and-use demo
2. How it was done slowly and carefully so you can do it on new data


## Review of What's on Disk

If you don't already have a replication of the dataset, the zipped data file is 270GB and can be downloaded thusly:

> wget "https://dataserv.ub.tum.de/s/m1524895/download?path=%2F5.625deg&files=all_5.625deg.zip" -O all_5.625deg.zip

The process of unzipping the downloaded file is left as an exercise to the reader. The unzipped data is 471GB. Once downloaded and unpacked, proceed.

In [1]:
# wbench_data_dir = '/g/data/wb00/NCI-Weatherbench/5.625deg'  # on NCI
wbench_data_dir = "/nesi/nobackup/niwa00004/riom/weatherbench/5.625deg/"  # on NIWA HPC
!ls $wbench_data_dir

10m_u_component_of_wind  potential_vorticity	       total_cloud_cover
10m_v_component_of_wind  relative_humidity	       total_precipitation
2m_temperature		 specific_humidity	       u_component_of_wind
constants		 temperature		       v_component_of_wind
geopotential		 temperature_850	       vorticity
geopotential_500	 toa_incident_solar_radiation


In [2]:
!ls $wbench_data_dir/total_cloud_cover/*198*

/nesi/nobackup/niwa00004/riom/weatherbench/5.625deg//total_cloud_cover/total_cloud_cover_1980_5.625deg.nc
/nesi/nobackup/niwa00004/riom/weatherbench/5.625deg//total_cloud_cover/total_cloud_cover_1981_5.625deg.nc
/nesi/nobackup/niwa00004/riom/weatherbench/5.625deg//total_cloud_cover/total_cloud_cover_1982_5.625deg.nc
/nesi/nobackup/niwa00004/riom/weatherbench/5.625deg//total_cloud_cover/total_cloud_cover_1983_5.625deg.nc
/nesi/nobackup/niwa00004/riom/weatherbench/5.625deg//total_cloud_cover/total_cloud_cover_1984_5.625deg.nc
/nesi/nobackup/niwa00004/riom/weatherbench/5.625deg//total_cloud_cover/total_cloud_cover_1985_5.625deg.nc
/nesi/nobackup/niwa00004/riom/weatherbench/5.625deg//total_cloud_cover/total_cloud_cover_1986_5.625deg.nc
/nesi/nobackup/niwa00004/riom/weatherbench/5.625deg//total_cloud_cover/total_cloud_cover_1987_5.625deg.nc
/nesi/nobackup/niwa00004/riom/weatherbench/5.625deg//total_cloud_cover/total_cloud_cover_1988_5.625deg.nc
/nesi/nobackup/niwa00004/riom/weatherbench/5.6

Here we see that the data is stored in subdirectories with human-readable names, and within those subdirectories, the files are names with the year and month of the data. 

This layout is somewhat different to the layout of the full ERA5 dataset as taken directly from CDS, so we will need to tell EDIT how to deal with this difference.

The challenge for EDIT is to understand this directory structure, figure out the shorthand variables names that are actually present in the files (e.g. total_cloud_cover is called "tcc" inside the netcdf file), and work out how to index the whole thing by shorthand-variable-name and date, including interpreting things like the unit of the variable and handling any variable renaming that might be needed for standardisation of naming conventions.

This requires some configuration, and a lot of EDIT's code is about this kind of data set comprehension. 

Let's summarise the easy way.

1. Install the ERA5-lowres Python module
2. Set an environment variable called ERA5LOWRES
3. Import the ERA5-lowres Python module

Assuming you have already installed the ERA5-lowres, let's do step 2 and 3.

In [3]:
%env ERA5LOWRES=$wbench_data_dir

env: ERA5LOWRES=/nesi/nobackup/niwa00004/riom/weatherbench/5.625deg/


In [4]:
import pyearthtools.data
import pyearthtools.tutorial



In [5]:
pyearthtools.data.archive.era5lowres?

[0;31mInit signature:[0m
[0mpyearthtools[0m[0;34m.[0m[0mdata[0m[0;34m.[0m[0marchive[0m[0;34m.[0m[0mera5lowres[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mvariables[0m[0;34m:[0m [0;34m'list[str] | str'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mlevel_value[0m[0;34m:[0m [0;34m'int | float | list[int | float] | tuple[list | int, ...] | None'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtransforms[0m[0;34m:[0m [0;34m'Transform | TransformCollection | None'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m      ECWMF ReAnalysis v5
[0;31mInit docstring:[0m
Setup ERA5 Low-Res Indexer

Args:
    variables (list[str] | str):
        Data variables to retrieve
    resolution (Literal[ERA_RES], optional):
        Resolution of data, must be one of 'monthly-averaged','monthly-averaged-by-hour', 'reanalysis'.

In [6]:
var = ['u', 'v']  # Note - there is no really straightforward way to just list the variables in the archive
                  # However, mismatches will cause edit to list what's available with a "did you mean" prompt
                  # A specific listing function should be added in future.
UandV = pyearthtools.data.archive.era5lowres(var)
UandV

In [7]:
data = UandV['1984-01-01'] 
data

Unnamed: 0,Array,Chunk
Bytes,2.44 MiB,555.75 kiB
Shape,"(24, 13, 32, 64)","(24, 8, 19, 39)"
Dask graph,8 chunks in 4 graph layers,8 chunks in 4 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 2.44 MiB 555.75 kiB Shape (24, 13, 32, 64) (24, 8, 19, 39) Dask graph 8 chunks in 4 graph layers Data type float32 numpy.ndarray",24  1  64  32  13,

Unnamed: 0,Array,Chunk
Bytes,2.44 MiB,555.75 kiB
Shape,"(24, 13, 32, 64)","(24, 8, 19, 39)"
Dask graph,8 chunks in 4 graph layers,8 chunks in 4 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,2.44 MiB,555.75 kiB
Shape,"(24, 13, 32, 64)","(24, 8, 19, 39)"
Dask graph,8 chunks in 4 graph layers,8 chunks in 4 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 2.44 MiB 555.75 kiB Shape (24, 13, 32, 64) (24, 8, 19, 39) Dask graph 8 chunks in 4 graph layers Data type float32 numpy.ndarray",24  1  64  32  13,

Unnamed: 0,Array,Chunk
Bytes,2.44 MiB,555.75 kiB
Shape,"(24, 13, 32, 64)","(24, 8, 19, 39)"
Dask graph,8 chunks in 4 graph layers,8 chunks in 4 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


In [8]:
data.u

Unnamed: 0,Array,Chunk
Bytes,2.44 MiB,555.75 kiB
Shape,"(24, 13, 32, 64)","(24, 8, 19, 39)"
Dask graph,8 chunks in 4 graph layers,8 chunks in 4 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 2.44 MiB 555.75 kiB Shape (24, 13, 32, 64) (24, 8, 19, 39) Dask graph 8 chunks in 4 graph layers Data type float32 numpy.ndarray",24  1  64  32  13,

Unnamed: 0,Array,Chunk
Bytes,2.44 MiB,555.75 kiB
Shape,"(24, 13, 32, 64)","(24, 8, 19, 39)"
Dask graph,8 chunks in 4 graph layers,8 chunks in 4 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


In [9]:
data.u.time