# Step One - Getting Data on Disk

This demo will begin by using the familiar and accessible weatherbench dataset (see https://github.com/pangeo-data/WeatherBench), and follow the examples provided at https://github.com/nci/NCI_WeatherBench but using EDIT along the way. 

Examples will use low-resolution data, but the data is still really big. It's going to be easier to do this on midrange or HPC than typical workstation setups, although a high-end workstation will be sufficient for getting the general idea.

Readers are strongly recommented to start by working through the non-EDIT version of the tutorials in order to have a strong grasp of what is going on with the ML **before** tackling the concepts in EDIT. The final EDIT examples are well-structured and simpler, but it's going to be difficult to learn what's going on in EDITland without seeing what a more bare-bones approach looks like.

Weatherbench2 provides some updates, and moved to a cloud-native architecture. This may suit those working in a cloud environment, but is more complex for those working on a workstation or HPC environment where it is easier to have the files replicated to disk first. 



## Data Download and layout on disk

If you don't already have a replication of the dataset, the zipped data file is 270GB and can be downloaded thusly:

> wget "https://dataserv.ub.tum.de/s/m1524895/download?path=%2F5.625deg&files=all_5.625deg.zip" -O all_5.625deg.zip

The process of unzipping the downloaded file is left as an exercise to the reader. The unzipped data is 471GB. Once downloaded and unpacked, proceed.

If you are working at NCI, the data is already available on disk as a data collection.


In [8]:
import os
wbench_data_dir = '/g/data/wb00/NCI-Weatherbench/5.625deg'
!ls $wbench_data_dir

10m_u_component_of_wind  potential_vorticity	       total_cloud_cover
10m_v_component_of_wind  relative_humidity	       total_precipitation
2m_temperature		 specific_humidity	       u_component_of_wind
constants		 temperature		       v_component_of_wind
geopotential		 toa_incident_solar_radiation  vorticity


# Loading Data the Old Fashioned Way

https://github.com/pangeo-data/WeatherBench/blob/master/notebooks/3-cnn-example.ipynb shows some examples of loading data the old-fashioned way. In short, you need to write the looping code to walk the filesystem, open things file-by-file, and manually do any reprocessing.

# Loading Data with EDIT

See https://bureau-sig.nci.org.au/edit/EDIT_101/Data/ for more documentation.

EDIT provides a standardised interface to data. Code is already written to nicely interface to the full ERA5 that's on disk at NCI (not just the 5.625 degree data from weatherbench). We'll start by taking a look at the EDIT ERA5 interface, then walk through the process of connecting to the low-res dataset to see how to connect to a new data source. 

In [2]:
import edit_archive_NCI
import edit.data
import edit.pipeline
import edit
import edit_archive_NCI
edit.data.archive.NCI?

WOOOOOOHOOOOO


  @register_archive("ERA5", sample_kwargs=dict(variable="2t"))


[0;31mType:[0m        module
[0;31mString form:[0m <module 'edit_archive_NCI' from '/scratch/kd24/tjl548/testeditableedit/nci/src/edit_archive_NCI/__init__.py'>
[0;31mFile:[0m        /scratch/kd24/tjl548/testeditableedit/nci/src/edit_archive_NCI/__init__.py
[0;31mDocstring:[0m  
National Computing Infrastructure specific Indexes

| Name        | Description |
| :---        |       ----: |
| [ERA5][edit_archive_NCI.ERA5]                | ECWMF ReAnalysis v5       |
| [ACCESS][edit_archive_NCI.ACCESS]            | Australian Community Climate and Earth-System Simulator       |
| [AGCD][edit_archive_NCI.AGCD]                | Australian Gridded Climate Data        |
| [BRAN][edit_archive_NCI.BRAN]                | Bluelink ReANalysis        |
| [OceanMaps][edit_archive_NCI.OceanMaps]      | Ocean Modelling and Analysis Prediction System        |
| [MODIS][edit_archive_NCI.MODIS]              | MODerate resolution Imaging Spectroradiometer       |
| [Himiwari][edit_archive_NCI.Himi

In [4]:
!pwd

/home/548/tjl548


In [3]:
edit.data.archive?

[0;31mType:[0m        module
[0;31mString form:[0m <module 'edit.data.archive' from '/scratch/kd24/tjl548/testeditableedit/data/src/edit/data/archive/__init__.py'>
[0;31mFile:[0m        /scratch/kd24/tjl548/testeditableedit/data/src/edit/data/archive/__init__.py
[0;31mDocstring:[0m  
Provide [Index][edit.data.ArchiveIndex] for known and widely used archived data sources.

These [Indexes][edit.data.ArchiveIndex] allow a user to retrieve data with only a date after being initialised.

More archives can be added by wrapping a class with [register_archive][edit.data.archive.register_archive]

    `edit.data` contains no archives itself, and require additional modules to define them.

    Currently the following exist,
    ```
     - NCI
     - UKMO
    ```

!!! Note
    If setup correctly, any registered archive will be automatically imported if detected to be on the appropriate system.
    So, there may be no need to explicity import it.

In [16]:
var=['u', 'v']  # Note - there is no really straightforward way to just list the variables in the archive
                # However, mismatches will cause edit to list what's available with a "did you mean" prompt
                # A specific listing function should be added in future.
UandV = edit.data.archive.ERA5(var)
UandV

In [19]:
UandV['1984-01-01']  # ERA5 is an analysis product, so is indexed by its analysis time. This request
                     # fetches all of the data from a particular date

Unnamed: 0,Array,Chunk
Bytes,6.87 GiB,55.62 MiB
Shape,"(24, 37, 721, 1440)","(4, 5, 405, 900)"
Dask graph,192 chunks in 4 graph layers,192 chunks in 4 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 6.87 GiB 55.62 MiB Shape (24, 37, 721, 1440) (4, 5, 405, 900) Dask graph 192 chunks in 4 graph layers Data type float64 numpy.ndarray",24  1  1440  721  37,

Unnamed: 0,Array,Chunk
Bytes,6.87 GiB,55.62 MiB
Shape,"(24, 37, 721, 1440)","(4, 5, 405, 900)"
Dask graph,192 chunks in 4 graph layers,192 chunks in 4 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,6.87 GiB,55.62 MiB
Shape,"(24, 37, 721, 1440)","(4, 5, 405, 900)"
Dask graph,192 chunks in 4 graph layers,192 chunks in 4 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 6.87 GiB 55.62 MiB Shape (24, 37, 721, 1440) (4, 5, 405, 900) Dask graph 192 chunks in 4 graph layers Data type float64 numpy.ndarray",24  1  1440  721  37,

Unnamed: 0,Array,Chunk
Bytes,6.87 GiB,55.62 MiB
Shape,"(24, 37, 721, 1440)","(4, 5, 405, 900)"
Dask graph,192 chunks in 4 graph layers,192 chunks in 4 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


# What this buys you

EDIT has abstracted the relationship between the data loading and the data source, in this case the filesystem. By writing a specific connector to the filesystem, it is possible to thereafter treat all data sources in a compatible and generic fashion, and also modularise the code which deals with filesystem connections (such as handling different filesystem layouts e.g. YYYY, YYMM, YYYMMDD layouts, or others)

It also applies standard coordinate naming, so that data sources can be more easily combined. It is common, for example for one data set to pick 'lat' and 'lon' instead of 'latitude' and 'longitude', or 't2m' instead of '2t' and suchlike. 

At this point, it would be possible to write a simple loop such as:

for every date in a training period:
   load it call a training update function

This is a valid use case. The next notebook, "Data Pipelines with EDIT", will explore how to make further use of these data objects for data processing and options for presentation to the ML training frameworks. 

The rest of this notebook will explore how to connect a novel data source.

