# Why bother with `edit`?

Without a proper framework or tools, loading data requires the reimplementation of base boiler plate code between projects. 

Rather than wasting Scientist's time on setting this up, EDIT aims to provide a unified interface to access any dataset the user has access to.

Currently EDIT uses `xarray` as it's base level data loader, so any `netcdf`, `zarr`, and other `xarray` supported structures are included by default.

## Data Loaders

### Location Specific 
`edit` is designed with the knowledge that each computing infrastructure hold it's data in a slightly different way.

Accordingly, an archive must be set up for each location to allow the data at said location to be loaded.

This example notebook, shall use the `NCI` (National Computing Infrastructure) specific archive, as this is the main development environment.

### Other Data 
In addition to these specific location data loaders, `edit` contains an `Intake`, and a `CDS` data loader, allowing data to be downloaded or accessed via a intake catalog.

Also, various `patterns` are implemented that allow to be saved out, and loaded as if it was an archive.

## Example

### ERA5

In [1]:
import edit.data



`edit` includes an auto import step, which seeks to identify the current computing platform and load the associated archive. Therefore by default, upon import of `edit.data`, the `NCI` specific archive is loaded.

In [4]:
print(f"NCI archive loaded: {hasattr(edit.data.archive, 'NCI')}")

NCI archive loaded: True


In [10]:
print(edit.data.archive.NCI.ERA5.__init__.__doc__)


        Setup ERA5 Indexer

        Args:
            variables (list[str] | str):
                Data variables to retrieve
            level (Literal[ERA5_LEVELS]):
                Model level of data, must be either "single", "pressure"
            resolution (Literal[ERA_RES], optional):
                Resolution of data, must be one of 'monthly-averaged','monthly-averaged-by-hour', 'reanalysis'. Defaults to 'reanalysis'.
            level_value: (int, optional):
                Level value to select if data contains levels. Defaults to None.
            transforms (Transform | TransformCollection, optional): 
                Base Transforms to apply.
                Defaults to TransformCollection().
        


In [15]:
ERA5_index = edit.data.archive.NCI.ERA5('t', level = 'pressure')
ERA5_index

Now that we have an index with `t` specified as the variable to retrieve, the object can be indexed with a time.

In [19]:
ERA5_index['2023-01-01T00']

Unnamed: 0,Array,Chunk
Bytes,146.54 MiB,12.01 MiB
Shape,"(1, 37, 721, 1440)","(1, 6, 486, 1080)"
Dask graph,28 chunks in 3 graph layers,28 chunks in 3 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 146.54 MiB 12.01 MiB Shape (1, 37, 721, 1440) (1, 6, 486, 1080) Dask graph 28 chunks in 3 graph layers Data type float32 numpy.ndarray",1  1  1440  721  37,

Unnamed: 0,Array,Chunk
Bytes,146.54 MiB,12.01 MiB
Shape,"(1, 37, 721, 1440)","(1, 6, 486, 1080)"
Dask graph,28 chunks in 3 graph layers,28 chunks in 3 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


You can notice that in the `repr` for the index, a couple of `Transforms`, these are added by default, and can be specified by the `archive`. Their role is to 'transform' the data into a more usable and universal format.

So:
- Coordinates are renamed to a standard
- Some ERA5 variables are aligned to the variable name given in the path

### ACCESS-G
Now that we have data from ERA5, let's show how to get data from ACCESS-G

In [17]:
ACCESS_index = edit.data.archive.NCI.ACCESS.analysis('pl/air_temp', region = 'G')
ACCESS_index

In [18]:
ACCESS_index['2023-01-01T00']

Unnamed: 0,Array,Chunk
Bytes,324.00 MiB,65.03 MiB
Shape,"(1, 27, 1536, 2048)","(1, 15, 924, 1230)"
Dask graph,8 chunks in 2 graph layers,8 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 324.00 MiB 65.03 MiB Shape (1, 27, 1536, 2048) (1, 15, 924, 1230) Dask graph 8 chunks in 2 graph layers Data type float32 numpy.ndarray",1  1  2048  1536  27,

Unnamed: 0,Array,Chunk
Bytes,324.00 MiB,65.03 MiB
Shape,"(1, 27, 1536, 2048)","(1, 15, 924, 1230)"
Dask graph,8 chunks in 2 graph layers,8 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


## Comparison
There is no meaningful difference between how the two indexes are used.

Therefore it is quite easy for a user to change their data source, and all the complex data finding is abstracted away.

## Patterns
A pattern index can be used to save and access data generated by the user,

There are a couple already implemented, and can be easily configured.

In [29]:
edit.data.patterns.__dir__()

['__name__',
 '__doc__',
 '__package__',
 '__loader__',
 '__spec__',
 '__path__',
 '__file__',
 '__cached__',
 '__builtins__',
 'utils',
 'default',
 'PatternIndex',
 'PatternTimeIndex',
 'PatternForecastIndex',
 'PatternVariableAware',
 'argument',
 'Argument',
 'ArgumentExpansion',
 'ArgumentExpansionVariable',
 'direct',
 'Direct',
 'TemporalDirect',
 'ForecastDirect',
 'DirectVariable',
 'ForecastDirectVariable',
 'TemporalDirectVariable',
 'expanded_date',
 'ExpandedDate',
 'TemporalExpandedDate',
 'ForecastExpandedDate',
 'ExpandedDateVariable',
 'ForecastExpandedDateVariable',
 'TemporalExpandedDateVariable',
 'static',
 'Static']

In [28]:
edit.data.patterns.ExpandedDate('temp', prefix = 'data_').search('2020-01-01T00')

PosixPath('/jobfs/108073067.gadi-pbs/tmp1cb8bhae/2020/01/01/data_20200101T0000.nc')

These can be setup to automatically split out variables from an `xarray` dataset, or allow temporal indexing