# Loading Data <img align="right" src="../resources/csiro_easi_logo.png">

This notebook will show you how to load data from the Open Data Cube (ODC) using the Python API. 

#### Index
- [Setup](#Setup)
- [Create a connection to the datacube](#Create-a-connection-to-the-datacube)
- [The datacube load() function](#The-datacube-load()-function)
- [Xarray Dataset and DataArray objects](#Xarray-Dataset-and-DataArray-objects)
- [Specify measurements](#Specify-measurements)
   - [Measurement aliases](#Measurement-aliases)
- [Specify time range](#Specify-time-range)
- [Specify spatial extent](#Specify-spatial-extent)
- [Specify output CRS and resolution](#Specify-output-CRS-and-resolution)
   - [No default CRS](#No-default-CRS)
   - [Reproject](#Reproject)
   - [Resample](#Resample)
   - [Align](#Align)
- [Reuse load parameters](#Reuse-load-parameters)
   - [Load parameters that match an existing dataset](#Load-parameters-that-match-an-existing-dataset)
- [No available data](#No-available-data)
- [Load large datasets with Dask](#Load-large-datasets-with-Dask)
   - [Data type](#Data-type)
   - [Rechunking](#Rechunking)
   - [Dask persist and compute](#Dask-persist-and-compute)

## Setup 

### Dask
Create or reuse a dask cluster.

In [None]:
# Optional EASI tools
import sys, os
os.environ['USE_PYGEOS'] = '0'
sys.path.append(os.path.expanduser('../scripts'))
from easi_tools import EasiDefaults
import notebook_utils
easi = EasiDefaults()
cluster, client = notebook_utils.initialize_dask(use_gateway=False, wait=True)
display(cluster if cluster else client)
print(notebook_utils.localcluster_dashboard(client, server=easi.hub))

#### AWS configuration
To use data in public requester-pays buckets, run the following code (once per dask cluster):

In [None]:
from datacube.utils.aws import configure_s3_access
configure_s3_access(aws_unsigned=False, requester_pays=True, client=client)

## Create a connection to the datacube

The `Datacube` object created below (which we have named `dc`) will be used to interact with the ODC database.

The EASI Jupyter environment is configured with default credentials to access the ODC database so there is no need to pass connection details to the `Datacube()` function.

In [None]:
import datacube
dc = datacube.Datacube()

## The datacube `load()` function

The ODC documentation describes each of the options available for loading data: https://datacube-core.readthedocs.io/en/latest/dev/api/generate/datacube.Datacube.load.html

At a minimum specify the `product name` and `location extents` of the data you wish to load.

Below we define `latitude`/`longitude` extents for a small area and then load the [Landsat 8 surface reflectance](https://explorer.csiro.easi-eo.solutions/landsat8_c2l2_sr) product.

> The examples below use **Dask** via the `dask_chunks` option. If using a dask local cluster (`gateway_cluster=False`) or no dask, the notebook will load the data into the memory of your Jupyter Lab instance and if you exceed the memory specifications your Jupyter Lab instance will crash.

> Dask loading is quick because no data is loaded yet; only the data load request has been added to the Dask scheduler.

In [None]:
# This configuration is read from the defaults for this system. 
# Examples are provided below each line to show how to set these manually.

study_area_lat = easi.latitude
# study_area_lat = (39.2, 39.3)

study_area_lon = easi.longitude
# study_area_lon = (-76.7, -76.6)

product = easi.product('landsat')
# product = 'landsat8_c2l2_sr'

set_time = easi.time
# set_time = ('2020-08-01', '2020-12-01')

set_crs = easi.crs('landsat')
# set_crs = 'EPSG:32618'

set_resolution = easi.resolution('landsat')
# set_resolution = (-30, 30)

In [None]:
data = dc.load(
    product = product, 
    x = study_area_lon,
    y = study_area_lat,
    output_crs=set_crs,
    resolution=set_resolution,
    dask_chunks = {"time":1}
)
display(data)

## Xarray `Dataset` and `DataArray` objects

The load function returns an `xarray.Dataset` object as a three-dimensional gridded dataset containing the requested product measurements.  

Calling `display(data)` in a Jupyter notebook (as in the cell above) prints a helpful overview of the dataset, showing the following:

- `Dimensions` - The labels and sizes of each of the datasets dimensions.
- `Coordinates` - The corresponding values along each dimension. 
- `Data variables` - An `xarray.DataArray` for each of the product measurements.
- `Attributes` - Additional metadata including the coordinate reference system (`crs`)

Click on the <html><i class="fa fa-file-text-o" style="font-size:18px;color:gray"></i></html> icon to the right of one of the `Coordinates` or `Data variables` above to see it's attributes (e.g. `units`, `resolution`, and `nodata` value where applicable). Click on the <html><i class="fa fa-database" style="font-size:18px;color:gray"></i></html> icon to reveal the underlying data array.

For more details on the `xarray.Dataset` structure, see http://xarray.pydata.org/en/stable/data-structures.html#dataset.

Accessing a single variable by name returns an `xarray.DataArray` object. The actual data array can then be accessed with:
- `xarray.DataArray.data` - The array’s data as a dask or numpy array
- `xarray.DataArray.values` - The array’s data as a numpy.ndarray

In [None]:
from IPython.core.display import HTML
# Return a DataArray
display(HTML('<h3>A <em>DataArray</em> object:</h3>'))
blue = data["blue"]  # or data.blue
display(blue)

# Return a Dataset
display(HTML('<h3>A <em>Dataset</em> object:</h3>'))
blue_ds = data[["blue"]]  # List of measurements
display(blue_ds)

## Specify measurements

By default, `dc.load()` will return all measurements available for the specified product. Use the `measurements` parameter to specify a subset measurement(s).

> Using dask it can be as easy to request all measurements (default) and then operate on only the variables required.

In [None]:
measurements = ("red", "green", "blue")
data = dc.load(
    product = product,
    measurements=measurements,
    x = study_area_lon,
    y = study_area_lat,
    output_crs=set_crs,
    resolution=set_resolution,
    dask_chunks = {"time":1}
)
display(data)

### Measurement aliases

Product measurements often have aliases that can also be used when loading. We can see the aliases by using the `Datacube.list_measurements()` function.

For example, lets look at the `landsat8_c2l2_sr` product.

Load the `"qa_pixel"` measurement using the `"pixel_quality"` alias instead. The resulting xarray variable will be named by the alias in this case.

In [None]:
all_meas = dc.list_measurements()
ls8_meas = all_meas.loc[product]
disp_columns = ["name", "aliases"]
ls8_meas[disp_columns]

In [None]:
data = dc.load(
    product = product,
    measurements = ["pixel_quality"],
    x = study_area_lon,
    y = study_area_lat,
    output_crs=set_crs,
    resolution=set_resolution,
    dask_chunks = {"time":1}
)
display(data)

# Note that the Xarray variable is named as the alias

## Specify time range

For the query above all available timesteps for the dataset were returned. To specify a time range (or a single time slice) you can pass a `time` option to `load()`, as below.

Time formats are parsed according to standard python date time parsing (https://dateutil.readthedocs.io/en/stable/parser.html).

The `time` option accepts a single value or a range.
- `time = ('2013-06-22 16:44.33', '2015-01-10')`
- `time = '2016-03'`

In [None]:
time = ('2020-02', '2020-04')
data = dc.load(
    product = product,
    measurements = measurements,
    x = study_area_lon,
    y = study_area_lat,
    time = time,
    output_crs=set_crs,
    resolution=set_resolution,
    dask_chunks = {"time":1}
)
display(data)

## Specify spatial extent

The following keys specify the ***search*** spatial extent in `datacube.load()`.

- `x`, `lon`, `long`, `longitude` - Single or a range of values
- `y`, `lat`, `latitude` - Single or a range of values
- `crs` - Coordinate reference system (CRS) in which to interpret the `x` and `y` values|

In the above examples we have used `x = longitude range` and `y = latitude range`. 

These are combined into a Polygon (or Line or Point) with the attached CRS. The default CRS is `EPSG:4326`. Use a different CRS to, for example, specify UTM coordinates.

The examples above request the `x`/`y` coordinates as longitude and latitude values (i.e., `EPSG:4326`). Here we load the same data as above by defining the search coordinates in a different CRS (`EPSG:32719`), and show that they are equal.

> The `landsat8_c2l2_sr` native CRS varies, but in this part of the United States, it is WGS 84 / UTM zone 18N (EPSG:32618) and the `x`/`y` coordinates are defined in metres with 30m resolution.

In [None]:
# Get the UTM coords from data

xs = data.x.data
ys = data.y.data
x = (xs.min(), xs.max())
y = (ys.min(), ys.max())

print(f'UTM x-coords: {x}')
print(f'UTM y-coords: {y}')

In [None]:
data_xy = dc.load(
    product = product,
    x = x,
    y = y,
    crs = set_crs,
    measurements = measurements,
    time = time,
    output_crs = set_crs,
    resolution = set_resolution,
    dask_chunks = {"time":1}
)
print(f"Datasets are equal: {data.equals(data_xy)}")
display(data_xy)

## Specify output CRS and resolution 

The following keys specify the ***result*** CRS, resolution and resampling in datacube.load()
- `output_crs` - Target CRS (default is the native CRS, if defined)
- `resolution` - Target pixel resolution _(y-pixel, x-pixel)_ (default is the native resolution)
- `resampling` - Resampling method when resolution is different to source resolution (default is nearest neighbor)
- `align` - Shift coordinate values to align with pixel edges _(y-pixel, x-pixel)_

The next sections show examples of each.

### No default CRS

Some products do not have a default CRS, in which case you will need to provide `output_crs` and `resolution` parameters when calling `load()`, otherwise you will recieve the following error:

```
ValueError: Product has no default CRS. Must specify 'output_crs' and 'resolution'
```

> This can be the case for products in a UTM projection where the spatial extent crosses multiple UTM zones or CRSs.

### Reproject

As an example, load our product into the WGS84 (EPSG:4326) CRS.

Note that:
- The spatial dimensions of the result are now labelled `longitude` and `latitude`, because the result is in a geographic coordinate system.
- The `resolution` is provided in `output_crs` units.
- The `resolution` includes the direction of the coordinates in the result, indicated by a positive or negative number. Positive resolution values result in coordinates in the range `minimum..maximum`. A negative resolution will reverse the coordinate range to be `maximum..minimum`.

In [None]:
data = dc.load(
    product = product,
    x = study_area_lon,
    y = study_area_lat,
    time = time,
    output_crs = "EPSG:4326",
    resolution = (-0.01, 0.01),  # Approximately 1 km
    dask_chunks = {"time":1}
)
display(data)

### Resample

The `resolution` parameter essentially defines a resampling operation. The `resampling` parameter can be used to specify which method to use. Default is `resampling="nearest"`.

See https://datacube-core.readthedocs.io/en/latest/api/indexed-data/generate/datacube.Datacube.load.html for the full range of options, including allowing different resampling methods for different measurements (with a dict).

Here we downsample the product from the native 30 m to 100 m resolution as the average.

In [None]:
data = dc.load(
    product = product,
    x = study_area_lon,
    y = study_area_lat,
    time = time,
    output_crs = "EPSG:3857",
    resolution = (-100, 100),
    resampling = "average",
    dask_chunks = {"time":1}
)
display(data)

### Align

The `align` parameter shifts the coordinates to align with the pixel boundary. It is applied as an offset after the rounding of grid coordinates to the resolution. Values are in the range `0 <= align <= absolute(resolution)`.

> While most scenarios should be handled correctly there a few places in the GDAL, Rasterio, Datacube stack that assume the top-left corner is the origin of the array. For most products, pixel boundary grid alignment will have been defined and checked in a preprocess stage but in case its not we have these options to adjust the grids in datacube.

In [None]:
# This example demonstrates the use of the `align` parameter but it is not necessarily an appropriate operation for this product
# TODO: Find an example where `align` is required

data = dc.load(
    product = product,
    x = study_area_lon,
    y = study_area_lat,
    time = time,
    output_crs = "EPSG:3857",
    resolution = (-100, 100),
    align = (25, 25),
    dask_chunks = {"time":1}
)
display(data)

## Reuse load parameters

For ease of use, it is common to store the load parameters in a Python dictionary, which can be passed to the `load()` function using Python's keyword expansion operator `**`.

In [None]:
query = {
    "product": product,
    "latitude": study_area_lat,
    "longitude": study_area_lon,
    "time": ("2013", "2015"),
    "output_crs": "EPSG:4326",
    "resolution": (-0.01, 0.01),
    "dask_chunks": {"time": 1}
}
data = dc.load(**query)
display(data)

### Load parameters that match an existing dataset 

Use the `like` parameter to load a new product to with the same extent, crs, and resolution as an existing dataset.

In [None]:
data2 = dc.load(
    product = "landsat7_c2l2_sr",
    like = data,
    dask_chunks = {"time": 1}
)
display(data2)

## No available data

For some product/location combinations there may be no data available, or it may not be indexed in the datacube yet. In this case an empty dataset will be returned.

You can view available products along with the data coverage at the EAIL Datacube Explorer: https://explorer.eail.easi-eo.solutions/products/landsat8_c2l2_sr

In [None]:
data3 = dc.load(
    product = product,
    x = (39.2, 39.3), 
    y = (-73.5, -73.4), # Note that the latitude has changed to be outside the available data area 
    dask_chunks = {"time": 1}
)
display(data3)

## Load large datasets with Dask

It is possible to load datasets with and without Dask. The examples above have used Dask, but they are also small enough queries that fit within our environment's memory. If we look at the size of our `xarray.Dataset` in MB, we see that it is fairly small:

In [None]:
data_mb = data.nbytes / (1024 ** 2) 
print(f"Dataset size: {data_mb:.2f} MB")

The Python Dask library allows us to load a dataset much larger than a single environment's memory by distributing the data across its dask "workers". Each worker is a dedicated compute node. Dask datasets are processed "lazily", which means data is only retrieved from storage and processed when it is needed.

Datacube and xarray work well with Dask. Use the `dask_chunks` parameter to `load()` to specify the size of dask chunks (across dimensions). Data are not loaded and processed until an output is requested, for example: convert to a numpy array, create a plot, or write data to a file.

**See the [Dask appendix notebook](A2%20-%20Dask.ipynb) for getting started and more information. See also the Dask arrays introduction https://docs.dask.org/en/latest/array.html.**

Lets increase the extent of our query and use the `dask_chunks` parameter. Setting the chunks for the `x` and `y` dimensions to `2048` (and using the default chunk size of `1` for `time`) means that behind the scenes our data will be split into many `1 x 2048 x 2048` pixel arrays.

In [None]:
latitude=easi.latitude_big
longitude=easi.longitude_big
chunks = {"x": 2048, "y": 2048}  # "x" and "y" match the dimension names from the datacube.load() result, so they may depend on `output_crs`
big_data = dc.load(
    product = product,
    output_crs = "EPSG:3857",
    resolution = (-300, 300),
    x = longitude,
    y = latitude,
    group_by = 'solar_day',
    dask_chunks = chunks,
)
display(big_data)  # This is a "lazy" request so its fast to show the structure but no data has been loaded yet

Select the <html><i class="fa fa-database" style="font-size:18px;color:gray"></i></html> icon for one of the measurements shows the details of the `dask.Array`. Each variable is much larger than the previous example and has many more chunks. When using data of this size, extra thought needs to be put regarding how to manage memory on your Jupyter node or in your Dask cluster. You aren't going to be able to load this dataset into RAM and there could be significant overhead managing so many calculation tasks. With a good understanding of Dask, it is possible to process very large quantities of data.

> Dask strategies (in [Dask appendix notebook](A2%20-%20Dask.ipynb)) suggests the chunk size should be no more than 100 MB.

We can see from the `Dimensions` of the dataset above that it is significantly larger than before.

In [None]:
# Get size of `data` in bytes and covert to GB
big_data_gb = big_data.nbytes / (1024 ** 3) 
print(f"Dataset size: {big_data_gb:.2f} GB")

### Data type

In the example above our data are `uint16`, and chunks of `(1, 2048, 2028)` are `8 MB` each. Data processing, such as scaling and and calculating indices, will result in `float64` data, which will increase the chunk memory size.

### Rechunking

Datacube loads data in spatial layers and places these into an xarray `time, y, x` cube. If further processing (valid data, masking, scaling etc) are reasonably per pixel operations then dask chunks can remain as are (spatially-orientated). If doing time series processing then consider rechunking the xarray dask dataset on the time dimension.

**Be aware that rechunking can result in very high memory usage when it comes to the point of running the calculations.**

Dask chunk size of `-1` indicates we want all data along that dimension chunked together. Below we have reduced the `longitude`/`latitude` chunks sizes to even out the memory required for each chunk.

> Click the <html><i class="fa fa-database" style="font-size:18px;color:gray"></i></html> icon to the right of any of the data variables to view the new chunking arrangement.

In [None]:
rechunked = big_data.chunk({'time':-1, 'x':512, 'y':512})  # Example only
display(rechunked)

### Dask persist and compute

The Dask `persist() and compute()` functions control how dask manages its chunks. See https://distributed.dask.org/en/latest/memory.html for a full discussion and additional options.

> Use `persist()` to maintain the data on the cluster and use `compute()` to bring a small result back to the notebook environment's memory. **These should be used sparingly and with good knowledge of Dask operations.**

<div class="alert alert-info">
    <p>Please review the other Dask notebooks in this repository to learn more about how to use Dask, particularly when using large datasets.</p>
</div>

***More examples to come...***