## Introduction to loading data from Digital Earth Australia
* **Compatability:** Notebook currently compatible with both the `NCI` and `DEA Sandbox` environments
* **Products used:** 
[`ls7_nbart_geomedian_annual`](https://explorer.sandbox.dea.ga.gov.au/ls7_nbart_geomedian_annual) 
* **Prerequisites:** Users of this notebook should have a basic understanding of:
    * How to run a [Jupyter notebook](Introduction_to_Jupyter.ipynb)
    * The basic structure of the DEA [satellite datasets](Introduction_to_DEA.ipynb)
    * Inspecting available [DEA products and measurements](Introduction_to_products_and_measurements.ipynb)

## Background
All Digital Earth Australia (DEA) analyses require the basic construction of a data query that specifies the what, where, and when of the data request.
Each query returns a multi-dimensional `xarray.Dataset` containing the contents of your request.
It is essential to understand the `xarray.Dataset` as it is fundamental to the structure of data loaded from the datacube.
Manipulations, transformations and visualisation of the `xarray.Dataset` contents provide datacube users with the ability to explore and analyse DEA datasets, as well as pose and answer scientific questions.

## Description
This notebook will introduce how to load data from the datacube through the construction of a query and use of the `dc.load()` function.
Topics covered include:
* Loading data using `dc.load`
* Reading the resulting `xarray.Dataset` object
* Customising parameters passed to the `dc.load` function
  * Measurements
  * Coordinate reference system (CRS) reprojection 

## Getting started
To run this introduction to querying, run all the cells in the notebook, starting with the "Connect to the datacube" cell. 

### Load packages
First we need to load the `datacube` package.
This will allow us to query the datacube database and load some data.

In [1]:
import datacube

### Connect to the datacube
We then need to connect to the datacube database.
We will then be able to use the `dc` datacube object to load data.

In [2]:
dc = datacube.Datacube(app="04_Loading_data")

## Loading data using `dc.load`

Loading data from the datacube uses the [`dc.load()`](https://datacube-core.readthedocs.io/en/latest/dev/api/generate/datacube.Datacube.load.html) function.

The function requires the following minimum arguments:

* **product**: A specifc product to load (to revise DEA products, see the [Introduction_to_products_and_measurements](Intro_to_products_and_measurements.ipynb) notebook).
* **x**: Defines the spatial region in the *x* dimension. By default, the *x* and *y* arguments accept queries in a geographical co-ordinate system WGS84, identified by the EPSG code *4326*.
* **y**: Defines the spatial region in the *y* dimension. The dimensions ``longitude``/``latitude`` and ``x``/``y`` can be used interchangeably.
* **time**: Defines the temporal extent. The time dimension can be specified using a tuple of datetime objects or strings with "YYYY", "YYYY-MM" or "YYYY-MM-DD" format. 

Let's run a query to load data from the Landsat 7 NBAR-T annual geomedian product for Moreton Bay in southern Queensland. 
For this example, we can use the following parameters:

* **product**: `ls7_nbart_geomedian_annual`
* **x**: `(153.3, 153.4)`
* **y**: `(-27.5, -27.6)`
* **time**: `("2015-01-01", "2016-01-01")`

Run the following cell to load all datasets from the `ls7_nbart_geomedian_annual` product that match this spatial and temporal extent:

In [3]:
ds = dc.load(product="ls7_nbart_geomedian_annual",
             x=(153.3, 153.4),
             y=(-27.5, -27.6),
             time=("2015-01-01", "2016-01-01"))

print(ds)

### Understanding the resulting `xarray.Dataset`
The variable `ds` has returned an `xarray.Dataset` containing all data that matched the spatial and temporal query parameters inputted into `dc.load`.

*Dimensions* 
* Identifies the number of temporal datasets returned in the search as well as the number of pixels in the `x` and `y` directions of the data query.

*Coordinates* 
* `time` identifies the date attributed to each returned dataset
* `x` and `y` are the coordinates for the pixels within the spatial bounds of your query

*Data variables*
* These are the measurements available for the nominated product. 
For every date (`time`) returned by the query, the measured value at each pixel (`y`, `x`) is returned as an array for each measurement.

*Attributes*
* `crs` identifies the coordinate reference system of the loaded data. 

## Customising the `dc.load` function

The `dc.load()` function can be tailored to refine a query.

Customisation options include:

* **measurements:** The measurements argument is a list of measurement names, as listed in `dc.list_measurements()`. 
For satellite datasets, measurements contain data for each individual satellite band (e.g. near infrared). 
If not provided, all measurements for the product will be returned.
* **crs:** The crs of the query is assumed to be `WGS84`/`EPSG:4326` unless the `crs` field is supplied, even if the stored data is in another projection or the `output_crs` is specified. 
* **group_by:** EO-specific datasets based around scenes can have multiple observations per day with slightly different time stamps as the satellite collects data along its path.
These observations can be combined by reducing the `time` dimension to the day level using `group_by=solar_day`.
* **output_crs** and **resolution**: To reproject or change the resolution the data, supply the `output_crs` and `resolution` fields. For example, to reproject data to the Australian Albers (`EPSG:3577`) projection system at 25 m resolution:
        
```
dc.load(product='ls5_nbar_albers', 
        x=(148.15, 148.2), 
        y=(-35.15, -35.2), 
        time=('1990', '1991'), 
        output_crs='EPSG:3577`, 
        resolution=(-25, 25))
```
             

For help or more customisation options, run `help(dc.load)` in an empty cell. 
Example syntax on the use of these options follows in the cells below.

### Specifying measurements
To load data from only the "red", "blue" and "green" satellite bands, we can add `measurements=["red", "blue", "green"]` to our query:

In [5]:
# Note the optional inclusion of the measurements list
dataset_rgb = dc.load(product="ls7_nbart_geomedian_annual",
                      measurements=["red", "blue", "green"],
                      x=(153.3, 153.4),
                      y=(-27.5, -27.6),
                      time=("2015-01-01", "2016-01-01"))

print(dataset_rgb)

<xarray.Dataset>
Dimensions:  (time: 2, x: 461, y: 508)
Coordinates:
  * time     (time) datetime64[ns] 2015-01-01 2016-01-01
  * y        (y) float64 -3.156e+06 -3.156e+06 ... -3.168e+06 -3.168e+06
  * x        (x) float64 2.067e+06 2.067e+06 2.067e+06 ... 2.079e+06 2.079e+06
Data variables:
    red      (time, y, x) int16 308 306 312 314 307 308 ... 490 419 400 365 390
    blue     (time, y, x) int16 519 496 480 499 503 506 ... 366 316 287 289 300
    green    (time, y, x) int16 563 555 545 558 552 553 ... 565 487 456 415 460
Attributes:
    crs:      EPSG:3577


Note that the *Data variables* component of the xarray now includes only the measurements specified in the query

### CRS reprojection
Certain applications may require that you output your data into a specific `crs`.
You can reproject your output data by specifying the new `output_crs` and identifying the `resolution` required.

In this example, we will reproject our data to a new `crs` (UTM Zone 56S, `EPSG:32756`) and resolution (250 x 250 m):

In [9]:
# This is the same query as initially appears in the notebook. 
# Note the different crs attribute in the xarray.Dataset
dataset_crs_reprojected = dc.load(product="ls7_nbart_geomedian_annual",
                                  x=(153.3, 153.4),
                                  y=(-27.5, -27.6),
                                  time=("2015-01-01", "2016-01-01"),
                                  output_crs="EPSG:32756",
                                  resolution=(-250, 250))

print(dataset_crs_reprojected)

<xarray.Dataset>
Dimensions:  (time: 2, x: 40, y: 45)
Coordinates:
  * time     (time) datetime64[ns] 2015-01-01 2016-01-01
  * y        (y) float64 6.958e+06 6.958e+06 6.958e+06 ... 6.947e+06 6.947e+06
  * x        (x) float64 5.296e+05 5.299e+05 5.301e+05 ... 5.391e+05 5.394e+05
Data variables:
    blue     (time, y, x) int16 492 447 420 419 417 421 ... 432 451 466 482 469
    green    (time, y, x) int16 513 462 426 419 414 417 ... 446 474 494 519 491
    red      (time, y, x) int16 269 236 211 209 206 204 ... 289 301 283 304 276
    nir      (time, y, x) int16 168 152 138 148 145 142 ... 183 203 191 213 211
    swir1    (time, y, x) int16 90 84 81 85 86 86 84 ... 109 107 125 119 142 136
    swir2    (time, y, x) int16 79 76 72 79 78 77 74 ... 96 97 113 110 126 123
Attributes:
    crs:      EPSG:32756


Note that due to the larger 250 m resolution there are now less pixels on the `x` and `y` dimensions (e.g. `x: 40, y: 45` compared to `x: 461, y: 508` in earlier examples).

### Loading data using the 'query' syntax
It is often useful to re-use a set of query parameters to load data from multiple products.
To achieve this, we can load data using `dc.load` using the "query" dictionary syntax.
This involves placing the query parameters we used to load data above inside a Python dictionary object which we can re-use for multiple data loads.

In [13]:
query = {'x': (153.3, 153.4),
         'y': (-27.5, -27.6),
         'time': ('2015-01-01', '2016-01-01')}

print(query)

{'x': (153.3, 153.4), 'y': (-27.5, -27.6), 'time': ('2015-01-01', '2016-01-01')}


We can then use this query dictionary object as an input to `dc.load`:

In [15]:
ds = dc.load(product="ls7_nbart_geomedian_annual",
             **query)

print(ds)

<xarray.Dataset>
Dimensions:  (time: 2, x: 461, y: 508)
Coordinates:
  * time     (time) datetime64[ns] 2015-01-01 2016-01-01
  * y        (y) float64 -3.156e+06 -3.156e+06 ... -3.168e+06 -3.168e+06
  * x        (x) float64 2.067e+06 2.067e+06 2.067e+06 ... 2.079e+06 2.079e+06
Data variables:
    blue     (time, y, x) int16 519 496 480 499 503 506 ... 366 316 287 289 300
    green    (time, y, x) int16 563 555 545 558 552 553 ... 565 487 456 415 460
    red      (time, y, x) int16 308 306 312 314 307 308 ... 490 419 400 365 390
    nir      (time, y, x) int16 207 183 183 189 187 ... 2866 2650 2505 2440 2538
    swir1    (time, y, x) int16 89 88 88 99 112 117 ... 1752 1368 1127 1120 1229
    swir2    (time, y, x) int16 75 98 87 82 91 94 96 ... 894 898 657 573 495 553
Attributes:
    crs:      EPSG:3577


Query dictionaries can contain any set of parameters that would usually be provided to `dc.load`:

In [17]:
query = {'x': (153.3, 153.4),
         'y': (-27.5, -27.6),
         'time': ('2015-01-01', '2016-01-01'),
         'output_crs': 'EPSG:32756',
         'resolution': (-250, 250)}

ds = dc.load(product="ls7_nbart_geomedian_annual",
             **query)

print(ds)


<xarray.Dataset>
Dimensions:  (time: 2, x: 40, y: 45)
Coordinates:
  * time     (time) datetime64[ns] 2015-01-01 2016-01-01
  * y        (y) float64 6.958e+06 6.958e+06 6.958e+06 ... 6.947e+06 6.947e+06
  * x        (x) float64 5.296e+05 5.299e+05 5.301e+05 ... 5.391e+05 5.394e+05
Data variables:
    blue     (time, y, x) int16 492 447 420 419 417 421 ... 432 451 466 482 469
    green    (time, y, x) int16 513 462 426 419 414 417 ... 446 474 494 519 491
    red      (time, y, x) int16 269 236 211 209 206 204 ... 289 301 283 304 276
    nir      (time, y, x) int16 168 152 138 148 145 142 ... 183 203 191 213 211
    swir1    (time, y, x) int16 90 84 81 85 86 86 84 ... 109 107 125 119 142 136
    swir2    (time, y, x) int16 79 76 72 79 78 77 74 ... 96 97 113 110 126 123
Attributes:
    crs:      EPSG:32756


## Recommended next steps

To continue following the introductory notebooks in the beginners guide, users are recommended to continue with:

- [Introduction_to_plotting](link to notebook)
- [Run_a_basic_analysis](link to notebook)

Advanced users are recommended to explore:
- [Using load_ard](https://github.com/GeoscienceAustralia/dea-notebooks/blob/Intro_to_products/Frequently_used_code/Using_load_ard.ipynb). This function allows the importing of cloud-free observations from multiple sensors into an xarray dataset. Furthermore, you can query for observations with a user-specified minimum proportion of good quality, non-cloudy or shadowed pixels.
- [DEA_datasets](https://github.com/GeoscienceAustralia/dea-notebooks/tree/develop/DEA_datasets) part of the repository.Here you can explore DEA products in depth.
- Continue exploring some with some [real world applications](https://github.com/GeoscienceAustralia/dea-notebooks/tree/develop/Real_world_examples)

## Additional information

**License:** The code in this notebook is licensed under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0). 
Digital Earth Australia data is licensed under the [Creative Commons by Attribution 4.0](https://creativecommons.org/licenses/by/4.0/) license.

**Contact:** If you need assistance, please post a question on the [Open Data Cube Slack channel](http://slack.opendatacube.org/) or on the [GIS Stack Exchange](https://gis.stackexchange.com/questions/ask?tags=open-data-cube) using the `open-data-cube` tag (you can view previously asked questions [here](https://gis.stackexchange.com/questions/tagged/open-data-cube)).
If you would like to report an issue with this notebook, you can file one on [Github](https://github.com/GeoscienceAustralia/dea-notebooks).

**Last modified:** October 2019

**Compatible `datacube` version:** 

In [None]:
print(datacube.__version__)

## Tags
Browse all available tags on the DEA User Guide's [Tags Index](https://docs.dea.ga.gov.au/genindex.html)