## Introduction to querying

**Notebook currently compatible with the `NCI`|`DEA Sandbox` environment only**

## Background
All DEA analyses require the basic construction of a data query which specifies the what? where? and when? of the data request.
Each query returns an xarray dataset containing the contents of your request.
It is essential to understand the xarray dataset as it is fundamental to the structure of the datacube.
Manipulations, transformations and visualisation of the xarray contents provide datacube users with the ability to explore DEA datasets and pose and answer scientific questions.
This notebook introduces how to construct and customise datacube queries in addition to introducing the xarray dataset.

## Prerequisites
Users of this notebook should have a basic understanding of how to run a [Jupyter notebook](future link to Intro_to_Jupyter) and understand the basic structure of the [satellite datasets](future link to Intro_to_DEA) that are held within the DEA.

## Description
This notebook will introduce how to load data from the datacube through the construction of a query and use of the *load* function.
Topics covered include:
* Loading data
* Reading the resulting xarray dataset
* Customising the load function
  * crs
  * multi-sensor queries
  * loading cloud-masked data


## Technical details
* **Products used:** `product_name`, `product_name`, `product_name`
* **Analyses used:** NDWI water index, geomedian compositing, pixel drill
* **Special requirements:** An _optional_ description of any special requirements, e.g. If running on the [NCI](https://nci.org.au/), ensure that `module load otps` is run prior to launching this notebook

## Getting started
To run this introduction to querying, run all the cells in the notebook, starting with the "Load packages" cell. 

### Load packages
Use standard import commands; some are shown below. 
Begin with any `iPython` magic commands, followed by standard Python packages, then any additional functionality you need from the `Scripts` directory.

In [1]:
# %matplotlib inline

# import datacube
# import matplotlib.pyplot as plt
# import numpy as np
# import pandas as pd
# import sys
# import xarray as xr

# sys.path.append("../Scripts")

### Connect to the datacube
Give your datacube app a unique name that is consistent with the purpose of the notebook.

In [5]:
import datacube
# Temporary solution to account for Collection 3 data being in a different
# database on the NCI
try:
    dc = datacube.Datacube(app='Introduction_to_querying', env='c3-samples')
except:
    dc = datacube.Datacube(app='Introduction_to_querying')


## Loading data

Loading data from the datacube uses the *load* function.

The function takes several arguments:

* *product*; A specifc product to load. 
* *x*; Defines the spatial region in the *x* dimension
* *y*; Defines the spatial region in the *y* dimension
* *time*; Defines the temporal extent.

**Note**: DEA products are discussed in more detail in Introduction_to_products(future link to Intro_to_products)

Lets run a query to load all datasets within the landsat 7 nbart annual geomedian product for Moreton Bay in QLD.
The *load* function requires the minimum following criteria:

* product: ls7_nbart_geomedian_annual
* location: x=(153.3, 153.4), y=(-27.5, -27.6)
* time period: 2015-01-01 to 2016-01-01

Run the following cell to load all matching datasets

In [6]:
#This runs as a minimum viable query using ls7_nbart_geomedian_annual. It doesn't work using ga_ls5t_ard_3
data = dc.load(product='ls7_nbart_geomedian_annual', 
               x=(153.3, 153.4), y=(-27.5, -27.6),
               time=('2015-01-01', '2016-01-01'))

#Whereas this does work with the ls5 product but requires more info in the query. 
data = dc.load(product='ga_ls5t_ard_3', 
               x=(2067437.5, 2078937.5), y=(-3168487.5, -3155812.5), crs='EPSG:3577',          
               time=('2008-01-01', '2008-02-01'),
               output_crs = 'EPSG: 3577',
               resolution = (25,25))

#This is an issue because I want to run the example using a basic product that has a range of measurements from which to search.
#LS5 ard offers the measurements but I also want to show a bare bones query before tailoring it.

In [7]:
print (data)

<xarray.Dataset>
Dimensions:                     (time: 1, x: 461, y: 508)
Coordinates:
  * time                        (time) datetime64[ns] 2008-01-24T23:33:03.370408
  * y                           (y) float64 -3.168e+06 -3.168e+06 ... -3.156e+06
  * x                           (x) float64 2.067e+06 2.067e+06 ... 2.079e+06
Data variables:
    nbar_blue                   (time, y, x) int16 883 736 1448 ... 342 342 359
    nbar_green                  (time, y, x) int16 990 692 1711 ... 326 292 359
    nbar_red                    (time, y, x) int16 919 733 1634 ... 198 225 225
    nbar_nir                    (time, y, x) int16 1655 1519 2538 ... 181 216
    nbar_swir_1                 (time, y, x) int16 1047 1002 1859 ... 76 76 76
    nbar_swir_2                 (time, y, x) int16 764 731 1283 ... 82 82 50
    nbart_blue                  (time, y, x) int16 878 732 1440 ... 342 342 359
    nbart_green                 (time, y, x) int16 985 688 1702 ... 326 292 359
    nbart_red         

### Reading the result xarray.Dataset
The variable *data* has returned an xarray Dataset containing all matching datasets.

*Dimensions* 
* identifies the number of temporal datasets returned in the search. 
In this case, there are 2 datasets that fit the criteria of our query.

*Coordinates* 
* *time* identifies the date attributed to each returned dataset
* *x* and *y* are the coordinates for the pixels within the spatial bounds of your query

*Data variables*
* For every date (time) returned by the query, the spectral response for each pixel (y, x) is returned as an array for each band.

*Attributes*
* *crs* identifies the coordinate reference system. By default, the *x* and *y* arguments accept queries in a geographical co-ordinate system WGS84, identified by the EPSG code *4326*, which is the same as within Google Earth.

### Customising the *load* function

The *load* function can be tailored to refine a query.

Common customisation options include:
* measurements
* crs
* resolution
* group_by
* output_crs
* products

For help or more customisation options, run help(dc.load) in an empty cell

Example syntax on the use of these options follows in the cells below.

#### crs
Users can query via the native co-ordinate system that the product is stored in, and supply the *crs* argument.

Run the cell below. Note that the result is identical to the initial query you ran in this notebook.

In [83]:
data_native_crs = dc.load(product='ga_ls5t_ard_3', 
               x=(2067437.5, 2078937.5), y=(-3168487.5, -3155812.5), crs='EPSG:3577',          
               time=('2008-01-01', '2008-02-01'),
               output_crs = 'EPSG: 3577',
               resolution = (25,25))
print (data_native_crs)

<xarray.Dataset>
Dimensions:                     (time: 1, x: 461, y: 508)
Coordinates:
  * time                        (time) datetime64[ns] 2008-01-24T23:33:03.370408
  * y                           (y) float64 -3.168e+06 -3.168e+06 ... -3.156e+06
  * x                           (x) float64 2.067e+06 2.067e+06 ... 2.079e+06
Data variables:
    nbar_blue                   (time, y, x) int16 883 736 1448 ... 342 342 359
    nbar_green                  (time, y, x) int16 990 692 1711 ... 326 292 359
    nbar_red                    (time, y, x) int16 919 733 1634 ... 198 225 225
    nbar_nir                    (time, y, x) int16 1655 1519 2538 ... 181 216
    nbar_swir_1                 (time, y, x) int16 1047 1002 1859 ... 76 76 76
    nbar_swir_2                 (time, y, x) int16 764 731 1283 ... 82 82 50
    nbart_blue                  (time, y, x) int16 878 732 1440 ... 342 342 359
    nbart_green                 (time, y, x) int16 985 688 1702 ... 326 292 359
    nbart_red         

### measurements

In [76]:
dc.list_measurements()

Unnamed: 0_level_0,Unnamed: 1_level_0,aliases,dtype,flags_definition,name,nodata,spectral_definition,units
product,measurement,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
fc_percentile_albers_annual,BS_PC_10,,int16,,BS_PC_10,-1,,percent
fc_percentile_albers_annual,PV_PC_10,,int16,,PV_PC_10,-1,,percent
fc_percentile_albers_annual,NPV_PC_10,,int16,,NPV_PC_10,-1,,percent
fc_percentile_albers_annual,BS_PC_50,,int16,,BS_PC_50,-1,,percent
fc_percentile_albers_annual,PV_PC_50,,int16,,PV_PC_50,-1,,percent
fc_percentile_albers_annual,NPV_PC_50,,int16,,NPV_PC_50,-1,,percent
fc_percentile_albers_annual,BS_PC_90,,int16,,BS_PC_90,-1,,percent
fc_percentile_albers_annual,PV_PC_90,,int16,,PV_PC_90,-1,,percent
fc_percentile_albers_annual,NPV_PC_90,,int16,,NPV_PC_90,-1,,percent
fc_percentile_albers_seasonal,BS_PC_10,,int16,,BS_PC_10,-1,,percent


## Running a query across multiple sensors/products

In [None]:
lat_range = (-27.715, -27.755)
lon_range = (153.42, 153.46)
time_range = ('1988', '2018')
time_step = '2Y'
tide_range = (0.50, 1.00)


## Load cloud-masked Landsat data
The first step in this analysis is to load in Landsat data for the `lat_range`, `lon_range` and `time_range` we provided above. 
The code below first connects to the datacube database, and then uses the `load_cloudmaskedlandsat` function to load in data from the Landsat 5, 7 and 8 satellites for the area and time included in `lat_range`, `lon_range` and `time_range`. 
The function will also automatically mask out clouds from the dataset, allowing us to focus on pixels that contain useful data:

In [None]:
# Create the 'query' dictionary object, which contains the longitudes, 
# latitudes and time provided above
query = {
    'y': lat_range,
    'x': lon_range,
    'time': time_range,
    'measurements': ['nbart_red', 'nbart_green', 'nbart_blue', 'nbart_swir_1'],
    'resolution': (-30, 30),
}

# Identify the most common projection system in the input query 
output_crs = mostcommon_crs(dc=dc, product='ga_ls5t_ard_3', query=query)

# Load available data from all three Landsat satellites
landsat_ds = load_ard(dc=dc, 
                      products=['ga_ls5t_ard_3', 
                                'ga_ls7e_ard_3', 
                                'ga_ls8c_ard_3'], 
                      lazy_load=True,
                      output_crs=output_crs,
                      align=(15, 15),
                      group_by='solar_day',
                      **query)


### Recommended next steps

Recommend notebooks to follow on from this one: list products, list measurements, run a basic analysis

## Additional information

**License:** The code in this notebook is licensed under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0). 
Digital Earth Australia data is licensed under the [Creative Commons by Attribution 4.0](https://creativecommons.org/licenses/by/4.0/) license.

**Contact:** If you need assistance, please post a question on the [Open Data Cube Slack channel](http://slack.opendatacube.org/) or on the [GIS Stack Exchange](https://gis.stackexchange.com/questions/ask?tags=open-data-cube) using the `open-data-cube` tag (you can view previously asked questions [here](https://gis.stackexchange.com/questions/tagged/open-data-cube)).
If you would like to report an issue with this notebook, you can file one on [Github](https://github.com/GeoscienceAustralia/dea-notebooks).

**Last modified:** September 2019

**Compatible `datacube` version:** 

In [7]:
print(datacube.__version__)

1.7+43.gc873f3ea


## Tags
Browse all available tags on the DEA User Guide's [Tags Index](https://docs.dea.ga.gov.au/genindex.html)