# Laoding data from KSA Datacube

* **Products used:** landsat_sr_kenya

* **Prerequisites:** Users of this notebook should have a basic understanding of:
    * How to run a [Jupyter notebook](01_Jupyter_notebooks.ipynb)
    * Inspecting available [KSA products and measurements](02_Products_and_measurements.ipynb)

## Background

To retrieve data from the [KSA] instance of the [Open Data Cube], it is necessary to formulate a data query specifying the what, where, and when of the desired data. Each query results in a [multi-dimensional xarray object] that encapsulates the requested data. A fundamental understanding of 'xarray' data structures is crucial, as they form the basis of the loaded data from the datacube. The manipulation, transformation, and visualization of 'xarray' objects empower datacube users to explore and analyze KSA datasets, allowing them to formulate and address scientific inquiries.

## Description
This notebook will introduce how to load data from the Kenya Space Agency datacube through the construction of a query and use of the `dc.load()` function.
Topics covered include:

* Loading data using `dc.load()`
* Interpreting the resulting `xarray.Dataset` object
    * Inspecting an individual `xarray.DataArray`
* Customising parameters passed to the `dc.load()` function
    * Loading specific measurements
    * Loading data for coordinates in a custom coordinate reference system (CRS)
    * Projecting data to a new CRS and spatial resolution 
    * Specifying a specific spatial resampling method
* Loading data using a reusable dictionary query
* Loading matching data from multiple products using `like`
* Adding a progress bar to the data load

***

## Getting started
To run this introduction to loading data from KSA Datacube, run all the cells in the notebook starting with the "Load packages" cell. For help with running notebook cells, refer back to the [Introduction to Datacube].

### Load packages
First we need to load the `datacube` package.
This will allow us to query the datacube database and load some data. 
The `with_ui_cbk` function from `odc.ui` will allow us to show a progress bar when loading large amounts of data.

In [2]:
# Supress Warning 
import warnings
warnings.filterwarnings('ignore')

In [3]:
import datacube
from odc.ui import with_ui_cbk

### Connect to the datacube
We then need to connect to the datacube database.
We will then be able to use the `dc` datacube object to load data.
The `app` parameter is a unique name used to identify the notebook that does not have any effect on the analysis.

In [4]:
dc = datacube.Datacube(app="Loading_data",config = '/etc/datacube.conf')

## Loading data using `dc.load()`

    Loading data from the datacube uses the [dc.load()]
    The function requires the following minimum arguments:

    * `product`: A specific product to load (to revise KSA products, see the [Products and measurements notebook].
    * `x`: Defines the spatial region in the *x* dimension. By default, the *x* and *y* arguments accept queries in a geographical co-ordinate system WGS84, identified by the EPSG code *4326*.
    * `y`: Defines the spatial region in the *y* dimension. The dimensions ``longitude``/``latitude`` and ``x``/``y`` can be used interchangeably.
    * `time`: Defines the temporal extent. The time dimension can be specified using a tuple of datetime objects or strings in the "YYYY", "YYYY-MM" or "YYYY-MM-DD" format. 

    An optional arguement which provides ease of use and ease of identification of the measurements to load is:

    * `measurements:` This argument is used to provide a list of measurement names to load, as listed in `dc.list_measurements()`. 
    For satellite datasets, measurements contain data for each individual satellite band (e.g. red). 
    If not provided, all measurements for the product will be returned, and they will have the default names from the satellite data.

    Let's run a query to load 2014 data from the KSA Datacube for part of Muranga County. 
    For this example, we can use the following parameters:

    * `product`: `landsat_sr_kenya`
    * `x`: (37.15100, 37.16133)
    * 'y': (-0.7234160,-0.731236)
    * ti: ("2014-01-01, 2014-12-31")
    * measurements:'blue', 'green', 'red', 'nir'] 
Run the following cell to load all datasets from landsat_sr_kenya product that match this spatial and temporal extent:

In [13]:
ds = dc.load(product="landsat_sr_kenya",
             output_crs = "EPSG:4326",
             resolution = (0.00027,0.00027),
             x = (36.2581, 35.982),
             y = (-0.5239, -0.2657),
             time = ("2020-01-01", "2020-12-31"),
             measurements = ['blue', 'green', 'red', 'nir'])
print(ds)

<xarray.Dataset>
Dimensions:      (time: 37, latitude: 957, longitude: 1024)
Coordinates:
  * time         (time) datetime64[ns] 2020-03-14T07:48:49.984619 ... 2020-12...
  * latitude     (latitude) float64 -0.5239 -0.5237 -0.5234 ... -0.2661 -0.2658
  * longitude    (longitude) float64 35.98 35.98 35.98 ... 36.26 36.26 36.26
    spatial_ref  int32 4326
Data variables:
    blue         (time, latitude, longitude) uint16 8128 8210 8318 ... 0 0 0
    green        (time, latitude, longitude) uint16 9196 9096 9282 ... 0 0 0
    red          (time, latitude, longitude) uint16 8754 9175 9387 ... 0 0 0
    nir          (time, latitude, longitude) uint16 19176 14957 15336 ... 0 0 0
Attributes:
    crs:           EPSG:4326
    grid_mapping:  spatial_ref


### Interpreting the resulting `xarray.Dataset`
The variable `ds` has returned an `xarray.Dataset` containing all data that matched the spatial and temporal query parameters inputted into `dc.load`.

*Dimensions* 

* Identifies the number of timesteps returned in the search (`time: 83`) as well as the number of pixels in the `x` and `y` directions of the data query.

*Coordinates* 

* `time` identifies the date attributed to each returned timestep.
* `x` and `y` are the coordinates for each pixel within the spatial bounds of your query.

*Data variables*

* These are the measurements available for the nominated product. 
For every date (`time`) returned by the query, the measured value at each pixel (`y`, `x`) is returned as an array for each measurement.
Each data variable is itself an `xarray.DataArray` . 

*Attributes*

* `crs` identifies the coordinate reference system (CRS) of the loaded data. 
* `resolution` identifies the resolution of the loaded data.

### Inspecting an individual `xarray.DataArray`
The `xarray.Dataset` we loaded above is itself a collection of individual `xarray.DataArray` objects that hold the actual data for each data variable/measurement. 
For example, all measurements listed under _Data variables_ above (e.g. `blue`, `green`, `red`, `nir`, `swir_1`, `swir_2`) are `xarray.DataArray` objects.

We can inspect the data in these `xarray.DataArray` objects using either of the following syntaxes:
```
ds["measurement_name"]
```
or:
```
ds.measurement_name
```

Being able to access data from individual data variables/measurements allows us to manipulate and analyse data from individual satellite bands or specific layers in a dataset. 
For example, we can access data from the near infra-red satellite band (i.e. `nir`):

In [6]:
print(ds.green)

<xarray.DataArray 'green' (time: 37, latitude: 957, longitude: 1024)>
array([[[ 9196,  9096,  9282, ...,  8978,  9046,  9038],
        [ 9326,  9099,  9215, ...,  8978,  9046,  9038],
        [ 9273,  9196,  9278, ...,  8984,  9083,  9081],
        ...,
        [ 9037,  8971,  9084, ...,  9000,  9444, 11433],
        [ 8968,  9272,  9610, ...,  8201,  8443,  9720],
        [ 9539,  9877, 10005, ...,  8050,  8035,  8336]],

       [[    0,     0,     0, ...,     0,     0,     0],
        [    0,     0,     0, ...,     0,     0,     0],
        [    0,     0,     0, ...,     0,     0,     0],
        ...,
        [    0,     0,     0, ...,     0,     0,     0],
        [    0,     0,     0, ...,     0,     0,     0],
        [    0,     0,     0, ...,     0,     0,     0]],

       [[    0,     0,     0, ...,     0,     0,     0],
        [    0,     0,     0, ...,     0,     0,     0],
        [    0,     0,     0, ...,     0,     0,     0],
        ...,
...
        ...,
        [    0,

## Customising the `dc.load()` function

The `dc.load()` function can be tailored to refine a query.

Customisation options include:

* `measurements:` This argument is used to provide a list of measurement names to load, as listed in `dc.list_measurements()`. 
For satellite datasets, measurements contain data for each individual satellite band (e.g. near infrared). 
If not provided, all measurements for the product will be returned.
* `crs:` The coordinate reference system (CRS) of the query's `x` and `y` coordinates is assumed to be `WGS84`/`EPSG:4326` unless the `crs` field is supplied, even if the stored data is in another projection or the `output_crs` is specified. 
The `crs` parameter is required if your query's coordinates are in any other CRS.
* `group_by:` Satellite datasets based around scenes can have multiple observations per day with slightly different time stamps as the satellite collects data along its path.
These observations can be combined by reducing the `time` dimension to the day level using `group_by=solar_day`.
* `output_crs` and `resolution`: To reproject or change the resolution the data, supply the `output_crs` and `resolution` fields.    
* `resampling`: This argument allows you to specify a custom spatial resampling method to use when data is reprojected into a different CRS. 


### Specifying measurements
By default, `dc.load()` will load *all* measurements in a product.

To load data from the `red`, `green` and `blue` satellite bands only, we can add `measurements=["red", "green", "blue"]` to our query:

In [8]:
ds_rgb = dc.load(product="landsat_sr_kenya",
             output_crs = "EPSG:4326",
             resolution = (-0.00027,0.00027),
             x = (36.2581, 35.982),
             y = (-0.5239, -0.2657),
             time = ("2020-01-01", "2020-12-31"),
             measurements = ['blue', 'green', 'red'])

print(ds_rgb)

<xarray.Dataset>
Dimensions:      (time: 37, latitude: 957, longitude: 1024)
Coordinates:
  * time         (time) datetime64[ns] 2020-03-14T07:48:49.984619 ... 2020-12...
  * latitude     (latitude) float64 -0.2658 -0.2661 -0.2664 ... -0.5237 -0.5239
  * longitude    (longitude) float64 35.98 35.98 35.98 ... 36.26 36.26 36.26
    spatial_ref  int32 4326
Data variables:
    blue         (time, latitude, longitude) uint16 8159 8296 8510 ... 0 0 0
    green        (time, latitude, longitude) uint16 9539 9877 10005 ... 0 0 0
    red          (time, latitude, longitude) uint16 9052 9400 9530 ... 0 0 0
Attributes:
    crs:           EPSG:4326
    grid_mapping:  spatial_ref


## Loading data using the query dictionary syntax
It is often useful to re-use a set of query parameters to load data from multiple products.
To achieve this, we can load data using the "query dictionary" syntax.
This involves placing the query parameters we used to load data above inside a Python dictionary object which we can re-use for multiple data loads:

In [11]:
query = {"x" : (36.2581, 35.982),
         "y" : (-0.5239, -0.2657),
         "output_crs": 'EPSG:4326',
         "resolution": (-0.00027,0.00027),
         "time": ("2020-01-01", "2020-12-31")}

We can then use this query dictionary object as an input to `dc.load()`. 

> The `**` syntax below is Python's "keyword argument unpacking" operator.
This operator takes the named query parameters listed in the dictionary we created, and "unpacks" them into the `dc.load()` function as new arguments. 
For more information about unpacking operators, refer to the [Python documentation](https://docs.python.org/3/tutorial/controlflow.html#unpacking-argument-lists)

In [12]:
ds = dc.load(product="landsat_sr_kenya",
             **query)

print(ds)

<xarray.Dataset>
Dimensions:        (time: 37, latitude: 957, longitude: 1024)
Coordinates:
  * time           (time) datetime64[ns] 2020-03-14T07:48:49.984619 ... 2020-...
  * latitude       (latitude) float64 -0.2658 -0.2661 ... -0.5237 -0.5239
  * longitude      (longitude) float64 35.98 35.98 35.98 ... 36.26 36.26 36.26
    spatial_ref    int32 4326
Data variables:
    sr_b1          (time, latitude, longitude) uint16 7891 7936 8175 ... 0 0 0
    sr_b2          (time, latitude, longitude) uint16 8159 8296 8510 ... 0 0 0
    sr_b3          (time, latitude, longitude) uint16 9539 9877 10005 ... 0 0 0
    sr_b4          (time, latitude, longitude) uint16 9052 9400 9530 ... 0 0 0
    sr_b5          (time, latitude, longitude) uint16 20854 21311 20983 ... 0 0
    sr_b6          (time, latitude, longitude) uint16 14245 15061 15405 ... 0 0
    sr_b7          (time, latitude, longitude) uint16 10746 11163 11385 ... 0 0
    qa_pixel       (time, latitude, longitude) uint16 21824 21824 21824

## Recommended next steps

For more advanced information about working with Jupyter Notebooks or JupyterLab, you can explore [JupyterLab documentation page](https://jupyterlab.readthedocs.io/en/stable/user/notebook.html).

To continue working through the notebooks in this beginner's guide, the following notebooks are designed to be worked through in the following order:

1. Introduction to Datacube
2. Products and Measurements
3. **Loading data (this notebook)**
4. Plotting
5. Performing a basic analysis
6. Introduction to numpy
7. Introduction to xarray
8. Parallel processing with Dask
