# Working with datasets and xarray in EASI <img align="right" src="../resources/csiro_easi_logo.png">

**Contents**

  - [Overview](#Overview)
  - [Setup](#Setup)
  - [Load some data](#Load-some-data)
    - [Start a local dask cluster](#Start-a-local-dask-cluster)
    - [Get default query parameters](#Get-default-query-parameters)
    - [Explore available datasets](#Explore-available-datasets)
    - [Load data](#Load-data)
    - [Plot data](#Plot-data)
    - [Mask data](#Mask-data)
  - [Working with xarray](#Working-with-xarray)
    - [Data structure](#Data-structure)
    - [Indexing and selecting](#Indexing-and-selecting)
    - [Xarray calculations (reduction)](#Xarray-calculations-(reduction))
    - [Timeseries](#Timeseries)
    - [Xarray and Pandas](#Xarray-and-Pandas)
  - [Other things to try](#Other-things-to-try)
    
# Overview

This notebook provides a general introduction to working with datasets and xarray, including how to work with, interrogate, filter and visualise xarray data objects.

This notebook was adapted from  https://github.com/csiro-easi/eocsi-hackathon-2022/blob/main/tutorials/Datasets_and_xarray.ipynb

## Setup
Let's start with some basic imports to set up our environment

In [None]:
%matplotlib inline

import datacube
from datacube.utils import masking
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import xarray as xr

import warnings
warnings.filterwarnings("ignore")

from dea_tools.plotting import rgb, display_map
from dea_tools.bandindices import calculate_indices

### EASI tools
import sys, os
sys.path.append(os.path.expanduser('../scripts'))
os.environ['USE_PYGEOS'] = '0'
from easi_tools import EasiDefaults
import notebook_utils
easi = EasiDefaults()

## Load some data

<div class="alert alert-info">To start with, we are going to load some data using some common methods that are in other tutorial notebooks. We will do some basic cloud masking and filtering to get a clean dataset to use in this notebook. For more information on loading, masking or visualising data, please refer to the relevant tutorial notebooks.</div>

You can view available products and data coverage at the EASI Explorer. You can get the relevant explorer URL by retrieving the value for `easi.explorer`

In [None]:
print(easi.explorer)

### Start a local dask cluster

In [None]:
cluster, client = notebook_utils.initialize_dask(workers=(1,2), use_gateway=False, wait=True)
display(cluster if cluster else client)
print(notebook_utils.localcluster_dashboard(client, server=easi.hub))

### Get default query parameters

In [None]:
# This configuration is read from the defaults for this system. 
# Examples are provided in a commented line to show how to set these manually.

study_area_lat = easi.latitude
# study_area_lat = (39.2, 39.3)

study_area_lon = easi.longitude
# study_area_lon = (-76.7, -76.6)

product = easi.product('sentinel-2')
# product = 's2_l2a'

# set_time = easi.time
set_time = ('2022-01-01', '2022-06-01') # We will set a specific time, not just the defaults to make it easier for the examples below

# set_crs = easi.crs('sentinel-2')
set_crs = 'EPSG:6933' # For compatibility, we will use a global projection 

set_resolution = easi.resolution('sentinel-2')
# set_resolution = (-30, 30)

### Explore available datasets

A good example for Sentinel-2 https://github.com/csiro-easi/eocsi-hackathon-2022/blob/main/case-studies/Chlorophyll_monitoring.ipynb

In [None]:
#import datacube

dc = datacube.Datacube(app="data_avail")
dc.list_products()

In [None]:
#dc.list_measurements().loc["nasa_aqua_l2_oc"]
dc.list_measurements().loc[product]

Useful figure for Sentinel-2 spectral bands: https://www.usgs.gov/faqs/how-does-data-sentinel-2as-multispectral-instrument-compare-landsat-data

In [None]:
dc.list_measurements().loc[product].loc["scl"]["flags_definition"]

### Load data

In [None]:
display_map(x=study_area_lon, y=study_area_lat)

In [None]:
# Pass the parameters to the load query
query = {
    "x": study_area_lon,
    "y": study_area_lat,
    "time": set_time,
    "group_by": "solar_day", 
    "measurements": ["red", "green", "blue", "scl"], # Note that here we have used some band aliases (e.g. "red" instead of "B04")
    "output_crs": set_crs, 
    "resolution": set_resolution, 
    "dask_chunks": {"time": 1} 
}

# Load the data
ds_s2 = dc.load(product=product, **query)

In [None]:
ds_s2

### Plot data

Some more plotting examples: https://github.com/GeoscienceAustralia/dea-notebooks/blob/develop/Beginners_guide/05_Plotting.ipynb

In [None]:
target_date='2022-02-14' # Change this if required to find some data that has some cloud, but not too much

In [None]:
# In this line, we use the `sel()` function to select a single time slice. There is more information on this function below.
ds_s2[["red","green","blue"]].sel(time=target_date,method='nearest').to_array().plot.imshow(vmin=500,vmax=5000,size=6,aspect=1);

### Mask data

Let's investigate the Sen2Cor Scene Classification (SCL) classes.

We can retrieve the available flag options using the `masking.describe_variable_flags()` function

In [None]:
masking.describe_variable_flags(ds_s2.scl)

We can get more information on individual columns by using `loc` as shown below

In [None]:
masking.describe_variable_flags(ds_s2.scl).loc["qa", "values"]

Using these flag labels, we can specify one or more flag that we want to keep in our data.

In [None]:
# Multiple flags are combined as logial OR using the | symbol
cloud_free_mask = (
    masking.make_mask(ds_s2.scl, qa="vegetation") | 
    masking.make_mask(ds_s2.scl, qa="bare soils") |
    masking.make_mask(ds_s2.scl, qa="water") |
    masking.make_mask(ds_s2.scl, qa="snow or ice") |
    masking.make_mask(ds_s2.scl, qa="unclassified")
)

We now have a mask which will remove all clouds, shadows and other pixels with other classification which are not in our list above.

This is visualised below. Purple areas have a value of 0 (which will be masked out) and yellow have a value of 1 (which will be kept).

In [None]:
cloud_free_mask.sel(time=target_date,method='nearest').plot()

In [None]:
# Calculate proportion of good pixels
valid_pixel_proportion = cloud_free_mask.sum(dim=("x", "y"))/(cloud_free_mask.shape[1] * cloud_free_mask.shape[2])

valid_threshold = 0.5
observations_to_keep = (valid_pixel_proportion >= valid_threshold)

In [None]:
# only keep observations above the good pixel proportion threshold
ds_s2 = ds_s2.sel(time=observations_to_keep)

In [None]:
# Mask the data
ds_s2 = ds_s2.where(cloud_free_mask)
ds_s2.sel(time=target_date,method='nearest')[["red","green","blue"]].to_array().plot.imshow(robust=True,size=6,aspect=1);

---
We now have a masked and filtered dataset

In [None]:
ds_s2 = ds_s2.compute() # This will conver the dask array to real numbers. See the Dask tutorials for more information.
ds_s2

In [None]:
# Plot each of the remaining dates as RGB thumbnails
ds_s2[["red","green","blue"]].to_array().plot.imshow(col="time",col_wrap=4,robust=True);

## Working with xarray

The section below shows various examples of working with xarray, but more information is available online, inlcuding at:

- Blog article on Xarray: https://towardsdatascience.com/basic-data-structures-of-xarray-80bab8094efa
- Xarray documentation: http://xarray.pydata.org/en/stable/user-guide/data-structures.html

### Data structure

>Xarray allows us to work with **labeled multi-dimensional array**


A `Dataset` can be seen as a dictionary structure packing up the data, dimensions and attributes. Variables in a `Dataset` object are called `DataArrays` and they share dimensions with the higher level `Dataset`. 

See also the terminology: https://docs.xarray.dev/en/stable/user-guide/terminology.html

<div>
    <span style="border:solid 1px #888;float:left;padding:10px;">
        <img src="https://docs.xarray.dev/en/stable/_images/dataset-diagram.png" alt="Figure 1. Overview of xarray concepts."/>
        <figcaption><em>Figure 1. Overview of xarray concepts.</em></figcaption>
    </span>
</div>



<div class="alert alert-info">
    <p><strong>Note</strong> that the data loaded via the Open Data Cube will return as an <code>xarray.Dataset</code> which will contain a series of <code>xarray.DataArray</code> objects.</p>
    <p>It is important to note when you are working with a DataSet and when you are working with a DataArray. This can be seen at the top of your dataset object as shown in the images below.</p>
</div>

<div>
    <span style="border:solid 1px #888;float:left;padding:10px;margin-right:25px;width:550px">
        <img src="../resources/xarray-dataset.png" alt="Figure 2. An example of an xarray Dataset.">
        <figcaption><em>Figure 2. An example of an xarray Dataset.</em></figcaption>
    </span>
    <span style="border:solid 1px #888;float:left;padding:10px;margin-left:25px;width:550px">
        <img src="../resources/xarray-dataarray.png" alt="Figure 3. An example of a single xarray DataArray from inside the DataSet in Figure 2.">
        <figcaption><em>Figure 3. An example of a single xarray DataArray from inside the DataSet in Figure 2.</em></figcaption>
    </span>
</div>

Note that:
* Data variables are stored as numpy or dask array
* Labels are in the forms of dimensions, coordinates and attributes
* xarray uses matplotlib for plotting
* ODC API (`datacube.load()`) loads data into a customized xarray dataset

See also an intro notebook (including how to construct a xarray dataset): https://github.com/GeoscienceAustralia/dea-notebooks/blob/develop/Beginners_guide/08_Intro_to_xarray.ipynb

And a more advanced notebook: https://rabernat.github.io/research_computing/xarray.html

In [None]:
# Show an xarray.Dataset
ds_s2

In [None]:
# Show a single Data variable - note that this is a DataArray
ds_s2.red

In [None]:
# Return the data type of the data inside a DataArray
print(type(ds_s2.red.data))

In [None]:
# Show a Coordinate of the Dataset - this is also a DataArray
ds_s2.time

In [None]:
# Get the attributes of the entire Dataset
ds_s2.attrs

In [None]:
# Get the attributes of one of the Data variables
ds_s2.red.attrs

In [None]:
# Get one attribute of the Dataset directly
print(ds_s2.crs)

In [None]:
# Get the GeoBox - the bounding box polygon for the dataset
print(ds_s2.geobox)

### Indexing and selecting

Selections and filtering can be achieved in a number of ways using xarray. the `isel()` and `sel()` functions provide easy ways to select data via an index (`isel()`) or a label (`sel()`).

See the xarray documentatio for more information:
- https://docs.xarray.dev/en/stable/generated/xarray.DataArray.isel.html
- https://docs.xarray.dev/en/stable/generated/xarray.DataArray.sel.html

__Note__ that you may need to adjust some of the values below based on the data in your default configuration

In [None]:
# Get the first time slice by index
ds_s2.isel(time=0)

In [None]:
# Find the time slices for one month - note that more than one time is returned
ds_s2.sel(time='2022-01')

In [None]:
# Find the time slices for one day - note that only one time slice is returned, but the time dimension still exists. Compare this to the isel() result above.
ds_s2.sel(time='2022-01-10')

In [None]:
# Get one specific time slice - by providing the full datetime value, the time dimension disappears. This is the same result as returned by isel(time=0) above.
# If you need to adjust this time to your local example, just copy one datetime value from your data above.
ds_s2.sel(time='2022-01-10T16:02:54')

In [None]:
# Return a single time slice using the .nearest() function - note that the time dimension disappears.
ds_s2.sel(time='2022-01-11',method='pad')

<div class="alert alert-info">
    <p>Options for the optional <code>method</code> variable include:</p>
    <ul>
        <li><strong>None</strong> (default): only exact matches</li>
        <li><strong>pad / ffill</strong>: propagate last valid index value forward</li>
        <li><strong>backfill / bfill</strong>: propagate next valid index value backward</li>
        <li><strong>nearest</strong>: use nearest valid index value</li>
    </ul>
    <p>Note that <strong>pad</strong> and <strong>backfill</strong> will fail if you try to pad or backfill beyond the extent of the dates in the dataset, e.g. it is not possible to propagate a value forward if the date you are looking for is already later than the latest date in the dataset.</p>
</div>

In [None]:
# Return a collection of time slices based on a list of indexes
ds_s2.isel(time=[0,2,5,6])

In [None]:
# Get some time slices from across the date range of your data
# This can be useful if you just want to show a range of time slices from across your data instead of all of it
num_slices = 3
time_ind = np.linspace(1, ds_s2.sizes['time'], num_slices, dtype='int') - 1 # This will try to select some index values which are evenly spaced across your data
ds_s2.isel(time=time_ind)

In [None]:
# Find time slices since a specific day
ds_s2.isel(time=(ds_s2.time > np.datetime64('2022-03-01')))

### Xarray calculations (reduction)

Using xarray functions, you can also apply various mathematical reductions (e.g. mean, median, minimum, maximum, standard devations, etc) to your data.

In [None]:
# Calculate the mean of each pixel through time and plot the resulting image
# This is now a composite image of all of your data
ds_s2.mean(dim="time")[["red","green","blue"]].to_array().plot.imshow(robust=True, size=6, aspect=1);

In [None]:
# Do the same as above but calculate the median (note that this is not a geomedian)
ds_s2.median(dim="time")[["red","green","blue"]].to_array().plot.imshow(robust=True, size=6, aspect=1);

In [None]:
# Calculate a mean value for each time slice for each variable. This collapses the x and y dimensions and returns a time series of values for each variable at each point in time. 
# Note that now you only have a single dimension - time
ds_s2.mean(dim=["x","y"])

In [None]:
# Plot this timeseries for a single data variable
ds_s2.mean(dim=["x","y"]).green.plot();

In [None]:
# Plot the time series for each data variable
ds_s2[['red','green','blue']].mean(dim=["x","y"]).to_array().plot.line(x='time');

### Timeseries

With xarray, it is possible to resample or re-map your data to other time intervals. This can be useful to upsample or downsample your data to remove noise or align with other data.

In [None]:
# Firstly lets look at what happens when we resample to monthly timesteps
# As you can see, the 'time' dimension values have now changed. There are fewer time slices, the time values are all "00:00:00" and the dates are the last date of each month
# When doing this, we have to provide some sort of aggregation function. Here we are using "nearest()", which returns the one closest time slice to each date, but we could also use .mean(), .median(), .min(), .max(), .count(), etc
ds_s2.resample(time='1M').nearest().time

In [None]:
# Calculate the mean reflectances per time slice and plot them
# Note that there are now fewer time steps in this graph compared to the similar plot above which wasn't resampled
ds_s2[['red','green','blue']].resample(time='1M').nearest().mean(dim=["x","y"]).to_array().plot.line(x='time');

In [None]:
# You can also calculate rolling aggregates. In this example, we generate a rolling window of 4 timesteps, calculate the mean and then calculate a second mean to convert into a timeseries as above
ds_s2[['red','green','blue']].rolling(time=2, min_periods=1).mean().mean(dim=["x","y"]).to_array().plot.line(x='time');

### Xarray and Pandas

If you are familiar working with Pandas, you can also convert an xarray Dataset or DataArray to a pandas dataframe. This can be useful when working with tabular data and some python libraries work better with Pandas than with xarray.

In [None]:
# Convert one fo the timeseries examples to a Pandas dataframe
df = ds_s2.mean(dim=["x","y"]).green.to_dataframe()
df

In [None]:
# Check that it is really a dataframe
type(df)

In [None]:
# Export the dataframe to a csv - this will be exported to the same folder where this notebook is located. 
df.to_csv('test.csv')

#### Learn more about pandas and geopands

pandas: https://pandas.pydata.org/docs/user_guide/10min.html

geopandas: https://geopandas.org/en/stable/docs/user_guide.html

## Other things to try

### Pick another dataset you are interested in. 

If unsure, try Sentinel-2 for where you live or recently visited. If you have used Sentinel-2 through EASI or ODC, try another dataset.

### Explore loading the data and plotting.

### Try xarray operations

e.g.
* Select a timestamp to plot. Trying using .isel() and sel().
* Calculate mean values over time for each pixel and plot the result.
* Try a different calculation (e.g. sum, median) or try to apply the calculation on a different dimension and plot the results
* Resample the data to a monthly (or daily, quarterly) frequency and plot monthly mean values as a line plot
* Save the result


### Think about

* What did you try to achieve and what you've accomplished or learned?
* What type of data did you access? Why? E.g. what does this data measure?
* What else would you like to do with this dataset?