# <u>Using a shapefile to mask raster data</u> <img align="right" src="../resources/csiro_easi_logo.png">

**Contents**

  - [Overview](#Overview)
  - [Notebook setup](#Notebook-setup)
  - [Loading up the Sentinel-2 data](#Loading-up-the-Sentinel-2-data)
    - [DataCube query](#DataCube-query)
    - [Data load](#Data-load)
    - [Cleaning up the data](#Cleaning-up-the-data)
    - [Time series display](#Time-series-display)
  - [Masking the raster data with a shapefile](#Masking-the-raster-data-with-a-shapefile)
    - [Loading up the shapefile](#Loading-up-the-shapefile)
    - [Creating a raster mask](#Creating-a-raster-mask)
    - [Applying the mask to the data time series](#Applying-the-mask-to-the-data-time-series)
    

# Overview

This notebook demonstrates how to use a shapefile to mask raster data extracted from the EAIL. For demonstration purposes, the notebook makes use of Sentinel-2 data and uses an existing shapefile containing the polygon(s) to apply as a mask. This shapefile is available in the `ancillary_data` folder in this CSIRO EASI case studies repository.

This notebook is adapted from a [Digital Earth Australia](https://github.com/GeoscienceAustralia/dea-notebooks) example by Claire Krause.

**A note regarding compatibility**

As stated above, this notebook makes use of a specific dataset of remote sensing data, and is applied over a given region of interest. The code below will thus not run properly if the EASI deployment used to run this notebook does not also contain a similar dataset.

Users can nonetheless investigate the outputs provided in this demonstration notebook, and also potentially modify certain variables in the code below to allow the notebook to run with a different EASI deployment (e.g. different remote sensing data, region of interest, etc.)


# Notebook setup

Here, we import the relevant Python modules and functions needed in the rest of this notebook. Subsequently, we open a connection to the EASI datacube.

In [None]:
### System
import os, sys

### Datacube 
from datacube import Datacube
from datacube.utils import masking
from odc.algo import enum_to_bool

### Data tools
os.environ['USE_PYGEOS'] = '0'
import numpy as np
import xarray as xr
import rasterio
import rasterio.features
import geopandas as gpd

### Plotting
import matplotlib.pyplot as plt

### EASI tools
sys.path.append(os.path.expanduser('../scripts'))
import notebook_utils
from app_utils import display_map
from easi_tools import EasiDefaults
easi = EasiDefaults()

In [None]:
dc = Datacube(app='Raster mask from shapefile')

#### Start a local dask cluster

In [None]:
cluster, client = notebook_utils.initialize_dask(use_gateway=False)
display(cluster if cluster else client)
print(notebook_utils.localcluster_dashboard(client, server=easi.hub))

#### AWS configuration
We will be using data in public requester-pays buckets, so the following configuration is required:

In [None]:
"""This function obtains credentials for S3 access and passes them on to
   processing threads, either local or on dask cluster.
   Note that AWS credentials may need to be renewed between sessions or
   after a period of time."""

from datacube.utils.aws import configure_s3_access
configure_s3_access(aws_unsigned=False, requester_pays=True, client=client)

# Loading up the Sentinel-2 data

## DataCube query

In this notebook, we're interested in a dataset of Sentinel-2 imagery. We set up the corresponding spatial extent in the `query` dicitionary below, together with a (small) temporal window using the defaults for this deployment. 

We will use the Sentinel-2 data for false-colour plots of the SWIR, NIR and green bands (in the corresponding RGB channels). We thus add these bands to the list of `measurements` to load up, together with the `SCL` band of pixel quality data, which we will use to remove bad-quality pixels.

We will also plot the dataset in its original (native) projection, so in the code below, we use the function `mostcommon_crs()` to work out the dataset's native coordinate reference system (CRS) and set the query parameter `output_crs` to that value.

In [None]:
# This configuration is read from the defaults for this system. 
# Examples are provided in a commented line to show how to set these manually.

study_area_lat = easi.latitude
# study_area_lat = (39.2, 39.3)

study_area_lon = easi.longitude
# study_area_lon = (-76.7, -76.6)

product = easi.product('sentinel-2')
# product = 'landsat8_c2l2_sr'

set_time = easi.time
# set_time = ('2020-08-01', '2020-12-01')

# set_crs = easi.crs('sentinel-2')
# set_crs = 'EPSG:32618'

set_resolution = easi.resolution('sentinel-2')
# set_resolution = (-30, 30)

In [None]:
query = { 'product': product,
          'lat': study_area_lat,
          'lon': study_area_lon,
          'measurements': ['green', 'nir_1', 'swir_2', 'SCL'],
          'time': set_time,
          'group_by': 'solar_day' }        # scene ordering

### Dataset's native projection
native_crs = notebook_utils.mostcommon_crs(dc, query)
print(f"The dataset's native CRS is: {native_crs}")

query.update({ 'output_crs': native_crs,     # EPSG code
               'resolution': set_resolution, # target resolution
               'dask_chunks': {'time': 1}})  # Dask config

## Data load

In [None]:
data = dc.load(**query)
data

<div class="alert alert-info">
    <p><strong>NOTE:</strong> Because our dataset is here relatively small, we did not make use of `Dask` (distributed `Xarray` processing on compute clusters). The data will therefore take a minute or two to load.</p>
    <p>As usual, we can check the size of the data object we just loaded up into the system memory &ndash; using a very large dataset (without `Dask`) could lead to technical and/or computational issues in the code further below in this notebook.</p>
</div>

In [None]:
print(f"The size of 'data' in memory is {data.nbytes/(1024**2):.2f} MB.")

## Cleaning up the data

Once loaded, we typically want to (pre-)process the data to remove pixels flagged as `nodata`, as well as those pixels affected by specific data issues such as cloud, cloud shadow, saturation, etc.

In [None]:
### Valid mask (i.e. not 'nodata'), for each data layer
valid_mask = masking.valid_data_mask(data)

### Mask of valid pixels
good_pixel_flags = {'water', 'vegetation', 'bare soils', 'unclassified', 'dark area pixels'}   # pixels to retain (i.e. remove pixels flagged as 'nodata', 'cloud', 'shadow', 'snow')
good_pixel_mask = enum_to_bool(data['SCL'], good_pixel_flags)  # -> DataArray

### Scaling factors for Sentinel-2 data (GA S2 products)
scale = 0.0001  # divide by 10000
offset = 0.0

### Apply valid mask, good pixel mask, and scaling to each layer
data['swir_2'] = data['swir_2'].where(valid_mask['swir_2'] & good_pixel_mask) * scale + offset
data['nir_1'] = data['nir_1'].where(valid_mask['nir_1'] & good_pixel_mask) * scale + offset
data['green'] = data['green'].where(valid_mask['green'] & good_pixel_mask) * scale + offset

## Time series display

For demonstration purposes, we can also select a small number of time slice to display from the original time series of 11 images. Below, we identify the four time indices that contain the largest numbers of valid pixels, and extract these time slices from the dataset.

In [None]:
num_clear_pix = good_pixel_mask.sum(['x','y'])   # number of valid pixels, in each time slice
pix_thr = np.sort( num_clear_pix )[-4]     # sort the time series and select a threshold (4th largest number)
clearest_ind = (num_clear_pix>=pix_thr)    # indices of time slices with nr of pixels above threshold

In [None]:
### Extract time slices from the time series
data = data.isel(time=clearest_ind)
data = data.compute() # Use the Dask compute() function to bring the data to local memory

In [None]:
### Plot the resulting time series (false-colour display)
image_array = data[['swir_2', 'nir_1', 'green']].to_array()
image_array.plot.imshow(robust=True, col='time', col_wrap=2, size=6, aspect=data.x.shape[0]/data.y.shape[0]);

# Masking the raster data with a shapefile

## Loading up the shapefile

For this example, we are using a training shapefile which has been provided in the default EASI configuration. This shapefile is also provided in the `ancillary_data` folder in this repository. You can upload and use your own data in a similar way.

In [None]:
shape_file = easi.training_shapefile
print(f'Using shapefile {shape_file}')
### Load the shapefile
shp = gpd.read_file(shape_file)
display(shp)
shp.crs

---
We can see here that the vector data within that shapefile is in a WGS84 geographic datum (`EPSG:4326`), which is different from that of our main Sentinel-2 dataset. For compatibility, we can here re-project the shapefile data to the CRS of the Sentinel-2 dataset.

In [None]:
### Reproject
shp = shp.to_crs(native_crs)

In [None]:
### Plot
shp.boundary.plot(figsize=(8,8))
plt.xlabel("x [metre]"); plt.ylabel("y [metre]")
plt.title("Training shapefile");

## Creating a raster mask

We can now create a raster mask from the vector data. The code below iterates over the polygons in the shapefile (in case multiple polygons are available), setting the raster mask values to `1` for all the pixels corresponding to the footprint of each polygon.

In [None]:
mask = rasterio.features.rasterize( ((feature['geometry'], 1) for feature in shp.iterfeatures()),
                                    out_shape = (data.dims['y'],data.dims['x']),
                                    transform = data.affine )

In [None]:
### Convert the mask (numpy array) to an Xarray DataArray
mask = xr.DataArray(mask, coords=(data.y, data.x))
mask

In [None]:
mask.plot(figsize=(8,8));

## Applying the mask to the data time series

Finally, we can use the mask we just created, apply it to the time series of Sentinel-2 data, and plot the result.

In [None]:
### Masking
data = data.where(mask)

### Plotting
image_array = data[['swir_2', 'nir_1', 'green']].to_array()
image_array.plot.imshow(robust=True, col='time', col_wrap=2, size=6, aspect=data.x.shape[0]/data.y.shape[0]);

In [None]:
### End notebook