# Water Classification Case Study

### Background
Over 40% of the world’s population lives within 100 km of the coastline. However, coastal environments are constantly changing, with erosion and coastal change presenting a major challenge to valuable coastal infrastructure and important ecological habitats. Up-to-date data on coastal change and erosion is essential for coastal managers to be able to identify and minimise the impacts of coastal change and erosion.

### The Problem
While coastlines can be detected using optical data (demonstrated in the [Costal Change Notebook](Coastal_Erosion.ipynb)), these images can be strongly affected by the weather, especially through the presence of clouds, which obscure the land and water below.

### Digital Earth Australia use case
Radar observations are largely unaffected by cloud cover, so can take reliable measurements of areas in any weather. Radar data is readily available from the ESA/EC Copernicus program's Sentinel 1 satellites. The two satellites provide all-weather observations, with a revisit time of 6 days. By developing a process to classify the observed pixels as either water or land, it is possible to identify the shoreline from radar data.

In this example, we use data from the Sentinel 1 satellites to build a classifier that can determine whether a pixel is water or land in radar data. The worked example takes users through the code required to:

1. Pick a study area along the coast.
1. Explore available data products and load Sentinel 1 data.
1. Visualise the returned data.
1. Perform pre-processing steps on the Sentinel 1 bands.
1. Design a classifier to distinguish land and water.
1. Apply the classifier to the study area and interpret the results.
1. Investigate how to identify change in the coastline.

### Technical details

* Products used: `s1_gamma0_geotif_scene`
* Instrument used: `C-SAR` (`VV` and `VH` polarisation). You can read more about the instrument [here](https://sentinel.esa.int/web/sentinel/technical-guides/sentinel-1-sar/sar-instrument).
* Analyses used: compositing, speckle filtering, classification, change detection

**To run this analysis, run all the cells in the notebook. When you finished the analysis, you can return to the start, modify some values (e.g. choose a different location) and re-run the analysis. Throughout the notebook, you will need to add some code of your own to complete the analysis.**

## Picking the study area

If you're running this notebook for the first time, we recommend you keep the default settings below. This will allow you to understand how the analysis works.

The example we've selected looks at part of the coastline of Melville Island, which sits off the coast of the Northen Territory, Australia. The study area also contains an additional small island, which will be useful for assessing how well radar data distinguishes between land and water.

Run the following two cells to set the latitude and logitude range, and then view the area.

In [None]:
latitude = (-11.287611, -11.085876)
longitude = (130.324262, 130.452652)

In [None]:
from utils.display import display_map
display_map(latitude=latitude, longitude=longitude)

## Loading available data

Before loading the data, we'll need to import the Open Data Cube library and load the `Datacube` class.

In [None]:
import datacube
dc = datacube.Datacube(app='sentinel-1-water-classifier')

### Specify product information

You'll also need to specify the data product we want to load. We'll be working with the `s1_gamma0_geotif_scene` product. The relevant information can be stored in a Python dictionary, which we'll pass to the `dc.load()` function later.

In [None]:
product_information = {
    'product': "s1_gamma0_geotif_scene",
    'output_crs': "EPSG:4326",
    'resolution': (0.00013557119,0.00013557119)
}

### Specify latitude and longitude information

We can specify the latitude and longitude bounds of our area using the variables we defined earlier in the notebook.

In [None]:
area_information = {
    'latitude': latitude,
    'longitude': longitude
}

### Load Data

Above, we specified the information in two dictonaries, which the `dc.load()` function can access by including `**` before the name of each dictionary, as demonstrated in the next cell.

In [None]:
dataset = dc.load(**product_information, **area_information)

If the load was sucessful, running the next cell should return the `xarray` summary of the dataset. Make a note of dimensions and data variables, as you'll need these variables during the data preperation and analysis.

In [None]:
print(dataset)

## Visualise loaded data

Sentinel 1 data has two observations, *VV* and *VH*, which correspond to the polarisation of the light sent and received by the satellite. *VV* refers to the satellite sending out vertically-polarised light and receiving vertically-polarised light back, whereas *VH* refers to the satellite sending out vertically-polarised light and receiving horizontally-polarised light back. These two bands can tell us different information about the area we're studying. 

Before running any plotting commands, we'll load the *matplotlib* library in the cell below, along with the *numpy* library. We'll also make use of the in-built plotting functions from *xarray*.

*Note that we take the base-10 logarithm of the bands and multiply by 10 before plotting them such that we work in units of decibels (dB) rather than digital number (DN).*

In [None]:
import matplotlib.pyplot as plt
import numpy as np

### Visualise VH bands

In [None]:
# Plot all VH observations for the year 

converted_vh = 10*np.log10(dataset.vh)  # Scale to plot data in decibels

converted_vh.plot(cmap="Greys_r", col="time", col_wrap=5)
plt.show()

In [None]:
# Plot the average of all VH observations

mean_converted_vh = converted_vh.mean(dim="time")

fig = plt.figure(figsize=(7,9))
mean_converted_vh.plot(cmap="Greys_r")
plt.title("Average VH")
plt.show()

What key differences do you notice between each individual observation and the mean?

### Visualise VV bands  

We've provided two empty cells for you to perform the same analysis as above, but now for the *VV* band. Try and type the code out -- it will help you get better at using the Open Data Cube library!

*Hint: You'll want to perform the same steps, but change the data variable. We've already used the `vh` variable, so you can go back and check the* `xarray` *summary to find the variable name for the VV observation.*

In [None]:
# Plot all VV observations for the year



In [None]:
# Plot the average of all VV observations



What key differences do you notice between each individual observation and the mean? What about differences between the average *VH* and *VV* bands?

Take a look back at the map image to remind yourself of the shape of the land and water of our study area. In both bands, what distinguishes the land and the water?

## Preprocessing the data through filtering

### Speckle Filtering using Lee Filter

You may have noticed that the water in the individual *VV* and *VH* images isn't a consistent colour. The distortion you're seeing is a type of noise known as speckle, which gives the images a grainy appearence. If we want to be able to easily decide whether any particular pixel is water or land, we need to reduce the chance of misinterpreting a water pixel as a land pixel due to the noise.

Speckle can be removed through filtering. If interested, you can find a technical introduction to speckle filtering [here](https://earth.esa.int/documents/653194/656796/Speckle_Filtering.pdf). For now, it is enough to know that we can filter the data using the python function defined in the next cell.

In [None]:
# Adapted from https://stackoverflow.com/questions/39785970/speckle-lee-filter-in-python

from scipy.ndimage.filters import uniform_filter
from scipy.ndimage.measurements import variance

def lee_filter(da, size):
    img = da.values
    img_mean = uniform_filter(img, (size, size))
    img_sqr_mean = uniform_filter(img**2, (size, size))
    img_variance = img_sqr_mean - img_mean**2

    overall_variance = variance(img)

    img_weights = img_variance / (img_variance + overall_variance)
    img_output = img_mean + img_weights * (img - img_mean)
    return img_output

Now that we've defined the filter, we can run it on the *VV* and *VH* data. You might have noticed that the function takes a `size` argument. This will change how blurred the image becomes after smoothing. We've picked a default value for this analysis, but you can experiement with this if you're interested.

In [None]:
# Set any null values to 0 before applying the filter to prevent issues
dataset_zero_filled = dataset.where(~dataset.isnull(), 0)

# Create a new entry in dataset corresponding to filtered VV and VH data
dataset["filtered_vv"] = dataset_zero_filled.vv.groupby('time').apply(lee_filter, size=7)
dataset["filtered_vh"] = dataset_zero_filled.vh.groupby('time').apply(lee_filter, size=7)

### Visualise Filtered VH bands

In [None]:
# Plot all filtered VH observations for the year 

converted_filtered_vh = 10*np.log10(dataset.filtered_vh)  # Scale to plot data in decibels

converted_filtered_vh.plot(cmap="Greys_r", col="time", col_wrap=5)
plt.show()

In [None]:
# Plot the average of all filtered VH observations

mean_converted_filtered_vh = converted_filtered_vh.mean(dim="time")

fig = plt.figure(figsize=(7,9))
mean_converted_filtered_vh.plot(cmap="Greys_r")
plt.title("Average filtered VH")
plt.show()

### Visualise Filtered VV bands

In [None]:
# Plot all filtered VV observations for the year



In [None]:
# Plot the average of all filtered VV observations



Now that you've finished filtering the data, compare the plots before and after and you should be able to notice the impact of the filtering. If you're having trouble spotting it, it's more noticable in the VH band. 

### Plotting VH and VV histograms

Another way to observe the impact of filtering is to view histograms of the pixel values before and after filtering. Try running the next two cells to view the histograms for *VH* and *VV*.

In [None]:
original_vh_db = 10*np.log10(dataset.vh)
filtered_vh_db = 10*np.log10(dataset.filtered_vh)

fig = plt.figure(figsize=(15,3))
filtered_vh_db.plot.hist(bins = 1000, label="VH filtered")
original_vh_db.plot.hist(bins=1000, label="VH", alpha=.5)
plt.legend()
plt.title("Comparison of filtered VH bands to original")
plt.show()

In [None]:
original_vv_db = 10*np.log10(dataset.vv)
filtered_vv_db = 10*np.log10(dataset.filtered_vv)

fig = plt.figure(figsize=(15,3))
filtered_vv_db.plot.hist(bins=1000, label="VV filtered")
original_vv_db.plot.hist(bins=1000, label="VV", alpha=.5)
plt.legend()
plt.title("Comparison of filtered VV bands to original") 
plt.show()

You may have noticed that both the original and filtered bands show two peaks in the histogram, which we can classify as a bimodal distribution. Looking back at the band images, it's clear that the water pixels generally have lower *VH* and *VV* values than the land pixels. This lets us conclude that the lower distribution corresponds to water pixels and the higher distribution corresponds to land pixels. Importantly, the act of filtering has made it clear that the two distributions can be separated, which is especially obvious in the *VH* histogram. This allows us to confidently say that pixel values below a certain threshold are water, and pixel values above it are land. This will form the basis for our classifier in the next section.

# Designing a threshold-based water classifier

Given that the distinction between the `land` and `water` pixel value distributions is strongest in the *VH* band, we'll base our classifier on this distribution. To separate them, we can choose a threshold: pixels with values below the threshold are water, and pixels with values above the threshold are not water (land).

There are a number of ways to determine the threshold; one is to estimate it by looking at the *VH* histogram. From this, we might guess that $\text{threshold} = -20.0$ is a reasonable value. Run the cell below to set the threshold.

In [None]:
threshold = -20.0

The classifier separates data into two classes: data above the threshold and data below the threshold. In doing this, we assume that values of both segments correspond to the same `water` and `not water` distinctions we make visually. This can be represented with a step function:

$$  \text{water}(VH) = \left\{
     \begin{array}{lr}
       \text{True} & :   VH < \text{threshold}\\
       \text{False} & :  VH \geq \text{threshold}
     \end{array}
   \right.\\ $$  

<br>


### Visualise threshold

To check if our chosen threshold reasonably divides the two distributions, we can add the threshold to the histogram plots we made earlier. Run the next two cells to view two different visualisations of this.

In [None]:
fig = plt.figure(figsize=(15,3))
plt.axvline(x=threshold, label='Threshold at {}'.format(threshold), color="red")
filtered_vh_db.plot.hist(bins=1000, label="VH filtered")
original_vh_db.plot.hist(bins=1000, label="VH", alpha=.5)
plt.legend()
plt.title("Histogram Comparison of filtered VH bands to original") 
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(15,3))
filtered_vh_db.plot.hist(bins=1000, label="VH filtered")
ax.axvspan(xmin=threshold, xmax=-.5, alpha=0.25, color='red', label="Not Water")
ax.axvspan(xmin=-40.0, xmax=threshold, alpha=0.25, color='green', label="Water")
plt.legend()
plt.title("Effect of the classifier") 
plt.show()

If you're curious about how changing the threshold impacts the classifier, try changing the threshold value and running the previous two cells again.

## Build and apply the classifier

Now that we know the threshold, we can write a function to only return the pixels that are classified as water. The basic steps that the function will perform are:
1. Check that the data set has a *VH* band to classify.
1. Clean the data by applying the speckle filter.
1. Convert the *VH* band measurements from digital number (DN) to decibels (dB) by taking the base-10 logarithm.
1. Find all pixels that have filtered dB values lower than the threshold; these are the `water` pixels.
1. Return a data set containing the `water` pixels.

These steps correspond to the actions taken in the function below. See if you can determine which parts of the function map to each step before you continue.

In [None]:
import numpy as np
import xarray as xr 

def s1_water_classifier(ds:xr.Dataset, threshold=-20.0) -> xr.Dataset:
    assert "vh" in ds.data_vars, "This classifier is expecting a variable named `vh` expressed in DN, not DB values"
    filtered = ds.vh.groupby('time').apply(lee_filter, size=7)
    water_data_array = 10*np.log10(filtered) < threshold
    return water_data_array.to_dataset(name="s1_water")

Now that we have defined the classifier function, we can apply it to the data. After you run the classifier, you'll be able to view the classified data product by running `print(dataset.s1_water)`. Try adding this line to the cell below, or add a new cell and run it there.

In [None]:
dataset["s1_water"] = s1_water_classifier(dataset).s1_water

### Validation with mean

We can now view the image with our classification. The classifier returns either `True` or `False` for each pixel. To detect the shoreline, we want to check which pixels are always water and which are always land. Conveniently, Python encodes `True = 1` and `False = 0`. If we plot the average classified pixel value, pixels that are always water will have an average value of `1` and pixels that are always land will have an average of `0`. Pixels that are sometimes water and sometimes land will have an average between these values.

The following cell plots the average classified pixel value over time. How might you classify the shoreline from the average classification value?

In [None]:
# Plot the mean of each classified pixel value

plt.figure(figsize=(15,12))
dataset.s1_water.mean(dim="time").plot(cmap="RdBu")
plt.title("Average classified pixel value")
plt.show()

#### Interpreting the mean classification 

From the image above, you should be able to see that the shoreline takes on a mix of values between `0` and `1`. You can also see that our threshold has done a good job of separating the water pixels (in blue) and land pixels (in red). 

### Validation with standard deviation

Given that we've identified the shoreline as the pixels that are calssified sometimes as land and sometimes as water, we can also see if the standard deviation of each pixel in time is a reasonable way to determine if a pixel is shoreline or not. Similar to how we calculated and plotted the mean above, you can calculate and plot the standard deviation by using the `std` function in place of the `mean` function. Try writing the code in the next cell.

*Hint: the only things you'll need to change from the code above are the function you use and the title of the plot.*

If you'd like to see the results using a different colour-scheme, you can also try substituting `cmap = "Greys"` or `cmap = "Blues"` in place of `cmap = "RdBu"` from the previous plot.

In [None]:
# Plot the standard deviation of each classified pixel value


#### Interpreting the standard deviation of the classification

From the image above, you should be able to see that the land and water pixels almost always have a standard deviation of `0`, meaning they didn't change over the time we sampled. With further invesitgation, you could potentially turn this statistic into a new classifier to extract shoreline pixels. If you're after a challenge, have a think about how you might approach this.

An important thing to recognise is that the standard deviation might not be able to detect the difference between noise and ongoing change, since a pixel that frequently alternates between land and water (noise) could have the same standard deviation as a pixel that is land for some time, then becomes water for the remaining time (ongoing change). Consider how you might distinguish between these two different cases with the data and tools you have.

## Detecting coastal change

The standard deviation we calculated before gives us an idea of how much the pixel has changed over the entire period of time that we looked at. It might also be interesting to look at which pixels have changed between any two particular times in our sample.

In the next cell, we choose the images to compare. Printing the dataset should show you that there are 27 time-steps, so the first has an index value of `0`, and the last has an index value of `26`. You can change these to be any numbers in between, as long as the start is earlier than the end.

In [None]:
start_time_index = 0
end_time_index = 26

Next, we can define the change as the difference in the classified pixel value at each point. Land becoming water will have a value of `-1` and water becoming land will have a value of `1`.

In [None]:
change = np.subtract(dataset.s1_water.isel(time=start_time_index), dataset.s1_water.isel(time=end_time_index), dtype=np.float32)
change = change.where(change != 0)  # set all '0' entries to NaN, which prevents them from displaying in the plot.
dataset["change"] = change

Now that we've added change to the data set, you should be able to plot it below to look at which pixels changed. You can also plot the original mean *VH* composite to see how well the change matches our understanding of the shoreline location.

In [None]:
plt.figure(figsize=(15,12))
dataset.filtered_vh.mean(dim="time").plot(cmap="Blues")
dataset.change.plot(cmap="RdBu", levels=2)
plt.title('Change in pixel value between time={} and time={}'.format(start_time_index, end_time_index))
plt.show()

## Drawing conclusions
Here are some questions to think about:

* What are the benefits and drawbacks of the possible classification options we explored?
* How could you extend the analysis to extract a shape for the coastline?
* How reliable is our classifier?
* Is there anything you can think of that would improve it?

## Next steps
When you are done, you can return to the start of the notebook and change the latitude and longitude if you're interested in rerunning the analysis for a new location. If you're going to change the location, you'll need to make sure Sentinel 1 data is available for the new location, which you can check at the [DEA Explorer](https://explorer.sandbox.dea.ga.gov.au/s1_gamma0_geotif_scene). Once you've found an area on the dashboard, navigate to it on the interactive map at the start of the notebook, and click on the area to get the latitude and longitude. You can then define a new area from these values and re-run the map to check that you're covering the area you're interested in.