# Matching satellite images to field samples 

* **Compatability:** Notebook currently compatible with the `NCI`|`DEA Sandbox` environment only
* **Products used:** 
[s2a_ard_granule](https://explorer.sandbox.dea.ga.gov.au/s2a_ard_granule), 
[s2b_ard_granule](https://explorer.sandbox.dea.ga.gov.au/s2b_ard_granule)

## Background

An important aspect of working with satellite data is linking it to physical processes and features of the Earth. 
A key technical aspect of this is the ability to easily identify the satellite data that is closest in time and space to when measurements are made on the ground.

## Description
This notebook covers how to return a measurement of the Normalised Difference Chlorophyll Index (NDCI) for the pixel closest to a sampling location that measured the concentration of Chlorophyll-*a*.

***


## Getting started
To run this analysis, run all the cells in the notebook, starting with the "Load packages" cell. 

### Load packages

In [None]:
%matplotlib inline

import datacube
from datacube.utils.geometry import CRS, point
import dask
from datetime import datetime, timedelta
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sys
import xarray as xr
from sklearn.linear_model import LinearRegression
from scipy.stats import spearmanr

from utils.dea_datahandling import load_ard

### Connect to the datacube
Give your datacube app a unique name. 
Ideally, this will be the same as the notebook file name.

In [None]:
dc = datacube.Datacube(app="Site_matching")

### Analysis parameters

* `sample_loc_file`: Name and location of the csv file containing sampling locations with corresponding latitudes and longitudes (e.g. `../Supplementary_data/Site_matching/locations.csv`).
* `sample_data_file`: Name and location of the csv file containing sampling dates for each location and sample measurements (e.g. `../Supplementary_data/Site_matching/dates.csv`).
* `obs_window`: Number of days from the ground observation to search for a matching satellite image (e.g. `7`). The window is applied both before and after the ground observation date

In [None]:
sample_loc_file = ("utils/chlorophyll_sampling_coorong_locations.csv")
sample_date_file = ("utils/chlorophyll_sampling_coorong_dates.csv")
obs_window = 7

## Load the input data


In [None]:
# Load the location csv
locations = pd.read_csv(sample_loc_file)

# View the first 5 entries
locations.head()

In [None]:
# Load observations csv
observations = pd.read_csv(sample_date_file)

# View the first 5 entries
observations.head()

## Identify the closest satellite data in time and space
* for each site:
    * loop over each date and extract key satellite band values and the computed NDCI at the pixel closest to the specified lat,lon of the site
    * return no observations if the closest date is more than +/-`obs_window` days out
    * record the valid results
* return a single file containing all valid results

In [None]:
# Get a list of all site dates to loop over
site_dates = observations["Date"].values

# Define the satellite products to search and measurements to return
products = ["s2a_ard_granule", "s2b_ard_granule"]
measurements = ["nbart_red_edge_1", "nbart_red"]

# Construct empty lists to store data
red_values = []
rededge_values = []
ndci_values = []
time_deltas = []
site_names = []
match_values = []
match_dates = []

# Loop over all sites
for site in locations.itertuples():
    # Extract information from locations table
    site_latitude = site.Lat
    site_longitude = site.Long
    site_name = site.SiteName
    site_count = site.Index + 1

    print(f"Processing site: {site_name}")

    # Extract the ground observations for the current site
    site_values = observations.iloc[:, site_count].to_list()

    # Convert from (long, lat) to (x, y) for finding nearest pixel to site
    site_point_ll = point(site_longitude, site_latitude, crs=CRS("EPSG:4326"))
    site_xy = site_point_ll.to_crs(CRS("EPSG:3577")).points[0]
    site_x = site_xy[0]
    site_y = site_xy[1]

    # Generate area to search over by adding buffer in degrees
    buffer = 0.001
    search_lon = (site_longitude - buffer, site_longitude + buffer)
    search_lat = (site_latitude - buffer, site_latitude + buffer)

    # Find all Sentinel-2 timesteps for selected area
    # Load with dask, meaning that data won't be loaded until
    # an exact value needs to be returned
    ds_s2 = load_ard(
        dc,
        products=products,
        measurements=measurements,
        x=search_lon,
        y=search_lat,
        output_crs="EPSG:3577",
        resolution=(-10, 10),
        lazy_load=True,
    )
    
    print("\n")

    # Loop over all dates to identify closest data within 7 days
    for count, date_string in enumerate(site_dates):
        # Get the date to compare to
        target_date = datetime.strptime(date_string, "%d/%m/%Y")

        # Isolate the timestep that is closest to the sample date
        ds_closest = ds_s2.sel(time=target_date, method="nearest")

        # Calculate the time-difference between sample date and closest date
        time_delta = np.abs(pd.to_datetime(target_date) - ds_closest.time.values)

        if time_delta.days <= obs_window:
            # Isolate the pixel that is closest to the site location
            ds_closest = ds_closest.sel(x=site_x, y=site_y, method="nearest")

            # Calculate the NDCI from the Red Edge and Red bands
            ds_closest["ndci"] = (ds_closest.nbart_red_edge_1 - ds_closest.nbart_red) / (ds_closest.nbart_red_edge_1 + ds_closest.nbart_red)

            # If NDCI value is not NaN, and the site value is not NaN, record the entries
            if (np.isnan(ds_closest.ndci.values.item()) == False) and (np.isnan(site_values[count]) == False):
                # Append appropriate values to list
                site_names.append(site_name)
                match_dates.append(target_date)
                match_values.append(site_values[count])
                ndci_values.append(round(ds_closest.ndci.values.item(), 4))
                red_values.append(ds_closest.nbart_red.values.item())
                rededge_values.append(ds_closest.nbart_red_edge_1.values.item())
                time_deltas.append(int(time_delta.days))

# Compile all valid results into a Pandas table
location_results = pd.DataFrame(
    {
        "SiteName": site_names,
        "ObservationDate": match_dates,
        "TimeDelta": time_deltas,
        "ObservationValue": match_values,
        "Red": red_values,
        "RedEdge": rededge_values,
        "NDCI": ndci_values,
    }
)

# Save valid results to a csv
file_path = "MatchedData.csv"
location_results.to_csv(file_path, na_rep="NaN", index=False)

## Perform linear regression and calculate correlation

In [None]:
X = np.asarray(location_results["NDCI"]).reshape(-1, 1)
y = location_results["ObservationValue"]

x_lin = np.linspace(np.min(X), np.max(X), 100).reshape(-1, 1)

lm = LinearRegression()
model = lm.fit(X, y)

predictions = lm.predict(x_lin)

In [None]:
location_results.plot.scatter(x="NDCI", y="ObservationValue")
plt.plot(x_lin, predictions, "k")
plt.title("Linear fit")
plt.xlabel("NDCI")
plt.ylabel("Chlorophyll-a Concentration (mg/L)")
plt.show()

In [None]:
pearson_rsq = lm.score(X, y)
spearman_rho = spearmanr(X, y).correlation

print("Linear fit results")
print(f"Pearson's R^2 = {pearson_rsq}")
print(f"Spearman's rho = {spearman_rho}")

## Perform polynomial fit

In [None]:
x = location_results["NDCI"]

poly_func = np.poly1d(np.polyfit(x, y, deg=2))
poly_coeff, poly_ssqres, _, _, _ = np.polyfit(x, y, deg=2, full=True)

In [None]:
location_results.plot.scatter(x="NDCI", y="ObservationValue")
plt.plot(x_lin, poly_func(x_lin), "k")
plt.title("Polynomial Fit")
plt.xlabel("NDCI")
plt.ylabel("Chlorophyll-a Concentration (mg/L)")
plt.show()

In [None]:
poly_ssqtot = np.sum((y - np.mean(y))**2)
poly_pearson_rsq = 1 - poly_ssqres/poly_ssqtot

print("Polynomial fit results")
print(f"Pearson's R^2 = {poly_pearson_rsq[0]}")
print(f"Spearman's rho = {spearman_rho}")

## Additional information

**License:** The code in this notebook is licensed under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0). 
Digital Earth Australia data is licensed under the [Creative Commons by Attribution 4.0](https://creativecommons.org/licenses/by/4.0/) license.

**Contact:** If you need assistance, please post a question on the [Open Data Cube Slack channel](http://slack.opendatacube.org/) or on the [GIS Stack Exchange](https://gis.stackexchange.com/questions/ask?tags=open-data-cube) using the `open-data-cube` tag (you can view previously asked questions [here](https://gis.stackexchange.com/questions/tagged/open-data-cube)).
If you would like to report an issue with this notebook, you can file one on [Github](https://github.com/GeoscienceAustralia/dea-notebooks).

**Last modified:** October 2019

**Compatible `datacube` version:** 

In [None]:
print(datacube.__version__)

## Tags
Browse all available tags on the DEA User Guide's [Tags Index](https://docs.dea.ga.gov.au/genindex.html)