<center><img src="https://raw.githubusercontent.com/EO-College/cubes-and-clouds/main/icons/cnc_3icons_process_circle.svg"
     alt="Cubes & Clouds logo"
     style="float: center; margin-right: 10px; margin-left: 10px; max-height: 250px;" /></center>

# 3.2 Validation of the results with Pangeo

<img src="https://raw.githubusercontent.com/pangeo-data/pangeo.io/refs/heads/main/public/Pangeo-assets/pangeo_logo.png"
     alt="Pangeo logo"
     style="float: center; margin-right: 10px; max-height: 80px;"/>
     
In this exercise, we focus on the validation of the results we have produced when using the Pangeo ecosystem. In general, the accuracy of a satellite derived product is expressed by comparing it to in-situ measurements. Furthermore, we will compare the resulting snow cover time series to the runoff of the catchment to check the plausibility of the observed relationship.

The steps involved in this analysis:
- Generate Datacube time-series of snowmap,
- Load _in-situ_ datasets: snow depth station measurements,
- Pre-process and filter _in-situ_ datasets to match area of interest, 
- Perform validation of snow-depth measurements,
- Plausibility check with runoff of the catchment

Start by creating the folders and data files needed to complete the exercise.

In [None]:
!cp -r ${DATA_PATH%/*/*}/notebooks/cubes-and-clouds/lectures/3.2_validation/exercises/32_data $HOME/
!cp -r ${DATA_PATH%/*/*}/notebooks/cubes-and-clouds/lectures/3.2_validation/exercises/_32_pangeo_utilities.py $HOME/
!mkdir -p $HOME/32_results

## Libraries

In [None]:
import json
from datetime import date
import numpy as np
import pandas as pd

import xarray as xr
import rioxarray as rio

import matplotlib.pyplot as plt
import rasterio
from rasterio.plot import show

import geopandas as gpd
import folium

from _32_pangeo_utilities import ( calculate_sca,
                                 station_temporal_filter,
                                 station_spatial_filter,
                                 binarize_snow,
                                 assign_site_snow,
                                 validation_metrics)
import os
import warnings;
warnings.filterwarnings('ignore');

## Region of Interest

Load the Val Passiria Catchment, our region of interest. And plot it.

In [None]:
catchment_outline = gpd.read_file('32_data/catchment_outline.geojson', crs="EPGS:4326")
catchment_outline

In [None]:
center_loc = catchment_outline.to_crs('+proj=cea').centroid.to_crs(epsg="4326")

In [None]:
# OpenStreetMap
map = folium.Map(location=[float(center_loc.y.iloc[0]), float(center_loc.x.iloc[0])], tiles="OpenStreetMap", zoom_start=9)
geo_j = catchment_outline["geometry"].to_json()
geo_j = folium.GeoJson(data=geo_j, style_function=lambda x: {"fillColor": "orange"})
geo_j.add_to(map)
map

## Generate Datacube of Snowmap

We have prepared the workflow to generate the snow map as a python function `calculate_sca()`. The `calculate_sca()` is from `_32_pangeo_utilities` and is used to reproduce the snow map process graph using the Pangeo software stack.

In [None]:
start_date = "2018-02-01"
end_date = "2018-06-30"
bbox = tuple(catchment_outline.bounds.iloc[0])
temporal_extent = [start_date, end_date]
snow_map_cloud_free = calculate_sca(bbox, temporal_extent)
snow_map_cloud_free

## Load snow-station in-situ data
Load the _in-situ_ datasets, snow depth station measurements. They have been compiled in the ClirSnow project and are available here: [Snow Cover in the European Alps](https://zenodo.org/record/5109574) with stations in our area of interest. 

We have made the data available for you already. We can load it directly.

In [None]:
# load snow station datasets from zenodo:: https://zenodo.org/record/5109574
station_df = pd.read_csv("32_data/data_daily_IT_BZ.csv", parse_dates=["Date"], date_format="%d.%m.%y")
station_df.head()

In [None]:
# load additional metadata for acessing the station geometries
station_df_meta = pd.read_csv("32_data/meta_all.csv")
station_df_meta.head()

## Pre-process and filter _in-situ_ snow station measurements

### Filter Temporally
Filter the in-situ datasets to match the snow-map time series using the function `station_temporal_filter()` from `_32_pangeo_utilities.py`, which merges the station dataframe with additional metadata needed for the Lat/Long information and convert them to geometries

In [None]:
snow_stations = station_temporal_filter(station_daily_df = station_df, 
                                        station_meta_df = station_df_meta,
                                        start_date = start_date,
                                        end_date = end_date)
snow_stations.head()

### Filter Spatially
Filter the in-situ datasets into the catchment area of interest using `station_spatial_filter()` from `cubes_utilities.py`.

In [None]:
catchment_stations = station_spatial_filter(snow_stations, catchment_outline)
catchment_stations.head()

### Plot the filtered stations
Visualize location of snow stations

In [None]:
print("There are", len(np.unique(catchment_stations.Name)), "unique stations within our catchment area of interest")

**_Quick Hint: Remember the number of stations within the catchment for the final quiz exercise_**

### Convert snow depth to snow presence
The stations are measuring snow depth. We only need the binary information on the presence of snow (yes, no). We use the `binarize_snow()`  function from `cubes_utilities.py` to assign 0 for now snow and 1 for snow in the **snow_presence** column.

In [None]:
catchment_stations = catchment_stations.assign(snow_presence=catchment_stations.apply(binarize_snow, axis=1))
catchment_stations.head()

### Save the pre-processed snow station measurements
Save snow stations within catchment as GeoJSON

In [None]:
with open("32_results/catchment_stations_pangeo.geojson", "w") as file:
    file.write(catchment_stations.to_json())

## Extract SCA from the data cube per station

### Prepare snow station data for usage in Pangeo
Create a buffer of approximately 80 meters (0.00075 degrees) around snow stations and visualize them.

In [None]:
catchment_stations_gpd = gpd.read_file("32_results/catchment_stations_pangeo.geojson")

# OpenStreetMap
map = folium.Map(location=[float(center_loc.y.iloc[0]), float(center_loc.x.iloc[0])], tiles="OpenStreetMap", zoom_start=10)

# catchment
catchment_layer = folium.FeatureGroup(name="catchment", show=True).add_to(map)
folium.GeoJson(data=catchment_outline["geometry"].to_json(), style_function=lambda x: {"fillColor": "orange"}).add_to(catchment_layer)

# catchment stations
stations_layer = folium.FeatureGroup(name="catchment stations", show=True).add_to(map)

for _, r in catchment_stations_gpd[["Longitude", "Latitude"]].drop_duplicates().iterrows():
    # Place the markers with the popup labels and data
    folium.Marker(location=[r["Latitude"], r["Longitude"]],
                  popup="Latitude: " + str(r["Latitude"]) 
                  + "<br>" 
                  + "Longitude: " + str(r["Longitude"])
                 ).add_to(stations_layer)
    
# catchment buffer
buffer_layer = folium.FeatureGroup(name="catchment station buffer", show=True).add_to(map)
catchment_stations_gpd["geometry"] = catchment_stations_gpd.geometry.buffer(0.00075)

for _, r in catchment_stations_gpd[["geometry"]].drop_duplicates().iterrows():
    # Place the markers with the popup labels and data
    folium.GeoJson(data=catchment_stations_gpd["geometry"].to_json(), style_function=lambda x: {"color": "#0F7229", "fillOpacity": 0}).add_to(buffer_layer)


folium.LayerControl().add_to(map)
map

Get the unique geometries for each catchment station buffer 

In [None]:
catchment_stations_gpd.geometry.drop_duplicates()

### Extract SCA from the data cube per station
We extract the SCA value of our data cube at the buffered station locations. Therefore we use the process `aggregate_spatial()` with the aggregation method `median()`. This gives us the most common value in the buffer (snow or snowfree).

In [None]:
snow_map_cloud_free.rio.write_crs("EPSG:32632", inplace=True)
snow_map_cloud_free.rio.set_nodata(np.nan, inplace=True)

### Reduce the amount of data by selecting a small Area Of Interest (AOI)

In [None]:
catchment_stations_gpd_utm32 = catchment_stations_gpd.to_crs(epsg=32632)
minx, miny, maxx, maxy = catchment_stations_gpd_utm32[["geometry"]].drop_duplicates().total_bounds

In [None]:
snowmap_clipped = snow_map_cloud_free.sel(x=slice(minx,maxx), y = slice(maxy, miny))
snowmap_clipped

### Aggregate to daily values
Data aggregation is a very important step in the analysis. It allows to reduce the amount of data and to make the analysis more efficient. Moreover as in this case we are going to aggregate the date to daily values, this will allow use to compute statistic on the data at the basin scale later on.

The `groupby` method allows to group the data by the time dimension, aggregating to the date and removing the time information, once the group is obtained we will aggregate the data by taking the max value.

In [None]:
geoms = []
for _, r in catchment_stations_gpd_utm32[["geometry"]].drop_duplicates().iterrows():
    geoms.append(r["geometry"])

snowmap_clipped = snow_map_cloud_free.rio.clip(geoms).groupby(snow_map_cloud_free.time.dt.floor('D')).max(dim="time")
snowmap_clipped = snowmap_clipped.rename({"floor":"time"})

It's time to persist the data in memory. We will use the persist method to load the data in memory and keep it there until the end of the analysis.

In [None]:
%%time
snowmap_clipped.persist()

### Extract SCA from the data cube per station
We extract the SCA value of our data cube at the buffered station locations. Therefore we use the aggregation method `median()`. This gives us the most common value in the buffer (snow or snowfree).

**Please note: this step may take around 5 minutes!**

In [None]:
%%time
x = []
for _, r in catchment_stations_gpd_utm32[["geometry"]].drop_duplicates().iterrows():
    snowmap_station = snowmap_clipped.rio.clip([r["geometry"]])
    snowmap_station.persist()
    median = snowmap_station.median(["x","y"])
    x.append(median.to_pandas())

Save the values into csv files

In [None]:
%%time

if not os.path.exists("32_results/snowmap_pangeo/"):
    os.makedirs("32_results/snowmap_pangeo")
for idx,r in catchment_stations_gpd_utm32[["Name"]].drop_duplicates().iterrows():
    print(idx, r["Name"])
    x[idx].to_csv("32_results/snowmap_pangeo/" + r["Name"] + ".csv")

## Combine station measurements and the extracted SCA from our data cube
The **station measurements** are **daily** and all of the stations are combined in **one csv file**. 
The **extracted SCA values** are in the best case **six-daily** (Sentinel-2 repeat rate) and also all stations are in **one json file**.
We will need to join the the extracted SCA with the station measurements by station (and time (selecting the corresponding time steps)

### Extract snow values from SCA extracted at the station location
Let's have a look at the data structure first

Open the snow covered area time series extracted at the stations. We'll have a look at it in a second.

In [None]:
x = []
for idx,r in catchment_stations_gpd[["Name"]].drop_duplicates().iterrows():
    print(idx, r["Name"])
    x.append(pd.read_csv("32_results/snowmap_pangeo/" + r["Name"] + ".csv", parse_dates=["time"], index_col="time"))

In [None]:
dates = x[0].index.tolist()
snow_val_smartino = [y[0] for y in x[0].values]
snow_val_rifiano = [y[0] for y in x[1].values]
snow_val_plata = [y[0] for y in x[2].values]
snow_val_sleonardo = [y[0] for y in x[3].values]
snow_val_scena = [y[0] for y in x[4].values]

### Match in-situ measurements to dates in SCA 
Let's have a look at the in-situ measurement data set.

In [None]:
catchment_stations_gpd = gpd.read_file("32_results/catchment_stations_pangeo.geojson")

Convert column "id" from strings to dates to enable selection by dates

In [None]:
catchment_stations_gpd["id"] = pd.to_datetime(catchment_stations_gpd["id"])

We are going to extract each station and keep only the dates that are available in the SCA results.

In [None]:
catchment_stations_gpd_smartino = catchment_stations_gpd.query("Name == 'S_Martino_in_Passiria_Osservatore'")
catchment_stations_gpd_smartino = catchment_stations_gpd_smartino[
    catchment_stations_gpd_smartino.id.isin(dates)
]

catchment_stations_gpd_rifiano = catchment_stations_gpd.query("Name == 'Rifiano_Beobachter'")
catchment_stations_gpd_rifiano = catchment_stations_gpd_rifiano[
    catchment_stations_gpd_rifiano.id.isin(dates)
]

catchment_stations_gpd_plata = catchment_stations_gpd.query("Name == 'Plata_Osservatore'")
catchment_stations_gpd_plata = catchment_stations_gpd_plata[
    catchment_stations_gpd_plata.id.isin(dates)
]

catchment_stations_gpd_sleonardo = catchment_stations_gpd.query("Name == 'S_Leonardo_in_Passiria_Osservatore'")
catchment_stations_gpd_sleonardo = catchment_stations_gpd_sleonardo[
    catchment_stations_gpd_sleonardo.id.isin(dates)
]

catchment_stations_gpd_scena = catchment_stations_gpd.query("Name == 'Scena_Osservatore'")
catchment_stations_gpd_scena = catchment_stations_gpd_scena[
    catchment_stations_gpd_scena.id.isin(dates)
]

### Combine in-situ measurements with SCA results at the stations 
The in situ measurements and the SCA are combined into one data set per station. This will be the basis for the validation.

In [None]:
smartino_snow = assign_site_snow(catchment_stations_gpd_smartino, snow_val_smartino)
rifiano_snow = assign_site_snow(catchment_stations_gpd_rifiano, snow_val_rifiano)
plata_snow = assign_site_snow(catchment_stations_gpd_plata, snow_val_plata)
sleonardo_snow = assign_site_snow(catchment_stations_gpd_sleonardo, snow_val_sleonardo)
scena_snow = assign_site_snow(catchment_stations_gpd_scena, snow_val_scena)   

Let's have a look at the SCA extracted at the station Plata Osservatore and it's in situ measurements.

In [None]:
catchment_stations_gpd_plata.sample(5)

Display snow presence threshold in in-situ data for Plata Osservatore

In [None]:
catchment_stations_gpd_plata.plot(x="id", y="HS_after_gapfill",rot=45,kind="line",marker='o')
plt.axhline(y = 0.4, color = "r", linestyle = "-")
plt.show()

## Validate the SCA results with the snow station measurements 
Now that we have combined the SCA results with the snow station measurements we can start the actual validation. A **confusion matrix** compares the classes of the station data to the classes of the SCA result. The numbers can be used to calculate the accuracy (correctly classified cases / all cases).

|             | no_snow | snow    |
|-------------|---------|---------|
| **no_snow** | correct | error   |
| **snow**    | error   | correct |

In [None]:
import seaborn as sns

In [None]:
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(10, 6))

fig.suptitle("Error matrices for snow stations within our selected Catchment")
sns.heatmap(validation_metrics(smartino_snow)[1], annot=True, xticklabels=["No Snow", "Snow"], yticklabels=["No Snow", "Snow"], ax=ax1)
ax1.set_title("San Martino in Passiria Osservatore")
ax1.set(xlabel="Predicted label", ylabel="True label")


sns.heatmap(validation_metrics(rifiano_snow)[1], annot=True, xticklabels=["No Snow", "Snow"], yticklabels=["No Snow", "Snow"], ax=ax2)
ax2.set_title("Rifiano Beobachter")
ax2.set(xlabel="Predicted label", ylabel="True label")


sns.heatmap(validation_metrics(plata_snow)[1], annot=True, xticklabels=["No Snow", "Snow"], yticklabels=["No Snow", "Snow"], ax=ax3)
ax3.set_title("Plata Osservatore")
ax3.set(xlabel="Predicted label", ylabel="True label")


sns.heatmap(validation_metrics(scena_snow)[1], annot=True, xticklabels=["No Snow", "Snow"], yticklabels=["No Snow", "Snow"], ax=ax4)
ax4.set_title("Scena Osservatore")
ax4.set(xlabel="Predicted label", ylabel="True label")

fig.tight_layout()

The **accuracy** of the snow estimate from the satellite image computation for each station is shown below: 


| **On-site snow station**             | **Accuracy**|
|--------------------------------------|-------------|
| San Martino in Passiria Osservatore  | **100.00%** |
| Rifiano Beobachter                   | **100.00%** |
| Plata Osservatore                    |    92.3%   |
| San Leonardo in Passiria Osservatore |    NaN      |
| Scena Osservatore                    | **100.00%** |

The fourth station **San Leonardo in Passiria Osservatore** recorded **_NaNs_** for snow depths for our selected dates, which could potentially be as a results of malfunctioning on-site equipments. Hence, we are not able to verify for it. But overall, the validation shows a 100% accuracy for stations **San Martino in Passiria Osservatore**, **Rifiano Beobachter** and **Scena Osservatore**, while station **Plata Osservatore** has False Positives decreasing the overall accuracy. This shows a good match between estimated snow values from satellite datasets and on-the ground measurements of the presence of snow. 

## Compare to discharge data
In addition to computing metrics for validating the data, we also check the plausibility of our results. We compare our results with another measure with a known relationship. In this case, we compare the **snow cover area** time series with the **discharge** time-series at the main outlet of the catchment. We suspect that after snow melting starts, with a temporal lag, the runoff will increase. Let's see if this holds true.

Load the discharge data at Meran, the main outlet of the catchment. We have prepared this data set for you, it's extracted from Eurac's [Environmental Data Platform Alpine Drought Observatory Discharge Hydrological Datasets](https://edp-portal.eurac.edu/discovery/9e195271-02ae-40be-b3a7-525f57f53c80)). 

In [None]:
discharge_ds = pd.read_csv('32_data/ADO_DSC_ITH1_0025.csv', 
                           sep=',', index_col='Time', parse_dates=True)
discharge_ds.head()

Load the SCA time series we have generated in a previous exercise. It's the time series of the aggregated snow cover area percentage for the whole catchment. **Please note: you need to complete the 3.1 exercise before proceeding!**

In [None]:
snow_perc_df = pd.read_csv("./31_results/filtered_snow_fraction_pangeo.csv", 
                          sep=',', index_col='date', parse_dates=True)
snow_perc_df.head()

Let's plot the relationship between the snow covered area and the discharge in the catchment.

In [None]:
start_date = date(2018, 2, 1)
end_date = date(2018, 6, 30)
# filter discharge data to start and end dates
discharge_ds = discharge_ds.loc[start_date:end_date]

ax1 = discharge_ds.discharge_m3_s.plot(label='Discharge', xlabel='', ylabel='Discharge (m$^3$/s)')
ax2 = snow_perc_df["SCA"].plot(marker='o', secondary_y=True, label='SCA', xlabel='', ylabel='Snow cover area (%)')
ax1.legend(loc='center left', bbox_to_anchor=(0, 0.6))
ax2.legend(loc='center left', bbox_to_anchor=(0, 0.5))
plt.show()

The relationship looks as expected! Once the snow cover decreases the runoff increases!