# TMS-OS (Tiered Multi-Sensor: Optical & SAR)

In [None]:
import pandas as pd
import geopandas as gpd
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots

_section for landsat-8 data reading and visualization_

# 1. Exploring Sentinel-2 time-series

In [None]:
# read in sentinel-2 data
s2 = pd.read_csv("../data/s2-sirindhorn.csv", parse_dates=['date']).set_index('date')
s2.head()

In [None]:
# calculate cloud cover in % over the ROI
s2['cloud_percentage'] = (s2['cloud_area_raw (km2)']*100)/(s2['water_area_raw (km2)']+s2['non_water_area_raw (km2)']+s2['cloud_area_raw (km2)'])
s2.head()

Plot the data

In [None]:
fig = make_subplots(
    rows=2,
    row_heights=[0.8, 0.2],
    shared_xaxes=True,
    vertical_spacing = 0.05
)

# Raw Surface Area
fig.add_trace(go.Scatter(
    x = s2.index,
    y = s2['water_area_raw (km2)'],
    name = 'Raw Reservoir Surface Area',
    mode = 'lines+markers'
), row=1, col=1)

# Cloud Corrected Surface Area
fig.add_trace(go.Scatter(
    x = s2.index,
    y = s2['water_area_cloud_corrected (km2)'],
    name = 'Cloud-Corrected Reservoir Surface Area',
    mode = 'lines+markers'
), row=1, col=1)

# Cloud
fig.add_trace(go.Bar(
    x = s2.index,
    y = s2['cloud_percentage'],
    name = 'Cloud Cover (%)',
    marker = dict(
        color = 'red',
    )
), row=2, col=1)

fig.update_layout(
    title=dict(                                         # title
        text='Reservoir Surface Areas - Sentinel-2',
        xanchor='center',
        x=0.5
    ),
    legend=dict(                                        # legend
        orientation = 'h',
        yanchor='top',
        y=-0.08,
        xanchor='right',
        x=1.0,
        bordercolor="grey",
        borderwidth=1
    ),
    margin=dict(l=20, r=20, t=60, b=20)                 # margins
)

fig

_There are a lot of `-1` values, what are those??_
- When the cloud cover is >90% the script returns `-1` as a fill value. Essentially, we don't have a data point there.

Let's remove these values

In [None]:
# set all the -1 values as np.nan (Not-A-Number)
s2.loc[s2['cloud_percentage']>90, ['water_area_raw (km2)', 'non_water_area_raw (km2)', 'cloud_area_raw (km2)', 'water_area_cloud_corrected (km2)']] = np.nan

# drop all the np.nan values
s2.dropna(inplace=True)

s2.head()

Plot the data again

In [None]:
fig = make_subplots(
    rows=2,
    row_heights=[0.8, 0.2],
    shared_xaxes=True,
    vertical_spacing = 0.05
)

# Raw Surface Area
fig.add_trace(go.Scatter(
    x = s2.index,
    y = s2['water_area_raw (km2)'],
    name = 'Raw Reservoir Surface Area',
    mode = 'lines+markers'
), row=1, col=1)

# Cloud Corrected Surface Area
fig.add_trace(go.Scatter(
    x = s2.index,
    y = s2['water_area_cloud_corrected (km2)'],
    name = 'Cloud-Corrected Reservoir Surface Area',
    mode = 'lines+markers'
), row=1, col=1)

# Cloud
fig.add_trace(go.Bar(
    x = s2.index,
    y = s2['cloud_percentage'],
    name = 'Cloud Cover (%)',
    marker = dict(
        color = 'red',
    )
), row=2, col=1)

fig.update_layout(
    title=dict(                                         # title
        text='Reservoir Surface Areas - Sentinel-2',
        xanchor='center',
        x=0.5
    ),
    legend=dict(                                        # legend
        orientation = 'h',
        yanchor='top',
        y=-0.08,
        xanchor='right',
        x=1.0,
        bordercolor="grey",
        borderwidth=1
    ),
    margin=dict(l=20, r=20, t=60, b=20)                 # margins
)

fig

Notice the sudden drops in surface areas, which usually occur during high cloud cover conditions, but may also happen during cloud-free days. These sudden drops (> 100 sq. km.) followed by sudden rise of similar magnitude in a span of 5-10 days aren't representative of the true behavior of the reservoirs - these are not actual signals. 

These artifacts can occur due to the automatic nature of the clustering algorithm ([Cascade simple K-Means clustering](https://developers.google.com/earth-engine/apidocs/ee-clusterer-wekacascadekmeans)). In the K-Means clustering algorithm, you'd have to specify the number of clusters ($K$) to form, before performing the clustering. Specifying a hard-coded value for $K$ can (1) be difficult, and (2) create additional issues, especially when several reservoirs are to be mapped at once (such as RAT-Mekong). Moreover, in case of a highly dynamic reservoir, where the area may change drastically during dry and wet seasons, the number of distinct features that can appear as a "cluster" may vary as well. Due to such reasons, it is recommended to use an automatic scheme of choosing the value of $K$, which, in this case, is done using [the Calinski-Harabasz criterion](https://www.tandfonline.com/doi/abs/10.1080/03610927408827101).

The limitation with this method, however, is that the Calinski-Harabasz crierion can sometimes choose a value of $K$ where the cluster representing water pixels gets divided into multiple clusters, during challenging scenarios. Such challenging scenarios can occur due to unmasked cloud cover, sediment-laden the water, intermittent vegetation, and other artifacts of processing done by the Satellite data provider.

**These erroneous values get corrected in the subsequent steps of TMS-OS**