# Waterbody clustering

This notebook investigates the clustering of waterbodies based on their time series surface areas and other features.

## Setup

In [111]:
%config IPython.use_jedi = False

### Load modules

In [1]:
%matplotlib widget

from pathlib import Path

import fiona
import numpy as np
import matplotlib.cm
import matplotlib.pyplot as plt
import geopandas as gpd
import pandas as pd
import scipy.spatial.distance
import scipy.ndimage
import sklearn.cluster
import sklearn.decomposition
from tqdm.notebook import tqdm

from fastdtw import fastdtw

### Load data

In [2]:
waterbody_shp_path = Path('/g/data/r78/cek156/dea-notebooks/Scientific_workflows/DEAWaterbodies/AusAllTime01-005HybridWaterbodies/AusWaterBodiesFINAL.shp')
waterbody_csv_path = Path('/g/data/r78/cek156/dea-notebooks/Scientific_workflows/DEAWaterbodies/timeseries_aus_uid/')
surface_area_threshold = 50

In [3]:
waterbody_shapes = gpd.read_file(waterbody_shp_path)

In [4]:
waterbody_shapes.iloc[0]

area                                                     11875
perimeter                                              549.791
UID                                                  q9cusmx7n
FID                                                          0
geometry     POLYGON ((-1538525.000000001 -3849499.99999999...
Name: 0, dtype: object

Choose an area of interest to focus on.

In [5]:
bbox = gpd.GeoDataFrame(geometry=gpd.points_from_xy((142.1246, 149.1300), (-37.0161, -34.2801)))  # Mildura -> Canberra, Seymour -> Griffith

In [6]:
bbox.crs = 'EPSG:4326'

In [7]:
x_min, y_min, x_max, y_max = bbox.to_crs('EPSG:3577').total_bounds

In [8]:
waterbody_shapes_ = waterbody_shapes.cx[x_min:x_max, y_min:y_max]

In [9]:
print(len(waterbody_shapes), 'waterbodies total')
print(len(waterbody_shapes_), 'in Mildura/Seymour/Canberra/Griffith area')

295906 waterbodies total
12535 in Mildura/Seymour/Canberra/Griffith area


In [10]:
waterbody_shapes = waterbody_shapes_

Join these with the BOM river regions. I grabbed these from the v2.1.1 Geofabric Reporting Regions and converted them from gdb + WGS84 to GeoJSON + Australian Albers in QGIS.

In [11]:
riverregions = gpd.read_file('bom_riverregions_v2p1p1.geojson')

In [12]:
waterbody_shapes = gpd.sjoin(waterbody_shapes, riverregions, how='left', op='within')

In [13]:
all_time_series = []
for i, shape in tqdm(waterbody_shapes.iterrows(), total=len(waterbody_shapes)):
    uid = shape.UID
    csv_path = waterbody_csv_path / uid[:4] / f'{uid}.csv'
    time_series = pd.read_csv(csv_path)
    # Relabel the third column to something consistent, and rename all columns to something
    # easier to access.
    time_series.rename(columns={
        'Observation Date': 'date',
        'Wet pixel percentage': 'pc_wet',
        time_series.columns[2]: 'px_wet',
        }, inplace=True)
    # Convert time strings into datetimes.
    time_series.date = pd.to_datetime(time_series.date)
    # Store the actual number of pixels too.
    n_pixels = shape.geometry.area // (25 ** 2)
    time_series.attrs['px_tot'] = n_pixels  # attrs is experimental.
    all_time_series.append(time_series)

HBox(children=(FloatProgress(value=0.0, max=12535.0), HTML(value='')))




In [14]:
waterbodies = waterbody_shapes.set_index('UID')

In [15]:
ax = plt.figure(figsize=(10, 5)).add_subplot(1, 1, 1)
waterbodies.plot(ax=ax, column='RivRegNum', cmap='rainbow')

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

<matplotlib.axes._subplots.AxesSubplot at 0x7f9c172aae48>

In [16]:
assert len(all_time_series) == len(waterbody_shapes)

It would be useful to remove entries with NaN water levels (presumably cloud or similar).

In [17]:
all_time_series_ = []
for t in tqdm(all_time_series):
    nans = t.px_wet.isnull()
    t = t[~nans].reset_index(drop=True)
    all_time_series_.append(t)

HBox(children=(FloatProgress(value=0.0, max=12535.0), HTML(value='')))




In [18]:
all_time_series = all_time_series_

In [19]:
waterbodies['water_history'] = all_time_series

## Focusing the dataset

I think that the rivers are throwing a spanner in the works a bit, and while the big lakes take up a *lot* of area we don't really care about them. We want to see dams, small lakes, and ponds! Let's use the Surface Hydrology Network to remove rivers. Claire has previously used this to remove major rivers but this led to inconsistent results where some large lakes were removed because they were part of the water network. However, in this case I don't actually care about those either: if they are rivers then they are gone, and lakes like Lake Hume should go too. We can always add them back in later (e.g. using an area threshold).

In [20]:
fiona.listlayers('SurfaceHydrologyLinesNational.gdb')

['HydroLines']

In [21]:
lines = gpd.read_file('SurfaceHydrologyLinesNational.gdb', layer='HydroLines')

In [22]:
lines = lines.to_crs('EPSG:3577')

In [23]:
lines = lines.cx[x_min:x_max, y_min:y_max]

In [26]:
watercourses = lines['FEATURETYPE'] == 'Watercourse'

If we strip everything that intersects with a watercourse, how much of our data does that remove?

In [33]:
joined = gpd.sjoin(waterbodies.drop(columns='index_right'), lines[watercourses], how='inner', op='intersects')

In [45]:
print('{:.02%} of waterbodies intersect watercourses'.format(joined.index.unique().shape[0] / waterbodies.shape[0]))

27.55% of waterbodies intersect watercourses


That's a lot! But it's also consistent with the idea that the non-river waterbodies are outliers.

In [41]:
joined.plot()

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

<matplotlib.axes._subplots.AxesSubplot at 0x7f9b7e730f28>

It does a pretty good job of pulling out rivers (and lakes that are made from dammed rivers). What's left?

In [46]:
yes_river = waterbodies.index.isin(joined.index)

In [47]:
yes_river.mean()

0.2755484642999601

In [48]:
waterbodies_not_river = waterbodies[~yes_river]

In [49]:
waterbodies_not_river.plot()

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

<matplotlib.axes._subplots.AxesSubplot at 0x7f9b7d6349e8>

Lots of remaining waterbodies resemble rivers, but I'm fairly sure that these are billabongs and similar, which are particularly prevalent along the Murray River.

In [50]:
waterbodies_including_rivers = waterbodies[yes_river]
waterbodies = waterbodies[~yes_river]

In [77]:
waterbodies = waterbodies.reset_index().set_index('UID')
waterbodies_including_rivers = waterbodies_including_rivers.reset_index().set_index('UID')

In [98]:
waterbodies.drop(columns='water_history').to_file('waterbodies_murray_norivers.geojson', driver='GeoJSON')

In [99]:
waterbodies_including_rivers.drop(columns='water_history').to_file('waterbodies_murray_onlyrivers.geojson', driver='GeoJSON')

## Distances and clustering

We need to define some kind of distance between two water level time series (henceforth "water histories"). These have different x values and lengths. A dilemma! One option is to interpolate so everything is the same length. We could also have some distance function that doesn't require the same x values. The former is simpler, and lets us use all our favourite distance measures, including all vector distances (e.g. cosine, Euclidean, Pearson correlation...) but requires assumptions on water behaviour. It also requires preprocessing the data to the same time steps, which will at minimum greatly increase the memory usage. The latter runs the risk of being slower. One option for the latter is dynamic time warping distance, but this requires a quadratic DP for each pair and can be pretty slow as a result, especially when there are many data points in each time series.

Let's start by interpolating to a common grid. How many elements should that grid have?

In [51]:
dates = set()
for history in tqdm(waterbodies.water_history):
    dates |= set(history.date.dt.round('1d').values.astype('datetime64[D]'))

HBox(children=(FloatProgress(value=0.0, max=9081.0), HTML(value='')))




In [52]:
print('average number of observations per waterbody:', waterbodies.water_history.map(lambda a: len(a)).mean())

average number of observations per waterbody: 666.9769849135557


In [53]:
print('unique dates:', len(dates))

unique dates: 5253


In [54]:
min(dates), max(dates)

(numpy.datetime64('1986-08-18'), numpy.datetime64('2020-07-19'))

For each water history we can add in all the dates between the first and most recent observation

In [55]:
dates = np.arange(min(dates), max(dates), 1)

In [56]:
len(dates)

12389

In [57]:
# First round every date and set date to be the index.
# Note that we also have to drop the timezone, which pandas assumes is UTC.
# If pandas did not assume it was UTC - maybe it assumed UTC+11 for example - then this would also do
# a conversion into UTC, which is probably not what we want.
for history in tqdm(waterbodies.water_history):
    history.date = history.date.dt.round('1d')
    history.set_index('date', drop=True, inplace=True)
    history.index = history.index.tz_convert(None)

HBox(children=(FloatProgress(value=0.0, max=9081.0), HTML(value='')))




In [58]:
dt_index = pd.DatetimeIndex(dates)

In [59]:
histories = []  # Storing reindexed dataframes back directly in waterbodies leads to some super bizarre behaviour where they are replaced entirely by nans.
# So, storing them in a list instead.
for i in tqdm(range(len(waterbodies))):
    # Merge duplicate dates into one.
    history = waterbodies.water_history[i].groupby('date').mean()
    # Then reindex with the full list of dates.
    histories.append(history.reindex(dt_index))

HBox(children=(FloatProgress(value=0.0, max=9081.0), HTML(value='')))




In [78]:
waterbodies.water_history = histories

With all the water histories now having the same time index, they are all aligned. We now need to handle the lack of measurements at some times, and we will do this by linear interpolation as it is the least information thing we can do (besides setting them to the last observed value, which feels unphysical).

In [61]:
for history in tqdm(waterbodies.water_history):
    history.interpolate(limit_direction='both', inplace=True)

HBox(children=(FloatProgress(value=0.0, max=9081.0), HTML(value='')))




Now everything is aligned! Put everything into a matrix, treating every time observation as an independent feature.

In [62]:
history_matrix = np.zeros((len(waterbodies), len(dt_index)))

In [63]:
for i, history in tqdm(enumerate(waterbodies.water_history)):
    history_matrix[i] = history.pc_wet

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))




In [64]:
history_matrix = np.nan_to_num(history_matrix)

In [65]:
dt_index.max() - dt_index.min()

Timedelta('12388 days 00:00:00')

Finally, let's downsample this because we really don't need 5000+ time entries. We have 12000 days of data, which is 1700 weeks, so let's downsample by 1/7.

In [66]:
history_matrix_original = history_matrix
history_matrix_zoomed = scipy.ndimage.zoom(history_matrix, (1, 1 / 7))

In [67]:
dt_zoomed = scipy.ndimage.zoom(dt_index.values.astype('datetime64[D]').astype(int), 1 / 7).astype('datetime64[D]')

In [68]:
history_df_original = gpd.GeoDataFrame(history_matrix_original, columns=dt_index, index=waterbodies.index, geometry=waterbodies.geometry)

In [69]:
history_df_zoomed = gpd.GeoDataFrame(history_matrix_zoomed, columns=dt_zoomed, index=waterbodies.index, geometry=waterbodies.geometry)

In [70]:
np.save('time_axis_murray_zoomed_norivers.npy', dt_zoomed)
np.save('time_axis_murray_full_norivers.npy', dt_index)
np.save('history_murray_full_norivers.npy', history_matrix_original)
np.save('history_murray_zoomed_norivers.npy', history_matrix_zoomed)

When exploring a dataset, it's always good to start with PCA! The first component is the mean, which is worth looking at regardless:

In [72]:
plt.figure()
mean = np.mean(history_matrix, axis=0)
std = np.std(history_matrix, axis=0)
plt.plot(dt_index, mean, c='black')
plt.fill_between(dt_index, mean - std, mean + std, color='black', alpha=0.2)
# for d in dt_index[dt_index.month == 1]:
#     plt.axvline(d, alpha=0.01, c='black')

def plot_la_nina_el_nino():
    for la_nina_from, la_nina_to in [('2010-04', '2012-03'), ('2008-08', '2009-04'), ('2007-06', '2008-02'), ('1998-05', '2001-03'), ('1988-04', '1989-07')]:
        plt.axvspan(np.datetime64(la_nina_from), np.datetime64(la_nina_to), color='blue', alpha=0.2)
    for el_nino_from, el_nino_to in [('2015-04', '2016-04'), ('2009-05', '2010-03'), ('2006-05', '2007-01'), ('2002-03', '2003-01'), ('1997-04', '1998-03'),
                                     ('1994-03', '1995-01'), ('1993-04', '1994-02'), ('1991-03', '1991-11'), ('1987-05', '1988-03')]:
        plt.axvspan(np.datetime64(el_nino_from), np.datetime64(el_nino_to), color='red', alpha=0.2)
plot_la_nina_el_nino()
    
plt.xlabel('Date')
plt.ylabel('Mean percentage of maximum extent')
# plt.xlim(np.datetime64('2008-01'), np.datetime64('2012-01'))

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

Text(0, 0.5, 'Mean percentage of maximum extent')

I've highlighted the dates in January in the dataset: these have lower water levels on average, which makes sense for the middle of summer in NSW and Victoria. I've also highlighted La Niña and El Niño events in blue and red respectively. They are weakly correlated with significant increases and decreases in mean water level. In particular, the very strong 2010-2012 La Niña corresponds with a particularly large increase in average water extent.

Next we'll do PCA.

In [73]:
pca = sklearn.decomposition.PCA(n_components=50)
pca_f = pca.fit_transform(history_matrix)

In [79]:
waterbodies.loc[waterbodies.RivRegNum.isnull(), 'RivRegNum'] = -1

In [80]:
plt.figure()
plt.scatter(pca_f[:, 0], pca_f[:, 1], s=2, edgecolor='None', c=waterbodies.RivRegNum.astype(int), cmap='rainbow')

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

<matplotlib.collections.PathCollection at 0x7f9b83f814e0>

There are no obvious correlations in 2-PCA-space. Let's try t-SNE.

In [121]:
import sklearn.manifold

tsne = sklearn.manifold.TSNE(verbose=True, perplexity=50)

tsne_f = tsne.fit_transform(pca_f)

In [107]:
names = dict(zip(waterbodies.RivRegNum.astype(int), waterbodies.RivRegName))

In [None]:
plt.figure(figsize=(8, 8))
xs = np.arange(min(names), max(names))
plt.scatter(tsne_f[:, 0], tsne_f[:, 1], s=(waterbodies.area / 0.5e3) ** 0.5,
            edgecolor='None', c=waterbodies.RivRegNum.astype(int), cmap='tab20', norm=matplotlib.colors.BoundaryNorm(xs, len(xs) + 1))
cb = plt.colorbar()
cb.set_ticks(xs + 0.5)
cb.set_ticklabels([names.get(i, '') for i in xs])

This does have some obvious substructure, particularly when we colour it by position. But clustering results in mostly useless clusters.

The data have been exported already, so we are good to try and cluster in other notebooks.