Initial CT-LSTM evaluation #55

AndyMcAliley · 2022-09-16T23:49:10Z

AndyMcAliley
Sep 16, 2022

Evaluate initial CT-LSTM

Executive summary (tl;dr)

The overal RMSE of the model for all observations in the validation set is 2.89 °C.
- The mean RMSE by lake is 2.55 °C.
- The median RMSE by lake is 2.14 °C.
RMSE varies with depth (higher RMSE deeper), day of year (higher RMSE in summer), and lake surface area (higher RMSE smaller).
In winter, predicted temperatures don't correlate strongly to observed temperatures.
Most training data are at shallow depths, in summer, in medium-sized (~100 hectare) lakes, which likely leads the model to deemphasize prediction accuracy at greater depths, during colder months, and in small or large lakes.

Here we evaluate the results of the initial CT-LSTM trained on data from the lake-temperature-model-prep pipeline. "CT" indicates concatenating static and dynamic features at every time step to form inputs for a standard LSTM, as opposed to using the modified architecture of an entity-aware LSTM (see Li et al., 2022). "Initial" means that we haven't trained many models, sweeping through hyperparameters to choose the optimal ones. We have done a bit of hyperparameter testing, and these results are for the best performing LSTM so far.

First, let's look at some performance metrics evaluated over the validation set. We applied a 60%-20%-20% train-validation-test split, splitting by lake such that the data in the training set are from different lakes than the data in the validation set or the test set. The test data will not be used for model selection, only for final evaluation.

The overal RMSE of the model for all observations in the validation set is 2.89 °C. The mean RMSE by lake is 2.55 °C, and the median RMSE by lake is 2.14 °C.

The Snakemake pipeline produces plots of RMSE and bias (as observed minus predicted) in different ways:

By lake

Here's RMSE by lake for every lake in the validation set:

To my eye, there's not much spatial coherence in RMSE. RMSE might increase a bit as you go south. Lake temperatures vary less during cold months, so fewer cold months results in more data variance and probably higher RMSE. In other words, higher RMSE as you go south could say more about the variance in the data than it says about the model.

Here's bias (observed - predicted) by lake:

Blue dots are lakes where the predictions are colder than observed temperatures on average, and red dots mean that predictions are warmer than observations. Again, I don't see notable spatial patterns in the bias.

Why I include the RMSE of the mean of the training data

I'm finding that including the RMSE of the mean of the training data helps to explain many trends in these RMSE and bias plots. Often, the trends have more to do with the variance in the data than they do with the model, and these comparisons help tease out the difference. They're in the spirit of streamflow's Nash-Sutcliffe Efficiency in that they help to take data variance into account during model evaluation.

Here's how I read these comparisons of the LSTM's RMSE and training mean RMSE. Training mean RMSE represents a very simple "model" - just take the mean of the training data and use it for your predictions. So, when the LSTM RMSE is smaller than the training mean RMSE, then the LSTM is outperforming that very simple model. When the model and the training mean RMSEs are similar, then the model really isn't doing any better than a very simple model.

One caveat: the means of the training data are taken without regard to observation depth. So, the gap between the LSTM RMSE and the training mean RMSE may be larger when the lake is stratified.

By day-of-year

Here's RMSE by day-of-year, compared with the RMSE that results from using the mean of the training data by day-of-year.

LSTM RMSE is highest during the summer, peaking around 3 °C. RMSE falls off during autumn and stays low during winter.

The LSTM RMSE is relatively low during days 1 to 31 (aka January). However, it's not much different than the training mean RMSE. Therefore, I conclude that the LSTM's low RMSE during January is an artifact of the distribution of temperatures in the data. It's not an indication that the model is especially performant at that time of year. This makes sense, too, because lake temperatures are basically close to zero during January and therefore have low variance.

The training mean RMSE also provides context for the spikes in RMSE around days 0, 60, and 75. We don't know why exactly the spikes occur from this plot alone, but it looks like they are more a result of the data than aberrant LSTM behavior.

Now, bias (observed - predicted) by day-of-year.

For most seasons, the LSTM's bias is near zero.
I see two notable exceptions:

During days 300 to 350 (November and December), lakes are colder than the LSTM predicts. It coincides with a similar bias in the training mean RMSE during part of the fall. However, we'd still hope to see that LSTM corrects for that bias, and I'm not sure why it doesn't.
At year start, lakes are warmer than the LSTM predicts. The spike at the beginning of the year is strange, and might be an indication of incorrect dates (just hypothesizing: observations with no date default to Jan 1, so some summer observations are mixed in?). I think this warrants a closer look.

For reference, here's how many observations there are in the training set, by day-of-year.

Most observations are during the spring and summer. That will encourage the model to perform best during those times of year.

Here's a more in-depth (ha ha) breakdown: observations in the training set by both day-of-year and lake depth.

Note the logarithmic scale for the colors (yellow is ten times greater than bright green). It's clear that not only are observations less common in winter, but that observations below 40 m are almost never taken during the winter. One compelling reason to pretrain using process model outputs is to train the model on a dataset that offers more temperatures during colder months and at depth.

By area

Here's the RMSE by lake surface area.

We can see that RMSE is high for lakes smaller than 100,000 square meters (10 hectares).

Here's the bias by area.

Observed lake temperatures are colder than predicted in small lakes, and warmer than predicted in large lakes.

Here's the prevalence of training data by lake area.

Many observations are in lakes of roughly 1,000,000 square meters (100 hectares). It's no surprise, then, that $10^6$ square meters is the area with the lowest RMSE and the bias closest to zero. Note that there are many observations in lakes of $10^9$ square meters, but those are all in three (well-observed) lakes: Lake Sakakawea, Lake Michigan, and Lake Champlain.

By depth

Here's RMSE by lake depth.

Okay, that huge spike around 90 meters is an attention grabber, but that corresponds to a tiny fraction of the total observations (see histogram below). There's a corresponding spike in the RMSE of the training data mean, so whatever is going on there is due to something in the data - it's not that the LSTM is predicting something wild. More importantly, what I see is that the LSTM model performs best near the surface and performance diminishes until 10 m depth, after which point it only slightly outperforms the mean of the training data. That's not too surprising because the model is trained on observations that are mostly shallow (again, see histogram below). Also, shallow lake temperatures are more directly driven by air temperatures and are easier to predict.

Here's the bias by depth.

Ignoring the 90 m spike, the model's predictions aren't as warm as they should be in the depth range of 10 to 20 m.

Here's the prevalence of observations by depth.

From this plot it's clear that the training set is dominated by observations in the top 10 m of the lake, and that there are very few observations at 90 m depth.

For good measure, here's that breakdown by both day-of-year and lake depth again.

By elevation

Here's RMSE by lake elevation above sea level.

Looks like the RMSE is fairly constant among lake elevations.

And here's bias by elevation.

Apart from a bit of noise at very low and very high elevations, the bias doesn't seem to correlate much to lake elevation. There aren't many lakes at those extremal elevations, especially in the validation set (see second histogram below), so I wouldn't put too much stock into the behaviors of metrics at elevations below 100 m or above 600 m.

Here's the number of observations in the training set by elevation.

And here's the number of observations in the validation set.

There are very few observations in the validation set below 100 m or above 600 m, so the metrics at those elevations are evaluated on a small number of residuals.

By season

Now, let's take a closer look at model predictions during the summer and winter. These plots aren't made automatically by the pipeline, so they'll require custom code.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
  
import sys
sys.path.insert(1, '5_visualize/src')
from plot_metrics import load_predictions, rms

# Plot settings
sns.set(rc={'figure.figsize':(11,8)})

# Choose a model to plot
data_source = 'model_prep'
run_id = 'initial'
model_id = 'local_a'
dataset = 'valid'

predictions_filepath = f"4_evaluate/out/{data_source}/{run_id}/{model_id}/interpolated_predictions_{dataset}.csv"
predictions_train_filepath = f"4_evaluate/out/{data_source}/{run_id}/{model_id}/interpolated_predictions_train.csv"

# Load predictions
predictions_valid_filepath = '4_evaluate/out/model_prep/initial/local_a/interpolated_predictions_valid.csv'
pv = load_predictions(predictions_filepath, doy_bin_width=5)

Predicted vs actual scatter plots

There are too many observations for one scatterplot. Let's plot a random sample of them instead.

import random

def scatter_random(df, n_samples, x='temp', y='predicted_temperature_obs_depth', alpha=0.05):
    plot_idx = random.sample(range(df.shape[0]), n_samples)
    df_plot = df.iloc[plot_idx]
    min_val = df_plot.loc[:, [x, y]].min().min()
    max_val = df_plot.loc[:, [x, y]].max().max()
    sns.scatterplot(df.iloc[plot_idx], x=x, y=y, alpha=alpha)
    plt.plot([min_val, max_val], [min_val, max_val], color='black')
    plt.show()

# Of the 498,892 samples in the validation set, plot 10,000 of them
scatter_random(pv, 10000)

On these plots, observed temperatures are on the x axis and temperatures predicted by the LSTM are on the y axis. Two things stand out to me. First, predictions look closest to observations at high temperatures (>25 °C), and furthest at intermediate temperatures (~15 °C). Second, temperatures below 4 °C are rarely confused with temperatures above 4 °C. This is probably a result of the distribution of the data: water is most dense at 4 °C, so either the temperatures in a lake are all less than 4 or greater than 4 at any given time.

Plot by time period

We can focus in on predictions made during summer and winter only.

pv_summer = pv.query("doy >= 172 and doy < 265")
pv_winter = pv.query("doy >= 355 or doy < 80")

# Plot 10,000 summer samples
scatter_random(pv_summer, 10000, alpha=0.1)

During summer, the hottest temperatures, which probably occur at shallow depths are predicted more accurately than intermediate temperatures.

# Plot 10,000 winter samples
scatter_random(pv_winter, 10000, alpha=0.1)

During winter, temperatures tend to stay between 0 and 5 degrees. Predictions also tend to be between 0 and 5 degrees, although the predicted temperatures don't agree well with observed temperatures within that range. Instead, it looks like predicted temperatures cluster around 3 °C when observed temperature are between 0 and 5 °C. This behavior could be a result of the relative lack of observations during winter.

Plot RMSE by depth and by season

We can examine how RMSE varies with depth during summer and during winter.

from plot_metrics import rms

summer_depth_rmse = (
    pv_summer.groupby('interpolated_depth')['residual']
    .aggregate(rms)
    .rename(f'LSTM_rmse')
)
sns.lineplot(summer_depth_rmse)
plt.xlim(0,40)
plt.ylim(0,5.5)
plt.show()

In the summer, RMSE is about 1.5 °C in the top 2 meters, then climbs quickly to a maximum above 5 °C between 5 and 10 m.

winter_depth_rmse = (
    pv_winter.groupby('interpolated_depth')['residual']
    .aggregate(rms)
    .rename(f'LSTM_rmse')
)
sns.lineplot(winter_depth_rmse)
plt.ylim(0,5.5)
plt.show()

There's a big difference from summer RMSE! Winter RMSE is not nearly as dependent on depth, and the highest RMSE values tend to be at shallower depths.

Surface RMSE by day-of-year

How does the RMSE for temperatures in the top 2 m of lakes vary over the year?

pv_surface = pv.query("depth < 2")

surface_doy_rmse = (
    pv_surface.groupby('doy_bin')['residual']
    .aggregate(rms)
    .rename(f'LSTM_rmse')
)

sns.lineplot(surface_doy_rmse)
plt.show()

Looks like surface temperature RMSE is highest during March, April and May, with another peak in November and December. Those times seem to coincide with lake melt and freeze dates.

Plot RMSE by DOY at bottom

# TODO once maximum depth is added to the lake metadata

jdiaz4302 · 2022-09-20T13:58:18Z

jdiaz4302
Sep 20, 2022

Very extensive/impressive write-up, Andy!

Contextualizing performance metrics

Could you provide context on how some of these error metrics compare to some target and/or existing process-based predictions? E.g., In standup yesterday, I recall Julie giving below 2C as a goal for her work - is that similar here? I would assume this task is harder / less constrained because it's dealing with lots of varying depths, but I'm not sure. Likewise, similar to the training set mean, it'd be nice to know what our more competitive baseline is.

Regarding deep predictions and winter predictions (or overall prediction scope)

With known properties such as predictions correlating less with observations in the winter and most observations being at shallow, summery waters, do you anticipate any changes in scope or is the hope to continue tackling prediction all predictions at all depths? I believe you said you would be depth as an input, so I suppose that can be expected to help quite a bit.

Future improvements to plots

More of a note/to-remember - some of the important takeaways are hidden in a small portion of the plot, so when it comes to a more final product time it would be good to highlight those better (example attached) whether that be through plotting primarily the difference between candidate models and baselines or including the histogram as an inset plot.

Comments regarding specific points/quotes

"CT" indicates concatenating static and dynamic features at every time step to form inputs for a standard LSTM, as opposed to using the modified architecture of an entity-aware LSTM

Oh interesting, I assumed this was the EA LSTM. I look forward to that comparison.

I don't see notable spatial patterns in the bias (and RMSE)

Agreed. Definitely more homogenous with noise rather than any widespread systematic pattern

higher RMSE as you go south

Would it be pretty easy to whip up a performance vs latitude plot and provide something like the Spearman or Pearson correlation form scipy?

During days 300 to 350 (November and December), lakes are colder than the LSTM predicts. It coincides with a similar bias in the training mean RMSE during part of the fall. However, we'd still hope to see that LSTM corrects for that bias, and I'm not sure why it doesn't.

Do you think there is some missing covariate that could explain this or is it possible that observations are less reliable, more noisy, or more dynamic (dynamic doesn't seem likely) in winter?

At year start, lakes are warmer than the LSTM predicts. The spike at the beginning of the year is strange, and might be an indication of incorrect dates (just hypothesizing: observations with no date default to Jan 1, so some summer observations are mixed in?). I think this warrants a closer look.

Maybe it's worth removing observations with no date from this step of the analysis since they can't speak for the (lack of) trend that we're interested in.

One compelling reason to pretrain using process model outputs is to train the model on a dataset that offers more temperatures during colder months and at depth.

Another thing I was wondering is if it would be worthwhile to use them to fill in the gaps for the final tuning on observations; they could be weighted very small to maintain emphasis on real observations, but it might help maintain realistic predictions in an otherwise less constrained training scenario - this is assuming that sparse training on the observations leads to unrealistic predictions in the unobserved areas though (perhaps the pretraining would be sufficient defense).

First, predictions look closest to observations at high temperatures (>25 °C)

It may be worth quantifying error and bias in that bin of the data. I feel like that is pretty uncommon to perform better on extremes, and that could actually be pretty desirable for some use cases; I would assume it, again, has something to do with the abundance of shallow, summery observations.

Second, temperatures below 4 °C are rarely confused with temperatures above 4 °C.

To highlight this, I almost want to see that plot in the style of a topographic map for data density to show two islands.

The thin, box-like silloute here is a little wild, but I am interested to see if including depth and trying out the EA LSTM approach helps resolve this.

During winter, temperatures tend to stay between 0 and 5 degrees. Predictions also tend to be between 0 and 5 degrees, although the predicted temperatures don't agree well with observed temperatures within that range. Instead, it looks like predicted temperatures cluster around 3 °C when observed temperature are between 0 and 5 °C. This behavior could be a result of the relative lack of observations during winter.

Yeah, this essentially resembles a regression to the mean when the model hasn't really learned this scenario. It is interesting that a good fraction of warm (deep?) observations are modeled well during winter though (the thinner cloud of points around the 1:1 line)

In the summer, RMSE is about 1.5 °C in the top 2 meters, then climbs quickly to a maximum above 5 °C between 5 and 10 m.

This is a really cool plot when contextualized with the depth histogram

Winter RMSE is not nearly as dependent on depth, and the highest RMSE values tend to be at shallower depths.

That's interesting too because I feel like deeper depths are likely the winter observations with higher temperatures (feel free to fact check) so they have the potential for higher residuals in the same way that summer observations do (relative to winter observations)

Looks like surface temperature RMSE is highest during March, April and May, with another peak in November and December. Those times seem to coincide with lake melt and freeze dates.

😮

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial CT-LSTM evaluation #55

{{title}}

Predicted vs actual scatter plots

Plot by time period

Plot RMSE by depth and by season

Plot RMSE by DOY at bottom

Replies: 1 comment

{{title}}

Select a reply

Initial CT-LSTM evaluation #55

AndyMcAliley Sep 16, 2022

Evaluate initial CT-LSTM

Executive summary (tl;dr)

Predicted vs actual scatter plots

Plot by time period

Plot RMSE by depth and by season

Plot RMSE by DOY at bottom

Replies: 1 comment

jdiaz4302 Sep 20, 2022

Contextualizing performance metrics

Regarding deep predictions and winter predictions (or overall prediction scope)

Future improvements to plots

Comments regarding specific points/quotes

AndyMcAliley
Sep 16, 2022

jdiaz4302
Sep 20, 2022