# Selecting A Sea Level Model

In [5]:
import Pkg
Pkg.activate("."; io=devnull);
Pkg.instantiate(; io=devnull);

Base.IOError: IOError: pwd(): no such file or directory (ENOENT)

## Candidate Models

A key consideration in any modeling exercise is the mathematical representation of the modeled process. This choice can allow or prohibit certain relationships between variables, or may result in too many or too few parameters to appropriately capture dynamics. For example, a simple linear model in time, 
```{math}
H_\text{lin}(t) = a + b*(t - t_0),
```
where $H$ is global sea level in $m$, $t$ is time in years and $t_0$ is a baseline year, does not allow for accelerated sea-level rise. A quadratic model in time, 
```{math}
H_\text{quad}(t) = a + b*(t - t_0) + c*(t - t_0)^2,
```
does allow for this acceleration. 

However, both of these models assume that time is the key variable explaining changes in sea levels from one year to another. In [](#introduction-to-coastal-flood-risk), we discussed the relationship between CO_2 concentrations, atmospheric temperature, and sea levels. {cite:t}`rahmstorfSemiEmpiricalApproachProjecting2007` proposed the following semi-empirical model relating sea-level rise (SLR) to changes in global mean temperatures:
```{math}
\Delta H_\text{emp}(t) = \alpha \left(T(t) - T_{\text{eq}}\right),
```
where $\alpha$ is the sensitivity of SLR to temperature changes, and $T_\text{eq}$ is the temperature when sea level is at equilibrium, that is, $H_\text{emph}(t) = H_\text{emp}(t-1)$ when $T(t) = T_\text{eq}$. In the next section, we will look at how we can find best-fit parameter values for these models to sea-level data and analyze the dynamics of these fitted models.

## Loading and Plotting The Data

To fit the three models described in [](#candidate-models), we need to load historical SLR data (for all three models) and global mean temperature (GMT) data (for $H_\text{emp}$). We have provided two data sets in `contents/flood/data`.

### Reading Data Files

The SLR data file, `CSIRO_Recons_gmsl_yr_2015.csv`, 
```{margin} SLR Data Information
The SLR dataset was taken from Australia's [Commonwealth Scientific and Industrial Research Organization (CSIRO)](https://research.csiro.au/slrwavescoast/sea-level/measurements-and-data/sea-level-data/) and described in {cite:t}`churchSealevelRiseLate2011`. This reconstruction of global mean sea levels spans 1880--2013. 
```
has three columns:
1. Time in years (the fractions correspond to months);
2. Global mean sea-level in $mm$ (relative to the 1961-1990 mean);
3. Standard deviation of the observational error in $mm$.

The GMT data has a similar structure: a header and comma-delimited. 
```{margin} GMT Data Information
The GMT data was obtained from the [HadCRUT5 website](https://www.metoffice.gov.uk/hadobs/hadcrut5/) {cite:p}`moriceUpdatedAssessmentSurface2021`.
``` 
This file, `HadCRUT.5.0.1.0.analysis.summary_series.global.annual.csv`, also has four columns, along with a header:
1. Time in years;
2. Annual mean temperature anomaly (relative to the 1961-1990 mean);
3. The lower end of the 95% confidence interval;
4. The upper end of the 95% confidence interval.

In Julia, we can read in these files using the [`DelimitedFiles` package](https://docs.julialang.org/en/v1/stdlib/DelimitedFiles/). The `DelimitedFiles.readdlm` function takes in a filename, and can also take in optional parameters like a delimiter and an output type. We can also convert the output of `DelimitedFiles.readdlm` to a [`DataFrame`](https://dataframes.juliadata.org/stable/). Since we want to read multiple files, let's write a function which takes in filename corresponding to a CSV file with a header and returns a `DataFrame`.

In [3]:

# load packages
using DataFrames
using DelimitedFiles

function read_csv(fname)
    dat, head = readdlm(fname, ',', header=true) # read in the data and header
    return DataFrame(dat, vec(head)) # return the DataFrame
end

# read data; first read in the data and the header, then convert to a DataFrame
slr_data = read_csv("data/CSIRO_Recons_gmsl_yr_2015.csv")

ArgumentError: ArgumentError: Cannot open 'data/CSIRO_Recons_gmsl_yr_2015.txt': not a file

Similarly, we'll use `read_csv` to read in the GMT data.

In [None]:
gmt_data = read_csv("data/HadCRUT.5.0.1.0.analysis.summary_series.global.annual.csv")

### Merging The Data

Now let's merge (or ["join"](https://dataframes.juliadata.org/stable/man/joins/)) the two DataFrames on common years. Then we can plot the two data series. 
```{margin} Database-style joins
Database-style joins allow a flexible way to merge two DataFrames (and other tabular data structures) on a common column (the "key"). There are several different styles of joins:
* *Inner joins* only return rows with key values which occur in all of the merged databases;
* *Outer joins* returns all rows with key values which occur in any merged database;
* *Left joins* return rows with key values occurring in the first (left) database;
* *Right joins* return rows with key values occurring in the second (right) database.

The [`DataFrames` documentation](https://dataframes.juliadata.org/stable/man/joins/) provides information on these, as well as some other, more specialized, joins.
```

There are many types of joins. Our goal is to focus on the SLR data, so let's use an "left join" (which includes all entries from the first listed `DataFrame` and entries from the second `DataFrame` which correspond to matching key values). First, we need to correct the "Time" column in `slr_data`, as those values end with a non-zero decimal.

In [1]:
slr_data[:, :Time] = slr_data[:, :Time] .- 0.5; # remove 0.5 from Times
all_data = leftjoin(slr_data, gmt_data, on="Time") # outer join the data frames on Time

UndefVarError: UndefVarError: slr_data not defined

### Plotting The Data

Now we can plot the observations. Since we merged the DataFrame, we can pass two of the columns to `Plots.plot` and a layout, and it will automatically plot them as subplots (though we do have to convert the y-values to a `Matrix`).

In [None]:
using Plots

# plot reconstructed anomalies as black points
scatter(all_data[:, 1], Matrix(all_data[:, [2, 4]]), grid=:false, markersize=3, xlabel="Year", legend=:none, layout=(2,1), plot_title="Anomaly Relative to 1961-1990 Mean", ylabel=["mm" "°C"], title=["Global Mean Sea Level" "Global Mean Temperature"])

## Fitting the Models to the Data

Now we can write functions to predict the SLR using the three models defined in [](#candidate-models).

In [None]:
# H_lin
function H_lin(t, a, b)
    slr_predict = a .+ (b .* (t .- t[1]))
    return slr_predict
end

# H_quad
function H_quad(t, a, b, c)
    slr_predict = a .+ (b .* (t .- t[1])) + (c .* (t .- t[1]).^2)
    return slr_predict
end

# H_emp
function H_emp(temp, α, T₀, H₀)
    temp_effect = α .* (temp .- T₀)
    slr_predict = cumsum(temp_effect) .+ H₀
    return slr_predict
end

To find parameter values for the models, we will minimize the root-mean-square-error (RMSE):
```{math}
\text{RMSE} = \sqrt{\sum_{t=1}^T\left(\text{pred}_t - \text{obs}_t\right)^2},
```
where $t$ is the time index, $\text{pred}_t$ is the model prediction, and $\text{obs}_t$ is the data value.

We can find parameter values which minimize the RMSE using [Optim.jl](https://julianlsolvers.github.io/Optim.jl/stable/#). The `Optim.optimize` function minimizes a function value using one of several numerical solvers. `Optim.optimize` wants its input function to accept a vector of proposed parameter values, so we need to construct a wrapper function which accepts a parameter vector and passes the relevant values to the target function, before calculating the RMSE. 

Even though we have three different functions to optimize, we can take advantage of how we defined those functions --- they all take in an auxiliary vector (time `t` for `H_lin` and `H_quad` and temperature `temp` for `H_emp`) followed by unknown parameter values.
```{margin} Using "splat" to unpack argument vectors
If all of our `H` functions had the same number of uncertain parameters, we could hard-code the unpacking of a parameter vector `v` by referencing `v[1]`, `v[2]`, etc. However, we want to optimize `H_lin` with respect to two parameters, and the others with respect to three, so we can't manually unpack the parameter vectors in this fashion without writing a lot of additional code, which would reduce readability and increase the chance for bugs. Fortunately, Julia has a "splat" operator (`...`), which is used to automatically unpack vectors. For example, if `v` has three elements, `fn(v...)` is the same as `fn(v[1], v[2], v[3])`, except that the splat is more flexible and doesn't require additional assumptions about the length of `v`.
``` 
As these functions all have a common structure to their arguments, we can pass that auxiliary vector, then use the splat operator to unpack the parameter vector, and the length of the parameter vector doesn't matter! We do this in the `rmse` function defined below.

In [None]:
using Optim
using Statistics # need to load Statistics for a mean function

# define a function for the root-mean-squared error
# in this function, we're using the known structure of our three functions,
# which accept an input data vector followed by several scalar parameters
# This input data is passed as the "aux" parameter
# all of the variable parameters are passed in as a single parameter vector for optimize()
function rmse(fn, params, aux, obs)
    predict = fn(aux, params...) # here we use the splat operator to unpack arguments
    rmse = sqrt(mean((predict - obs).^2))
    return rmse
end

The only thing left to do is to call `Optim.optimize` on RMSE with varying functions `fn` corresponding to our three SLR models. We can deal with the constant values for each call (`fn`, `aux`, and `obs`) by using anonymous functions to map the parameter vector proposed in a given solver iteration to an `rmse` call:

In [None]:
# optimize each of the SLR models

function minimize_rmse(fn, aux, obs, init_params)
    optimize_out = optimize(params -> rmse(fn, params, aux, obs), init_params)
    params_optim = Optim.minimizer(optimize_out)
    return params_optim
end

In [None]:

# H_lin
# this has two uncertain parameters (a, b), so we use a 2d initial vector [0, 1]
# we use a decimal after each initial value to make Julia interpret them as Floats, rather than Ints
params_lin = minimize_rmse(H_lin, all_data[:, 1], all_data[:, 2], [0., 1.])

In [None]:
# H_quad
# this has three uncertain parameters (a, b, c), so we use a 3d initial vector [0, 1, 0]
params_quad = minimize_rmse(H_quad, all_data[:, 1], all_data[:, 2], [0., 1., 0.])

In [None]:
# H_emp
# unlike the other two models, we pass in temperatures as the auxiliary
# this has three uncertain parameters (α, H₀, T₀), so we use a 3d initial vector [1, 0, 0]
params_emp = minimize_rmse(H_emp, all_data[:, 4], all_data[:, 2], [1., 0., 0.])

Finally, let's plot the SLR hindcasts from all three fitted models to see how they perform.

In [None]:
scatter(all_data[:, 1], all_data[:, 2], color="black", alpha=0.7, label="Reconstructed Data", legend=:topleft, grid=false, xaxis="Year", yaxis="Sea Level Anomaly (mm)", title="Global Mean Sea Level Anomaly Model Predictions")
plot!(all_data[:, 1], H_lin(all_data[:, 1], params_lin...), color="#66c2a5", linewidth=3, label="H_lin")
plot!(all_data[:, 1], H_quad(all_data[:, 1], params_quad...), color="#fc8d62", linewidth=3, label="H_quad")
plot!(all_data[:, 1], H_emp(all_data[:, 4], params_emp...), color="#8da0cb", linewidth=3, label="H_emp")

We can see that the linear model $H_\text{lin}$ (green) fails to pick up the data trend at the start and end of the period (including the acceleration at the end of the data set, which makes sense), but the quadratic and semi-empirical models $H_\text{quad}$ (red) and $H_\text{emp}$ (purple) both have pretty similar outputs. Comparing the RMSEs of the two models:

In [None]:
(rmse(H_quad, params_quad, all_data[:, 1], all_data[:, 2]), rmse(H_emp, params_emp, all_data[:, 4], all_data[:, 2]))

```{margin} Choosing $H_\text{quad}$ vs. $H_\text{emp}$
Since both of these models perform similarly, the choice between the two depends on what question you are asking. If you were primarily interested in inferring the historical temporal trend, $H_\text{quad}$ might be more useful, since it makes the time-dependence explicit, while $H_\text{emp}$ focuses on the dependence on GMT, which would be more useful for making projections given the causal link between warming and SLR.
```
The RMSE is slightly lower for $H_\text{emp}$, but we want to be careful to not over-interpret small differences. Visually, the main difference is that $H_\text{emp}$ shows more of an increase in the last few years, which matches what appears to be an acceleration in the data. As a result, for the rest of this section, we will use $H_\text{emp}$ for our SLR modeling, but you could justify using $H_\text{quad}$. 

```{bibliography}
:filter: docname in docnames
```