Copyright 2021-2023 Lawrence Livermore National Security, LLC and other MuyGPyS
Project Developers. See the top-level COPYRIGHT file for details.

SPDX-License-Identifier: MIT

# One-Line Regression Workflow

This notebook walks through the same regression workflow as 
[the univariate regression tutorial](univariate_regression_tutorial.ipynb).

This workflow differs from the 
[tutorial](univariate_regression_tutorial.ipynb)
by making use of a 
[high-level API](../MuyGPyS/examples/regress.rst)
that automates all of the steps contained therein.
`MuyGPyS.examples` automates a small number of such workflows.
While it is recommended to stick to the lower-level API, the supported high-level APIs are useful for the simple applications that they support. 

In [None]:
import matplotlib.pyplot as plt
import numpy as np

from MuyGPyS._test.gp import benchmark_sample, benchmark_sample_full, BenchmarkGP

We will set a random seed here for consistency when building docs.
In practice we would not fix a seed.

In [None]:
np.random.seed(0)

We perform the same operations to sample a curve from a conventional GP as described in the 
[tutorial notebook](univariate_regression_tutorial.ipynb).

In [None]:
lb = -10.0
ub = 10.0
data_count = 5001
train_step = 10
x = np.linspace(lb, ub, data_count).reshape(data_count, 1)
test_features = x[np.mod(np.arange(data_count), train_step) != 0, :]
train_features = x[::train_step, :]
test_count, _ = test_features.shape
train_count, _ = train_features.shape

In [None]:
from MuyGPyS.gp.distortion import IsotropicDistortion, NullDistortion
from MuyGPyS.gp.kernels import Hyperparameter, Matern
from MuyGPyS.gp.noise import HomoscedasticNoise
nugget_var = 1e-14
fixed_length_scale = 1.0
gp = BenchmarkGP(
    Matern(
        nu=Hyperparameter(2.0),
        length_scale=Hyperparameter(fixed_length_scale),
        metric=NullDistortion("l2"),
    ),
    eps=HomoscedasticNoise(nugget_var),
)

In [None]:
y = benchmark_sample(gp, x)

In [None]:
test_responses = y[np.mod(np.arange(data_count), train_step) != 0, :]
measurement_eps = 1e-5
train_responses = y[::train_step, :] + np.random.normal(0, measurement_eps, size=(train_count,1))

In [None]:
fig, axes = plt.subplots(2, 1, figsize=(15, 11))

axes[0].set_title("Sampled Curve", fontsize=24)
axes[0].set_xlabel("Feature Domain", fontsize=20)
axes[0].set_ylabel("Response Range", fontsize=20)
axes[0].plot(train_features, train_responses, "k*", label="perturbed train response")
axes[0].plot(test_features, test_responses, "g-", label="test response")
axes[0].legend(fontsize=20) 

vis_subset_size = 10
mid = int(train_count / 2)

axes[1].set_title("Sampled Curve (subset)", fontsize=24)
axes[1].set_xlabel("Feature Domain", fontsize=20)
axes[1].set_ylabel("Response Range", fontsize=20)
axes[1].plot(
    train_features[mid:mid + vis_subset_size], 
    train_responses[mid:mid + vis_subset_size], 
    "k*", label="perturbed train response"
)
axes[1].plot(
    test_features[mid * (train_step - 1):mid * (train_step - 1) + (vis_subset_size * (train_step - 1))], 
    test_responses[mid * (train_step - 1):mid * (train_step - 1) + (vis_subset_size * (train_step - 1))], 
    "g-", label="test response"
)

plt.tight_layout()

plt.show()

 We now set our nearest neighbor index and kernel parameters. 

In [None]:
nn_kwargs = {"nn_method": "exact", "algorithm": "ball_tree"}
k_kwargs = {
    "kernel": Matern(
        nu=Hyperparameter("log_sample", (0.1, 5.0)),
        length_scale=Hyperparameter(fixed_length_scale),
        metric=IsotropicDistortion("l2")
    ),
    "eps": HomoscedasticNoise(measurement_eps),
}
opt_kwargs = {"random_state": 1, "init_points": 5, "n_iter": 20}

Finally, we run [do_regress()](../MuyGPyS/examples/regress.rst).
This function entirely instruments a simple regression workflow, with several tunable options.
Most of the keyword arguments in this example are specified at their default values, so in practice this call need not be so verbose. 

The kwarg `opt_method` indicates which optimization method to use.
In this example, we have used `"bayesian"`, which will use the corresponding kwargs given by `opt_kwargs`. 
The other currently supported option, `"scipy"`, expects no additional kwargs and so the user can safely omit `opt_kwargs`.

In [None]:
from MuyGPyS.examples.regress import do_regress

muygps, nbrs_lookup, predictions, variances = do_regress(
    test_features,
    train_features,
    train_responses,
    nn_count=30,
    batch_count=train_count,
    loss_method="mse",
    obj_method="loo_crossval",
    opt_method="bayesian",
    sigma_method="analytic",
    k_kwargs=k_kwargs,
    nn_kwargs=nn_kwargs,
    opt_kwargs=opt_kwargs,
    verbose=True,
)

We here evaluate our prediction performance in the same manner as in the 
[tutorial](univariate_regression_tutorial.ipynb).
We report the RMSE, mean diagonal posterior variance, the mean 95% confidence interval size, and the coverage, which ideally should be near 95%. 

In [None]:
from MuyGPyS.optimize.loss import mse_fn

confidence_intervals = np.sqrt(variances) * 1.96
coverage = (
    np.count_nonzero(
        np.abs(test_responses - predictions)
        < confidence_intervals
    )
    / test_count
)
confidence_intervals = confidence_intervals.reshape((test_count,))
print(f"RMSE: {np.sqrt(mse_fn(predictions, test_responses))}")
print(f"mean diagonal variance: {np.mean(variances)}")
print(f"mean confidence interval size: {np.mean(confidence_intervals * 2)}")
print(f"coverage: {coverage}")

We also produce the same plots.

In [None]:
fig, axes = plt.subplots(2, 1, figsize=(15, 11))

axes[0].set_title("Sampled Curve", fontsize=24)
axes[0].set_xlabel("Feature Domain", fontsize=20)
axes[0].set_ylabel("Response Range", fontsize=20)
axes[0].plot(train_features, train_responses, "k*", label="perturbed train response")
axes[0].plot(test_features, test_responses, "g-", label="test response")
axes[0].plot(test_features, predictions, "r--", label="test predictions")
axes[0].fill_between(
    test_features[:, 0], 
    (predictions[:, 0] - confidence_intervals),
    (predictions[:, 0] + confidence_intervals),
    facecolor="red",
    alpha=0.25,
    label="95% Confidence Interval",
)
axes[0].legend(fontsize=20)

axes[1].set_title("Sampled Curve (subset)", fontsize=24)
axes[1].set_xlabel("Feature Domain", fontsize=20)
axes[1].set_ylabel("Response Range", fontsize=20)
axes[1].plot(
    train_features[mid:mid + vis_subset_size], 
    train_responses[mid:mid + vis_subset_size], 
    "k*", label="perturbed train response"
)
axes[1].plot(
    test_features[mid * (train_step - 1):mid * (train_step - 1) + (vis_subset_size * (train_step - 1))], 
    test_responses[mid * (train_step - 1):mid * (train_step - 1) + (vis_subset_size * (train_step - 1))], 
    "g-", label="test response"
)
axes[1].plot(
    test_features[mid * (train_step - 1):mid * (train_step - 1) + (vis_subset_size * (train_step - 1))], 
    predictions[mid * (train_step - 1):mid * (train_step - 1) + (vis_subset_size * (train_step - 1))],
    "r--", label="test predictions")
axes[1].fill_between(
    test_features[mid * (train_step - 1):mid * (train_step - 1) + (vis_subset_size * (train_step - 1))][:, 0], 
    (predictions[:, 0] - confidence_intervals)[mid * (train_step - 1):mid * (train_step - 1) + (vis_subset_size * (train_step - 1))],
    (predictions[:, 0] + confidence_intervals)[mid * (train_step - 1):mid * (train_step - 1) + (vis_subset_size * (train_step - 1))],
    facecolor="red",
    alpha=0.25,
    label="95% Confidence Interval",
)
axes[1].legend(fontsize=20)

plt.tight_layout()

plt.show()