# Spatial diagnostics

This notebook is used to create the checkerboard test shown in Fig3 C

## Settings

Here are the settings you can adjust when running this notebook:
- ``num_threads``: If running on a multi-core machine, change this from ``None`` to an ``int`` in order to set the max number of threads to use
- ``feattype``: If you want to run this using RGB features rather than RCF, change this from "random" to "rgb"
- ``subset_[n,feat]``: If you want to subset the training set data for quick tests/debugging, specify that here using the `slice` object. `slice(None)` implies no subsetting of the ~80k observations for each label that are in the training set. `subset_n` slices observations; `subset_feat` subsets features.
- ``overwrite``: By default, this code will raise an error if the file you are saving already exists. If you would like to disable that and overwrite existing data files, change `overwrite` to `True`.
- ``fixed_lambda``: If True, only run the lambda that was previously chosen. Will throw an error if you haven't already generated a results file.
- ``labels_to_run``: By default, this notebook will loop through all the labels. If you would like, you can reduce this list to only loop through a subset of them, by changing ``"all"`` to a list of task names, e.g. ``["housing", "treecover"]``

In [None]:
num_threads = None

feattype = "random"
# feattype = "rgb"

subset_n = slice(None)
subset_feat = slice(None)

overwrite = True

fixed_lambda = False

labels_to_run = "all"

### Imports

In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

import os

# Import necessary packages
from mosaiks import transforms
from mosaiks.utils.imports import *
from threadpoolctl import threadpool_limits

if num_threads is not None:
    threadpool_limits(num_threads)
    os.environ["NUMBA_NUM_THREADS"] = str(num_threads)

if overwrite is None:
    overwrite = os.getenv("MOSAIKS_OVERWRITE", False)
if labels_to_run == "all":
    labels_to_run = c.app_order

## Load Random Feature Data

This loads our feature matrix `X` for both POP and UAR samples

In [None]:
X = {}
latlons = {}

X["UAR"], latlons["UAR"] = io.get_X_latlon(c, "UAR")
X["POP"], latlons["POP"] = io.get_X_latlon(c, "POP")

## Run regressions

The following loop will:
1. Load the appropriate labels
2. Merge them with the feature matrix
3. Remove test set observations
4. Split data into 50/50 train/validation sets, using a checkerboard pattern of various size squares. This will cause training data to be closer or further (geographically) to validation data. It will also offset each checkerboard pattern several times to calculate uncertainty in the performance estimate induced by choosing a checkerboard origin.
5. Run both RBF smoothing and ridge regression (using RCF features) on each train/validation split.
5. Save the out-of-sample predictions for both approaches for use in Figure 3.

In [None]:
jitter_pos = c.checkerboard["num_jitter_positions_sqrt"]
prefix_str = "checkerboardJitterInterpolation"
if not (subset_n == slice(None) and subset_feat == slice(None)):
    prefix_str += "_SUBSAMPLE"
    if overwrite:
        print(
            "Setting overwrite = False because you are working with a subset of the full feature matrix."
        )
    overwrite = False

In [None]:
for label in labels_to_run:

    print("*** Running regressions for: {}".format(label))

    ## Set some label-specific variables
    this_cfg = io.get_filepaths(c, label, feattype=feattype)
    c_app = getattr(this_cfg, label)
    sampling_type = c_app["sampling"]  # UAR or POP

    if c_app["logged"]:
        bounds = np.array([c_app["us_bounds_log_pred"]])
    else:
        bounds = np.array([c_app["us_bounds_pred"]])

    # Set solver arguments for image-based features
    solver_kwargs_base = {"return_preds": True, "clip_bounds": bounds}
    if fixed_lambda:
        best_lambda_fpath = join(
            this_cfg.fig_dir_sec,
            f"checkerboardJitterInterpolation_{label}_{c_app['variable']}_{this_cfg.full_suffix_image}.data",
        )
        sigmas_rbf = io.get_lambdas(
            c,
            label,
            best_lambda_name="best_sigma_interp",
            best_lambda_fpath=best_lambda_fpath,
        )
    else:
        best_lambda_fpath = None
        sigmas_rbf = this_cfg.checkerboard["sigmas"]

    solver_kwargs_image_CB = {
        **solver_kwargs_base,
        "lambdas": io.get_lambdas(
            c,
            label,
            lambda_name="lambdas_checkerboard",
            best_lambda_name="best_lambda_rcf",
            best_lambda_fpath=best_lambda_fpath,
        ),
        "svd_solve": False,
        "allow_linalg_warning_instances": False,
        "solve_function": solve.ridge_regression,
    }

    # and for spatial interpolation
    solver_kwargs_interpolation_CB = {
        **solver_kwargs_base,
        "sigmas": sigmas_rbf,
        "solve_function": rbf_interpolate.rbf_interpolate_solve,
    }

    ## get X, Y, latlon values of training data
    (
        this_X,
        _,
        this_Y,
        _,
        this_latlons,
        _,
    ) = parse.merge_dropna_transform_split_train_test(
        this_cfg, label, X[sampling_type], latlons[sampling_type]
    )

    ## cast latlons to float32 to reduce footprint of RBF smoothing
    this_latlons = this_latlons.astype(np.float32)

    ## subset
    this_latlons = this_latlons[subset_n]
    this_X = this_X[subset_n, subset_feat]
    this_Y = this_Y[subset_n]

    ## -----------------------------------------------------
    # Checkerboard r2 analysis (jitter, sigma-sweep)
    ## Lat-lon features are done with density estimation (interpolation)
    ## -----------------------------------------------------

    # Run regressions with spatial interpolation and RCF features
    metrics_checkered = {}
    best_hps = {}
    for features, task, solver_kwargs, hp_name, best_hp_name in [
        (
            this_latlons,
            "latlon features sigma tuned",
            solver_kwargs_interpolation_CB,
            "sigmas",
            "best_sigma_interp",
        ),
        (
            this_X,
            "image_features",
            solver_kwargs_image_CB,
            "lambdas",
            "best_lambda_rcf",
        ),
    ]:
        print(f"Running {task}...")
        results_checkered = spatial_experiments.checkered_predictions_by_radius(
            features,
            this_Y,
            this_latlons,
            this_cfg.checkerboard["deltas"],
            this_cfg.plotting["extent"],
            crit=["r2_score"],
            return_hp_idxs=True,
            num_jitter_positions_sqrt=jitter_pos,
            **solver_kwargs,
        )

        metrics_checkered[task] = spatial_experiments.results_to_metrics(
            results_checkered
        )
        best_hps[best_hp_name] = np.array(
            [solver_kwargs[hp_name][r["hp_idxs_chosen"][0]] for r in results_checkered]
        )

    # Plot and save
    print(f"Plotting and saving output to {this_cfg.fig_dir_sec}...")
    spatial_plotter.checkerboard_vs_delta_with_jitter(
        metrics_checkered,
        best_hps,
        this_cfg.checkerboard["deltas"],
        "r2_score",
        val_name=c_app["variable"],
        app_name=label,
        prefix=prefix_str,
        suffix=this_cfg.full_suffix_image,
        save_dir=this_cfg.fig_dir_sec,
        overwrite=overwrite,
    )