# Sensitivity to number of training samples

This notebook is used to create a plot of R^2 and of MSE as a function of:
    1. training sample size
    2. number of features

## Settings

Here are the settings you can adjust when running this notebook:
- ``num_threads``: If running on a multi-core machine, change this from ``None`` to an ``int`` in order to set the max number of threads to use
- ``feattype``: If you want to run this using RGB features rather than RCF, change this from "random" to "rgb"
- ``subset_[n,feat]``: If you want to subset the training set data for quick tests/debugging, specify that here using the `slice` object. `slice(None)` implies no subsetting of the ~80k observations for each label that are in the training set. `subset_n` slices observations; `subset_feat` subsets features.
- ``overwrite``: By default, this code will raise an error if the file you are saving already exists. If you would like to disable that and overwrite existing data files, change `overwrite` to `True`.
- ``fixed_lambda``: If True, only run the lambda that was previously chosen. Will throw an error if you haven't already generated a results file.
- ``labels_to_run``: By default, this notebook will loop through all the labels. If you would like, you can reduce this list to only loop through a subset of them, by changing ``"all"`` to a list of task names, e.g. ``["housing", "treecover"]``

In [None]:
num_threads = None

feattype = "random"
# feattype = "rgb"

subset_n = slice(None)
subset_feat = slice(None)

overwrite = False

fixed_lambda = False

labels_to_run = "all"

### Imports

In [None]:
%load_ext autoreload
%autoreload 2

import os

# Import necessary packages
from mosaiks import transforms
from mosaiks.utils.imports import *
from threadpoolctl import threadpool_limits

if num_threads is not None:
    threadpool_limits(num_threads)
    os.environ["NUMBA_NUM_THREADS"] = str(num_threads)

if labels_to_run == "all":
    labels_to_run = c.app_order

## Load Random Feature Data

This loads our feature matrix `X` for both POP and UAR samples

In [None]:
X = {}

X["UAR"], _ = io.get_X_latlon(c, "UAR")
X["POP"], _ = io.get_X_latlon(c, "POP")

## Run regressions

The following loop will:
1. Load the appropriate labels
2. Merge them with the feature matrix
3. Remove test set observations
4. Run ridge regression on the training/validation set, sweeping over a range of possible regularization parameters using 5-fold Cross-Validation and clipping predictions to pre-specified bounds. It will do this for multiple training set sizes and feature vector lengths.
5. Save the out-of-sample predictions and the observations for use in Figure 3.

In [None]:
num_samp_vector = c.performance["num_samp_vector"]
if subset_n.stop is not None:
    num_samp_vector = [s for s in num_samp_vector if s <= subset_n.stop]
    if len(num_samp_vector) < 2:
        num_samp_vector = [int(subset_n.stop * 0.8 / 2), int(subset_n.stop * 0.8)]

num_feat_vector = c.performance[feattype]["num_feat_vector"]
if subset_feat.stop is not None:
    num_feat_vector = [s for s in num_feat_vector if s <= subset_feat.stop]
    if len(num_feat_vector) < 2:
        num_feat_vector = [int(subset_feat.stop / 2), subset_feat.stop]

folds = c.performance["folds"]

if not (subset_n == slice(None) and subset_feat == slice(None)):
    suffix_str = "_SUBSAMPLE"
else:
    suffix_str = ""

In [None]:
for label in labels_to_run:

    print("*** Running regressions for: {}".format(label))

    ## Set some label-specific variables
    this_cfg = io.get_filepaths(c, label, feattype=feattype)
    c_app = getattr(this_cfg, label)
    sampling_type = c_app["sampling"]  # UAR or POP

    if c_app["logged"]:
        bounds = np.array([c_app["us_bounds_log_pred"]])
    else:
        bounds = np.array([c_app["us_bounds_pred"]])

    if fixed_lambda:
        best_lambda_fpath_base = join(
            this_cfg.fig_dir_sec,
            f"r2_score_vs_{{type}}size_{label}_{c_app['variable']}_{this_cfg.full_suffix}.data",
        )
        best_lambda_fpath_numSamp = best_lambda_fpath_base.format(type="train")
        best_lambda_fpath_numFeat = best_lambda_fpath_base.format(type="feat")
    else:
        best_lambda_fpath_numSamp = best_lambda_fpath_numFeat = None
    solver_kwargs_base = {
        "return_preds": True,
        "svd_solve": False,
        "clip_bounds": bounds,
    }

    solver_kwargs_numSamp = {
        **solver_kwargs_base,
        "lambdas": io.get_lambdas(
            c, label, best_lambda_fpath=best_lambda_fpath_numSamp
        ),
    }
    solver_kwargs_numFeat = {
        **solver_kwargs_base,
        "lambdas": io.get_lambdas(
            c, label, best_lambda_fpath=best_lambda_fpath_numFeat
        ),
    }

    ## get X, Y, latlon values of training data
    this_X, _, this_Y, _, _, _ = parse.merge_dropna_transform_split_train_test(
        this_cfg, label, X[sampling_type], X[sampling_type].iloc[:, :2]
    )

    ## subset
    this_X = this_X[subset_n, subset_feat]
    this_Y = this_Y[subset_n]

    ## --------------------------------------------------
    ## Performance vs. train sample size
    ## --------------------------------------------------
    (
        idxs_by_num_samples,
        results_by_num_samples,
        predictions_by_num_samples,
        num_samples_taken,
    ) = model_experiments.performance_by_num_train_samples(
        this_X, this_Y, num_samp_vector, folds, **solver_kwargs_numSamp
    )
    best_lambdas_by_num_samples = np.asarray(solver_kwargs_numSamp["lambdas"])[
        idxs_by_num_samples.squeeze()
    ]

    ## --------------------------------------------------
    ## Performance vs. number of features
    ## --------------------------------------------------
    (
        idxs_by_num_feat,
        results_by_num_feat,
        preds_by_num_feat,
    ) = model_experiments.performance_by_num_features(
        this_X, this_Y, num_feat_vector, num_folds=folds, **solver_kwargs_numFeat
    )
    best_lambdas_by_num_feat = np.asarray(solver_kwargs_numFeat["lambdas"])[
        idxs_by_num_feat.squeeze()
    ]

    ## --------------------------------------------------
    ## Plotting and saving
    ## --------------------------------------------------
    print(f"Plotting and saving output to {this_cfg.fig_dir_sec}...")
    for crit in ["mse", "r2_score"]:
        if len(this_Y.shape) == 1:
            crits = [crit]
        else:
            crits = [crit] * this_Y.shape[1]
        for plot in (
            (
                results_by_num_samples,
                best_lambdas_by_num_samples,
                "_vs_trainsize",
                num_samples_taken,
                "train set size",
            ),
            (
                results_by_num_feat,
                best_lambdas_by_num_feat,
                "_vs_featsize",
                num_feat_vector,
                "number of features",
            ),
        ):
            prefix_str = crit + plot[2] + suffix_str
            plots.metrics_vs_size(
                plot[0],
                plot[1],
                num_folds=folds,
                val_names=c_app["variable"],
                num_vector=plot[3],
                xtitle=plot[4],
                crits=crits,
                app_name=label,
                prefix=prefix_str,
                suffix=this_cfg.full_suffix,
                save_dir=this_cfg.fig_dir_sec,
                overwrite=overwrite,
            )