# Run Regressions

This short notebook runs ridge regressions on the pre-featurized data matrix using a 5-fold cross-validation approach for all 12 ACS labels. It saves `.data` files that are then opened by the notebook "2_make_fig.ipynb" in this same folder to make the plot.

## Settings

Here are the settings you can adjust when running this notebook:
- ``num_threads``: If running on a multi-core machine, change this from ``None`` to an ``int`` in order to set the max number of threads to use
- ``subset_[n,feat]``: If you want to subset the training set data for quick tests/debugging, specify that here using the `slice` object. `slice(None)` implies no subsetting of the ~80k observations for each label that are in the training set. `subset_n` slices observations; `subset_feat` subsets features.
- ``overwrite``: By default, this code will raise an error if the file you are saving already exists. If you would like to disable that and overwrite existing data files, change `overwrite` to `True`.
- ``labels_to_run``: By default, this notebook will loop through all the ACS labels. If you would like, you can reduce this list to only loop through a subset of them.
- ``intercept``: Do you want to add an intercept to (a.k.a. center) the data? We used this for testing, confirmed that with a large enough dataset and high dimensionality, the intercept does not help

In [None]:
num_threads = None

subset_n = slice(None)
subset_feat = slice(None)

overwrite = None

labels_to_run = [
    "B08303",
    "B15003",
    "B19013",
    "B19301",
    "C17002",
    "B22010",
    "B25071",
    "B25001",
    "B25002",
    "B25035",
    "B25017",
    "B25077",
]

intercept = False

### Imports

In [None]:
%load_ext autoreload
%autoreload 2

import os
import pickle
from os.path import isfile

# Import necessary packages
from mosaiks import transforms
from mosaiks.utils.imports import *
from mosaiks.utils import OVERWRITE_EXCEPTION
from threadpoolctl import threadpool_limits

if num_threads is not None:
    threadpool_limits(num_threads)
    os.environ["NUMBA_NUM_THREADS"] = str(num_threads)

if overwrite is None:
    overwrite = os.getenv("MOSAIKS_OVERWRITE", False)
if labels_to_run == "all":
    labels_to_run = c.app_order

solver = solve.ridge_regression

The following pattern will be used as the filename to save the out-of-sample predictions for each label. If you wish to try your own analysis and not overwrite the data as it exists, **you must change the name**. Regardless, the `2_make_fig.ipynb` notebook will look for model prediction `.data` files matching the default pattern below.

In [None]:
save_patt = (
    "outcomes_scatter_obsAndPred_{{app}}_{{variable}}_CONTUS_16_640_{{sampling_type}}_{sampling_num}_"
    "{sampling_seed}_random_features_{patch_size}_{feature_seed}{{subset}}.data"
).format(
    sampling_num=c.sampling["n_samples"],
    sampling_seed=c.sampling["seed"],
    patch_size=c.features["random"]["patch_size"],
    feature_seed=c.features["random"]["seed"],
    intercept=intercept,
)
save_patt

## Load X Matrix

This loads our feature matrix `X` for both POP and UAR samples

In [None]:
X = {}
latlons = {}

X["UAR"], latlons["UAR"] = io.get_X_latlon(c, "UAR")
X["POP"], latlons["POP"] = io.get_X_latlon(c, "POP")

## Run regressions

The following loop will:
1. Load the appropriate labels
2. Merge them with the feature matrix
3. Remove test set observations
4. Run a ridge regression on the training/validation set, sweeping over a range of possible regularization parameters using 5-fold Cross-Validation and clipping predictions to pre-specified bounds.
5. Save the out-of-sample predictions and the observations for use in the ACS figure.


In [None]:
# need to run each label separately b/c of different missingness patterns across labels
for label in labels_to_run:
    print(label)
    print("*** Running regressions for: {}".format(label))

    ## Set some label-specific variables
    this_cfg = io.get_filepaths(c, label, "random", True)
    c_app = getattr(this_cfg, label)
    sampling_type = c_app["sampling"]  # UAR or POP
    print(sampling_type)

    ## Get save path
    if (subset_n != slice(None)) or (subset_feat != slice(None)):
        subset_str = "_subset"
    else:
        subset_str = ""
    save_path = join(
        c.fig_dir,
        "primary_analysis",
        save_patt.format(
            app=label,
            variable=c_app["variable"],
            sampling_type=sampling_type,
            subset=subset_str,
        ),
    )

    if (not overwrite) and isfile(save_path):
        raise OVERWRITE_EXCEPTION

    ## get X, Y, latlon values of training data
    (
        this_X,
        _,
        this_Y,
        _,
        this_latlons,
        _,
    ) = parse.merge_dropna_transform_split_train_test(
        this_cfg, label, X[sampling_type], latlons[sampling_type], ACS=True
    )

    ## subset
    this_latlons = this_latlons[subset_n]
    this_X = this_X[subset_n, subset_feat]
    this_Y = this_Y[subset_n]

    ##Set solver arguments
    bounds_this = [np.array(c_app["us_bounds_pred"])]
    print("lambdas are ", c_app["lambdas"])
    solver_kwargs = {
        "lambdas": c_app["lambdas"],
        "return_preds": True,
        "svd_solve": False,
        "clip_bounds": bounds_this,
        "intercept": intercept,
    }

    ## Train model using ridge regression and 5-fold cross-valiation
    ## (number of folds can be adjusted using the argument n_folds)
    print("Training model...")
    kfold_results = solve.kfold_solve(
        this_X,
        this_Y,
        solve_function=solver,
        num_folds=this_cfg.ml_model["n_folds"],
        return_model=True,
        **solver_kwargs
    )
    print("")

    ## Store the metrics and the predictions from the best performing model
    best_lambda_idx, best_metrics, best_preds = ir.interpret_kfold_results(
        kfold_results, "r2_score", hps=[("lambdas", solver_kwargs["lambdas"])]
    )

    ## combine out-of-sample predictions over folds
    preds = np.vstack([solve.y_to_matrix(i) for i in best_preds.squeeze()]).squeeze()
    truth = np.vstack(
        [solve.y_to_matrix(i) for i in kfold_results["y_true_test"].squeeze()]
    ).squeeze()

    # get latlons in same shuffled, cross-validated order
    ll = this_latlons[
        np.hstack([test for train, test in kfold_results["cv"].split(this_latlons)])
    ]

    data = {
        "truth": truth,
        "preds": preds,
        "lon": ll[:, 1],
        "lat": ll[:, 0],
        "bounds": bounds_this,
        "best_lambda_idx": best_lambda_idx[0][0],
    }

    print("Saving model to {}".format(save_path))
    with open(save_path, "wb") as f:
        pickle.dump(data, f)

The below cell checks whether we hit the bounds of our search space for the regularization parameter. For each label, it prints the index of the chosen parameter value, as well as the index of the highest value. If the chosen value index is either 0 or equal to that of the highest value, you should extend your search space using the `config.py` file

In [None]:
for label in labels_to_run:
    print(label)

    ## Set some label-specific variables
    this_cfg = io.get_filepaths(c, label, "random", True)
    c_app = getattr(this_cfg, label)
    sampling_type = c_app["sampling"]  # UAR or POP

    ## Get save path
    if (subset_n != slice(None)) or (subset_feat != slice(None)):
        subset_str = "_subset"
    else:
        subset_str = ""
    save_path = join(
        c.fig_dir,
        "primary_analysis",
        save_patt.format(
            app=label,
            variable=c_app["variable"],
            sampling_type=sampling_type,
            subset=subset_str,
        ),
    )

    with open(save_path, "rb") as f:
        x = pickle.load(f)
        try:
            print(x["best_lambda_idx"], len(c_app["lambdas"]) - 1)
        except KeyError:
            print("not found")