# Run Regressions

This notebook makes and evaluates predictions on the holdout test set. It trains and evaluates a model using features from a pre-trained ResNet-152 (shown as one of the sets of bars in Fig. 3A) and saves them to root/results/figures/Fig3/TestSetR2_resnet152_1e5_pretrained_.csv

## Settings

Here are the settings you can adjust when running this notebook:
- ``num_threads``: If running on a multi-core machine, change this from ``None`` to an ``int`` in order to set the max number of threads to use
- ``subset_[n,feat]``: If you want to subset the training set data for quick tests/debugging, specify that here using the `slice` object. `slice(None)` implies no subsetting of the ~80k observations for each label that are in the training set. `subset_n` slices observations; `subset_feat` subsets features.
- ``overwrite``: By default, this code will raise an error if the file you are saving already exists. If you would like to disable that and overwrite existing data files, change `overwrite` to `True`.
- ``fixed_lambda``: If True, only run the lambda that was previously chosen. Will throw an error if you haven't already generated a results file.
- ``labels_to_run``: By default, this notebook will loop through all the labels. If you would like, you can reduce this list to only loop through a subset of them, by changing ``"all"`` to a list of task names, e.g. ``["housing", "treecover"]``

In [1]:
num_threads = None

subset_n = slice(None)
subset_feat = slice(None)

overwrite = True

fixed_lambda = False

labels_to_run = "all"

### Imports

In [2]:
%load_ext autoreload
%autoreload 2

import io as b_io
import os
from pathlib import Path

import dill

# Import necessary packages
from mosaiks import transforms
from mosaiks.utils import OVERWRITE_EXCEPTION
from mosaiks.utils.imports import *
from threadpoolctl import threadpool_limits

if num_threads is not None:
    threadpool_limits(num_threads)
    os.environ["NUMBA_NUM_THREADS"] = str(num_threads)

if overwrite is None:
    overwrite = os.getenv("MOSAIKS_OVERWRITE", False)
if labels_to_run == "all":
    labels_to_run = c.app_order

solver = solve.ridge_regression

## Load Random Feature Data

This loads our features from a pre-trained ResNet-152 for both POP and UAR samples

In [3]:
X = {}
latlons = {}

for sample in ["UAR", "POP"]:
    # path to features X for UAR
    features_path = (
        Path(c.features_dir)
        / f"{c.features['pretrained']['model_type']}_pretrained_{c.grid['area']}_{c.images['zoom_level']}_{c.images['n_pixels']}_{sample}.pkl"
    )
    with open(features_path, "rb") as f:
        data = dill.load(f)
    features = data["X"].astype(np.float64)
    latlons_samp = data["latlon"]
    ids_x = data["ids_X"]

    X[sample] = pd.DataFrame(features, index=ids_x)
    latlons[sample] = pd.DataFrame(latlons_samp, index=ids_x, columns=["lat", "lon"])

## Run regressions

The following loop will:
1. Load the appropriate labels
2. Merge them with the feature matrix
3. Remove test set observations
4. Run a ridge regression on the training/validation set, sweeping over a range of possible regularization parameters using 5-fold Cross-Validation and clipping predictions to pre-specified bounds.
5. Using the optimal hyperparameters found from step 4, we retrain the model on the entire training set and then make a prediction in the test set. 

**Note**: We drop observations for which our labels are missing. In one label (nightlights) we drop an outlier that was not dropped during dataset pre-processing. For all variables that we model in log-space, we first add 1 to deal with 0-valued observations (except for housing, for which we do not have any 0's).

In [7]:
resultsDict = {}
for label in labels_to_run:

    print("*** Running regressions for: {}".format(label))

    ## Set some label-specific variables
    this_cfg = io.get_filepaths(c, label)
    c_app = getattr(this_cfg, label)
    sampling_type = c_app["sampling"]  # UAR or POP

    best_lambda_fpath = join(c.fig_dir_sec, "best_lambda_tl.npz")
    if fixed_lambda:
        lambdas = io.get_lambdas(c, label, best_lambda_fpath=best_lambda_fpath)
    else:
        lambdas = io.get_lambdas(c, label, best_lambda_fpath=None)

    if c_app["logged"]:
        bounds = np.array([c_app["us_bounds_log_pred"]])
    else:
        bounds = np.array([c_app["us_bounds_pred"]])

    # Set solver arguments
    solver_kwargs = {
        "lambdas": lambdas,
        "return_preds": True,
        "svd_solve": False,
        "clip_bounds": bounds,
    }

    # Expand possible lambdas for this transfer learning feature set as needed so that
    # the optimal lambda selected is not hitting the bounds of possible lambdas.
    if (not fixed_lambda) and (label in ["income", "roads"]):
        solver_kwargs["lambdas"] = np.logspace(-3, 4, 9)

    ## get X, Y, latlon values of training data
    (
        this_X,
        this_X_test,
        this_Y,
        this_Y_test,
        _,
        _,
    ) = parse.merge_dropna_transform_split_train_test(
        this_cfg, label, X[sampling_type], latlons[sampling_type]
    )

    ## subset
    this_X = this_X[subset_n, subset_feat]
    this_X_test = this_X_test[:, subset_feat]
    this_Y = this_Y[subset_n]

    ## Train model using ridge regression and 5-fold cross-valiation
    ## (number of folds can be adjusted using the argument n_folds)
    print("Training model...")
    kfold_results = solve.kfold_solve(
        this_X,
        this_Y,
        solve_function=solver,
        num_folds=this_cfg.ml_model["n_folds"],
        return_model=True,
        **solver_kwargs
    )
    print("")

    ## Store the metrics and the predictions from the best performing model
    best_lambda_idx, best_metrics, best_preds = ir.interpret_kfold_results(
        kfold_results, "r2_score", hps=[("lambdas", solver_kwargs["lambdas"])]
    )
    best_lambdas = np.array(
        [solver_kwargs["lambdas"][np.asarray(best_lambda_idx).squeeze()]]
    )

    # save best lambdas
    if subset_n == slice(None) and subset_feat == slice(None):
        np.savez(best_lambda_fpath, best_lambda=best_lambdas)

    ## Retrain a model using this best lambda:
    holdout_results = solve.single_solve(
        this_X,
        this_X_test,
        this_Y,
        this_Y_test,
        lambdas=best_lambdas,
        return_preds=True,
        return_model=False,
        clip_bounds=bounds,
        svd_solve=False,
    )

    ## Store the R2
    resultsDict[label] = holdout_results["metrics_test"][0][0][0]["r2_score"]

*** Running regressions for: treecover
Loading labels...
Merging labels and features...
Splitting training/test...
Training model...
on fold (of 5): 1 2 3 4 5 
*** Running regressions for: elevation
Loading labels...
Merging labels and features...
Splitting training/test...
Training model...
on fold (of 5): 1 2 3 4 5 
*** Running regressions for: population
Loading labels...
Merging labels and features...
Splitting training/test...
Training model...
on fold (of 5): 1 2 3 4 5 
*** Running regressions for: nightlights
Loading labels...
Merging labels and features...
Splitting training/test...
Training model...
on fold (of 5): 1 2 3 4 5 
*** Running regressions for: income
Loading labels...
Merging labels and features...
Splitting training/test...
Training model...
on fold (of 5): 1 2 3 4 5 
*** Running regressions for: roads
Loading labels...
Merging labels and features...
Splitting training/test...
Training model...
on fold (of 5): 1 2 3 4 5 
*** Running regressions for: housing
Loading

# Print and save a table of the test set R2s

In [9]:
## Get save path
if (subset_n != slice(None)) or (subset_feat != slice(None)):
    subset_str = "_subset"
else:
    subset_str = ""

fn = Path(
    c.data_dir
    + "/output/cnn_comparison/TestSetR2_resnet152_1e5_pretrained"
    + subset_str
    + ".csv"
)

if (not overwrite) and subset_str == "" and fn.is_file():
    raise OVERWRITE_EXCEPTION

# save
with open(fn, "w") as f:
    for key in resultsDict.keys():
        f.write("%s,%s\n" % (key, resultsDict[key]))