# Model selection: Logs and levels

This notebook makes and evaluates predictions for each task under two models. The first logs the outcome variable, while the second keeps the outcome variable in levels.
The notebook creates the results shown in Supplementary Table S3 and saves them to root/results/tables/TableS3/LogsVsLevels.csv

## Settings

Here are the settings you can adjust when running this notebook:
- ``num_threads``: If running on a multi-core machine, change this from ``None`` to an ``int`` in order to set the max number of threads to use
- ``subset_[n,feat]``: If you want to subset the training set data for quick tests/debugging, specify that here using the `slice` object. `slice(None)` implies no subsetting of the ~80k observations for each label that are in the training set. `subset_n` slices observations; `subset_feat` subsets features.
- ``overwrite``: By default, this code will raise an error if the file you are saving already exists. If you would like to disable that and overwrite existing data files, change `overwrite` to `True`.
- ``labels_to_run``: By default, this notebook will loop through all the labels. If you would like, you can reduce this list to only loop through a subset of them, by changing ``"all"`` to a list of task names, e.g. ``["housing", "treecover"]``

In [None]:
num_threads = None

subset_n = slice(None)
subset_feat = slice(None)

overwrite = None

labels_to_run = "all"

### Imports

In [None]:
%load_ext autoreload
%autoreload 2

import os
from pathlib import Path

from mosaiks import transforms
from mosaiks.utils import OVERWRITE_EXCEPTION
from mosaiks.utils.imports import *
from sklearn import metrics
from threadpoolctl import threadpool_limits

if num_threads is not None:
    threadpool_limits(num_threads)
    os.environ["NUMBA_NUM_THREADS"] = str(num_threads)

if overwrite is None:
    overwrite = os.getenv("MOSAIKS_OVERWRITE", False)
if labels_to_run == "all":
    labels_to_run = c.app_order

solver = solve.ridge_regression

out_dir = Path(c.res_dir, "tables", "TableS3")
out_dir.mkdir(exist_ok=True, parents=True)

## Load Random Feature Data

This loads our feature matrix `X` for both POP and UAR samples

In [None]:
X = {}
latlons = {}

X["UAR"], latlons["UAR"] = io.get_X_latlon(c, "UAR")
X["POP"], latlons["POP"] = io.get_X_latlon(c, "POP")

## Run regressions

The following loop will:
1. Load the appropriate labels
2. Merge them with the feature matrix
3. Remove test set observations
4. Run two sets of ridge regressions on the training/validation set, sweeping over a range of possible regularization parameters using 5-fold Cross-Validation and clipping predictions to pre-specified bounds. The first set uses the outcome variable in levels, the second set uses the outcome variable in logs.
5. Store two R2 values per label, one from the levels model, one from the log model.  

**Note**: We drop observations for which our labels are missing. In one label (nightlights) we drop an outlier that was not dropped during dataset pre-processing. In another (elevation), we convert a small number of negative values to zero in order to take log transformations. For all variables that we model in log-space, we first add 1 to deal with 0-valued observations (except for housing, for which we do not have any 0's).

In [None]:
resultsDict = {}
for label in labels_to_run:

    print("*** Running regressions for: {}".format(label))

    ## Set some label-specific variables
    this_cfg = io.get_filepaths(c, label)
    c_app = getattr(this_cfg, label)
    sampling_type = c_app["sampling"]  # UAR or POP

    # Set solver arguments: levels model
    solver_kwargs = {
        "lambdas": c_app["lambdas"],
        "return_preds": True,
        "svd_solve": False,
        "clip_bounds": np.array([c_app["us_bounds_pred"]]),
    }

    # Set solver arguments: logs model
    solver_kwargs_log = {
        "lambdas": c_app["lambdas"],
        "return_preds": True,
        "svd_solve": False,
        "clip_bounds": np.array([c_app["us_bounds_log_pred"]]),
    }

    ## get labels
    print("Loading labels...")
    this_Y = io.get_Y(this_cfg, c_app["colname"])

    # For elevation, below sea level obs need to be clipped to zero for logs (we only perform this for this single test)
    if label == "elevation":
        this_Y[this_Y < 0] = 0

    ## merge X and Y accounting for different ordering
    ## and the sampling type
    print("Merging labels and features...")
    this_Y, this_X, this_latlons = parse.merge(
        this_Y, X[sampling_type], latlons[sampling_type]
    )

    ## Drop missing observations as needed
    this_X, this_Y, this_latlons = transforms.dropna(
        this_X, this_Y, this_latlons, c_app
    )

    ## Split the data into the training/validation sample vs. test sample
    ## (discarding test set for now to keep memory low)
    print("Splitting training/test...")
    this_X, this_X_test, this_Y, this_Y_test = parse.split_data_train_test(
        this_X, this_Y, frac_test=this_cfg.ml_model["test_set_frac"], return_idxs=False
    )

    ## Create a logged set of outcomes for training/validation sample
    this_logY = transforms.log_all(this_Y, c_app)

    ## LEVELS model: Train model using ridge regression and 5-fold cross-valiation
    ## (number of folds can be adjusted using the argument n_folds)
    print("Training levels model...")
    kfold_results = solve.kfold_solve(
        this_X[subset_n, subset_feat],
        this_Y[subset_n],
        solve_function=solver,
        num_folds=this_cfg.ml_model["n_folds"],
        return_model=True,
        **solver_kwargs
    )
    print("")

    ## LOGS model: Train model using ridge regression and 5-fold cross-valiation
    ## (number of folds can be adjusted using the argument n_folds)
    print("Training logs model...")
    kfold_results_log = solve.kfold_solve(
        this_X[subset_n, subset_feat],
        this_logY[subset_n],
        solve_function=solver,
        num_folds=this_cfg.ml_model["n_folds"],
        return_model=True,
        **solver_kwargs_log
    )
    print("")

    # Get best predictions from levels model, flatten and calculate R2
    (
        best_lambda_idx_level,
        best_metrics_levels,
        best_preds_levels,
    ) = ir.interpret_kfold_results(kfold_results, crits="r2_score")
    preds_levels = np.vstack(
        [solve.y_to_matrix(i) for i in best_preds_levels.squeeze()]
    ).squeeze()
    truth_levels = np.vstack(
        [solve.y_to_matrix(i) for i in kfold_results["y_true_test"].squeeze()]
    ).squeeze()
    r2_level = metrics.r2_score(truth_levels, preds_levels)

    # Get best predictions from logs model, flatten and calculate R2
    best_lambda_idx_log, best_metrics_log, best_preds_log = ir.interpret_kfold_results(
        kfold_results_log, crits="r2_score"
    )
    preds_log = np.vstack(
        [solve.y_to_matrix(i) for i in best_preds_log.squeeze()]
    ).squeeze()
    truth_log = np.vstack(
        [solve.y_to_matrix(i) for i in kfold_results_log["y_true_test"].squeeze()]
    ).squeeze()
    r2_log = metrics.r2_score(truth_log, preds_log)

    ## Store the R2s
    resultsDict[label] = [r2_log, r2_level]

# Print and save a table of the logs and levels R2s

In [None]:
# format output
out = pd.DataFrame.from_dict(
    resultsDict, orient="index", columns=["r2_log", "r2_level"]
)

In [None]:
## Get save path
if (subset_n != slice(None)) or (subset_feat != slice(None)):
    subset_str = "_subset"
else:
    subset_str = ""

fn = out_dir / f"LogsVsLevels{subset_str}.csv"

if (not overwrite) and os.path.isfile(fn):
    raise OVERWRITE_EXCEPTION

# save
out.to_csv(fn, index=True)