# Prediction Analysis for WSL

Lukas Graz  
February 13, 2025

In [None]:
source("R/data_prep.R")


Number of matches per filter criteria (not disjoint)
  Headphone  PRS_all_NA    Distance Activity_NA    Duration  HMNoise_NA 
        303         226         221         102          96          96 
JourneyTime 
         20 
Keep  1494 of 2206 observations

Imputing PRS_orig_vars

TODO: Remove PCA?

Imputing mediators & GIS_vars for MLR

## Setup

In [None]:
library(mlr3verse, quietly = TRUE)
library(GGally, quietly = TRUE, warn.conflicts = FALSE)


Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2

Testing prediction quality of GIS_vars -\> Mediators -\> PRS_vars using

-   Linear models
-   Random forests (default parameters)
-   XGBoost (with parameter tuning)
-   LASSO (not shown since inferior)

**GIS Variables:**

In [None]:
GIS_vars


 [1] "LCARTIF_sqrt"   "LCFOREST_sqrt"  "HETER"          "OVDIST_sqrt"   
 [5] "VIS5K_sqrt"     "RL_NDVI"        "RL_NOISE"       "DISTKM_sqrt"   
 [9] "JNYTIME_sqrt"   "STRIMP123_sqrt" "STRIMP999_sqrt"

**Mediators:**

In [None]:
Mediator_vars


[1] "FEELNAT"  "LNOISE"   "LOC_SENS" "LOC_SOUN" "LOC_SCEN" "LOC_VISE" "LOC_VEGE"
[8] "LOC_FAUN"

### PRS ~ GIS

In [None]:
tasks_GIS <- lapply(PRS_vars, \(y) 
  as_task_regr(
    subset(Dmlr, select = c(y, GIS_vars)),
    target = y,
    id = y
  ))
get_benchi_table(tasks_GIS) 


       lm xgboost ranger
LA   0.00    0.02   0.00
BA   0.00   -0.02  -0.03
EC  -0.01   -0.05  -0.05
ES   0.05    0.04   0.03
PC1 -0.01   -0.01  -0.03
PC2  0.04    0.01   0.00
PC3  0.03    0.04   0.02
PC4 -0.01   -0.01  -0.01

GIS variables alone show poor predictive performance.

### PRS ~ GIS + Mediators

In [None]:
tasks_GIS_MED <- lapply(PRS_vars, \(y) 
  as_task_regr(
    subset(Dmlr, select = c(y, Mediator_vars, GIS_vars)),
    target = y,
    id = y
  ))
get_benchi_table(tasks_GIS_MED) 


      lm xgboost ranger
LA  0.22    0.25   0.24
BA  0.13    0.14   0.12
EC  0.02    0.00   0.01
ES  0.15    0.16   0.15
PC1 0.22    0.23   0.23
PC2 0.07    0.04   0.06
PC3 0.02    0.04   0.04
PC4 0.01    0.01   0.01

### PRS ~ Mediators

In [None]:
tasks_MED <- lapply(PRS_vars, \(y) 
  as_task_regr(
    subset(Dmlr, select = c(y, Mediator_vars)),
    target = y,
    id = y
  ))
get_benchi_table(tasks_MED) 


      lm xgboost ranger
LA  0.21    0.24   0.20
BA  0.14    0.13   0.09
EC  0.03    0.02  -0.03
ES  0.13    0.11   0.06
PC1 0.22    0.23   0.20
PC2 0.06    0.05   0.02
PC3 0.00   -0.02  -0.06
PC4 0.00   -0.01  -0.03

### Mediators ~ GIS

In [None]:
tasks_MED_by_GIS <- lapply(Mediator_vars, \(y) 
  as_task_regr(
    subset(Dmlr, select = c(y, GIS_vars)),
    target = y,
    id = y
  ))
get_benchi_table(tasks_MED_by_GIS)


           lm xgboost ranger
FEELNAT  0.13    0.12   0.09
LNOISE   0.09    0.07   0.07
LOC_SENS 0.01   -0.01  -0.02
LOC_SOUN 0.05    0.00   0.00
LOC_SCEN 0.04    0.04   0.02
LOC_VISE 0.00   -0.03  -0.04
LOC_VEGE 0.06    0.05   0.04
LOC_FAUN 0.06    0.07   0.04

### Legacy Code

In [None]:

# Get parameter estimates for XGBoost
t <- as_task_regr(
  subset(Dmlr, select = c("FEELNAT", GIS_vars)),
  target = "FEELNAT"
)

l <- lrn("regr.xgboost",
  nrounds = 500  # More iterations due to lower learning rate
)

# Create search space
ps <- ps(
  max_depth = p_int(2, 3),
  eta = p_dbl(0.001, 0.3, tags = "logscale")
)

# Setup tuning
instance <- ti(
  task = t,
  learner = l,
  resampling = rsmp("cv", folds = 3),
  measure = msr("regr.mse"),
  terminator = trm("none"),
  search_space = ps
)

# Grid search
tuner <- mlr3tuning::tnr("grid_search")
tuner$optimize(instance)
instance$result


In [None]:

library(randomForest)

fit <- lm(as.formula(paste0(
    "cbind(", paste(PRS_vars, collapse = ", "), ")",
    " ~ ",
    paste(Mediator_vars, collapse = " + ")
  )), 
  D)
coef(fit) |> round(2)

rsq.lm <- sapply(summary(fit), \(x) x$r.sq)
rsq.rf <- sapply(PRS_vars, \(x) {
  rf <- randomForest(as.formula(paste0(
    x, " ~ ", paste(Mediator_vars, collapse = " + ")
  )), 
  D, na.action = na.omit
  ) 
  rf$rsq[500]
})

cbind(lm = rsq.lm, rf = rsq.rf) |> round(2)


In [None]:

autoplot(mytsk1, type = "pairs")
mytsk1 <- as_task_regr(
  subset(Dmlr, select = c("LA", Mediator_vars, GIS_vars)),
  feature = c(Mediator_vars, GIS_vars),
  target = "LA",
  id = "bla"
)

lrn_xgb <- lrn("regr.xgboost")
lrn_avg <- lrn("regr.featureless")
splits <- partition(mytsk1)
lrn_xgb$train(mytsk1, splits$train)$predict(mytsk1, splits$test)$score(mse)
lrn_avg$train(mytsk1, splits$train)$predict(mytsk1, splits$test)$score(mse)
rr <- resample(mytsk1, lrn_xgb, cv3)
rr$aggregate(mse)

learners <- lrns(c("regr.featureless", "regr.lm", "regr.xgboost", "regr.ranger"))
learners$regr.xgboost$param_set$set_values(eta = 0.03, nrounds = 300, max_depth = 2)
learners <- c("regr.featureless", "regr.lm")
