<br>
<br>

# `OutR2`: **Predictive Metrics in Psychological Research**

Tutorial for estimating the prediction error with `OutR2` using the data from Klein et al. (2014): *Investigating Variation in Replicability A ''Many Labs'' Replication Project*, accessible here: <https://osf.io/wx7ck/>.

### **Installation**

Using the `devtools` package, `OutR2` can be installed from its Github page:


In [1]:
# devtools::install_github("MagikT/OutR2")

Once `OutR2` has been installed. It is ready to be used.

### **Estimating the prediction error**

In [2]:
# Load required packages 
library(OutR2) # Predictive metrics, OutR2_0.1.0
library(haven) # Read .sav data, haven_2.5.2

**1. Load data set**

In [3]:
data <- as.data.frame(read_sav("CleanedDataset.sav")) # Data from the Many Labs Replication Project

**2.  Data pre-processing**  

In the original study, Klein et al. (2014) analyzed 13 classical and contemporary effects documented in the psychological literature (13 + 3, considering the four anchoring effects). Of the 13, 11 can be approached through regression analysis. For this tutorial we will be using the data from the Anchoring Effect (Babies). 

The anchoring effect is a cognitive bias that highlights our propensity to give excessive relevance to the first information received (known as "anchor") when making a judgment in certain situations (see e.g., Jacowitz & Kahneman, 1995). In the experimental settings in which this phenomenon has been investigated, participants are asked to provide, for example, an estimate of the number of babies born in the U.S. each day after being exposed to unreasonably low (e.g., 100) or high (e.g., 50.000) anchor values.

In [4]:
# 2. 1. Store in a new data frame the data from the Anchoring Effect - Babies
# Effect
effect <- "Anchoring Babies"

# Experimental condition - group, Independent Variable IV
group <- "anch4group"

# Dependent Variable, DV
DV <- "anchoring4"

# Location of data collection (36 laboratories)
location <- c("sample", "referrer") 

# New data frame  
data_babies <- data[, c(location, group, DV)] # Get Anchor Effect columns from the large data set containing all effects
data_babies <- na.omit(data_babies) # Omit NA values
data_babies <- data.frame(id = 1:nrow(data_babies), # Add ID
                          location = data_babies[, 1],
                          location_label = data_babies[, 2],
                          group = data_babies[, 3],
                          y = data_babies[, 4]) # Recode variable names

Emulating a real-world scenario, we will estimate the prediction error from a single sample (i.e., 1 of the 36 laboratories), using the data from the Ohio State University (OSU) laboratory.

In [5]:
# 2. 2. Get data from OSU laboratory
data_babies <- data_babies[data_babies$location_label == "osu", ]

# View data
head(data_babies)

Unnamed: 0_level_0,id,location,location_label,group,y
Unnamed: 0_level_1,<int>,<dbl+lbl>,<chr>,<dbl+lbl>,<dbl>
2386,2386,16,osu,1,3000
2387,2387,16,osu,0,145
2388,2388,16,osu,1,2500
2389,2389,16,osu,1,38560
2390,2390,16,osu,0,150
2391,2391,16,osu,1,35000


For analyzing the Anchoring Effect from a predictive approach, we fit the following linear model:

$$\hat{Y} = \beta_0 + \beta_1 \cdot X_{i1}$$

Where $\beta_0$ represents the predicted (i.e., expected) response for potential, yet unobserved, participants belonging to the low anchor condition, and $\beta_0 + \beta_1$ represents the predicted response for the ones who will be exposed to the high anchor condition. Note that the possible values of the predictor variable, low and high anchor conditions, are coded as $X_1 = 0$ and $X_1 = 1$, respectively (see above the `group` column in `data_babies`). Then, the predicted response for the $i_{th}$ participant, exposed to the low anchor condition (i.e., when $X_{i1} = 0$) will be: $\hat{Y_i} = \beta_0 + \beta_1 \cdot 0 = \beta_0$. That is, we expect those who will potentially be exposed to the low anchor condition to estimate a number of babies close to $\beta_0$. Similarly, we predict that those who will be exposed to the high anchor condition will estimate a number of babies close to $\hat{Y_i} = \beta_0 + \beta_1 \cdot 1 = \beta_0 + \beta_1$.

**3.  Fit Predictive Model**  

We fit the model using the `glm` function from the `stats` package, part of R base.

In [6]:
# 3. 1. Fit predictive model
fit <- glm(formula = y ~ group, data = data_babies)

# 3. 2. Model predictions
cat(sprintf("Low anchor = %.0f\nHigh anchor = %.0f", 
            fit$coefficients[1], 
            fit$coefficients[1] + fit$coefficients[2]))

Low anchor = 2571
High anchor = 26292

The fitted model predicts that future (i.e., yet unobserved) participants exposed to the low and high anchor conditions will estimate a number of babies close to $2.571$ and $26.292$, respectively. To evaluate its predictive performance out-of-sample, we use `OutR2`.

**4. Predictive Performance Metrics**

In its current version (`OutR2 0.1.0`), the main function `rsq_cv` estimates the prediction error of a fitted model in the $R^2$ and $MSE$ metrics (also $RMSE$). In this example we will estimate the prediction error by Cross-Validation (CV) in the $R^2$ and $RMSE$ metrics.

`rsq_cv` takes as input:
- `data` data set or sample where the model was fitted.
- `k` number of folds or validation sets (`k = n` for Leave-One-Out CV).
- `balanced` when working with categorical predictors and `k != n` (i.e., for K-fold CV), `balanced = TRUE` generates balanced folds.
- `J` number of groups when working with categorical predictors (version `0.1.0` is set up for `J = 2`).

In [7]:
# 4. 1. Predictive performance metrics
n <- nrow(data_babies) # Sample size, n, for Leave-One-Out Cross-Validation
predictive_metrics <- rsq_cv(data = data_babies, glmfit = fit, k = n) # Get prediction error estimation

# 4. 1. 1
Pe_R2 <- predictive_metrics$R2_Cv # Prediction error estimation by CV in R2 metric (i.e., out-of-sample R2)
Pe_RMSE <- predictive_metrics$RMSE_Cv # Prediction error estimation by CV in RMSE metric (i.e., Root of the Mean Squared Error of Prediction)

`$R2_Cv` and `$RMSE_Cv` contain the prediction error estimated by CV in the $R^2$ and $RMSE$ metrics, respectively. In this example, we interpret the results as follows.

In [8]:
# 4. 2. Results
cat(sprintf("Estimated prediction error by Leave-One-Out CV:\nR2 = %.2f\nRMSE = %.0f",
            Pe_R2, # R2 metric
            Pe_RMSE)) # RMSE metric

Estimated prediction error by Leave-One-Out CV:
R2 = 0.52
RMSE = 11513

For the Anchoring Effect (Babies), the predictive performance of the fitted model was estimated by Leave-One-Out (LOO) CV. The LOO CV estimates that the model that included the experimental condition (i.e., low vs high anchor) as predictor deviates, on average, 11.513 babies from the observed responses in out-of-sample observations ($RMSE_{LOO} = 11.513$), accounting for 52 % of the observed variance in out-of-sample responses ($R{^2}_{LOO} = 0.52$).

**References**

- Jacowitz, K. E., & Kahneman, D. (1995). Measures of anchoring in estimation tasks. *Personality and Social Psychology Bulletin*, *21*(11), 1161-1166. <https://doi.org/10.1177/01461672952111004>.
- Klein, R. A., Ratliff, K. A., Vianello, M., Adams Jr, R. B., Bahník, Š., Bernstein, M. J., ..., & Nosek, B. A. (2014). Investigating variation in replicability: A "Many Labs" Replication Project. *Social psychology*, *45*(3), 142-152. <https://doi.org/10.1027/1864-9335/a000178>.