<a href="https://colab.research.google.com/github/Jlokkerbol/masterclass/blob/main/The_more_the_better.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [6]:
install.packages('glmnet')
install.packages('data.table')
install.packages('caret')

library(glmnet)
library(data.table)
library(caret)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependency ‘shape’


Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Loading required package: Matrix

Loaded glmnet 4.0-2



#The more the better?

One of the ideas that people tend to have about machine learning, is that it is always better to have more data. This notebook considers how adding irrelevant data to your dataset impacts the performance of your model, i.e. the extent to which machine learning is able to distinguish between relevant and irrelevant information.

The following steps are taken:
- step 1: define datagenerating process and simulate data
- step 2: split the data into training and test set
- step 3: train model using all data
- step 4: evaluate performance of full model
- step 5: train model using only the relevant data
- step 6: compare the performance of both models

In [23]:
# step 1: define datagenerating process and simulate data
set.seed(12345)
X1 <- rnorm(n = 180, mean = 20, sd = 5)
X2 <- rnorm(n = 180, mean = 0, sd = 5)
error <- rnorm(n = 180, mean = 0, sd = 100)
Y <- 10+40*X1+20*X2+error
df <- cbind(Y, X1, X2)
df <- as.data.frame(df)
head(df)

#create non-relevant variables and add to the dataframe
for (i in 3:100) {
        df <- cbind(df,rnorm(n=180, mean=0, sd=1))
        colnames(df)[i+1] <- paste("X",i,sep="")
}

head(df)

Unnamed: 0_level_0,Y,X1,X2
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>
1,798.3495,22.92764,-4.5586971
2,765.7054,23.54733,-0.2452235
3,776.4606,19.45348,-2.0269374
4,813.3765,17.73251,5.651909
5,1014.51,23.02944,4.0773237
6,519.0936,10.91022,0.3820876


Unnamed: 0_level_0,Y,X1,X2,X3,X4,X5,X6,X7,X8,X9,⋯,X91,X92,X93,X94,X95,X96,X97,X98,X99,X100
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,798.3495,22.92764,-4.5586971,2.12746303,-0.2915456,-0.78486098,-0.8257089,0.209283,0.7548681,-0.2624089,⋯,0.2170089,0.9811519,-0.1497657,2.7536749,-0.0712445,-0.3998977,0.9764391,-0.3016068,-1.4666577,-0.28744876
2,765.7054,23.54733,-0.2452235,-0.18770648,0.9726278,-2.56005244,0.4902205,0.5836894,0.2102209,-1.2662017,⋯,0.5958067,1.3969812,0.5189769,0.8732472,0.8300113,-1.3213831,-0.9214097,-1.2180848,-0.9668004,0.03717712
3,776.4606,19.45348,-2.0269374,0.09875784,-0.2397067,0.07280078,-0.9265319,-0.6860822,-0.6535798,-0.9027523,⋯,0.8350547,-0.2559876,0.8028454,-0.5068545,-0.31125485,-0.4995762,0.5872777,0.4780813,-0.8942023,-1.34888323
4,813.3765,17.73251,5.651909,1.91037815,-0.9017177,0.75024358,1.6025499,-1.0184904,0.2518509,0.1460915,⋯,-0.751907,1.2994639,0.5629971,1.4382039,-0.51459724,-0.6148111,1.5750744,0.0106562,-0.9325735,-0.7640613
5,1014.51,23.02944,4.0773237,1.62145572,-0.6311743,-0.12824888,-0.860891,-1.1347534,-0.4156504,1.3384244,⋯,0.8735983,0.7651778,-1.7634525,0.4803649,0.31325164,-0.9264642,-0.8271999,2.6599423,-1.0749029,0.44507673
6,519.0936,10.91022,0.3820876,2.09306799,-1.6259751,-0.48786673,-0.3655133,-0.4582505,-0.1920236,0.6668734,⋯,0.2263574,-0.4766437,-2.8152937,-0.9024205,0.02830745,1.32532,0.3081907,0.5242,-0.4868474,1.78749391


We defined the 'true' relation between Y and X1 and X2, and added additional variables X3 - X100, that have nothing to do with the outcome Y.

Now that we have this data, we can explore to what extent machine learning is able to distinguish between the relevant predictors X1 and X2, and the irrelevant predictors X3 - X100.

In [24]:
##### step 2: split the data into training and test set
set.seed(1)
train <- createDataPartition(df$Y,p = 0.7, list = FALSE)
df_train <- df[train,]
df_test <- df[-train,]

In [25]:
##### step 3: train model using all data

# define cross-validation strategy
fitControl <- trainControl(## 10-fold CV
                        method = "repeatedcv",
                        number = 10,    ## repeated ten times
                        repeats = 10)

# train model (LASSO)
lambda <- 10^seq(-3,3,length=100)
set.seed(825)
LassoFit <- train(Y ~ ., data = df_train, 
                 method = "glmnet", 
                 trControl = fitControl,
                 tuneGrid = expand.grid(alpha = 1, lambda = lambda))
LassoFit
LassoFit$finalModel$tuneValue

coef(LassoFit$finalModel, LassoFit$bestTune$lambda)

“There were missing values in resampled performance measures.”


glmnet 

128 samples
100 predictors

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 10 times) 
Summary of sample sizes: 115, 116, 115, 116, 115, 115, ... 
Resampling results across tuning parameters:

  lambda        RMSE       Rsquared   MAE      
  1.000000e-03  261.24390  0.3790358  214.96968
  1.149757e-03  261.24390  0.3790358  214.96968
  1.321941e-03  261.24390  0.3790358  214.96968
  1.519911e-03  261.24390  0.3790358  214.96968
  1.747528e-03  261.24390  0.3790358  214.96968
  2.009233e-03  261.24390  0.3790358  214.96968
  2.310130e-03  261.24390  0.3790358  214.96968
  2.656088e-03  261.24390  0.3790358  214.96968
  3.053856e-03  261.24390  0.3790358  214.96968
  3.511192e-03  261.24390  0.3790358  214.96968
  4.037017e-03  261.24390  0.3790358  214.96968
  4.641589e-03  261.24390  0.3790358  214.96968
  5.336699e-03  261.24390  0.3790358  214.96968
  6.135907e-03  261.24390  0.3790358  214.96968
  7.054802e-03  261.24390  0.3790358  214.96968
  8.111308e-0

Unnamed: 0_level_0,alpha,lambda
Unnamed: 0_level_1,<dbl>,<dbl>
71,1,17.47528


101 x 1 sparse Matrix of class "dgCMatrix"
                      1
(Intercept) 94.10991091
X1          36.38231342
X2          12.50758018
X3           .         
X4           .         
X5           .         
X6           .         
X7           .         
X8           .         
X9           .         
X10          .         
X11          .         
X12          .         
X13          .         
X14          .         
X15          .         
X16          .         
X17          .         
X18          .         
X19          .         
X20          .         
X21          .         
X22          .         
X23          .         
X24          .         
X25          .         
X26          .         
X27          .         
X28          .         
X29          .         
X30          .         
X31          .         
X32          .         
X33          .         
X34          .         
X35          .         
X36          .         
X37          .         
X38          .       

This LASSO-model, which is specifically developed to distinguish between relevant and irrelevant predictors, includes the relevant predictors X1 and X2, but also includes 4 other predictors (and manages to exclude 94 out of 98 irrelevant predictors).

In [26]:
##### step 4: evaluate performance of full model
error_full <- df_test[,1] - predict(LassoFit, df_test)
rmse_full <- sqrt(mean(error_full^2))
mae_full <- mean(abs(error_full))
print(paste('RMSE for model using all predictors:', rmse_full))
print(paste('MAE for model using all predictors: ', mae_full))

[1] "RMSE for model using all predictors: 110.207866557162"
[1] "MAE for model using all predictors:  89.0088336830163"


In [27]:
##### step 5: train model using only the relevant data
model_causal <- lm(Y~X1+X2, data = df_train)
summary(model_causal)


Call:
lm(formula = Y ~ X1 + X2, data = df_train)

Residuals:
     Min       1Q   Median       3Q      Max 
-244.825  -51.130    0.789   61.179  298.566 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)    9.104     32.650   0.279    0.781    
X1            40.462      1.544  26.211  < 2e-16 ***
X2            17.610      1.884   9.348 4.49e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 94.46 on 125 degrees of freedom
Multiple R-squared:  0.8492,	Adjusted R-squared:  0.8468 
F-statistic:   352 on 2 and 125 DF,  p-value: < 2.2e-16


In [28]:
##### step 6: compare the performance of both models
error_causal <- df_test[,1] - predict(model_causal, df_test)
rmse_causal <- sqrt(mean(error_causal^2))
mae_causal <- mean(abs(error_causal))

print(paste('RMSE for model using only relevant predictors:', rmse_causal))
print(paste('MAE for model using only relevant predictors: ', mae_causal))

print(paste('% improvement (RMSE) compared to using all predictors:', (rmse_full - rmse_causal) / rmse_full * 100, '%'))
print(paste('% improvement (MAE) compared to using all predictors: ', (mae_full - mae_causal) / mae_full * 100, '%'))



[1] "RMSE for model using only relevant predictors: 99.6668873023072"
[1] "MAE for model using only relevant predictors:  80.4775732788778"
[1] "% improvement (RMSE) compared to using all predictors: 9.5646341628321 %"
[1] "% improvement (MAE) compared to using all predictors:  9.58473451581291 %"


# Take away
- Machine learning does not perfectly detect which predictors truly matter (though it does a fairly good job trying)
- it is always important to appraoch prediction from a causal understanding of the problem at hand