In [1]:
import stata_setup
stata_setup.config("C:/Program Files/Stata17/", "mp")


  ___  ____  ____  ____  ____ ®
 /__    /   ____/   /   ____/      17.0
___/   /   /___/   /   /___/       MP—Parallel Edition

 Statistics and Data Science       Copyright 1985-2021 StataCorp LLC
                                   StataCorp
                                   4905 Lakeway Drive
                                   College Station, Texas 77845 USA
                                   800-STATA-PC        https://www.stata.com
                                   979-696-4600        stata@stata.com

Stata license: Single-user 4-core  perpetual
Serial number: 501706303466
  Licensed to: David Tomas Jacho-Chavez
               Emory University

Notes:
      1. Unicode is supported; see help unicode_advice.
      2. More than 2 billion observations are allowed; see help obs_advice.
      3. Maximum number of variables is set to 5,000; see help set_maxvar.


## Preparing the data

In [2]:
%%stata
use https://www.stata-press.com/data/r17/breathe, clear
quietly do https://www.stata-press.com/data/r17/no2
display "$cc"
display "$fc"


. use https://www.stata-press.com/data/r17/breathe, clear
(Nitrogen dioxide and attention)

. quietly do https://www.stata-press.com/data/r17/no2

. display "$cc"
no2_home age age0 sev_home green_home noise_school sev_school precip siblings_o
> ld siblings_young

. display "$fc"
sex grade overweight lbweight breastfeed msmoke meducation feducation

. 


We use ```splitsample``` with the option ```split(.75 .25)``` to generate the variable ```sample```, which is 1 for a 75% of the sample and 2 for the remaining 25% of the sample. The assignment of each observation in sample to 1 or 2 is random, but the ```rseed``` option makes the random assignment reproducible.

In [3]:
%%stata
splitsample , generate(sample) split(.75 .25) rseed(52)
label define slabel 1 "Training" 2 "Validation"
label values sample slabel
tabulate sample


. splitsample , generate(sample) split(.75 .25) rseed(52)

. label define slabel 1 "Training" 2 "Validation"

. label values sample slabel

. tabulate sample

     sample |      Freq.     Percent        Cum.
------------+-----------------------------------
   Training |        817       75.02       75.02
 Validation |        272       24.98      100.00
------------+-----------------------------------
      Total |      1,089      100.00

. 


## OLS

In [4]:
%%stata
quietly regress react no2_class $cc i.($fc) if sample==1
estimate store ols


. quietly regress react no2_class $cc i.($fc) if sample==1

. estimate store ols

. 


## Ridge

In [5]:
%%stata
quietly elasticnet linear react no2_class $cc i.($fc) if sample==1, alpha(0) lambda(0.1(.005)0.3) folds(781) nolog
estimate store ridge


. quietly elasticnet linear react no2_class $cc i.($fc) if sample==1, alpha(0) 
> lambda(0.1(.005)0.3) folds(781) nolog

. estimate store ridge

. 


## Lasso

In [6]:
%%stata
quietly lasso linear react no2_class $cc i.($fc) if sample==1, folds(20) rseed(52) nolog
estimate store lasso


. quietly lasso linear react no2_class $cc i.($fc) if sample==1, folds(20) rsee
> d(52) nolog

. estimate store lasso

. 


## Elastic Net

In [7]:
%%stata
quietly elasticnet linear react no2_class $cc i.($fc) if sample==1, alpha(.02 (0.02) .1) nolog folds(20) rseed(52)
estimate store elasticnet


. quietly elasticnet linear react no2_class $cc i.($fc) if sample==1, alpha(.02
>  (0.02) .1) nolog folds(20) rseed(52)

. estimate store elasticnet

. 


## In- \& Out-of-Sample Prediction

In [8]:
%%stata
lassogof ols ridge lasso elasticnet, over(sample)


Penalized coefficients
-------------------------------------------------------------
Name             sample |         MSE    R-squared        Obs
------------------------+------------------------------------
ols                     |
               Training |    14749.69       0.2884        813
             Validation |    16974.57       0.2386        271
------------------------+------------------------------------
ridge                   |
               Training |    15490.39       0.2379        781
             Validation |    18025.61       0.1916        255
------------------------+------------------------------------
lasso                   |
               Training |    15863.91       0.2243        801
             Validation |    18136.28       0.1936        266
------------------------+------------------------------------
elasticnet              |
               Training |    15544.73       0.2352        781
             Validation |    18055.63       0.1903        255
----

<strong>Postselection</strong> coefficients should not be used with <em>elasticnet</em> and, in particular, with <em>ridge regression</em>. Ridge works by shrinking the coefficient estimates, and these are the estimates that should be used for prediction. Because postselection coefficients are OLS regression coefficients for the selected coefficients and because ridge always selects all variables, postselection coefficients after ridge are OLS regression coefficients for all potential variables, which clearly we do not want to use for prediction.
