In [1]:
import stata_setup
stata_setup.config("C:/Program Files/Stata17/", "mp")


  ___  ____  ____  ____  ____ ®
 /__    /   ____/   /   ____/      17.0
___/   /   /___/   /   /___/       MP—Parallel Edition

 Statistics and Data Science       Copyright 1985-2021 StataCorp LLC
                                   StataCorp
                                   4905 Lakeway Drive
                                   College Station, Texas 77845 USA
                                   800-STATA-PC        https://www.stata.com
                                   979-696-4600        stata@stata.com

Stata license: Single-user 4-core  perpetual
Serial number: 501706303466
  Licensed to: David Tomas Jacho-Chavez
               Emory University

Notes:
      1. Unicode is supported; see help unicode_advice.
      2. More than 2 billion observations are allowed; see help obs_advice.
      3. Maximum number of variables is set to 5,000; see help set_maxvar.


## Resampling Methods

In [2]:
%%stata
use https://www.stata-press.com/data/r17/breathe, clear
quietly do https://www.stata-press.com/data/r17/no2


. use https://www.stata-press.com/data/r17/breathe, clear
(Nitrogen dioxide and attention)

. quietly do https://www.stata-press.com/data/r17/no2

. 


### Cross-Validation
#### Validation Set Approach

In [3]:
%%stata
splitsample , generate(sample) split(.80 .20) rseed(52)
label define slabel 1 "Training" 2 "Validation"
label values sample slabel
tabulate sample


. splitsample , generate(sample) split(.80 .20) rseed(52)

. label define slabel 1 "Training" 2 "Validation"

. label values sample slabel

. tabulate sample

     sample |      Freq.     Percent        Cum.
------------+-----------------------------------
   Training |        871       79.98       79.98
 Validation |        218       20.02      100.00
------------+-----------------------------------
      Total |      1,089      100.00

. 


In [4]:
%%stata
quietly regress react no2_class $cc i.$fc if sample==1
estimates store ols
lassogof ols, over(sample)


. quietly regress react no2_class $cc i.$fc if sample==1

. estimates store ols

. lassogof ols, over(sample)

Penalized coefficients
-------------------------------------------------------------
Name             sample |         MSE    R-squared        Obs
------------------------+------------------------------------
ols                     |
               Training |     15416.1       0.2710        866
             Validation |    16086.88       0.2368        218
-------------------------------------------------------------

. 


#### Leave-One-Out Cross-Validation

One needs to install the user-written package ```loocv``` by issuing the command ```ssc install loocv``` before executing the following code:

In [5]:
%%stata
loocv regress react no2_class $cc i.$fc



 Leave-One-Out Cross-Validation Results 
-----------------------------------------
         Method          |    Value
-------------------------+---------------
Root Mean Squared Errors |   129.89159
Mean Absolute Errors     |   103.40771
Pseudo-R2                |   .19156592
-----------------------------------------


Given the original sample $\{Y_1,\ldots,Y_n\}$ and the loocv predictions $\{\widehat{Y}_1,\ldots,\widehat{Y}_n\}$, then
$$
\begin{align}
\text{Root Mean Squared Errors}&=&\sqrt{n^{-1}\sum_{i=1}^n(Y_i-\widehat{Y}_i)^2}\\
\text{Mean Absolute Errors}&=&n^{-1}\sum_{i=1}^n|Y_i-\widehat{Y}_i|\\
\text{Pseudo-R2}&=&\widehat{\text{corr}}(Y_i,\widehat{Y}_i)^2
\end{align}
$$

#### _k_-Fold Cross-Validation

One needs to install the user-written package ```crossfold``` by issuing the command ```ssc install crossfold``` before executing the following code:

In [6]:
%%stata
crossfold regress react no2_class $cc i.$fc, k(5) stub(fold)


             |      RMSE 
-------------+-----------
       fold1 |  131.3796 
       fold2 |  124.6447 
       fold3 |   133.332 
       fold4 |  130.5541 
       fold5 |  130.4793 


Displaying the OLS estimates from the 3th fold

In [7]:
%%stata -eret steret
estimates restore fold3

(results fold3 are active now)


In [8]:
steret['e(b)']

array([[ 2.37222280e+00,  7.41469848e-02, -3.47806437e+01,
         5.90589708e+00,  1.62783388e+01,  1.40383825e+02,
        -7.14551905e-01, -2.24453765e+01,  2.74418460e+00,
         8.09833618e-01,  1.15731112e+01,  0.00000000e+00,
         5.32365515e+01, -3.92517016e+01,  4.04275675e+00,
        -1.43916693e+01,  1.96385164e+00, -6.68714662e+00,
        -8.07707088e+00, -2.31291040e+01,  1.11343328e+03]])

In [9]:
import pandas as pd
from pystata import stata
from sfi import Scalar, Matrix
stata.run('qui crossfold regress react no2_class $cc i.$fc, k(5) stub(fold)')
df_rmse = pd.DataFrame(sum(Matrix.get('r(fold)'),[]))
rows = Matrix.getRowNames('r(fold)')

stata.run('qui crossfold regress react no2_class $cc i.$fc, k(5) stub(fold) mae')
df_mae = pd.DataFrame(sum(Matrix.get('r(fold)'),[]))

stata.run('qui crossfold regress react no2_class $cc i.$fc, k(5) stub(fold) r2')
df_r2 = pd.DataFrame(sum(Matrix.get('r(fold)'),[]))

# Export to result with Dataframe format
result = pd.concat([df_rmse,df_mae,df_r2],axis=1)
result.columns = ['RMSE','MAE','pseudo R2']
result.index = rows
print(result)

             RMSE         MAE  pseudo R2
fold1  131.971042  103.827229   0.173063
fold2  131.089282  104.633933   0.219005
fold3  131.865644  107.469575   0.174983
fold4  135.504024  100.342675   0.230675
fold5  121.367081  100.383902   0.191130


In this case $\sqrt{CV_{(5)}}$ equals

In [10]:
import math as math
import statistics as st
print(math.sqrt(st.mean(result['RMSE']**2)))

130.4458605035774
