# Resampling (Julia)

In [31]:
using RDatasets
using DataFrames
using CategoricalArrays
using GLM
using Gadfly
using Statistics
using Lathe.preprocess: TrainTestSplit
using MLBase

In [5]:
auto = dataset("ISLR","Auto")

Unnamed: 0_level_0,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Year,Origin
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,18.0,8.0,307.0,130.0,3504.0,12.0,70.0,1.0
2,15.0,8.0,350.0,165.0,3693.0,11.5,70.0,1.0
3,18.0,8.0,318.0,150.0,3436.0,11.0,70.0,1.0
4,16.0,8.0,304.0,150.0,3433.0,12.0,70.0,1.0
5,17.0,8.0,302.0,140.0,3449.0,10.5,70.0,1.0
6,15.0,8.0,429.0,198.0,4341.0,10.0,70.0,1.0
7,14.0,8.0,454.0,220.0,4354.0,9.0,70.0,1.0
8,14.0,8.0,440.0,215.0,4312.0,8.5,70.0,1.0
9,14.0,8.0,455.0,225.0,4425.0,10.0,70.0,1.0
10,15.0,8.0,390.0,190.0,3850.0,8.5,70.0,1.0


In [20]:
#Use Lathe to split the data into train and test sets
train, test = TrainTestSplit(auto,.5)

([1m191×9 DataFrame[0m
[1m Row [0m│[1m MPG     [0m[1m Cylinders [0m[1m Displacement [0m[1m Horsepower [0m[1m Weight  [0m[1m Acceleration [0m[1m Ye[0m ⋯
[1m     [0m│[90m Float64 [0m[90m Float64   [0m[90m Float64      [0m[90m Float64    [0m[90m Float64 [0m[90m Float64      [0m[90m Fl[0m ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │    15.0        8.0         350.0       165.0   3693.0          11.5     ⋯
   2 │    18.0        8.0         318.0       150.0   3436.0          11.0
   3 │    16.0        8.0         304.0       150.0   3433.0          12.0
   4 │    17.0        8.0         302.0       140.0   3449.0          10.5
   5 │    14.0        8.0         455.0       225.0   4425.0          10.0     ⋯
   6 │    15.0        8.0         400.0       150.0   3761.0           9.5
   7 │    21.0        6.0         200.0        85.0   2587.0          16.0
   8 │    27.0        4.0          97.0        88.0   2130.0

In [22]:
fm = @formula(MPG ~ Horsepower)
lR = lm(fm,train)

StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

MPG ~ 1 + Horsepower

Coefficients:
──────────────────────────────────────────────────────────────────────────
                 Coef.  Std. Error       t  Pr(>|t|)  Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept)  40.0946    1.00658      39.83    <1e-93  38.109     42.0801
Horsepower   -0.161853  0.00913813  -17.71    <1e-41  -0.179879  -0.143828
──────────────────────────────────────────────────────────────────────────

In [23]:
resp = predict(lR,test)

201-element Vector{Union{Missing, Float64}}:
 19.053630107544933
  8.0475989745293
  4.486824196200715
  5.296091191275393
  9.342426166648787
 12.579494146947502
 14.198028137096859
  3.677557201126036
 24.718499073067683
 24.718499073067683
 24.394792275037812
 26.01332626518717
 25.527766068142363
  ⋮
 28.764834048441077
 25.851472866172234
 27.955567053366398
 28.764834048441077
 29.250394245485886
 22.290698087843648
 26.337033063217042
 25.20405927011249
 24.55664567405275
 26.175179664202105
 31.67819523070992
 26.498886462231976

In [27]:
MSE = []

for i in 1:nrow(test)
    val = (test[i,:MPG]-resp[i])^2
    push!(MSE,val)
end

mean(MSE)

26.302338433708133

In [36]:
deviance(lR)/nrow(train) #Deviance gives a measure of the model fit

21.649050282546565

## Leave one out cross validation

Cross validation process in R is much easier can directly produce results with a single command. Use MLUtils kfolds function to partion data into folds. Then iterate through train, test and evaluate error.

In [35]:
fm = @formula(MPG ~ Horsepower)

(c, v, inds) = cross_validate(
    lm(fm,train),
    compMSE(lR,test),
    nrow(test),
    LOOCV(nrow(test)), # cross validation plan: 5-fold 
    Reverse) # smaller score indicates better model

# display results


LoadError: MethodError: no method matching cross_validate(::Float64, ::Int64, ::LOOCV, ::Base.Order.ReverseOrdering{Base.Order.ForwardOrdering}; lR=StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

MPG ~ 1 + Horsepower

Coefficients:
──────────────────────────────────────────────────────────────────────────
                 Coef.  Std. Error       t  Pr(>|t|)  Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept)  40.0946    1.00658      39.83    <1e-93  38.109     42.0801
Horsepower   -0.161853  0.00913813  -17.71    <1e-41  -0.179879  -0.143828
──────────────────────────────────────────────────────────────────────────)
[0mClosest candidates are:
[0m  cross_validate([91m::Function[39m, [91m::Function[39m, [91m::Int64[39m, ::Any) at ~/.julia/packages/MLBase/SAM9e/src/crossval.jl:177[91m got unsupported keyword argument "lR"[39m
[0m  cross_validate([91m::Function[39m, [91m::Function[39m, [91m::Integer[39m, ::Any) at ~/.julia/packages/MLBase/SAM9e/src/crossval.jl:193[91m got unsupported keyword argument "lR"[39m

## Bootstrap

Julia bootstrapping is done via the bootstrapping package. Use it to evaulte model statistics in a more accurate way.