## Load RCall
- run R instance in the background
- establish connection between Julia and R
- uses the R_HOME or default R location in certain OS to run R binary


In [None]:
using RCall

## R"" string macro
- easiest way to interface with R
- exact R statements
- returns an R object which can be converted to Julia using rcopy

In [None]:
aq_j=R"airquality" |> rcopy # get an R data and covert to julia dataframe

In [None]:
first(aq_j,3)  # convert to Julia dataframe

In [None]:
using DataVoyager
aq_j |> Voyager

In [None]:
using DataVoyager

## Let's load ggplot2 and plot airquality

In [None]:
R"library(ggplot2)"

In [None]:
p1=R"ggplot(data=airquality)+geom_point(aes(x=Wind,y=Solar.R,color=Temp))"

In [None]:
p1=R"ggplot(data=airquality)+geom_point(aes(x=Wind,y=Ozone,color=Temp))+facet_grid(Month ~ .)"

In [None]:
rcopy(p1) # translate robj to julia which is a dictionary type

In [None]:
p1=R"ggplot(data=airquality)+geom_point(aes(x=Ozone,y=Solar.R,color=Temp))"

In [None]:
R"plot(airquality)"

In [None]:
R"library(randomForest)"

In [None]:
R"rfmodel=randomForest(Temp ~ .,data=airquality,na.action=na.omit)" # regression

In [None]:
R"varImpPlot(rfmodel)"

In [None]:
R"rfmodel"

In [None]:
R"randomForest(as.factor(Month) ~ .,data=airquality,na.action=na.omit)" # classification

In [None]:
R"library(caret)"

## Train model using Caret
This function sets up a grid of tuning parameters for a number of classification and regression routines, fits each model and calculates a resampling based performance measure.

For particular model, a grid of parameters (if any) is created and the model is trained on slightly different data for each candidate combination of tuning parameters. Across each data set, the performance of held-out samples is calculated and the mean and standard deviation is summarized for each combination. The combination with the optimal resampling statistic is chosen as the final model and the entire training set is used to fit a final model.

For Random Forest model (`rf` in method below), the hyperparameter interrogated is `mtry` which denotes the number of variables randomly sampled as candidates at each split. 

In [None]:
R"crf=train(Temp ~ .,data=airquality,method='rf',na.action=na.omit)"

The train function can be used to

  -  evaluate, using resampling, the effect of model tuning parameters on performance
  -  choose the “optimal” model across these parameters
  -  estimate model performance from a training set


In [None]:
R"ctreebag=train(Temp ~ .,data=airquality,method='treebag',na.action=na.omit)"

In [None]:
crf_j=rcopy(@rget crf);
ctreebag_j=rcopy(@rget ctreebag)

## Grid search for parameter optimization

The tuning parameter grid can be specified by the user. The argument tuneGrid can take a data frame with columns for each tuning parameter. The column names should be the same as the fitting function’s arguments. For the `Random Forest (rf)` example this argument would be `mtry` (number of variables randomly sampled as candidates at each split).

In [30]:
R"tunegrid <- expand.grid(.mtry=c(1:10))"

RObject{VecSxp}
   .mtry
1      1
2      2
3      3
4      4
5      5
6      6
7      7
8      8
9      9
10    10


## trainControl
The function trainControl generates parameters that further control how models are created, with a large number of possible values. These include:
- `method`: The resampling method: "boot", "cv", "LOOCV", "LGOCV", "repeatedcv", "timeslice", "none" and "oob". 
- `number` and `repeats`: number controls with the number of folds in K-fold cross-validation or number of resampling iterations for bootstrapping and leave-group-out cross-validation. repeats applied only to repeated K-fold cross-validation. Suppose that method = "repeatedcv", number = 10 and repeats = 3,then three separate 10-fold cross-validations are used as the resampling scheme.
- `search`: the method adopted to search the hyperparameter space. Common methods are `random` or an exhaustive `grid` search

In [31]:
R"control <- trainControl(method='repeatedcv', number=3, repeats=3, search='grid')"

RObject{VecSxp}
$method
[1] "repeatedcv"

$number
[1] 3

$repeats
[1] 3

$search
[1] "grid"

$p
[1] 0.75

$initialWindow
NULL

$horizon
[1] 1

$fixedWindow
[1] TRUE

$skip
[1] 0

$verboseIter
[1] FALSE

$returnData
[1] TRUE

$returnResamp
[1] "final"

$savePredictions
[1] FALSE

$classProbs
[1] FALSE

$summaryFunction
function (data, lev = NULL, model = NULL) 
{
    if (is.character(data$obs)) 
        data$obs <- factor(data$obs, levels = lev)
    postResample(data[, "pred"], data[, "obs"])
}
<bytecode: 0x7fb1162f6258>
<environment: namespace:caret>

$selectionFunction
[1] "best"

$preProcOptions
$preProcOptions$thresh
[1] 0.95

$preProcOptions$ICAcomp
[1] 3

$preProcOptions$k
[1] 5

$preProcOptions$freqCut
[1] 19

$preProcOptions$uniqueCut
[1] 10

$preProcOptions$cutoff
[1] 0.9


$sampling
NULL

$index
NULL

$indexOut
NULL

$indexFinal
NULL

$timingSamps
[1] 0

$predictionBounds
[1] FALSE FALSE

$seeds
[1] NA

$adaptive
$adaptive$min
[1] 5

$adaptive$alpha
[1] 0.05

$adaptive$method


In [None]:
R"crf=train(Temp ~ .,data=airquality,method='rf',na.action=na.omit,tuneGrid=tunegrid, trControl=control)"

In [None]:
R"plot(crf)"

In [None]:
R"dcomp=airquality[complete.cases(airquality),]"
R"bestmtry <- tuneRF(dcomp[,-4],dcomp[,4], ntree=500)"

In [None]:
iris_j=rcopy(R"iris")

In [None]:
model=R"train(Species ~ .,data=$iris_j,method='rf')"

In [None]:
rcopy(model)