<hr style="border-top: 5px solid black;">

<h2 style="color:blue;">Keywords: R programming, Model Tuning, Data Splitting, AppliedPredictiveModeling</h2>

<hr style="border-top: 5px solid black;">

<div class="alert alert-block alert-info" style="font-size:20px; border:1px solid black; padding:10px">
<center><h1>Post Goals:</h1></center><br>
    <hr style="border-top: 2px dashed black;"><br>
    <p style="font-size:22px;">
        Discuss model tuning and data splitting using the AppliedPredictiveModeling R library. 
    </p><br><br>
</div>

<hr style="border-top: 5px solid black;">

<div class="alert alert-block alert-success" style="font-size:20px; border:1px solid black; padding:10px">
<center><h1>Post Outline:</h1></center>
    <hr style="border-top: 2px dashed black;">
    <ol>
        <li><a href="#objective1">Introduction to Model Tuning.</a></li><br>
        <li><a href="#objective2">Data Splitting.</a></li><br>
        <li><a href="#objective3">Demonstrate how to perform these Model Tuning and Data Splitting using the <code>caret</code> and <code>AppliedPredictiveModeling</code> R packages.</a></li><br>
<!--         <li><a href="#objective4">z-test on a population proportion.</a></li><br> -->
    </ol>
</div>

<hr style="border-top: 5px solid black;">

<div id="objective1" class="alert alert-block alert-warning" style="font-size:16px; border:1px solid black; padding:10px"><center><h1><br><font color="blue">Introduction to Model Tuning.</font></h1></center><br>
</div>
<div style="font-size:16px; border:1px solid black; padding:10px">
    <ul><strong><u>The basics</u></strong>
        <li>Some predictive models typically have important parameters that have to be specified during the training procedure</li><br>
        <li>Parameters that cannot be obtained using an analytic formula are called tuning parameters.</li><br>
        <li>Tuning parameters control model complexity and how well it can generalize to new data.</li><br>
        <li>Model tuning is the process of searching for the optimal parameters values and generally involves testing the model performance across various parameter values.</li><br>
        <li>There are several methods for model turning.</li><br>
    </ul>
</div>

<hr style="border-top: 5px solid black;">

<div id="objective2" class="alert alert-block alert-warning" style="font-size:16px; border:1px solid black; padding:10px"><center><h1><br><font color="blue">Data Splitting.</font></h1></center><br>
</div>
<div style="font-size:16px; border:1px solid black; padding:10px">
    <ul><strong><u>The basics</u></strong>
        <li>Training a predictive model usually involves splitting the data into two groups, a training and test set.</li><br>
        <li>The training set is used to train the model, while the test (or validation) set is used to determine the performance of the model.</li><br>
        <li>When the number of instances is small, resampling methods such as cross-validation can be used.</li><br>
        <li>When the data is split, there are several methods, such as nonrandom, random, and stratified random sampling.</li><br>
    </ul>
</div>

<hr style="border-top: 5px solid black;">

<div id="objective3" class="alert alert-block alert-warning" style="font-size:16px; border:1px solid black; padding:10px"><center><h1><br><font color="blue">Demonstrate how to perform these Model Tuning and Data Splitting using the <code>caret</code> and <code>AppliedPredictiveModeling</code> R packages.</font></h1></center><br>
</div>
<div style="font-size:16px; border:1px solid black; padding:10px">
    <ul><strong><u>AppliedPredictiveModeling package</u></strong>
        <li>The data is provided by and consist of measurements taken from individual cells using a high content screen for cancer.</li><br>
        <li>The data consists of 2019 rows and 116 columns. Each row is an instance and the 116 are features/measurements taken for each cell.</li><br>
        <li>The objective is to use predictive modeling to examine cell characteristics in order to understand the effects of disease on the size, shape and characteristics of cells.</li><br>
        <li>In the example below, a classification predictive model using caret will be used to classify cells as either poorly segmented or well-segmented.</li><br>
    </ul>
<hr style="border-top: 2px solid black;">
    <ul><strong><u>caret Package</u></strong>
        <li>The caret package contains a set of functions used in the complex regression and classification model training problems.</li><br>
        <li>caret stands for Classification and REgression Training.</li><br>
        <li>caret package utilizes up to 32 packages but only loads packages as needed, or prompts the user to install it.</li><br>
        <li>caret is powerful package as almost every step of the process can be customized.</li><br>                
    </ul>  
</div>

# Install and Load Data

In [1]:
# install packages
install.packages(c("caret", "corrplot", "e1071", "lattice", "AppliedPredictiveModeling"))


The downloaded binary packages are in
	/var/folders/0d/xqmptt6x035b25m88r2ffc9m0000gn/T//RtmpxU5u9j/downloaded_packages


In [2]:
# Load the library
library(AppliedPredictiveModeling)
set.seed(1)
library(caret)

Loading required package: ggplot2

Loading required package: lattice



# Inspect the Dataset Dimensions

In [3]:
# Load and inspect the data and print out the predictors, classes
data(twoClassData)
str(predictors)
str(classes)
nrow(predictors)

'data.frame':	208 obs. of  2 variables:
 $ PredictorA: num  0.158 0.655 0.706 0.199 0.395 ...
 $ PredictorB: num  0.1609 0.4918 0.6333 0.0881 0.4152 ...
 Factor w/ 2 levels "Class1","Class2": 2 2 2 2 2 2 2 2 2 2 ...


# Use the createDataPartition function that splits the data 

 - training (80% of data) and test (20%)
 - slits the classes, which are Predictor A and Predictor B

In [4]:
trainingRows <- createDataPartition(classes, p = .80, list= FALSE)
head(trainingRows)
nrow(trainingRows)

Resample1
1
2
3
7
8
9


# Subset the training data into predictor and class vectors

In [5]:
trainPredictors <- predictors[trainingRows, ]
trainClasses <- classes[trainingRows]

# Subset the test data into predictor and class vectors

 - The negative sign will return the test set instances

In [6]:
# Do the same for the test set using negative integers.
testPredictors <- predictors[-trainingRows, ]
testClasses <- classes[-trainingRows]
str(trainPredictors)
str(testPredictors)

'data.frame':	167 obs. of  2 variables:
 $ PredictorA: num  0.1582 0.6552 0.706 0.0658 0.3086 ...
 $ PredictorB: num  0.161 0.492 0.633 0.179 0.28 ...
'data.frame':	41 obs. of  2 variables:
 $ PredictorA: num  0.1992 0.3952 0.425 0.0847 0.2909 ...
 $ PredictorB: num  0.0881 0.4152 0.2988 0.0548 0.3021 ...


<hr style="border-top: 5px solid black;">

# Resampling Procedure with createDataPartition

 - The caret package has functions for data splitting
 - The function createDataPartition can be used generate multiple splits using the times parameter

In [7]:
set.seed(1)
repeatedSplits <- createDataPartition(trainClasses, p = .80, times = 3)
str(repeatedSplits)
repeatedSplits$Resample1

List of 3
 $ Resample1: int [1:135] 1 2 3 4 6 7 9 10 11 12 ...
 $ Resample2: int [1:135] 1 2 3 4 5 6 7 9 10 11 ...
 $ Resample3: int [1:135] 1 2 3 4 5 7 8 9 11 12 ...


# Resampling Procedure with To create indicators for 10-fold cross-validation  using createFolds and createMultiFolds

 - createFolds (for k-fold cross-validation)
 - createMultiFolds (for repeated cross-validation)

In [8]:
set.seed(1)
cvSplits <- createFolds(trainClasses, k = 10, returnTrain = TRUE)
str(cvSplits)
cvSplits

List of 10
 $ Fold01: int [1:150] 1 2 4 5 6 7 8 10 11 13 ...
 $ Fold02: int [1:150] 1 2 3 4 6 7 8 9 10 11 ...
 $ Fold03: int [1:150] 1 3 4 5 6 7 8 9 10 11 ...
 $ Fold04: int [1:150] 1 2 3 4 5 6 7 8 9 10 ...
 $ Fold05: int [1:150] 2 3 4 5 6 7 8 9 10 11 ...
 $ Fold06: int [1:150] 1 2 3 4 5 6 7 8 9 11 ...
 $ Fold07: int [1:150] 1 2 3 4 5 6 7 9 10 12 ...
 $ Fold08: int [1:151] 1 2 3 4 5 6 8 9 10 11 ...
 $ Fold09: int [1:151] 1 2 3 5 6 7 8 9 10 11 ...
 $ Fold10: int [1:151] 1 2 3 4 5 7 8 9 10 11 ...


In [9]:
# Get the first set of row numbers from the list.
fold1 <- cvSplits[[1]]
length(fold1)
# or
fold11=cvSplits$Fold01
length(fold11)

In [10]:
## To get the first 90% of the data (the fold):
cvPredictors1 <- trainPredictors[fold1,]
cvClasses1 <- trainClasses[fold1]
nrow(trainPredictors)
nrow(cvPredictors1)

<hr style="border-top: 5px solid black;">