In [1]:
#set the wd to file location
setwd(getSrcDirectory(function(){})[1])

# dataSplit Notebook
This notebook is to be run 2nd.\
The purpose of this file is to split the cleaned data into two sets, one for wrangling and training, and another as a final holdout set that analysis and tuning is blinded to to prevent leakage.

In [2]:
#Import Necessary libraries
suppressWarnings({
library('tidyverse')
library('dplyr')
library('forcats')})

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.4.1     [32m✔[39m [34mpurrr  [39m 0.3.5
[32m✔[39m [34mtibble [39m 3.1.8     [32m✔[39m [34mdplyr  [39m 1.1.0
[32m✔[39m [34mtidyr  [39m 1.2.1     [32m✔[39m [34mstringr[39m 1.5.0
[32m✔[39m [34mreadr  [39m 2.1.3     [32m✔[39m [34mforcats[39m 0.5.2
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


In [3]:
#Read Data
cleanVehicles = read.csv('02-vehicles-clean.csv')

## dataSplitFunction
The data split function takes the following inputs:
<ul> 
    <li> df: the dataframe to split (should be pre-cleaned), type dataframe </li>
    <li> alpha: the proprtion of the data to be held out, type int </li>
    <li> seed: a seed to set randomness, type int </li>
</ul>

The data split function returns the following output:
<ul>
    <li> iholdout: the indexes of df which belong to the holdout set </li>
</ul>
The data is split randomly

In [4]:
#define the splitting function (read above for documentation)
dataSplit <- function(df,alpha,seed = 447){
    set.seed(447)
    N = nrow(df) 
    n = round(alpha*N)
    index = sample(N)
    iholdout = index[1:n]
    return(iholdout)
}

## Applying dataSplit and saving the files

The above function is called then the resulting files are saved, to prevent data leakage we do not access the holdout file until the final model evaluations. The second split uses alpha = 0.22 to account for the fact that we are splitting on 0.9 of the set. This leads to 0.7 of the data in the sample set, 0.2 of the data in the tuning set, and 0.1 of the data in the holdout.

In [7]:
#generate the holdout indices
iholdout = dataSplit(cleanVehicles,alpha = 0.1)
#split cleanVehicles
holdoutVehicles = cleanVehicles[iholdout,]; sampleVehicles = cleanVehicles[-iholdout,]

In [8]:
#save the files
write.csv(sampleVehicles, '03a-vehicles-sample.csv',row.names=FALSE)
write.csv(holdoutVehicles, '03b-vehicles-holdout.csv',row.names=FALSE)