In [17]:
#set the wd to file location
setwd(getSrcDirectory(function(){})[1])

# dataSplit Notebook
This notebook is to be run 2nd.\
The purpose of this file is to split the cleaned data into two sets, one for wrangling and training, and another as a final holdout set that analysis and tuning is blinded to to prevent leakage.

In [18]:
#Import Necessary libraries
suppressWarnings({
library('tidyverse')
library('dplyr')
library('forcats')})

In [19]:
#Read Data
cleanVehicles = read.csv('02-vehicles-clean.csv')

## dataSplitFunction
The data split function takes the following inputs:
<ul> 
    <li> df: the dataframe to split (should be pre-cleaned), type dataframe </li>
    <li> alpha: the proprtion of the data to be held out, type int </li>
    <li> seed: a seed to set randomness, type int </li>
</ul>

The data split function returns the following output:
<ul>
    <li> iholdout: the indexes of df which belong to the holdout set </li>
</ul>
The data is split randomly

In [20]:
#define the splitting function (read above for documentation)
dataSplit <- function(df,alpha,seed = 447){
    set.seed(447)
    N = nrow(df) 
    n = round(alpha*N)
    index = sample(N)
    iholdout = index[1:n]
    return(iholdout)
}

## Applying dataSplit and saving the files

The above function is called then the resulting files are saved, to prevent data leakage we do not access the holdout file until the final model evaluations. 

In [21]:
#generate the holdout indices
iholdout = dataSplit(cleanVehicles,alpha = 0.2)
#split cleanVehicles
holdoutVehicles = cleanVehicles[iholdout,]; sampleVehicles = cleanVehicles[-iholdout,]

check the number of rows to verify correct functioning of ```dataSplit()```


In [23]:
#get number of rows of sampleVehicles
print(nrow(sampleVehicles))

[1] 195064


In [24]:
#get number of rows of holdoutVehicles
print(nrow(holdoutVehicles))

[1] 48766


In [25]:
#save the files
write.csv(sampleVehicles, '03a-vehicles-sample.csv',row.names=FALSE)
write.csv(holdoutVehicles, '03b-vehicles-holdout.csv',row.names=FALSE)
     