# Machine Learning in R: part 2

## This week focuses on improvement. We will our data cleaning code more efficient and then use some tuning methods to improve the predictive ability of models we build.


## Step 1a. Cleaning and formatting the data (from last week)

Below is the code from the loading, cleaning and train/test split sections we went over last week. This is all we requre to get the data into the format needed to being training some machine learning models.

In [5]:
library(tidyverse)

housing = read.csv('./housing.csv')

housing$total_bedrooms[is.na(housing$total_bedrooms)] = median(housing$total_bedrooms , na.rm = TRUE)

housing$mean_bedrooms = housing$total_bedrooms/housing$households
housing$mean_rooms = housing$total_rooms/housing$households

drops = c('total_bedrooms', 'total_rooms')

housing = housing[ , !(names(housing) %in% drops)]

categories = unique(housing$ocean_proximity)
#split the categories off
cat_housing = data.frame(ocean_proximity = housing$ocean_proximity)

for(cat in categories){
    cat_housing[,cat] = rep(0, times= nrow(cat_housing))
}

for(i in 1:length(cat_housing$ocean_proximity)){
    cat = as.character(cat_housing$ocean_proximity[i])
    cat_housing[,cat][i] = 1
}

cat_columns = names(cat_housing)
keep_columns = cat_columns[cat_columns != 'ocean_proximity']
cat_housing = select(cat_housing,one_of(keep_columns))
drops = c('ocean_proximity','median_house_value')
housing_num =  housing[ , !(names(housing) %in% drops)]


scaled_housing_num = scale(housing_num)

cleaned_housing = cbind(cat_housing, scaled_housing_num, median_house_value=housing$median_house_value)


set.seed(19) # Set a random seed so that same sample can be reproduced in future runs

sample = sample.int(n = nrow(cleaned_housing), size = floor(.8*nrow(cleaned_housing)), replace = F)
train = cleaned_housing[sample, ] #just the samples
test  = cleaned_housing[-sample, ] #everything but the samples


train_y = train[,'median_house_value']
train_x = train[, names(train) !='median_house_value']

test_y = test[,'median_house_value']
test_x = test[, names(test) !='median_house_value']

head(train)


Unnamed: 0,NEAR BAY,<1H OCEAN,INLAND,NEAR OCEAN,ISLAND,longitude,latitude,housing_median_age,population,households,median_income,mean_bedrooms,mean_rooms,median_house_value
2418,0,0,1,0,0,0.06473791,0.4485767,-0.05081113,-0.08342596,-0.50882695,-1.2394168,-0.0364878,-0.4145713,56700
9990,0,0,1,0,0,-0.74882545,1.6471053,-1.08374113,1.39212008,2.14071836,-0.7358959,-0.19291092,-0.1004065,143400
13440,0,0,1,0,0,1.07295753,-0.7218613,-0.05081113,0.28656434,0.06136148,0.1404495,-0.18700644,0.2732884,128300
1412,1,0,0,0,0,-1.25293526,1.0759315,0.50538194,0.35897294,0.42492199,0.6344959,-0.11581168,0.2741324,233200
7539,0,1,0,0,0,0.67865382,-0.8061329,-0.20972344,1.03802435,0.21829408,-1.0991931,-0.03247975,-0.5151724,110200
4621,0,1,0,0,0,0.62874196,-0.7265431,1.6177681,0.10024464,0.24706505,-0.6573622,-0.07763347,-0.4598522,350900


## Step 1b. Cleaning - The tidyverse way! 

The code below does the same thing as the code above, but employs the tidyverse. I've pulled out the bare bones parts of Karl's 'Housing_R_tidy.r' script needed to get the data to where we want it (i.e. removed all the graphs head commands etc. Go to his original script to see the notes and visual blandishments).

I like this code more then my original version because:
1. It is easy to follow the workflow. magrittr makes it easy to see when one cleaning task ends and the next begins.
2. It is more concise.
3. The use of comments(#) after the pipes(%>%) looks professional and also makes the code more readable. Being able to share your code with others and have them understand it is very important!
4. It runs faster.

Note: in the tibble docs the function is listed as: as_tibble() not as.tibble() as first written. Oddly as.tibble() worked in my normal R deployment, but threw an error in the jupyter notebook :\ This confused me and I don't know what to make of it.

In [15]:
library(tidyverse)

housing.tidy = read_csv('housing.csv')

housing.tidy = housing.tidy %>% 
  mutate(total_bedrooms = ifelse(is.na(total_bedrooms), 
                                 median(total_bedrooms, na.rm = T),
                                 total_bedrooms),
         mean_bedrooms = total_bedrooms/households,
         mean_rooms = total_rooms/households) %>%
  select(-c(total_rooms, total_bedrooms))


categories = unique(housing.tidy$ocean_proximity) # all categories

cat_housing.tidy = categories %>% # compare the full vector against each category consecutively
  lapply(function(x) as.numeric(housing.tidy$ocean_proximity == x)) %>% # convert to numeric
  do.call("cbind", .) %>% as_tibble() # clean up
colnames(cat_housing.tidy) = categories # make nice column names

cleaned_housing.tidy = housing.tidy %>% 
  select(-c(ocean_proximity, median_house_value)) %>%
  scale() %>% as_tibble() %>%
  bind_cols(cat_housing.tidy) %>%
  add_column(median_house_value = housing.tidy$median_house_value)

set.seed(19) # Set a random seed so that same sample can be reproduced in future runs

sample = sample.int(n = nrow(cleaned_housing.tidy), size = floor(.8*nrow(cleaned_housing.tidy)), replace = F)
train = cleaned_housing.tidy[sample, ] #just the samples
test  = cleaned_housing.tidy[-sample, ] #everything but the samples
      
head(train)


Parsed with column specification:
cols(
  longitude = col_double(),
  latitude = col_double(),
  housing_median_age = col_double(),
  total_rooms = col_double(),
  total_bedrooms = col_double(),
  population = col_double(),
  households = col_double(),
  median_income = col_double(),
  median_house_value = col_double(),
  ocean_proximity = col_character()
)


longitude,latitude,housing_median_age,population,households,median_income,mean_bedrooms,mean_rooms,NEAR BAY,<1H OCEAN,INLAND,NEAR OCEAN,ISLAND,median_house_value
0.06473791,0.4485767,-0.05081113,-0.08342596,-0.50882695,-1.2394168,-0.0364878,-0.4145713,0,0,1,0,0,56700
-0.74882545,1.6471053,-1.08374113,1.39212008,2.14071836,-0.7358959,-0.19291092,-0.1004065,0,0,1,0,0,143400
1.07295753,-0.7218613,-0.05081113,0.28656434,0.06136148,0.1404495,-0.18700644,0.2732884,0,0,1,0,0,128300
-1.25293526,1.0759315,0.50538194,0.35897294,0.42492199,0.6344959,-0.11581168,0.2741324,1,0,0,0,0,233200
0.67865382,-0.8061329,-0.20972344,1.03802435,0.21829408,-1.0991931,-0.03247975,-0.5151724,0,1,0,0,0,110200
0.62874196,-0.7265431,1.6177681,0.10024464,0.24706505,-0.6573622,-0.07763347,-0.4598522,0,1,0,0,0,350900


In [None]:

########
# Random Forest Model
########
rf_model = randomForest(train_x, y = train_y , ntree = 500, importance = TRUE)

names(rf_model) #these are all the different things you can call from the model.

importance_dat = rf_model$importance
importance_dat

sorted_predictors = sort(importance_dat[,1], decreasing=TRUE)
sorted_predictors

oob_prediction = predict(rf_model) #leaving out a data source forces OOB predictions

#you may have noticed that this is avaliable using the $mse in the model options.
#but this way we learn stuff!
train_mse = mean(as.numeric((oob_prediction - train_y)^2))
oob_rmse = sqrt(train_mse)
oob_rmse


y_pred_rf = predict(rf_model , test_x)
test_mse = mean(((y_pred_rf - test_y)^2))
test_rmse = sqrt(test_mse)
test_rmse # ~48620



In [None]:

######
# XG Boost
######
# http://cran.fhcrc.org/web/packages/xgboost/vignettes/xgboost.pdf
library(xgboost)

#put into the xgb matrix format
dtrain = xgb.DMatrix(data =  as.matrix(train_x), label = train_y )
dtest = xgb.DMatrix(data =  as.matrix(test_x), label = test_y)

# these are the datasets the rmse is evaluated for at each iteration
watchlist = list(train=dtrain, test=dtest)

# try 1 -off a set of paramaters I know work pretty well generally

bst = xgb.train(data = dtrain, 
                max.depth = 8, 
                eta = 0.3, 
                nthread = 2, 
                nround = 1000, 
                watchlist = watchlist, 
                objective = "reg:linear", 
                early_stopping_rounds = 50)



In [None]:
# try a 'slower learning' model. The up and down weights for each iteration are smaller
# we also use more iterations

bst_slow = xgb.train(data = dtrain, 
                        max.depth=6, 
                        eta = 0.01, 
                        nthread = 2, 
                        nround = 10000, 
                        watchlist = watchlist, 
                        objective = "reg:linear", 
                        early_stopping_rounds = 50)

# note the best iteration is not the last iteration. 
XGBoost_importance = xgb.importance(feature_names = names(train_x), model = bst_slow)
XGBoost_importance[1:10]

# rmse: 45225.968750 # max.depth=5
#last week: $48620 with random forest
# an improvement of ~$3400 in average error. Wait! What we have done here is fit to the training set (leading to model overfit). Need to work with a validation set, then only at the end evaluate the model performance against the test set.




In [None]:

####
# Proper use - validation set
####

#validation set - Another subset of our data that is witheld from the training algorithm, but compared against at each iteration to see how

#make validation set

set.seed(19) # Set a random seed so that same sample can be reproduced in future runs

sample = sample.int(n = nrow(train), size = floor(.8*nrow(train)), replace = F)

train_t = train[sample, ] #just the samples
valid  = train[-sample, ] #everything but the samples

train_y = train_t[,'median_house_value']
train_x = train_t[, names(train) !='median_house_value']

valid_y = valid[,'median_house_value']
valid_x = valid[, names(test) !='median_house_value']

gb_train = xgb.DMatrix(data = as.matrix(train_x), label = train_y )
gb_valid = xgb.DMatrix(data = as.matrix(valid_x), label = valid_y)

# train xgb, evaluating against the validation
watchlist = list(train = gb_train, valid = gb_valid)

bst_slow = xgb.train(data= gb_train, 
                        max.depth = 10, 
                        eta = 0.01, 
                        nthread = 2, 
                        nround = 10000, 
                        watchlist = watchlist, 
                        objective = "reg:linear", 
                        early_stopping_rounds = 50)

# error, need the matrix format
y_hat = predict(bst_slow, test_x)

# recall we ran the following to get the test data in the right format:
# dtest = xgb.DMatrix(data =  as.matrix(test_x), label = test_y)
# here I have it with the label taken off, just to remind us its external data xgb would ignore the label though during predictions
dtest = xgb.DMatrix(data =  as.matrix(test_x))

#test the model on truly external data

y_hat_valid = predict(bst_slow, dtest)

test_mse = mean(((y_hat_valid - test_y)^2))
test_rmse = sqrt(test_mse)
test_rmse 
# ~47507.09 This is higher then on the first run through, but we can be confident that the improved score is not due to overfit thanks to our use of a validation set! point out that this is evidence of how a lower rmse isn't necessarily better, as we now have more confidence in external predictions.2.3% improvement over a basic random forest... is it worth the effort? The answer to this question always depends on the purpose of the model.



In [None]:

###
# Grid search first principles 
###

max.depths = c(3, 5, 7, 9)
etas = c(0.01, 0.001, 0.0001)

best_params = 0
best_score = 0

count = 1
for( depth in max.depths ){
    for( num in etas){

        bst_grid = xgb.train(data = gb_train, 
                                max.depth = depth, 
                                eta=num, 
                                nthread = 2, 
                                nround = 10000, 
                                watchlist = watchlist, 
                                objective = "reg:linear", 
                                early_stopping_rounds = 50, 
                                verbose=0)

        if(count == 1){
            best_params = bst_grid$params
            best_score = bst_grid$best_score
            count = count + 1
            }
        else if( bst_grid$best_score < best_score){
            best_params = bst_grid$params
            best_score = bst_grid$best_score
        }
    }
}

best_params
best_score
#valid-rmse: 47033.28

# max_depth of 9, eta of 0.01
bst_tuned = xgb.train( data = gb_train, 
                        max.depth = 9, 
                        eta = 0.01, 
                        nthread = 2, 
                        nround = 10000, 
                        watchlist = watchlist, 
                        objective = "reg:linear", 
                        early_stopping_rounds = 50)

y_hat_xgb_grid = predict(bst_tuned, dtest)

test_mse = mean(((y_hat_xgb_grid - test_y)^2))
test_rmse = sqrt(test_mse)
test_rmse # test-rmse: 46675
# By tuning the hyperparamaters we have moved to a 4% improvement over random forest




In [None]:

#######
# tweak the hyperparamaters using a grid search
# The caret package (short for classification and regression training)
######

library(caret) 

# look up the model we are running to see the paramaters
modelLookup("xgbLinear")
 
# set up all the pairwise combinations

xgb_grid_1 = expand.grid(nrounds = c(1000,2000,3000,4000) ,
                            eta = c(0.01, 0.001, 0.0001),
                            lambda = 1,
                            alpha = 0)
xgb_grid_1


#here we do one better then a validation set, we use cross validation to 
#expand the amount of info we have!
xgb_trcontrol_1 = trainControl(method = "cv",
                                number = 5,
                                verboseIter = TRUE,
                                returnData = FALSE,
                                returnResamp = "all", 
                                allowParallel = TRUE)


######
#below a grid-search, cross-validation xgboost model in caret
######
# train the model for each parameter combination in the grid, using CV to evaluate on multiple folds. Make sure your laptop is plugged in or RIP battery.

# note how this is now the caret train function
?train

xgb_train_1 = train(x = as.matrix(train_x),
					y = train_y,
					trControl = xgb_trcontrol_1,
					tuneGrid = xgb_grid_1,
					method = "xgbLinear",
					max.depth = 5)

names(xgb_train_1)
xgb_train_1$bestTune
xgb_train_1$method
summary(xgb_train_1)


#alternatively, you can 'narrow in' on the best paramaters by taking a range of options around the best values found and seeing if high resolution tweaks can provide even further improvements.

xgb_cv_yhat = predict(xgb_train_1 , as.matrix(test_x))


test_mse = mean(((xgb_cv_yhat - test_y)^2))
test_rmse = sqrt(test_mse)
test_rmse # 46641... pretty close to the 'by hand' grid search!

#Cam's hypothesis - not using 'early stopping rounds' here so the model isn't cutting out at the exact best point. re-running this with a validation setup as opposed to a cv setup would allow us to implement a grid search efficiently and wind up with the best hyperparamaters. I shall leave this as a follow up exercise for the curious.



In [None]:

########
# Ensemble the models together, 
# strategy for when accuracy is more important then knowing the best predictors
########


y_pred_rf #random forest
y_hat_valid #xgBoost with validation
y_hat_xgb_grid #xgBoost grid search
xgb_cv_yhat #xgBoost caret cross validation

length(y_hat_xgb_grid)


blend_pred = (y_hat * .25) + (y_pred_rf * .25) + (xgb_cv_yhat * .25) + (y_hat_xgb_grid * .25)
length(blend_pred)

length(blend_pred) == length(y_hat_xgb_grid)

blend_test_mse = mean(((blend_pred - test_y)^2))
blend_test_rmse = sqrt(blend_test_mse)
blend_test_rmse # 45205 by averaging just 4 predictors we have dropped the rmse a few percent lower then the best scoring of the 4 models. This does come at a cost though, we now can't make accurate inferrences about the best predictors!

#next step - you can grid search the weights of the ensemble to try and drop the rmse further!

