### Model 4: XGBoost

For the part 3, we select XGBoost to implement a learning procedure.

XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. 

In [None]:
install.packages("xgboost")
install.packages("Matrix")
install.packages("MatrixModels")
install.packages("data.table")

library(xgboost)
library(Matrix)
library(MatrixModels)
library(data.table)

In [None]:
train_xgb <- data
train_xgb <- subset(train_xgb, select = c(-id,-status_group))
train_xgb <- as.matrix(as.data.frame(lapply(train_xgb, as.numeric)))
label_xgb <- data$status_group
label_xgb <-as.numeric(label_xgb)

xgb.DMatrix

XGBoost offers a way to group meta-data in a xgb.DMatrix.  This will be useful for the most advanced features.



In [None]:
train.DMatrix <- xgb.DMatrix(data = train_xgb,label = label_xgb, missing = NA)

multi:softmax: set XGBoost to do multiclass classification using the softmax objective.

num_class: number of classes. 

nrounds:  the max number of iterations. 

nfold: The dataset is randomly partitioned into nfold equal size subsamples. 

early_stopping_rounds: integer, means that training with a validation set will stop if the performance doesn't improve for k rounds. 

booster: which booster to use.

In [None]:
xgb.tab = xgb.cv(data = train.DMatrix, objective = "multi:softmax", booster = "gbtree",
                 nrounds = 500, nfold = 4, early_stopping_rounds = 10, num_class = 4, maximize = FALSE,
                 evaluation = "merror", eta = .2, max_depth = 12, colsample_bytree = .4)

XGBoost has several features to help view the learning progress internally. The purpose is to help to set the best parameters, which is the key of your model quality. One of the simplest way to see the training progress is to set the verbose option. 

verbose = 1, print evaluation metric

In [None]:
model4 <- xgboost(data = train.DMatrix, objective = "multi:softmax", booster = "gbtree",
                  eval_metric = "merror", nrounds = 33, 
                  num_class = 4,eta = .2, max_depth = 14, colsample_bytree = .4,  verbose = 1)

In [None]:
testing <- test[ , -which(names(test) %in% c("id"))]
test_xgb <- as.matrix(as.data.frame(lapply(testing, as.numeric)))

In [None]:
predict4 <- predict(model4, test_xgb)
predict4[predict4==1]<-"functional"
predict4[predict4==2]<-"functional needs repair"
predict4[predict4==3]<-"non functional"


Check the prediction.

In [None]:
table(predict4)
predict4

It is important to check if there are highly correlated features in the dataset. There are several types of importance in the Xgboost - it can be computed in several different ways. In the projcet the gain type is selected to compute the importance. The gain type shows the average gain across all splits where feature was used.

It could be found that the features date_recorded and quantity_group have a high importance value for our model.

In [None]:
xgb_importance <- xgb.importance(feature_names = colnames(train_xgb), model =model4)
xgb.plot.importance(importance_matrix = xgb_importance)