# Predicting Forest Cover Types from Cartographic Variables

[Covertype](https://archive.ics.uci.edu/ml/datasets/Covertype) (aka "forest cover") is a classic dataset used for multi-class, non-linear algorithm benchmarking.  The data consists of 54 variables and 581,012 observations.  There are 7 classes (some of which are minority classes).  

This [1999](http://gis.fs.fed.us/rm/ogden/research/publications/downloads/journals/1999_compag_blackard.pdf) paper  applies ANNs to achieve 80% accuracy (while burning through 45 hours of compute for each run).  More recently Wise.io [benchmarked](http://www.wise.io/blog/benchmarking-random-forest-part-1) their Random Forest implementation and achieved 97.4% accuracy with a 50-tree forest. 

In [15]:
using DataFrames, XGBoost, Resampling

In [16]:
import DecisionTree.confusion_matrix

In [19]:
df = readtable("/users/arshakn/Dropbox/data/covtype.txt");

In [20]:
# shift classes labels to start from zero
for i in 1:nrow(df)
    df[i,:cover_type] -= 1
end

In [21]:
# randomly split the data 80/20 for train and test
train, test = splitrandom(df,0.8);

In [22]:
dtrain = DMatrix(array(train[1:end-1]), label = array(train[:cover_type]))
dtest = DMatrix(array(test[1:end-1]), label = array(test[:cover_type]))

DMatrix(Ptr{Void} @0x00007ff03f18c930,_setinfo)

In [30]:
watchlist = [(dtest,"eval"), (dtrain,"train")]

2-element Array{(DMatrix,ASCIIString),1}:
 (DMatrix(Ptr{Void} @0x00007ff03f18c930,_setinfo),"eval") 
 (DMatrix(Ptr{Void} @0x00007ff03f157d60,_setinfo),"train")

In [37]:
num_round = 5
bst = xgboost(dtrain, num_round, eta=0.5, max_depth=20, colsample_bytree=0.5, watchlist=watchlist, num_class=7, objective="multi:softmax")

[1]	eval-merror:0.143982	train-merror:0.098154
[2]	eval-merror:0.130161	train-merror:0.079047
[3]	eval-merror:0.096169	train-merror:0.045038
[4]	eval-merror:0.079740	train-merror:0.027575
[5]	eval-merror:0.071625	train-merror:0.018935


Booster(Ptr{Void} @0x00007ff001e9cfb0)

In [38]:
labels = get_info(dtest, "label")
preds = predict(bst, dtest);

In [39]:
cm = confusion_matrix(int(labels),int(preds))

7x7 Array{Int64,2}:
 38872   3497     0    0    16     4    64
  2009  54283   168    1    69    81    12
     2    243  6592   32     7   181     0
     0      2    78  465     0    26     0
    26    688    15    0  1129     2     1
    10    252   456   14     4  2778     0
   328     34     0    0     1     0  3760

Classes:  [0,1,2,3,4,5,6]
Matrix:   
Accuracy: 0.9283747267689024
Kappa:    0.8839009035267643

In [40]:
dump_model (bst,"boost-5colsample5.raw.txt")

## Accuracy
After 500 rounds of boosting we get a slightly better 2.50 error compared with benchmark's 2.6

## Prediction Time

In [35]:
@time predict(bst, array(test[1:end-1]))

elapsed time: 0.48896137 seconds (206807720 bytes allocated, 17.70% gc time)


116202-element Array{Float32,1}:
 0.0
 1.0
 0.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 0.0
 1.0
 ⋮  
 1.0
 1.0
 1.0
 1.0
 0.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 4.0

In [36]:
@time predict(bst, array(test[1;1:end-1]))

elapsed time: 0.000553814 seconds (45512 bytes allocated)


1-element Array{Float32,1}:
 0.0