# Predicting Forest Cover Types from Cartographic Variables

[Covertype](https://archive.ics.uci.edu/ml/datasets/Covertype) (aka "forest cover") is a classic dataset used for multi-class, non-linear algorithm benchmarking.  The data consists of 54 variables and 581,012 observations.  There are 7 classes (some of which are minority classes).  

This [1999](http://gis.fs.fed.us/rm/ogden/research/publications/downloads/journals/1999_compag_blackard.pdf) paper  applies ANNs to achieve 80% accuracy (while burning through 45 hours of compute for each run).  More recently Wise.io [benchmarked](http://www.wise.io/blog/benchmarking-random-forest-part-1) their Random Forest implementation and achieved 97.4% accuracy with a 50-tree forest. 

In [2]:
using DataFrames, XGBoost, Resampling

In [3]:
import DecisionTree.confusion_matrix

In [4]:
df = readtable("data/covtype.txt");

In [5]:
# shift classes labels to start from zero
for i in 1:nrow(df)
    df[i,:cover_type] -= 1
end

In [9]:
# randomly split the data 80/20 for train and test
train, test = splitrandom(df,0.8);

In [10]:
dtrain = DMatrix(array(train[1:end-1]), label = array(train[:cover_type]))
dtest = DMatrix(array(test[1:end-1]), label = array(test[:cover_type]))

DMatrix(Ptr{Void} @0x00007f8b184a1fa0,_setinfo)

In [11]:
watchlist = [(dtest,"eval"), (dtrain,"train")]

2-element Array{(DMatrix,ASCIIString),1}:
 (DMatrix(Ptr{Void} @0x00007f8b184a1fa0,_setinfo),"eval") 
 (DMatrix(Ptr{Void} @0x00007f8b18449060,_setinfo),"train")

In [79]:
num_round = 500
bst = xgboost(dtrain, num_round, eta=0.1, max_depth=20, watchlist=watchlist, num_class=7, objective="multi:softmax")

[1]	eval-merror:0.075790	train-merror:0.047157
[2]	eval-merror:0.065119	train-merror:0.036305
[3]	eval-merror:0.060825	train-merror:0.031850
[4]	eval-merror:0.058553	train-merror:0.029119
[5]	eval-merror:0.056574	train-merror:0.026628
[6]	eval-merror:0.054922	train-merror:0.024345
[7]	eval-merror:0.053467	train-merror:0.022349
[8]	eval-merror:0.052013	train-merror:0.020664
[9]	eval-merror:0.050472	train-merror:0.019199
[10]	eval-merror:0.049078	train-merror:0.017745
[11]	eval-merror:0.047917	train-merror:0.016273
[12]	eval-merror:0.046841	train-merror:0.014993
[13]	eval-merror:0.045782	train-merror:0.013801
[14]	eval-merror:0.044991	train-merror:0.012758
[15]	eval-merror:0.043863	train-merror:0.011811
[16]	eval-merror:0.042512	train-merror:0.010766
[17]	eval-merror:0.041807	train-merror:0.010008
[18]	eval-merror:0.041006	train-merror:0.009286
[19]	eval-merror:0.040541	train-merror:0.008606
[20]	eval-merror:0.040042	train-merror:0.007840
[21]	eval-merror:0.039242	train-merror:0.007255
[

Booster(Ptr{Void} @0x00007f8b18fdab10)

In [80]:
labels = get_info(dtest, "label")
preds = predict(bst, dtest);

In [81]:
cm = confusion_matrix(int(labels),int(preds))

7x7 Array{Int64,2}:
 41346   1042     0    0    17     2    70
   753  55479    79    0    96    42    12
     0     71  6891   27     9    92     0
     0      0    52  503     0    21     0
    13    179    21    0  1700     5     0
     1     49   114   16     4  3361     0
   101     22     0    0     0     0  4012

Classes:  [0,1,2,3,4,5,6]
Matrix:   
Accuracy: 0.9749574017658904
Kappa:    0.9598229725429582

In [76]:
save(bst,"test.model")

In [78]:
dump_model (bst,"test.raw.txt")

## Accuracy
After 500 rounds of boosting we get a slightly better 2.50 error compared with benchmark's 2.6

## Prediction Time

In [93]:
@time predict(bst, array(test[1:end-1]))

elapsed time: 18.251000367 seconds (206804664 bytes allocated, 0.66% gc time)


116202-element Array{Float32,1}:
 1.0
 1.0
 0.0
 0.0
 1.0
 1.0
 1.0
 6.0
 1.0
 2.0
 0.0
 1.0
 1.0
 ⋮  
 1.0
 0.0
 1.0
 5.0
 0.0
 1.0
 1.0
 0.0
 1.0
 0.0
 1.0
 4.0

In [85]:
@time predict(bst, array(test[1;1:end-1]))

elapsed time: 0.022366966 seconds (305464 bytes allocated)


1-element Array{Float32,1}:
 1.0