# Decision Trees and Ensembles

# Forest Cover Prediction
In this assignment we are going to predict the forest cover type (the predominant kind of tree cover) from strictly cartographic variables (as opposed to remotely sensed data). Cover_Type (7 types, integer 1 to 7). The seven types are:
1. Spruce/Fir
2. Lodgepole Pine
3. Ponderosa Pine
4. Cottonwood/Willow
5. Aspen
6. Douglas-fir
7. Krummholz

"Predicting forest cover type from cartographic variables only (no remotely sensed data). The actual forest cover type for a given observation (30 x 30 meter cell) was determined from US Forest Service (USFS) Region 2 Resource Information System (RIS) data. Independent variables were derived from data originally obtained from US Geological Survey (USGS) and USFS data. Data is in raw form (not scaled) and contains binary (0 or 1) columns of data for qualitative independent variables (wilderness areas and soil types)." [https://archive.ics.uci.edu/ml/datasets/covertype] 

In order to classify the forest cover, we will use several different classifiers and compare their results. The classifiers we will use are Decision Trees, Bagging, Boosting, and Random Forest. In this assignment you are suppose to use built-in classifiers from `sklearn`. The training, validation, and test partitions are provided. You may need to do some preprocessing, and of course hyper-parameter tuning for each classifier.

In [9]:
from sklearn import datasets
import sklearn
from sklearn.model_selection import train_test_split
import numpy as np

In [2]:
covtype = datasets.fetch_covtype()
X = covtype.data
Y = covtype.target

In [3]:
X.shape, Y.shape

((581012, 54), (581012,))

In [4]:
np.random.seed(0)
perm = np.random.permutation(581012)
trainx = X[perm[0:49500],:]
trainy = Y[perm[0:49500]]
valx = X[perm[49500:55000],:]
valy = Y[perm[49500:55000]]
testx = X[perm[55000:581012],:]
testy = Y[perm[55000:581012]]

In [5]:
sum(trainy==1), sum(trainy==2), sum(trainy==3), sum(trainy==4), sum(trainy==5), sum(trainy==6), sum(trainy==7)

(17945, 24251, 3023, 254, 786, 1481, 1760)

# 1. Decision tree

In [14]:
from sklearn.tree import DecisionTreeClassifier
# training and hyper-parameter tuning
criterions = ["gini","entropy"]
splitters = ['best', 'random']
min_samples_split = np.array(range(2, 20))
print(min_samples_split)


# trainx
# trainy
# valx 
# valy 
# testx 
# testy 

best_min_samples_split = 0
best_acc = 0 

for criterion in criterions:
    for splitter in splitters:
        for i in min_samples_split:
            clf = DecisionTreeClassifier(criterion = criterion, splitter= splitter, min_samples_split = i )
            clf = clf.fit(trainx, trainy)

            
            # evaluate on trainx, trainy and valx, valy
            y_pred_on_train = clf.predict(trainx)
            train_acc = sklearn.metrics.accuracy_score(y_true=trainy, y_pred=y_pred_on_train)
            
            y_pred_on_val = clf.predict(valx)
            val_acc = sklearn.metrics.accuracy_score(y_true=valy, y_pred=y_pred_on_val)
            
            print("Setting criterion={}, splitter={}, min_samples_split={}, train_acc={:.2f}, val_acc={:.2f}".format(criterion, splitter, i, train_acc, val_acc))
            if val_acc > best_acc:
                best_acc = val_acc
                best_min_samples_split = i
                best_criterion = criterion
                best_splitter = splitter

        
print("Best Setting criterion={}, splitter={}, min_samples_split={}, val_acc={:.2f}".format(best_criterion, best_splitter, best_min_samples_split, best_acc))

[ 2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]
Setting criterion=gini, splitter=best, min_samples_split=2, train_acc=1.00, val_acc=0.84
Setting criterion=gini, splitter=best, min_samples_split=3, train_acc=0.99, val_acc=0.84
Setting criterion=gini, splitter=best, min_samples_split=4, train_acc=0.99, val_acc=0.84
Setting criterion=gini, splitter=best, min_samples_split=5, train_acc=0.98, val_acc=0.84
Setting criterion=gini, splitter=best, min_samples_split=6, train_acc=0.98, val_acc=0.83
Setting criterion=gini, splitter=best, min_samples_split=7, train_acc=0.97, val_acc=0.83
Setting criterion=gini, splitter=best, min_samples_split=8, train_acc=0.97, val_acc=0.83
Setting criterion=gini, splitter=best, min_samples_split=9, train_acc=0.96, val_acc=0.83
Setting criterion=gini, splitter=best, min_samples_split=10, train_acc=0.96, val_acc=0.82
Setting criterion=gini, splitter=best, min_samples_split=11, train_acc=0.95, val_acc=0.82
Setting criterion=gini, splitter=best, min_samples_s

In [26]:
#test
X_train_val_merge = np.vstack([trainx, valx]) 
y_train_val_merge = np.hstack([trainy, valy])

clf = DecisionTreeClassifier(criterion = best_criterion, splitter= best_splitter, min_samples_split = best_min_samples_split )
clf = clf.fit(X_train_val_merge, y_train_val_merge)


# evaluate on testx, testy
y_pred_on_test = clf.predict(testx)
test_acc = sklearn.metrics.accuracy_score(y_true=testy, y_pred=y_pred_on_test)


print("Setting criterion={}, splitter={}, min_samples_split={}, test_acc={:.2f}".format(best_criterion, best_splitter, best_min_samples_split, test_acc))

Setting criterion=gini, splitter=best, min_samples_split=4, test_acc=0.83


# 2. Bagging

In [24]:
from sklearn.ensemble import BaggingClassifier
# training and hyper-parameter tuning

n_estimators = np.array(range(10, 50, 5))

# trainx
# trainy
# valx
# valy
# testx
# testy


for n in n_estimators:
    bagging = BaggingClassifier(DecisionTreeClassifier(criterion=best_criterion, splitter=best_splitter, min_samples_split=best_min_samples_split),
                                n_estimators=n)
    bagging = bagging.fit(trainx, trainy)

    # evaluate on trainx, trainy and valx, valy
    y_pred_on_train = bagging.predict(trainx)
    train_acc = sklearn.metrics.accuracy_score(
        y_true=trainy, y_pred=y_pred_on_train)

    y_pred_on_val = bagging.predict(valx)
    val_acc = sklearn.metrics.accuracy_score(y_true=valy, y_pred=y_pred_on_val)

    print("Setting criterion={}, splitter={}, min_samples_split={}, n_estimators={}, train_acc={:.2f}, val_acc={:.2f}".format(
        best_criterion, best_splitter, best_min_samples_split, n, train_acc, val_acc))
    if val_acc > best_acc:
        best_acc = val_acc
        best_n_estimators = n


print("Best Setting criterion={}, splitter={}, min_samples_split={}, n_estimators={}, val_acc={:.2f}".format(
    best_criterion, best_splitter, best_min_samples_split, best_n_estimators, best_acc))


Setting criterion=gini, splitter=best, min_samples_split=4, n_estimators=10, train_acc=0.99, val_acc=0.88
Setting criterion=gini, splitter=best, min_samples_split=4, n_estimators=15, train_acc=0.99, val_acc=0.89
Setting criterion=gini, splitter=best, min_samples_split=4, n_estimators=20, train_acc=1.00, val_acc=0.89
Setting criterion=gini, splitter=best, min_samples_split=4, n_estimators=25, train_acc=1.00, val_acc=0.89
Setting criterion=gini, splitter=best, min_samples_split=4, n_estimators=30, train_acc=1.00, val_acc=0.89
Setting criterion=gini, splitter=best, min_samples_split=4, n_estimators=35, train_acc=1.00, val_acc=0.90
Setting criterion=gini, splitter=best, min_samples_split=4, n_estimators=40, train_acc=1.00, val_acc=0.90
Setting criterion=gini, splitter=best, min_samples_split=4, n_estimators=45, train_acc=1.00, val_acc=0.90
Best Setting criterion=gini, splitter=best, min_samples_split=4, n_estimators=45, val_acc=0.90


In [25]:
#test
X_train_val_merge = np.vstack([trainx, valx]) 
y_train_val_merge = np.hstack([trainy, valy])

bagging = BaggingClassifier(DecisionTreeClassifier(criterion = best_criterion, splitter= best_splitter, min_samples_split = best_min_samples_split ),
                                n_estimators=best_n_estimators)
bagging = bagging.fit(X_train_val_merge, y_train_val_merge)




# evaluate on testx, testy
y_pred_on_test = bagging.predict(testx)
test_acc = sklearn.metrics.accuracy_score(y_true=testy, y_pred=y_pred_on_test)


print("Setting criterion={}, splitter={}, min_samples_split={}, test_acc={:.2f}".format(best_criterion, best_splitter, best_min_samples_split, test_acc))

Setting criterion=gini, splitter=best, min_samples_split=4, test_acc=0.89


# 3. AdaBoost

In [27]:
from sklearn.ensemble import AdaBoostClassifier
# training and hyper-parameter tuning

n_estimators = np.array(range(10, 50, 5))

# trainx
# trainy
# valx
# valy
# testx
# testy


for n in n_estimators:
    ada_boost = AdaBoostClassifier(DecisionTreeClassifier(criterion=best_criterion, splitter=best_splitter, min_samples_split=best_min_samples_split),
                                n_estimators=n)
    ada_boost = ada_boost.fit(trainx, trainy)

    # evaluate on trainx, trainy and valx, valy
    y_pred_on_train = ada_boost.predict(trainx)
    train_acc = sklearn.metrics.accuracy_score(
        y_true=trainy, y_pred=y_pred_on_train)

    y_pred_on_val = ada_boost.predict(valx)
    val_acc = sklearn.metrics.accuracy_score(y_true=valy, y_pred=y_pred_on_val)

    print("Setting criterion={}, splitter={}, min_samples_split={}, n_estimators={}, train_acc={:.2f}, val_acc={:.2f}".format(
        best_criterion, best_splitter, best_min_samples_split, n, train_acc, val_acc))
    if val_acc > best_acc:
        best_acc = val_acc
        best_n_estimators = n


print("Best Setting criterion={}, splitter={}, min_samples_split={}, n_estimators={}, val_acc={:.2f}".format(
    best_criterion, best_splitter, best_min_samples_split, best_n_estimators, best_acc))

Setting criterion=gini, splitter=best, min_samples_split=4, n_estimators=10, train_acc=1.00, val_acc=0.88
Setting criterion=gini, splitter=best, min_samples_split=4, n_estimators=15, train_acc=1.00, val_acc=0.89
Setting criterion=gini, splitter=best, min_samples_split=4, n_estimators=20, train_acc=1.00, val_acc=0.89
Setting criterion=gini, splitter=best, min_samples_split=4, n_estimators=25, train_acc=1.00, val_acc=0.89
Setting criterion=gini, splitter=best, min_samples_split=4, n_estimators=30, train_acc=1.00, val_acc=0.89
Setting criterion=gini, splitter=best, min_samples_split=4, n_estimators=35, train_acc=1.00, val_acc=0.89
Setting criterion=gini, splitter=best, min_samples_split=4, n_estimators=40, train_acc=1.00, val_acc=0.89
Setting criterion=gini, splitter=best, min_samples_split=4, n_estimators=45, train_acc=1.00, val_acc=0.89
Best Setting criterion=gini, splitter=best, min_samples_split=4, n_estimators=45, val_acc=0.90


In [28]:
#test
X_train_val_merge = np.vstack([trainx, valx]) 
y_train_val_merge = np.hstack([trainy, valy])

ada_boost = AdaBoostClassifier(DecisionTreeClassifier(criterion = best_criterion, splitter= best_splitter, min_samples_split = best_min_samples_split ),
                                n_estimators=best_n_estimators)
ada_boost = ada_boost.fit(X_train_val_merge, y_train_val_merge)




# evaluate on testx, testy
y_pred_on_test = ada_boost.predict(testx)
test_acc = sklearn.metrics.accuracy_score(y_true=testy, y_pred=y_pred_on_test)


print("Setting criterion={}, splitter={}, min_samples_split={}, test_acc={:.2f}".format(best_criterion, best_splitter, best_min_samples_split, test_acc))

Setting criterion=gini, splitter=best, min_samples_split=4, test_acc=0.89


# 4. Random Forest

In [35]:
from sklearn.ensemble import RandomForestClassifier
# training and hyper-parameter tuning
criterions = ["gini", "entropy"]
splitters = ['best', 'random']
min_samples_split = np.array(range(2, 20, 4))
n_estimators = np.array(range(30, 50, 5))


# trainx
# trainy
# valx
# valy
# testx
# testy

best_min_samples_split = 0
best_acc = 0
best_n_estimators = 0

for criterion in criterions:
    for i in min_samples_split:
        for n in n_estimators:
            rf = RandomForestClassifier(
                criterion=criterion, min_samples_split=i, n_estimators=n)
            rf = rf.fit(trainx, trainy)

            # evaluate on trainx, trainy and valx, valy
            y_pred_on_train = rf.predict(trainx)
            train_acc = sklearn.metrics.accuracy_score(
                y_true=trainy, y_pred=y_pred_on_train)

            y_pred_on_val = rf.predict(valx)
            val_acc = sklearn.metrics.accuracy_score(
                y_true=valy, y_pred=y_pred_on_val)

            print("Setting criterion={}, min_samples_split={}, n_estimators={}, train_acc={:.2f}, val_acc={:.2f}".format(
                criterion, i, n, train_acc, val_acc))
            if val_acc > best_acc:
                best_acc = val_acc
                best_min_samples_split = i
                best_criterion = criterion
                best_n_estimators = n
                


print("Best Setting criterion={}, min_samples_split={}, n_estimators={}, val_acc={:.2f}".format(
    best_criterion, best_min_samples_split,best_n_estimators, best_acc))


Setting criterion=gini, min_samples_split=2, n_estimators=30, train_acc=1.00, val_acc=0.88
Setting criterion=gini, min_samples_split=2, n_estimators=35, train_acc=1.00, val_acc=0.88
Setting criterion=gini, min_samples_split=2, n_estimators=40, train_acc=1.00, val_acc=0.88
Setting criterion=gini, min_samples_split=2, n_estimators=45, train_acc=1.00, val_acc=0.88
Setting criterion=gini, min_samples_split=6, n_estimators=30, train_acc=0.98, val_acc=0.87
Setting criterion=gini, min_samples_split=6, n_estimators=35, train_acc=0.99, val_acc=0.87
Setting criterion=gini, min_samples_split=6, n_estimators=40, train_acc=0.99, val_acc=0.87
Setting criterion=gini, min_samples_split=6, n_estimators=45, train_acc=0.99, val_acc=0.87
Setting criterion=gini, min_samples_split=10, n_estimators=30, train_acc=0.96, val_acc=0.87
Setting criterion=gini, min_samples_split=10, n_estimators=35, train_acc=0.96, val_acc=0.87
Setting criterion=gini, min_samples_split=10, n_estimators=40, train_acc=0.97, val_acc=0

In [None]:
#test
X_train_val_merge = np.vstack([trainx, valx]) 
y_train_val_merge = np.hstack([trainy, valy])

rf = RandomForestClassifier(
                criterion=criterion, min_samples_split=i, n_estimators=best_n_estimators)
rf = rf.fit(X_train_val_merge, y_train_val_merge)




# evaluate on testx, testy
y_pred_on_test = rf.predict(testx)
test_acc = sklearn.metrics.accuracy_score(y_true=testy, y_pred=y_pred_on_test)


print("Setting criterion={}, min_samples_split={}, n_estimators={}, test_acc={:.2f}".format(best_criterion, best_min_samples_split,best_n_estimators, test_acc))

Setting criterion=gini, splitter=best, min_samples_split=4, test_acc=0.88


## Questions:

Please report in you submission the following for each classifier:
1. The best result on the validaton set
2. Hyperparameter values for the classifier
3. The result on the test set.

Apart from the above, please provide your comments and observations on the results of the different classifiers. 

## Ansewers: 
1. Decesion Tree:
    - The best result in validation is 0.84
    - The hyperparameters are criterion=gini, splitter=best, min_samples_split=3
    - The best result in test-set is 0.83

2. Bagging:
    - The best result in validation is 0.90 
    - The hyperparameters are criterion=gini, splitter=best, min_samples_split=4, n_estimators=45
    - The best result in test-set is 0.89

2. AdaBoost:
    - The best result in validation is 0.90
    - The hyperparameters are criterion=gini, splitter=best, min_samples_split=4, n_estimators=45
    - The best result in test-set is 0.89

2. Random Forest:
    - The best result in validation is 0.88
    - The hyperparameters are criterion=entropy, min_samples_split=2, n_estimators=40
    - The best result in test-set is 0.88


## EXTRA

Download the Statlog (Vehicle Silhouettes) Data Set (https://archive.ics.uci.edu/ml/datasets/Statlog+%28Vehicle+Silhouettes%29)

It has 18 features and following four classes:
**OPEL, SAAB, BUS, VAN**

The purpose is to classify a given silhouette as one of four types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles. 


Use Decision Tree classifer to train the classifier with train/val/test partition as 70/15/15 (random seed=777)
Fine-tune the tree on the validation set.

1. Report the performance on the test set
2. Extract the decision rules for classifiecation.
3. Plot the decision tree as an image using some library.

**Note:** You have library support in Python for visualizing the learnt decision trees.