# Decision Trees and Ensembles

# Forest Cover Prediction
In this assignment we are going to predict the forest cover type (the predominant kind of tree cover) from strictly cartographic variables (as opposed to remotely sensed data). Cover_Type (7 types, integer 1 to 7). The seven types are:
1. Spruce/Fir
2. Lodgepole Pine
3. Ponderosa Pine
4. Cottonwood/Willow
5. Aspen
6. Douglas-fir
7. Krummholz

"Predicting forest cover type from cartographic variables only (no remotely sensed data). The actual forest cover type for a given observation (30 x 30 meter cell) was determined from US Forest Service (USFS) Region 2 Resource Information System (RIS) data. Independent variables were derived from data originally obtained from US Geological Survey (USGS) and USFS data. Data is in raw form (not scaled) and contains binary (0 or 1) columns of data for qualitative independent variables (wilderness areas and soil types)." [https://archive.ics.uci.edu/ml/datasets/covertype] 

In order to classify the forest cover, we will use several different classifiers and compare their results. The classifiers we will use are Decision Trees, Bagging, Boosting, and Random Forest. In this assignment you are suppose to use built-in classifiers from `sklearn`. The training, validation, and test partitions are provided. You may need to do some preprocessing, and of course hyper-parameter tuning for each classifier.

In [24]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
import numpy as np

In [2]:
covtype = datasets.fetch_covtype()
X = covtype.data
Y = covtype.target

In [3]:
X.shape, Y.shape

((581012, 54), (581012,))

In [4]:
np.random.seed(0)
perm = np.random.permutation(581012)
trainx = X[perm[0:49500],:]
trainy = Y[perm[0:49500]]
valx = X[perm[49500:55000],:]
valy = Y[perm[49500:55000]]
testx = X[perm[55000:581012],:]
testy = Y[perm[55000:581012]]

In [5]:
sum(trainy==1), sum(trainy==2), sum(trainy==3), sum(trainy==4), sum(trainy==5), sum(trainy==6), sum(trainy==7)

(17945, 24251, 3023, 254, 786, 1481, 1760)

# 1. Decision tree

In [11]:
from sklearn.tree import DecisionTreeClassifier
# training and hyper-parameter tuning
tree_parameters = {'criterion':['gini','entropy'],'max_depth':np.arange(1,10)}

grid_tree = GridSearchCV(estimator= DecisionTreeClassifier(),
                        param_grid = tree_parameters)
grid_tree.fit(trainx,trainy)

GridSearchCV(estimator=DecisionTreeClassifier(),
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': array([1, 2, 3, 4, 5, 6, 7, 8, 9])})

In [14]:
print("Best hyper-parameters",grid_tree.best_params_)
print("Best accuracy :",grid_tree.best_score_)

Best hyper-parameters {'criterion': 'gini', 'max_depth': 9}
Best accuracy : 0.7526666666666666


In [26]:

y_pred = grid_tree.predict(valx)
print("Accuracy:",accuracy_score(valy, y_pred))

Accuracy: 0.7572727272727273


In [28]:
#test
y_pred = grid_tree.predict(testx)
print("Accuracy:",accuracy_score(testy, y_pred))

Accuracy: 0.7537774803616647


# 2. Bagging

In [30]:
from sklearn.ensemble import BaggingClassifier
# training and hyper-parameter tuning
param_grid = {
    'base_estimator__max_depth' : np.arange(1,10),
    'max_samples' : [0.05, 0.1, 0.2, 0.5]
}

grid_bag = GridSearchCV(BaggingClassifier(DecisionTreeClassifier(),
                                     n_estimators = 100, max_features = 0.5),
                                     param_grid)
grid_bag.fit(trainx, trainy)

GridSearchCV(estimator=BaggingClassifier(base_estimator=DecisionTreeClassifier(),
                                         max_features=0.5, n_estimators=100),
             param_grid={'base_estimator__max_depth': array([1, 2, 3, 4, 5, 6, 7, 8, 9]),
                         'max_samples': [0.05, 0.1, 0.2, 0.5]})

In [31]:
print("Best hyper-parameters",grid_bag.best_params_)
print("Best accuracy :",grid_bag.best_score_)

Best hyper-parameters {'base_estimator__max_depth': 9, 'max_samples': 0.5}
Best accuracy : 0.7667070707070707


In [32]:
#val
y_pred = grid_bag.predict(valx)
print("Accuracy:",accuracy_score(valy, y_pred))

Accuracy: 0.7665454545454545


In [33]:
#test
y_pred = grid_bag.predict(testx)
print("Accuracy:",accuracy_score(testy, y_pred))

Accuracy: 0.7706135981688631


# 3. AdaBoost

In [None]:
from sklearn.ensemble import AdaBoostClassifier
# training and hyper-parameter tuning
param_grid = {
    'base_estimator__max_depth' : np.arange(1,10),
    'max_samples' : [0.05, 0.1, 0.2, 0.5]
    'learning_rate':[0.1,0.2,0.01,0.001,0.0001]
    'n_estimators': [100 ,200 ,50]
}

grid_boost = GridSearchCV(BaggingClassifier(DecisionTreeClassifier(),param_grid)
grid_boost.fit(trainx, trainy)


# ada = AdaBoostClassifier(DecisionTreeClassifier(max_depth =1) n estimators = 200 , algorithm = 'SAMME.R',learning_rate = 0.1)
# grid_boost.fit(trainx,trainy)

In [None]:
print("Best hyper-parameters",grid_boost.best_params_)
print("Best accuracy :",grid_boost.best_score_)

In [None]:
#val
y_pred = grid_boost.predict(valx)
print("Accuracy:",accuracy_score(valy, y_pred))

In [None]:
#test
y_pred = grid_boost.predict(testx)
print("Accuracy:",accuracy_score(testy, y_pred))

# 4. Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
# training and hyper-parameter tuning
param_grid = { 
    'n_estimators': np.arange(100,1000,100),
    'max_features': ['auto', 'sqrt', 'log2']
}

grid_rand = GridSearchCV(DecisionTreeClassifier(), param_grid=param_grid)
grid_rand.fit(X, y)

In [None]:
print("Best hyper-parameters",grid_rand.best_params_)
print("Best accuracy :",grid_rand.best_score_)

In [None]:
#val
y_pred = grid_rand.predict(valx)
print("Accuracy:",accuracy_score(valy, y_pred))

In [None]:
#test
y_pred = grid_rand.predict(testx)
print("Accuracy:",accuracy_score(testy, y_pred))

## Questions:

Please report in you submission the following for each classifier:
1. The best result on the validaton set
2. Hyperparameter values for the classifier
3. The result on the test set.

Apart from the above, please provide your comments and observations on the results of the different classifiers. 

## EXTRA

Download the Statlog (Vehicle Silhouettes) Data Set (https://archive.ics.uci.edu/ml/datasets/Statlog+%28Vehicle+Silhouettes%29)

It has 18 features and following four classes:
**OPEL, SAAB, BUS, VAN**

The purpose is to classify a given silhouette as one of four types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles. 


Use Decision Tree classifer to train the classifier with train/val/test partition as 70/15/15 (random seed=777)
Fine-tune the tree on the validation set.

1. Report the performance on the test set
2. Extract the decision rules for classifiecation.
3. Plot the decision tree as an image using some library.

**Note:** You have library support in Python for visualizing the learnt decision trees.