# Forest Cover Prediction

In this assignment we are going to predict the forest cover type (the predominant kind of tree cover) from strictly cartographic variables (as opposed to remotely sensed data). Cover_Type (7 types, integer 1 to 7). The seven types are:
1. Spruce/Fir
2. Lodgepole Pine
3. Ponderosa Pine
4. Cottonwood/Willow
5. Aspen
6. Douglas-fir
7. Krummholz

"Predicting forest cover type from cartographic variables only (no remotely sensed data). The actual forest cover type for a given observation (30 x 30 meter cell) was determined from US Forest Service (USFS) Region 2 Resource Information System (RIS) data. Independent variables were derived from data originally obtained from US Geological Survey (USGS) and USFS data. Data is in raw form (not scaled) and contains binary (0 or 1) columns of data for qualitative independent variables (wilderness areas and soil types)." [https://archive.ics.uci.edu/ml/datasets/covertype] 

In order to classify the forest cover, we will use several different classifiers and compare their results. The classifiers we will use are SVM, Decision Trees, Bagging, Boosting, and Random Forest. In this assignemnt you are suppose to use built-in classifiers from `sklearn`. The training, validation, and test partitions are provided. You may need to do some preprocessing, and of course hyper parameter tuning for each classifier.

In [1]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
import numpy as np

In [2]:
covtype = datasets.fetch_covtype()
X = covtype.data
Y = covtype.target

#trim
# percent_keep = 0.1
# X = X[0:int(X.shape[0]*percent_keep),:]
# Y = Y[0:int(Y.shape[0]*percent_keep)]


In [3]:
X.shape, Y.shape

((581012, 54), (581012,))

In [8]:
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler

In [11]:
scaler = StandardScaler()
scaler.fit(X)
X = scaler.transform(X)


In [12]:
49500/581012, (49500-5500)/581012, (526012-(49500+5500))/581012

(0.08519617494991498, 0.0757299332888133, 0.8106751667779667)

In [13]:
np.random.seed(0)
perm = np.random.permutation(581012)

trainx = X[perm[0:49500],:]
trainy = Y[perm[0:49500]]
valx = X[perm[49500:55000],:]
valy = Y[perm[49500:55000]]
testx = X[perm[55000:581012],:]
testy = Y[perm[55000:581012]]

# # TRAIN_PERC, VAL_PERC, TEST_PERC = (0.08519617494991498, 0.0757299332888133, 0.8106751667779667)
# TRAIN_PERC, VAL_PERC, TEST_PERC = (0.001, 0.0001, 0.8)


# TRAIN_SIZE = int(TRAIN_PERC*X.shape[0])
# VAL_SIZE = int(VAL_PERC*X.shape[0])
# TEST_SIZE = int(TEST_PERC*X.shape[0])

# trainx = X[perm[0:TRAIN_SIZE],:]
# trainy = Y[perm[0:TRAIN_SIZE]]
# valx =   X[perm[TRAIN_SIZE:          TRAIN_SIZE+VAL_SIZE],:]
# valy =   Y[perm[TRAIN_SIZE:          TRAIN_SIZE+VAL_SIZE]]
# testx =  X[perm[TRAIN_SIZE+VAL_SIZE: TRAIN_SIZE+VAL_SIZE+TEST_SIZE],:]
# testy =  Y[perm[TRAIN_SIZE+VAL_SIZE: TRAIN_SIZE+VAL_SIZE+TEST_SIZE]]

In [14]:
sum(trainy==1), sum(trainy==2), sum(trainy==3), sum(trainy==4), sum(trainy==5), sum(trainy==6), sum(trainy==7)

(17945, 24251, 3023, 254, 786, 1481, 1760)

In [15]:
trainx.shape, trainy.shape, valx.shape, valy.shape, testx.shape, testy.shape

((49500, 54), (49500,), (5500, 54), (5500,), (526012, 54), (526012,))

# 1. SVM

In [17]:
from sklearn.svm import SVC
# training and hyper-parameter tuning
# https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

parameters = {
    'kernel':['rbf'],
    'C':[1, 10]
}

svc = SVC(
#     C=1.0,
#     kernel='rbf',
    degree=3,
    gamma='auto_deprecated',
    coef0=0.0,
    shrinking=True,
    probability=False,
    tol=0.001,
    cache_size=200,
    class_weight=None,
    verbose=False,
    max_iter=-1,
    decision_function_shape='ovr',
    random_state=None,
)

clf = GridSearchCV(svc, parameters, cv=5)
clf.fit(trainx, trainy)

best_params = list(sorted(clf.cv_results_.keys()))[0]



In [19]:
import pickle as pkl

with open('svm.pkl', 'wb') as f:
    pkl.dump(clf, f)

In [None]:
#test
svc.predict(valx)
svc.predict(testx)

# 2. Decision tree

In [None]:
from sklearn.tree import DecisionTreeClassifier
# training and hyper-parameter tuning


In [None]:
#test


# 3. Bagging

In [None]:
from sklearn.ensemble import BaggingClassifier
# training and hyper-parameter tuning


In [None]:
#test


# 4. AdaBoost

In [None]:
from sklearn.ensemble import AdaBoostClassifier
# training and hyper-parameter tuning


In [None]:
#test


# 5. Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
# training and hyper-parameter tuning


In [None]:
#test


## Questions:

Please report in you submission the following for each classifier:
1. The best result on the validaton set
2. Hyperparameter values for the classifier
3. The result on the test set.

Apart from the above, please provide your comments and observations on the results of the different classifiers. 