# Course project

#### (F) Nonlinear classifiers

#### Instructions

Try with nonlinear classifiers, can you do better than the models from above?

1. Try with a **random Forest**, does increasing the number of trees help?
2. Try with **SVMs** - does the **RBF kernel** perform better than the **linear** one?

In [1]:
import numpy as np
import pandas as pd

In [2]:
with np.load('swissroads_highlevel_features.npz', allow_pickle=False) as npz_file:
    # Load the arrays
    features_tr = npz_file['features_train']
    labels_tr = npz_file['labels_train']
    features_va = npz_file['features_valid']
    labels_va = npz_file['labels_valid']
    features_te = npz_file['features_test']
    labels_te = npz_file['labels_test']
    imgs_tr = npz_file['imgs_train']
    imgs_va = npz_file['imgs_valid']
    imgs_te = npz_file['imgs_test']

print('features_tr:', features_tr.shape)
print('labels_tr:', labels_tr.shape)
print('features_va:', features_va.shape)
print('labels_va:', labels_va.shape)
print('features_te:', features_te.shape)
print('labels_te:', labels_te.shape)
print('imgs_tr:', imgs_tr.shape)
print('imgs_va:', imgs_va.shape)
print('imgs_te:', imgs_te.shape)

features_tr: (280, 2048)
labels_tr: (280,)
features_va: (139, 2048)
labels_va: (139,)
features_te: (50, 2048)
labels_te: (50,)
imgs_tr: (280, 299, 299, 3)
imgs_va: (139, 299, 299, 3)
imgs_te: (50, 299, 299, 3)


In [3]:
X_tr = features_tr
X_va = features_va
X_te = features_te
y_tr = labels_tr
y_va = labels_va
y_te = labels_te
labels = ['bike','car','motorcycle','other','truck','van']

### Task F

#### i) Random Forest

In [4]:
from sklearn.ensemble import RandomForestClassifier

In [5]:
forest = RandomForestClassifier()
forest.fit(X_tr, y_tr)
forest.score(X_va, y_va)



0.8920863309352518

In [6]:
forest.get_params

<bound method BaseEstimator.get_params of RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)>

In [7]:
estimators = [1,3,5,7,9,11,13,15]
trees = [2,4,6,8,10,12,14,16,18,20]
forest_results = []

for e in estimators:
    for t in trees:
        forest2 = RandomForestClassifier(n_estimators=e, criterion='gini', max_depth=t, random_state=1)
        forest2.fit(X_tr, y_tr)
        forest_results.append({
            'estimators':e,
            'depth':t,
            'train_accuracy':forest2.score(X_tr, y_tr),
            'valid_accuracy':forest2.score(X_va, y_va)
        })

In [8]:
pd.DataFrame(forest_results).sort_values(by='valid_accuracy', ascending=False).head(20)

Unnamed: 0,depth,estimators,train_accuracy,valid_accuracy
72,6,15,0.996429,0.928058
79,20,15,1.0,0.920863
52,6,11,0.992857,0.920863
78,18,15,1.0,0.920863
77,16,15,1.0,0.920863
76,14,15,1.0,0.920863
75,12,15,1.0,0.920863
74,10,15,1.0,0.920863
62,6,13,1.0,0.920863
45,12,9,0.996429,0.920863


**Based on above grid search I decide to go with 'n_estimators = 9' and 'max_depth = 6' which seems to reach a strong accuracy while limiting depth and estimators which would risk over-fitting**

In [9]:
forest_final = RandomForestClassifier(n_estimators=9, criterion='gini', max_depth=6)
forest_final.fit(X_tr, y_tr)

randomforest_train_score = forest_final.score(X_tr, y_tr)
randomforest_valid_score = forest_final.score(X_va, y_va)
randomforest_test_score = forest_final.score(X_te, y_te)

print('Accuracy RandomForest train set:',randomforest_train_score)
print('Accuracy RandomForest validation set:',randomforest_valid_score)
print('Accuracy RandomForest test set:',randomforest_test_score)

Accuracy RandomForest train set: 0.9892857142857143
Accuracy RandomForest validation set: 0.8705035971223022
Accuracy RandomForest test set: 0.88


In [10]:
%store randomforest_test_score

Stored 'randomforest_test_score' (float64)


#### ii) SVMs (support vector Machine)

In [11]:
from sklearn.svm import LinearSVC

In [12]:
linear = LinearSVC(C=1)
linear.fit(X_tr, y_tr)
linear.score(X_va, y_va)

0.920863309352518

In [13]:
svm_linear_train = linear.score(X_tr, y_tr)
svm_linear_valid = linear.score(X_va, y_va)
svm_linear_test = linear.score(X_te, y_te)

print('Accuracy Linear SVM train',svm_linear_train)
print('Accuracy Linear SVM validation',svm_linear_valid)
print('Accuracy Linear SVM test',svm_linear_test)

Accuracy Linear SVM train 1.0
Accuracy Linear SVM validation 0.920863309352518
Accuracy Linear SVM test 0.92


In [14]:
%store svm_linear_test

Stored 'svm_linear_test' (float64)


In [15]:
from sklearn.svm import SVC

In [16]:
svc = SVC(C=1,kernel='rbf', gamma=1)
svc.fit(X_tr, y_tr)
svc.score(X_va, y_va)

0.23741007194244604

In [17]:
Cs  = [0.001, 0.01, 0.1, 1, 10, 100, 1000]
gammas = [0.001, 0.01, 0.1, 1, 10, 100, 1000]

svc_results = []
for c in Cs:
    for g in gammas:
        svc2 = SVC(C=c, kernel='rbf', gamma=g)
        svc2.fit(X_tr, y_tr)
        svc_results.append({
            'C':c,
            'Gamma':g,
            'train_accuracy':svc2.score(X_tr, y_tr),
            'valid_accuracy':svc2.score(X_va, y_va)
        })

In [18]:
pd.DataFrame(svc_results).sort_values(by='valid_accuracy', ascending=False).head(10)

Unnamed: 0,C,Gamma,train_accuracy,valid_accuracy
28,10.0,0.001,1.0,0.920863
22,1.0,0.01,1.0,0.920863
21,1.0,0.001,0.953571,0.920863
42,1000.0,0.001,1.0,0.913669
29,10.0,0.01,1.0,0.913669
35,100.0,0.001,1.0,0.913669
36,100.0,0.01,1.0,0.913669
43,1000.0,0.01,1.0,0.913669
15,0.1,0.01,0.75,0.76259
14,0.1,0.001,0.614286,0.640288


In [19]:
svc_final = SVC(C=1,kernel='rbf',gamma=0.001)
svc_final.fit(X_tr, y_tr)
svm_rbf_train = svc_final.score(X_tr, y_tr)
svm_rbf_valid = svc_final.score(X_va, y_va)
svm_rbf_test = svc_final.score(X_te, y_te)

print('Accuracy SVM RBF train:',svm_rbf_train)
print('Accuracy SVM RBF validation:',svm_rbf_valid)
print('Accuracy SVM RBF test:',svm_rbf_test)

Accuracy SVM RBF train: 0.9535714285714286
Accuracy SVM RBF validation: 0.920863309352518
Accuracy SVM RBF test: 0.94


In [20]:
%store svm_rbf_test

Stored 'svm_rbf_test' (float64)


#### Notes/Quesionts Task F (Greg)
* Shall I have standardized data to input in RandomForest (I would think yes)? *(cell 4)*
* Is it correct to say we don't need cross-validation with RandomForest because the number estimators are already doing the same job? *(cell 7)*
* Why do we get to 100% train accuracy here and previous estimator (logistic regression)? Not enough data points and therefore over-fitting? *(cell 8)*
* Why is accuracy so bad for RBF kernal without tuning parameters? *(cell 16)*
* Could I use GridSearchCV for RandomForest or SVMs?