# Applied Machine Learning 2
## Course project          
                                                 Author: Diego Rodriguez
## Nonlinear classifiers
Try with nonlinear classifiers, can you do better than the baseline models from above?
- Try with a random Forest, does increasing the number of trees help?
- Try with SVMs - does the RBF kernel perform better than the linear one?


In [1]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt

# Load the npz file
base_dir = '/Users/rodriguezmod/Downloads/swissroads/'

with np.load(base_dir+'features.npz', allow_pickle=False) as npz_file: 
    # It's a dictionary-like object 
    print(list(npz_file.keys()))
    
    # Load the arrays    
    # Merging test and validation features data to use a cross-validation approach to model fitting.
    X_tr = np.concatenate((npz_file['train_features'], npz_file['validation_features']))
    X_tr_pixels = np.concatenate((npz_file['train_pixels'], npz_file['validation_pixels']))
    y_tr = np.concatenate((npz_file['train_labels'], npz_file['validation_labels']))
    # Reduce to 1-dim
    y_tr = np.argmax(y_tr, axis=1)

    X_te = npz_file['test_features']
    X_te_pixels = npz_file['test_pixels']
    y_te = npz_file['test_labels']
    # Reduce to 1-dim
    y_te = np.argmax(y_te, axis=1)

['train_features', 'validation_features', 'test_features', 'train_labels', 'validation_labels', 'test_labels', 'train_pixels', 'validation_pixels', 'test_pixels']


## RFs performance
Try with a random forest, does increasing the number of trees help? Yes, it is. Let's first analyze the result of a random forest model for a case of n_estimators = 1 and then for much higher values.

In [2]:
from sklearn.ensemble import RandomForestClassifier

# Create a decision tree object
rfc = RandomForestClassifier(
         n_estimators=1, max_depth=12, random_state=0)

# Fitting on train set
rfc.fit(X_tr, y_tr)

# Evaluate on train set
accuracy_tr = rfc.score(X_tr, y_tr)

# Evaluate on test set
accuracy_te = rfc.score(X_te, y_te)

# Print accuracy
print('Train accuracy: {:.1f}%'.format(100*accuracy_tr))
      
# Print accuracy
print('Test accuracy: {:.1f}%'.format(100*accuracy_te))

Train accuracy: 87.4%
Test accuracy: 68.0%


In [3]:
from sklearn.model_selection import GridSearchCV

parameters = {
    'n_estimators'      : [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 30, 100, 200],
    'max_depth'         : [8, 9, 10, 11, 12],
    #'random_state'      : [0],
    #'max_features': ['auto'],
    #'criterion' :['gini']
}
rfc_gscv = GridSearchCV(rfc, parameters, cv=10, n_jobs=10)

# Fitting on train set
rfc_gscv.fit(X_tr, y_tr)

# Evaluate on train set
accuracy_tr = rfc_gscv.score(X_tr, y_tr)

# Evaluate on test set
accuracy_te = rfc_gscv.score(X_te, y_te)

# Print accuracy
print('Train accuracy: {:.1f}%'.format(100*accuracy_tr))
      
# Print accuracy
print('Test accuracy: {:.1f}%'.format(100*accuracy_te))

Train accuracy: 100.0%
Test accuracy: 88.0%


A random forest classifier model an accuracy value of **68.0%** with n_estimators = 1, which is acceptable but not better value. With a random forest classifier model with n_estimators highers values, an accuracy value of **88.0%** is obtained, it shows that increasing the number of trees help to get better results.

## Ensemble size effect
Try with SVMs - does the RBF kernel perform better than the linear one? It does not do better, but anyway the values obtained are very good.

In [4]:
from sklearn.svm import LinearSVC

# Create SVM with linear kernel
linear_svc = LinearSVC()

# Fitting on train set
linear_svc.fit(X_tr, y_tr)

# Evaluate on train set
accuracy_tr = linear_svc.score(X_tr, y_tr)

# Evaluate on test set
accuracy_te = linear_svc.score(X_te, y_te)

# Print accuracy
print('Train accuracy: {:.1f}%'.format(100*accuracy_tr))
      
# Print accuracy
print('Test accuracy: {:.1f}%'.format(100*accuracy_te))

Train accuracy: 100.0%
Test accuracy: 92.0%


In [5]:
from sklearn.svm import SVC

# Create SVM with RBF kernel
rbf_svc_c1 = SVC(kernel='rbf', C=1)

# Fitting on train set
rbf_svc_c1.fit(X_tr, y_tr)

# Evaluate on train set
accuracy_tr = rbf_svc_c1.score(X_tr, y_tr)

# Evaluate on test set
accuracy_te = rbf_svc_c1.score(X_te, y_te)

# Print accuracy
print('Train accuracy: {:.1f}%'.format(100*accuracy_tr))
      
# Print accuracy
print('Test accuracy: {:.1f}%'.format(100*accuracy_te))

Train accuracy: 99.3%
Test accuracy: 94.0%


## Linear, RBF SVMs

After fitting a random forest classifier model an accuracy value of **68.0%** is obtained, which is not a good value. With a random forest classifier model with cross-validation estimator, an accuracy value of **88.0%** is obtained, it is a really good value, it is better than the one obtained previously. 

Finally, with a support vector classification (SVM) object get a value of accuracy of **92.0%** getting a good value. support vector classification (SVM) with linear kernel object get a value of accuracy of **94.0%**. Both values show an improvement of the accuracy value using SVM model (with or without kernel), but even so, it is the best value obtained by random forest classifier model with cross-validation estimator.