# Machine Learning 6: Support Vector Machines (SVM Multiclassification)

## Submission by: Mark Preston

This week, I'll be classifying a high resolution aerial image into nine types of urban land cover using SVM. The set contains multi-scale spectral, size, shape, and texture information. The nine cover types are: trees, grass, soil, concrete, asphalt, buildings, cars, pools, shadows.

### Data Processing

Beginning the analysis, the data is already split into train and test. The dimensions for each shows interesting data set shapes. In fact, the test set is almost a square matrix, something that's unusual to see in prediction tasks. Below that, I've checked for missing values. Neither set seems to have any null values so no further cleaning is needed.

In [2]:
import pandas as pd

land_cover_train = pd.read_csv("train_data.csv")

land_cover_test = pd.read_csv("test_data.csv")

print("Land cover training set has", 
      land_cover_train.shape[0], "rows and", 
      land_cover_train.shape[1],  "columns")

print("Land cover test set has", 
      land_cover_test.shape[0], "rows and", 
      land_cover_test.shape[1],  "columns")

Land cover training set has 507 rows and 148 columns
Land cover test set has 168 rows and 148 columns


In [3]:
import numpy as np

train_null_count = land_cover_train.isnull()
test_null_count = land_cover_test.isnull()

print("Train null values:", np.count_nonzero(train_null_count))
print("Test null values:", np.count_nonzero(test_null_count))

Train null values: 0
Test null values: 0


The class dispersion for each set seems to be fairly consistent. The outcome variable is not perfectly balanced given the class proportions vary from around 8% to 17% in train and 3% to 19% in train.

In [4]:
class_null_count = pd.DataFrame({
    "train_class_count": land_cover_train["class"].value_counts() / land_cover_train.shape[0],
    "test_class_count": land_cover_test["class"].value_counts() / land_cover_test.shape[0]
})

class_null_count

Unnamed: 0,test_class_count,train_class_count
asphalt,0.083333,0.088757
building,0.14881,0.191321
car,0.089286,0.04142
concrete,0.136905,0.183432
grass,0.172619,0.163708
pool,0.089286,0.027613
shadow,0.095238,0.088757
soil,0.083333,0.039448
tree,0.10119,0.175542


Before moving into the prediction work, I've scaled the training and test predictors as well as separating the outcome variable.

In [5]:
from sklearn.preprocessing import StandardScaler

X_train = land_cover_train.drop(columns=["class"])
y_train = land_cover_train[["class"]]

X_test = land_cover_test.drop(columns=["class"])
y_test = land_cover_test[["class"]]

scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train), columns=list(X_train.head(0)))
X_test_scaled = pd.DataFrame(scaler.transform(X_test), columns=list(X_test.head(0)))

In [6]:
X_train_scaled[["Area", "Round", "Bright", "Compact"]].head()

Unnamed: 0,Area,Round,Bright,Compact
0,-0.618232,-0.761575,0.904361,-0.603625
1,0.431962,-0.530025,-1.86886,-0.889931
2,-0.219932,-0.423156,-1.808387,-0.93574
3,-0.537999,1.197695,-1.512353,1.400516
4,-0.639723,1.447056,-1.203813,0.999688


In [7]:
X_test_scaled[["Area", "Round", "Bright", "Compact"]].head()

Unnamed: 0,Area,Round,Bright,Compact
0,-0.675542,-0.476591,1.041138,-0.912836
1,-0.460631,0.574291,0.800037,0.312553
2,-0.424813,0.413987,1.053803,-0.134084
3,-0.234259,0.075567,1.025624,0.34691
4,0.546581,0.877087,0.436406,0.106413


### Random Forest Classifier - Base Model

I've started the prediction task by developing a function to construct the baseline models. From there, I've used it to develop the random forest.

In [8]:
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

classes = list(class_null_count.index)

rf=RandomForestClassifier(random_state=1017)

def base_model_dev(classifier):
    #set model
    classifier.fit(X_train_scaled, y_train.values.ravel())
    
    #confusion matrices
    train_preds = pd.DataFrame({
        "actual": y_train["class"],
        "class_pred": classifier.predict(X_train_scaled)
    })
    
    test_preds = pd.DataFrame({
        "actual": y_test["class"],
        "class_pred": classifier.predict(X_test_scaled)
    })
    
    train_cm = pd.DataFrame(confusion_matrix(y_true=train_preds["actual"], 
                                             y_pred= train_preds["class_pred"]), 
                                             columns=classes,
                                             index=classes)
    
    test_cm = pd.DataFrame(confusion_matrix(y_true=test_preds["actual"], 
                                            y_pred= test_preds["class_pred"]),
                                            columns=classes,
                                            index=classes)
                            
    
    #classification reports
    train_class_report = classification_report(train_preds["actual"], train_preds["class_pred"])
    test_class_report = classification_report(test_preds["actual"], test_preds["class_pred"])
    
    if "RandomForest" in str(classifier):
    
        var_importance =  pd.DataFrame({
            "name": list(X_train_scaled.head(0)),
            "var_importance": classifier.feature_importances_
        }) 
    
        output = {"test_cm": test_cm,
                  "train_cm": train_cm,
                  "test_report": test_class_report,
                  "train_report": train_class_report,
                  "feature_importance": var_importance
                  }
    
    elif "RandomForest" not in str(classifier):
        output = {"test_cm": test_cm,
                  "train_cm": train_cm,
                  "test_report": test_class_report,
                  "train_report": train_class_report
                  }
    
    return(output)

rf_baseline = base_model_dev(classifier = rf)

  from numpy.core.umath_tests import inner1d


The test confusion matrix and classification report show the initial model provides good classification results.Concrete seems to be the hardest class given it has the lowest F1 score while shadow has the highest indicating the random forest does well classifying it.

In [9]:
rf_baseline["test_cm"]

Unnamed: 0,asphalt,building,car,concrete,grass,pool,shadow,soil,tree
asphalt,14,0,0,0,0,0,0,0,0
building,1,22,0,2,0,0,0,0,0
car,0,0,13,1,0,1,0,0,0
concrete,0,5,0,18,0,0,0,0,0
grass,0,0,0,0,26,0,0,0,3
pool,0,0,1,0,0,14,0,0,0
shadow,1,0,0,0,0,0,15,0,0
soil,0,1,0,6,2,0,0,5,0
tree,0,0,0,1,4,0,0,0,12


In [10]:
print(rf_baseline["test_report"])

             precision    recall  f1-score   support

   asphalt        0.88      1.00      0.93        14
  building        0.79      0.88      0.83        25
       car        0.93      0.87      0.90        15
  concrete        0.64      0.78      0.71        23
     grass        0.81      0.90      0.85        29
      pool        0.93      0.93      0.93        15
    shadow        1.00      0.94      0.97        16
      soil        1.00      0.36      0.53        14
      tree        0.80      0.71      0.75        17

avg / total       0.84      0.83      0.82       168



Naturally, the baseline model is overfit given no tuning parameters were included. This means the forest almost perfectly fits the data, which inherently leads to a gulf between train and test evaluation metrics.  The confusion matrix and classification report show large differences which solidifies this intuition.

In [11]:
rf_baseline["train_cm"]

Unnamed: 0,asphalt,building,car,concrete,grass,pool,shadow,soil,tree
asphalt,45,0,0,0,0,0,0,0,0
building,0,96,0,1,0,0,0,0,0
car,0,0,21,0,0,0,0,0,0
concrete,0,2,0,91,0,0,0,0,0
grass,0,0,0,0,83,0,0,0,0
pool,0,0,0,0,0,14,0,0,0
shadow,0,0,0,0,0,0,45,0,0
soil,0,2,0,0,0,0,0,18,0
tree,0,0,0,0,0,0,0,0,89


In [12]:
print(rf_baseline["train_report"])

             precision    recall  f1-score   support

   asphalt        1.00      1.00      1.00        45
  building        0.96      0.99      0.97        97
       car        1.00      1.00      1.00        21
  concrete        0.99      0.98      0.98        93
     grass        1.00      1.00      1.00        83
      pool        1.00      1.00      1.00        14
    shadow        1.00      1.00      1.00        45
      soil        1.00      0.90      0.95        20
      tree        1.00      1.00      1.00        89

avg / total       0.99      0.99      0.99       507



The top 5 features are shown below. Of these, NDVI is the highest, though they all seem close in value. As a methodology note, NDVI is Normalized Difference Vegetation Index. This seems to make sense given cover like asphalt and concrete would have little vegetation while grass and soil would have some value.

In [13]:
rf_baseline["feature_importance"].sort_values(by=["var_importance"], ascending=False).head()

Unnamed: 0,name,var_importance
18,NDVI,0.058519
27,Mean_G_40,0.045158
28,Mean_R_40,0.040262
102,NDVI_100,0.036004
70,Mean_R_80,0.034834


### Linear SVM Classifier - Base Model

The second baseline model is a linear support vector classifier. It shows worse predictions than the random forest baseline given the average metrics for precision, recall, and F1-score are lower. Much like the random forest model, the SVC also shows overfitting given the difference between train and test evaluation metrics. Future iterations will work to tune the model to te better test results while also easing the gap between train and test accuracy.

In [14]:
from sklearn.svm import LinearSVC

lin_svc = LinearSVC(random_state=1017)

svm_baseline = base_model_dev(classifier = lin_svc)

In [15]:
svm_baseline["test_cm"]

Unnamed: 0,asphalt,building,car,concrete,grass,pool,shadow,soil,tree
asphalt,13,0,0,0,0,0,1,0,0
building,0,22,1,1,1,0,0,0,0
car,0,2,12,0,0,0,0,0,1
concrete,1,6,1,15,0,0,0,0,0
grass,0,0,0,1,26,0,0,0,2
pool,1,0,1,0,0,13,0,0,0
shadow,2,0,0,0,0,0,14,0,0
soil,0,4,0,1,3,0,0,6,0
tree,0,0,0,1,6,0,0,0,10


In [16]:
print(svm_baseline["test_report"])

             precision    recall  f1-score   support

   asphalt        0.76      0.93      0.84        14
  building        0.65      0.88      0.75        25
       car        0.80      0.80      0.80        15
  concrete        0.79      0.65      0.71        23
     grass        0.72      0.90      0.80        29
      pool        1.00      0.87      0.93        15
    shadow        0.93      0.88      0.90        16
      soil        1.00      0.43      0.60        14
      tree        0.77      0.59      0.67        17

avg / total       0.80      0.78      0.77       168



In [17]:
svm_baseline["train_cm"]

Unnamed: 0,asphalt,building,car,concrete,grass,pool,shadow,soil,tree
asphalt,45,0,0,0,0,0,0,0,0
building,0,97,0,0,0,0,0,0,0
car,0,0,21,0,0,0,0,0,0
concrete,0,0,0,93,0,0,0,0,0
grass,0,1,0,0,80,0,0,0,2
pool,0,0,0,0,0,14,0,0,0
shadow,0,0,0,0,0,0,45,0,0
soil,0,0,0,0,0,0,0,20,0
tree,0,0,0,0,0,0,0,0,89


In [18]:
print(svm_baseline["train_report"])

             precision    recall  f1-score   support

   asphalt        1.00      1.00      1.00        45
  building        0.99      1.00      0.99        97
       car        1.00      1.00      1.00        21
  concrete        1.00      1.00      1.00        93
     grass        1.00      0.96      0.98        83
      pool        1.00      1.00      1.00        14
    shadow        1.00      1.00      1.00        45
      soil        1.00      1.00      1.00        20
      tree        0.98      1.00      0.99        89

avg / total       0.99      0.99      0.99       507



### Support Vector Machine Classifier + Linear Kernel + Grid Search

I've developed the initial SVM below using a linear kernel.

In [19]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

def tuned_sv(kernel_choice, tune_params):
    svm_model = SVC(kernel = kernel_choice)
    
    grid_search_model = GridSearchCV(svm_model, tune_params, cv = 5, refit = True, n_jobs = -1, verbose = 5)
    grid_search_model.fit(X_train_scaled, y_train.values.ravel())
    
    #confusion matrices
    train_preds = pd.DataFrame({
        "actual": y_train["class"],
        "class_pred": grid_search_model.predict(X_train_scaled)
    })
    
    test_preds = pd.DataFrame({
        "actual": y_test["class"],
        "class_pred": grid_search_model.predict(X_test_scaled)
    })
    
    train_cm = pd.DataFrame(confusion_matrix(y_true=train_preds["actual"], 
                                             y_pred= train_preds["class_pred"]), 
                                             columns=classes,
                                             index=classes)
    
    test_cm = pd.DataFrame(confusion_matrix(y_true=test_preds["actual"], 
                                            y_pred= test_preds["class_pred"]),
                                            columns=classes,
                                            index=classes)
                            
    
    #classification reports
    train_class_report = classification_report(train_preds["actual"], train_preds["class_pred"])
    test_class_report = classification_report(test_preds["actual"], test_preds["class_pred"])
    
    output = {"test_cm": test_cm,
              "train_cm": train_cm,
              "test_report": test_class_report,
              "train_report": train_class_report
              }
    
    return(output)

lin_params = {"C": np.linspace(.01, 10, 10/.2).round(2)}

svm_lin_kernel = tuned_sv(kernel_choice = "linear", tune_params = lin_params) 



Fitting 5 folds for each of 50 candidates, totalling 250 fits


[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    2.8s
[Parallel(n_jobs=-1)]: Done  91 tasks      | elapsed:    5.0s
[Parallel(n_jobs=-1)]: Done 243 out of 250 | elapsed:    8.1s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done 250 out of 250 | elapsed:    8.3s finished


The test results show an upgrade over the initial support vector linear classifier but, not over the random forest. The are still signs of overfitting given the traind and test reports metrics are not congruent. That said, the training metric averages were at .89 and each of the tets metrics was between .81 and .83 so there was a substantial reduction of the gap from the baseline model. I think this further indicates that the classes are broadly separable by a linear decision boundary too, although the polynomial and rfb kernels will provide a comparison for this. 

In [20]:
svm_lin_kernel["test_cm"]

Unnamed: 0,asphalt,building,car,concrete,grass,pool,shadow,soil,tree
asphalt,13,0,0,0,0,0,1,0,0
building,0,22,0,2,1,0,0,0,0
car,0,1,14,0,0,0,0,0,0
concrete,0,5,0,17,0,0,0,1,0
grass,0,0,0,1,25,0,0,0,3
pool,0,0,0,0,0,14,1,0,0
shadow,1,0,0,0,0,0,15,0,0
soil,0,3,0,5,2,0,0,4,0
tree,0,0,0,1,2,0,0,0,14


In [21]:
print(svm_lin_kernel["test_report"])

             precision    recall  f1-score   support

   asphalt        0.93      0.93      0.93        14
  building        0.71      0.88      0.79        25
       car        1.00      0.93      0.97        15
  concrete        0.65      0.74      0.69        23
     grass        0.83      0.86      0.85        29
      pool        1.00      0.93      0.97        15
    shadow        0.88      0.94      0.91        16
      soil        0.80      0.29      0.42        14
      tree        0.82      0.82      0.82        17

avg / total       0.83      0.82      0.81       168



In [22]:
svm_lin_kernel["train_cm"]

Unnamed: 0,asphalt,building,car,concrete,grass,pool,shadow,soil,tree
asphalt,40,0,0,0,0,0,5,0,0
building,2,87,0,7,0,0,1,0,0
car,0,1,19,1,0,0,0,0,0
concrete,0,9,0,83,1,0,0,0,0
grass,0,1,0,0,70,0,0,0,12
pool,0,1,0,0,1,12,0,0,0
shadow,1,0,0,0,0,0,43,0,1
soil,0,3,0,4,2,0,0,11,0
tree,0,0,0,0,3,0,1,0,85


In [23]:
print(svm_lin_kernel["train_report"])

             precision    recall  f1-score   support

   asphalt        0.93      0.89      0.91        45
  building        0.85      0.90      0.87        97
       car        1.00      0.90      0.95        21
  concrete        0.87      0.89      0.88        93
     grass        0.91      0.84      0.88        83
      pool        1.00      0.86      0.92        14
    shadow        0.86      0.96      0.91        45
      soil        1.00      0.55      0.71        20
      tree        0.87      0.96      0.91        89

avg / total       0.89      0.89      0.89       507



### Support Vector Machine Classifier + Poly Kernel + Grid Search

In [24]:
poly_params = {"C": np.linspace(.01, 10, 10/.2).round(2),
               "degree": [2, 3, 4, 5, 6]
              }

svm_poly_kernel = tuned_sv(kernel_choice = "poly", tune_params = poly_params)

  """Entry point for launching an IPython kernel.


Fitting 5 folds for each of 250 candidates, totalling 1250 fits


[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    3.3s
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:    5.8s
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:    8.7s
[Parallel(n_jobs=-1)]: Done 280 tasks      | elapsed:   13.2s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:   19.1s
[Parallel(n_jobs=-1)]: Done 640 tasks      | elapsed:   25.8s
[Parallel(n_jobs=-1)]: Done 874 tasks      | elapsed:   33.7s
[Parallel(n_jobs=-1)]: Done 1144 tasks      | elapsed:   43.1s
[Parallel(n_jobs=-1)]: Done 1250 out of 1250 | elapsed:   46.5s finished


Interestingly, the poly kernel shows worse evaluation metrics than the linear kernel. I think this further highlights that a linear option provides reasonable class separation. The poly version is also very overfit with the train/test evaluation split values more in line with the un-tuned SVC.

In [25]:
svm_poly_kernel["test_cm"]

Unnamed: 0,asphalt,building,car,concrete,grass,pool,shadow,soil,tree
asphalt,13,0,0,0,0,0,1,0,0
building,0,22,0,2,1,0,0,0,0
car,0,2,11,0,0,1,0,1,0
concrete,0,5,0,17,0,0,0,1,0
grass,0,0,0,0,26,0,0,1,2
pool,0,0,0,0,0,14,1,0,0
shadow,2,0,0,0,0,0,14,0,0
soil,0,3,0,4,6,0,0,1,0
tree,0,0,0,1,3,0,0,0,13


In [26]:
print(svm_poly_kernel["test_report"])

             precision    recall  f1-score   support

   asphalt        0.87      0.93      0.90        14
  building        0.69      0.88      0.77        25
       car        1.00      0.73      0.85        15
  concrete        0.71      0.74      0.72        23
     grass        0.72      0.90      0.80        29
      pool        0.93      0.93      0.93        15
    shadow        0.88      0.88      0.88        16
      soil        0.25      0.07      0.11        14
      tree        0.87      0.76      0.81        17

avg / total       0.76      0.78      0.76       168



In [27]:
svm_poly_kernel["train_cm"]

Unnamed: 0,asphalt,building,car,concrete,grass,pool,shadow,soil,tree
asphalt,45,0,0,0,0,0,0,0,0
building,0,97,0,0,0,0,0,0,0
car,0,0,20,0,1,0,0,0,0
concrete,0,1,0,91,1,0,0,0,0
grass,0,1,0,0,82,0,0,0,0
pool,0,0,0,0,1,13,0,0,0
shadow,0,0,0,0,0,0,45,0,0
soil,0,0,0,0,5,0,0,15,0
tree,0,0,0,0,1,0,0,0,88


In [28]:
print(svm_poly_kernel["train_report"])

             precision    recall  f1-score   support

   asphalt        1.00      1.00      1.00        45
  building        0.98      1.00      0.99        97
       car        1.00      0.95      0.98        21
  concrete        1.00      0.98      0.99        93
     grass        0.90      0.99      0.94        83
      pool        1.00      0.93      0.96        14
    shadow        1.00      1.00      1.00        45
      soil        1.00      0.75      0.86        20
      tree        1.00      0.99      0.99        89

avg / total       0.98      0.98      0.98       507



### Support Vector Machine Classifier + RBF Kernel + Grid Search

The final model here is an SVM with kernel. The test report highlights this being the best model here given it has the highest average evaluation metrics. This is the only SVM model I developed that surpassed the random forest, which leads me to believe a tuned RF would likely be a better modelling choice. Nonetheless, the RBF kernel has the best result here. That said, despite having a .86, .84, .83 split for the test evaluation metrics, the training values show noticeable overfitting. Here, this is evident given all the training metrics are in the high 90s, which leads to a large generalization error. 

In [29]:
rbf_params = {"C": np.linspace(.01, 10, 10/.2).round(2),
               "gamma": [.01, .1, 1, 10, 100]
              }

svm_rbf_kernel = tuned_sv(kernel_choice = "rbf", tune_params = rbf_params)

  """Entry point for launching an IPython kernel.


Fitting 5 folds for each of 250 candidates, totalling 1250 fits


[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    3.2s
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:    6.2s
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:   10.8s
[Parallel(n_jobs=-1)]: Done 280 tasks      | elapsed:   17.9s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:   27.1s
[Parallel(n_jobs=-1)]: Done 640 tasks      | elapsed:   37.4s
[Parallel(n_jobs=-1)]: Done 874 tasks      | elapsed:   49.8s
[Parallel(n_jobs=-1)]: Done 1144 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 1250 out of 1250 | elapsed:  1.2min finished


In [30]:
svm_rbf_kernel["test_cm"]

Unnamed: 0,asphalt,building,car,concrete,grass,pool,shadow,soil,tree
asphalt,13,0,0,0,0,0,1,0,0
building,0,21,0,3,1,0,0,0,0
car,0,1,13,1,0,0,0,0,0
concrete,0,4,0,19,0,0,0,0,0
grass,0,1,0,0,26,0,0,0,2
pool,0,0,0,0,0,14,1,0,0
shadow,1,0,0,0,0,0,15,0,0
soil,0,2,0,4,3,0,0,5,0
tree,0,0,0,1,1,0,0,0,15


In [31]:
print(svm_rbf_kernel["test_report"])

             precision    recall  f1-score   support

   asphalt        0.93      0.93      0.93        14
  building        0.72      0.84      0.78        25
       car        1.00      0.87      0.93        15
  concrete        0.68      0.83      0.75        23
     grass        0.84      0.90      0.87        29
      pool        1.00      0.93      0.97        15
    shadow        0.88      0.94      0.91        16
      soil        1.00      0.36      0.53        14
      tree        0.88      0.88      0.88        17

avg / total       0.86      0.84      0.83       168



In [32]:
svm_rbf_kernel["train_cm"]

Unnamed: 0,asphalt,building,car,concrete,grass,pool,shadow,soil,tree
asphalt,45,0,0,0,0,0,0,0,0
building,0,97,0,0,0,0,0,0,0
car,0,0,21,0,0,0,0,0,0
concrete,0,1,0,92,0,0,0,0,0
grass,0,1,0,0,82,0,0,0,0
pool,0,0,0,0,0,14,0,0,0
shadow,0,0,0,0,0,0,45,0,0
soil,0,0,0,0,0,0,0,20,0
tree,0,0,0,0,1,0,0,0,88


In [34]:
print(svm_rbf_kernel["train_report"])

             precision    recall  f1-score   support

   asphalt        1.00      1.00      1.00        45
  building        0.98      1.00      0.99        97
       car        1.00      1.00      1.00        21
  concrete        1.00      0.99      0.99        93
     grass        0.99      0.99      0.99        83
      pool        1.00      1.00      1.00        14
    shadow        1.00      1.00      1.00        45
      soil        1.00      1.00      1.00        20
      tree        1.00      0.99      0.99        89

avg / total       0.99      0.99      0.99       507



### Conceptual Questions

#### From the models run in steps 2-6, which performs the best based on the Classification Report? Support your reasoning with evidence around your test data. 

I touched on this above but, the SVM with an RBF kernel showed the best results. This model had the highest average precision, recall, and F1 metrics when compared to the other four models, which means it's the preferable choice here.That said, it shows only marginally high metrics when compared to the basleine random forest (0.86, 0.84, .83 vs 0.84, 83, .82). This leads me to think that trying a more thoroughly tuned RF might show better results thna the RBF SVM. However, this is still the best choice here given the test classification report metrics.

#### Compare models run for steps 4-6 where different kernels were used. What is the benefit of using a polynomial or rbf kernel over a linear kernel? What could be a downside of using a polynomial or rbf kernel? 

A linear classifier is appropriate for decision boundaries that can be divided by a straight line (linearly definable). If the problem shows such a boundary during an EDA, then linear classifiers are the best choice. However, a large number of problems need more complexity for separating classes, which is where polynomial and RBF kernels help. Polynomials add more terms to the class boundaries and can often transform decision boundaries into near linear separation. These added terms are polynomials, such as $x^2$, $x^3$, and so on. This adds curvature to the decision boundary thereby making it more flexible to non-linearity. The downside here is adding a lot of features, which can slow down the model development process (though the kernel trick helps avoid this to an extent). The Gaussian RBF, or radial basis function, provides a similarity measure between input features and some landmark value again adding complexity. The method also reduces feature variance. Overall, both kernels are useful for taking non-linear separations and mapping them into a high feature space where they become linearly separable. However, the downside is this model complexity. Neither kernel scales well to larger sets. As a rule of thumb, a more simplistic kernel, like linear, is a good baseline; following this, added kernel complexity can be compared against the result.

#### Explain the 'C' parameter used in steps 4-6. What does a small C mean versus a large C in sklearn? Why is it important to use the 'C' parameter when fitting a model? 

C is a regularization parameter that influences how much misclassification is permitted in an SVM. Small C values will set up large margin separating the hyperplane, which can lead to more misclassification. Larger C values develop a smaller-margin hyperplane, which can lead to a better job of getting all the training points classified correctly. It's important to experiment with different C values because it's one of the essential hyperparameters influencing classification results. For example, the grid searches I did here use 50 C values to find the best classification result. Specifying a variety of C values inherently lends itself to finding a better classifier since it's an part of the hyperparameter and support vector construction. 


#### Scaling our input data does not matter much for Random Forest, but it is a critical step for Support Vector Machines. Explain why this is such a critical step. Also, provide an example of a feature from this data set that could cause issues with our SVMs if not scaled.

SVMs are sensitive to feature scale. This is because modelling approach sets up a separable plane and support vectors using the least squares method. This means that a single feature with very high values becomes dominant when calculating the distances. Using the data set here, I've picked out the max value from each column and sorted the top 5, all of which are area features. As seen, the values are as high as 51,578. In contrast, the lower scale features, like NDVI, are only a few decimal points at times. With this in mind, these have to be put on the same scale so features like area do not dominate the plane and support vector definition process.

In [182]:
pd.DataFrame({"column_max": land_cover_train.max(numeric_only=True)}).sort_values(by=["column_max"], ascending=False).head()

Unnamed: 0,column_max
Area_120,51578.0
Area_140,51578.0
Area_80,42018.0
Area_100,42018.0
Area_60,24295.0


In [36]:
pd.DataFrame({"column_min": land_cover_train.min(numeric_only=True)}).sort_values(by=["column_min"], ascending=True).head()

Unnamed: 0,column_min
NDVI,-0.38
NDVI_40,-0.37
NDVI_140,-0.36
NDVI_60,-0.36
NDVI_80,-0.36


#### Describe conceptually what the purpose of a kernel is for Support Vector Machines

The kernel trick takes a more simplistic feature space and maps it into a high dimensional space using some function. In this new high dimensional space, the class separation can be become linear, which is the ideal situation for developing a classifier. This linear solution can then be mapped back into the original feature space yielding a boundary for a non-linearly separable solution. The additional useful dimension here is by constructing the classification boundary without explicitly adding these high-dimensional terms, like polynomials, the trick can help produce some computational efficiency as well. 

In [37]:
import os

os.getcwd()

'C:\\Users\\Mark\\Machine Learning 6'