# COMP3115: Exploratory Data Analysis and Visualization

# Lab 8: Support Vector Machine and KNN

## 1.1 Data Set

In this lab exercise, we make use a dataset which is called "iris". It contains four input attributes/features corresponding to a particular species of iris, including length of sepal, width of sepal, length of petal, and width of petal. It also contains labels specifying the particular species of the iris. Note that this dataset is commonly used by the machine learning community for benchmarking.

### Importing data from file

In [1]:
import pandas as pd
import numpy as np
df = pd.read_csv('iris.csv')
print(df.head())
print('\n Different species: ',df.species.unique())

   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa

 Different species:  ['setosa' 'versicolor' 'virginica']


## 1.2 Support Vector Machine (SVM)

### Data preparation for k-fold cross validation

The following code adopt k-fold cross validation for training and testing the support vector machine for classification. It divides the data into k folds, and hold one for testing and the remaining for training. Repeat that k times. If k=5, it is essentially splitting the data set into 80% for training and 20% for testing. Training and testing steps are done k times and the evaluation results can be averaged to reduce the bias due to a particular data split.

In [2]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_predict
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix

# Suppose we want to classify if an iris is either `setosa` or `virginica` or not. Prepare data into right format
# by labeling them as +1 and `versicolor` as -1. 
X = df.drop(['species'], axis=1).values # drop the species column and leave only the remaining as input
y = np.ones(df.shape[0]) # set all to 1
y[df['species']!='versicolor'] = -1 # set the versicolor to -1

# Construct SVM model - as kernel is not specified, the default is RBF
svm_classifier = SVC() #C-Support Vector Classification.

# Obtain accuracy based on 5-fold cross-validation
cv_accuracy = cross_val_score(svm_classifier, X, y, scoring='accuracy', cv = 5)
y_pred = cross_val_predict(svm_classifier, X, y, cv = 5)

np.set_printoptions(precision=3)

print('accuracy (per fold)= ', cv_accuracy)
print('accuracy (average)= ', round(cv_accuracy.mean(),3),'(',round(cv_accuracy.std(),3),')')

tn, fp, fn, tp = confusion_matrix(y, y_pred).ravel()
print('\nConfusion Matrix:')
print('=================')
print('TN=',tn, 'FP=', fp, 'FN=', fn, 'TP=', tp)
print('Recall/Sensitivity= ',round(tp/(tp+fn),3))
print('Specificity= ', round(tn/(tn+fp),3))
print('Precision= ', round(tp/(tp+fp),3))

accuracy (per fold)=  [1.    1.    0.933 0.8   0.833]
accuracy (average)=  0.913 ( 0.083 )

Confusion Matrix:
TN= 88 FP= 12 FN= 1 TP= 49
Recall/Sensitivity=  0.98
Specificity=  0.88
Precision=  0.803


### SVMs with different parameter settings

In [3]:
# Construct SVM with a different parameter seting; C for regularization and gamma is for kernel
svm_classifier = SVC(C = 100, gamma = 'auto')

#Cfloat, default=1.0, Regularization parameter. 
#The strength of the regularization is inversely proportional to C. 
#Must be strictly positive. The penalty is a squared l2 penalty.

#gamma{‘scale’, ‘auto’} or float, default=’scale’. Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’.
# influence of single sample to decision boundary, large gamma better to map data to high-dimensional space but may introduce overfitting
#if gamma='scale' (default) is passed then it uses 1 / (n_features * X.var()) as value of gamma,
#if ‘auto’, uses 1 / n_features
#if float, must be non-negative.

# Obtain accuracy based on 5-fold cross-validation
cv_accuracy = cross_val_score(svm_classifier, X, y, scoring='accuracy', cv = 5)
y_pred = cross_val_predict(svm_classifier, X, y, cv = 5)

print('accuracy (per fold)= ', cv_accuracy)
print('accuracy (average)= ', round(cv_accuracy.mean(),3),'(',round(cv_accuracy.std(),3),')')

tn, fp, fn, tp = confusion_matrix(y, y_pred).ravel()
print('\nConfusion Matrix:')
print('=================')
print('TN=',tn, 'FP=', fp, 'FN=', fn, 'TP=', tp)
print('Recall/Sensitivity= ',round(tp/(tp+fn),3))
print('Specificity= ', round(tn/(tn+fp),3))
print('Precision= ', round(tp/(tp+fp),3))

accuracy (per fold)=  [1.    1.    0.867 0.9   0.933]
accuracy (average)=  0.94 ( 0.053 )

Confusion Matrix:
TN= 95 FP= 5 FN= 4 TP= 46
Recall/Sensitivity=  0.92
Specificity=  0.95
Precision=  0.902


In [4]:
# Construct SVM with another parameter setting - linear kernel and C=5
svm_classifier = SVC(C=5, kernel = 'linear')

# Obtain accuracy based on 5-fold cross-validation
cv_accuracy = cross_val_score(svm_classifier, X, y, scoring='accuracy', cv = 5)
y_pred = cross_val_predict(svm_classifier, X, y, cv = 5)

print('accuracy (per fold)= ', cv_accuracy)
print('accuracy (average)= ', round(cv_accuracy.mean(),3),'(',round(cv_accuracy.std(),3),')')

tn, fp, fn, tp = confusion_matrix(y, y_pred).ravel()
print('\nConfusion Matrix:')
print('=================')
print('TN=',tn, 'FP=', fp, 'FN=', fn, 'TP=', tp)
print('Recall/Sensitivity= ',round(tp/(tp+fn),3))
print('Specificity= ', round(tn/(tn+fp),3))
print('Precision= ', round(tp/(tp+fp),3))

accuracy (per fold)=  [0.767 0.833 0.567 0.533 0.7  ]
accuracy (average)=  0.68 ( 0.115 )

Confusion Matrix:
TN= 79 FP= 21 FN= 27 TP= 23
Recall/Sensitivity=  0.46
Specificity=  0.79
Precision=  0.523


## 1.3 K-Nearest Neighbors (KNN)

KNN considers the neighbors of a data item in the dataset to determine its class, instead of learning a parametric model. So, it is considered to be a non-parametric classification method.

In [5]:
from sklearn.neighbors import KNeighborsClassifier

# Construct KNN model
KNN_classifier = KNeighborsClassifier(n_neighbors=3)

# Obtain accuracy based on 5-fold cross-validation
cv_accuracy = cross_val_score(KNN_classifier, X, y, scoring='accuracy', cv = 5)

# Perform 5-fold cross-validation and put the prediction results in y_pred
y_pred = cross_val_predict(KNN_classifier, X, y, cv = 5)

print('accuracy (per fold)= ', cv_accuracy)
print('accuracy (average)= ', round(cv_accuracy.mean(),3),'(',round(cv_accuracy.std(),3),')')

tn, fp, fn, tp = confusion_matrix(y, y_pred).ravel()
print('\nConfusion Matrix:')
print('=================')
print('TN=',tn, 'FP=', fp, 'FN=', fn, 'TP=', tp)
print('Recall/Sensitivity= ',round(tp/(tp+fn),3))
print('Specificity= ', round(tn/(tn+fp),3))
print('Precision= ', round(tp/(tp+fp),3))

accuracy (per fold)=  [1.    1.    0.9   0.933 0.967]
accuracy (average)=  0.96 ( 0.039 )

Confusion Matrix:
TN= 97 FP= 3 FN= 3 TP= 47
Recall/Sensitivity=  0.94
Specificity=  0.97
Precision=  0.94


In [6]:
from sklearn import decomposition

# Apply principal component analysis for dimension reduction first
pca = decomposition.PCA(n_components=2)
pca.fit(X)
XX = pca.transform(X)

# Construct KNN model
KNN1_classifier = KNeighborsClassifier(n_neighbors=3)

# Perform 5-fold cross-validation and put the prediction results in y_pred
y_pred = cross_val_predict(KNN1_classifier, XX, y, cv = 5)

print('accuracy (per fold)= ', cv_accuracy)
print('accuracy (average)= ', round(cv_accuracy.mean(),3),'(',round(cv_accuracy.std(),3),')')

tn, fp, fn, tp = confusion_matrix(y, y_pred).ravel()
print('\nConfusion Matrix:')
print('=================')
print('TN=',tn, 'FP=', fp, 'FN=', fn, 'TP=', tp)
print('Recall/Sensitivity= ',round(tp/(tp+fn),3))
print('Specificity= ', round(tn/(tn+fp),3))
print('Precision= ', round(tp/(tp+fp),3))

accuracy (per fold)=  [1.    1.    0.9   0.933 0.967]
accuracy (average)=  0.96 ( 0.039 )

Confusion Matrix:
TN= 98 FP= 2 FN= 3 TP= 47
Recall/Sensitivity=  0.94
Specificity=  0.98
Precision=  0.959
