

# Support Vector Classifier 

I would like to mention that while there are many libraries/frameworks available to implement SVM (Support Vector Machine) algorithm without writing a bunch of code, I decided to write the code with as few high-level libraries as possible so that you and I can get a good grasp of important components involved in training an SVM model (with 99% accuracy, 0.98 recall, and precision). If you are looking for a quick implementation of SVM, then you are better off using packages like scikit-learn, cvxopt, etc.


In [7]:
import numpy as np
import pandas as pd
import itertools
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn import datasets
np.random.seed(1234)

## Fitting the data
Below, first we implement the `fit` function that learns the model parameters. The convex optimization problem for the SVM is equivalent to the following form, 

$$\begin{array}{l}
\min _{\beta, \beta_{0}} \frac{1}{2}\|\beta\|^{2}+C \sum_{i=1}^{N} \xi_{i} \\
\text { subject to } \xi_{i} \geq 0, y_{i}\left(x_{i}^{T} \beta+\beta_{0}\right) \geq 1-\xi_{i} \forall i,
\end{array}$$

In the training phase, Larger C results in the narrow margin (for infinitely large C the SVM becomes hard margin) and smaller C results in the wider margin.

Now, we are going to minimize the following objective function shown below: 

$$L(\beta) =\frac{1}{2}\|\beta\|^{2} + \frac{C}{N} \sum_{i=1}^{N} \operatorname{max}(0, 1- y_{i}\left(x_{i}^{T} \beta+\beta_{0}\right) )$$

In [8]:
# set hyper-parameters and call init
regularization_strength = 10000
learning_rate = 0.000001
class SVM_classifier:
    
    def __init__(self):
        return
    
    def fit(self, x, y):
        self.W = self._sgd(x.to_numpy(), y.to_numpy())
        print("training finished.")
        print("weights are: {}".format(self.W))
        return self
    def _compute_cost(self, W, X, Y):
        # calculate hinge loss
        N = X.shape[0]
        distances = 1 - Y * (np.dot(X, W))
        distances[distances < 0] = 0  # equivalent to max(0, distance)
        hinge_loss = regularization_strength * (np.sum(distances) / N)

        # calculate cost
        cost = 1 / 2 * np.dot(W, W) + hinge_loss
        return cost
    def _calculate_cost_gradient(self, W, X_batch, Y_batch):
        # if only one example is passed (eg. in case of SGD)
        if type(Y_batch) == np.float64:
            Y_batch = np.array([Y_batch])
            X_batch = np.array([X_batch])  # gives multidimensional array

        distance = 1 - (Y_batch * np.dot(X_batch, W))
        dw = np.zeros(len(W))

        for ind, d in enumerate(distance):
            if max(0, d) == 0:
                di = W
            else:
                di = W - (regularization_strength * Y_batch[ind] * X_batch[ind])
            dw += di

        dw = dw/len(Y_batch)  # average
        return dw
    def _sgd(self, features, outputs):
        """features and outputs need to be numpy array"""
        max_epochs = 5000
        weights = np.zeros(features.shape[1])
        nth = 0
        prev_cost = float("inf")
        cost_threshold = 0.01  # in percent
        # stochastic gradient descent
        for epoch in range(1, max_epochs):
            # shuffle to prevent repeating update cycles
            X, Y = shuffle(features, outputs)
            for ind, x in enumerate(X):
                ascent = self._calculate_cost_gradient(weights, x, Y[ind])
                weights = weights - (learning_rate * ascent)

            # convergence check on 2^nth epoch
            if epoch == 2 ** nth or epoch == max_epochs - 1:
                cost = self._compute_cost(weights, features, outputs)
                print("Epoch is: {} and Cost is: {}".format(epoch, cost))
                # stoppage criterion
                if abs(prev_cost - cost) < cost_threshold * prev_cost:
                    return weights
                prev_cost = cost
                nth += 1
        return weights

## Prediction for new data
So far we have produced model parameters. To predict new classes given the pretrained model, we don't need to update the parameter. Thus, we need a new function ```prediction(...)```


In [9]:
def predict(self, x):
    y_predicted = np.array([])
    for i in range(x.shape[0]):
        yp = np.sign(np.dot(x.to_numpy()[i], self.W))
        y_predicted = np.append(y_predicted, yp)
    return y_predicted

SVM_classifier.predict = predict

## Experiment
Next, we’ll be working with a breast cancer dataset available on [Kaggle](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data). As before we only use two features for better visualization.

In [10]:
import statsmodels.api as sm
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split as tts
from sklearn.metrics import accuracy_score, recall_score, precision_score
from sklearn.utils import shuffle


In [11]:

# >> FEATURE SELECTION << #
def remove_correlated_features(X):
    corr_threshold = 0.9
    corr = X.corr()
    drop_columns = np.full(corr.shape[0], False, dtype=bool)
    for i in range(corr.shape[0]):
        for j in range(i + 1, corr.shape[0]):
            if corr.iloc[i, j] >= corr_threshold:
                drop_columns[j] = True
    columns_dropped = X.columns[drop_columns]
    X.drop(columns_dropped, axis=1, inplace=True)
    return columns_dropped


def remove_less_significant_features(X, Y):
    sl = 0.05
    regression_ols = None
    columns_dropped = np.array([])
    for itr in range(0, len(X.columns)):
        regression_ols = sm.OLS(Y, X).fit()
        max_col = regression_ols.pvalues.idxmax()
        max_val = regression_ols.pvalues.max()
        if max_val > sl:
            X.drop(max_col, axis='columns', inplace=True)
            columns_dropped = np.append(columns_dropped, [max_col])
        else:
            break
    regression_ols.summary()
    return columns_dropped

In [12]:
print("reading dataset...")
# read data in pandas (pd) data frame
data = pd.read_csv('./data.csv')

# drop last column (extra column added by pd)
# and unnecessary first column (id)
data.drop(data.columns[[-1, 0]], axis=1, inplace=True)

print("applying feature engineering...")
# convert categorical labels to numbers
diag_map = {'M': 1.0, 'B': -1.0}
data['diagnosis'] = data['diagnosis'].map(diag_map)

# put features & outputs in different data frames
Y = data.loc[:, 'diagnosis']
X = data.iloc[:, 1:]

# filter features
remove_correlated_features(X)
remove_less_significant_features(X, Y)

# normalize data for better convergence and to prevent overflow
X_normalized = MinMaxScaler().fit_transform(X.values)
X = pd.DataFrame(X_normalized)

# insert 1 in every row for intercept b
X.insert(loc=len(X.columns), column='intercept', value=1)

# split data into train and test set
print("splitting dataset into train and test sets...")
X_train, X_test, y_train, y_test = tts(X, Y, test_size=0.2, random_state=42)

reading dataset...
applying feature engineering...
splitting dataset into train and test sets...


In [13]:
model = SVM_classifier()
model.fit(X_train, y_train)

Epoch is: 1 and Cost is: 7226.631781243718
Epoch is: 2 and Cost is: 6718.745279355554
Epoch is: 4 and Cost is: 5543.138445903449
Epoch is: 8 and Cost is: 3850.3110958451207
Epoch is: 16 and Cost is: 2630.836909563661
Epoch is: 32 and Cost is: 1959.463652259652
Epoch is: 64 and Cost is: 1593.6849558404429
Epoch is: 128 and Cost is: 1324.782791125777
Epoch is: 256 and Cost is: 1158.6017735548542
Epoch is: 512 and Cost is: 1080.3310531551897
Epoch is: 1024 and Cost is: 1049.5975862744087
Epoch is: 2048 and Cost is: 1041.6398680490345
training finished.
weights are: [ 3.55394667 11.03426046 -2.30419309 -7.89882843 10.15727646 -1.27468922
 -6.4398698   2.25059307 -3.88065271  3.22811997  4.94187004  4.8266097
 -4.70381503]


<__main__.SVM_classifier at 0x7f13f4550e90>

In [14]:
# prediction 
y_test_predicted = model.predict(X_test)

In [15]:
print("accuracy on test dataset: {}".format(accuracy_score(y_test, y_test_predicted)))
print("recall on test dataset: {}".format(recall_score(y_test, y_test_predicted)))
print("precision on test dataset: {}".format(recall_score(y_test, y_test_predicted)))


accuracy on test dataset: 0.9912280701754386
recall on test dataset: 0.9767441860465116
precision on test dataset: 0.9767441860465116


# Bonus Question: 
Why do we remove one of the correlated features? There are multiple reasons but the simplest of them is that correlated features almost have the same effect on the dependent variable. Moreover, correlated features won’t improve our model and would most probably worsen it, therefore we are better off using only one of them. After all, fewer features result in improved learning speed and a simpler model (model with fewer features).

