## Hard Margin Support Vector Machine

Same as perceptron, hard-margin support vector machine aims to find a hyperplane that separates the data into different classes. When the training data is linearly separable, there is generally more than one linear hyperplane that can classify the data. The hard-margin support vector machine uses the interval maximization to obtain an optimal separating hyperplane. A hard-margin support vector machine can be formulated as a convex quadratic programming problem:

$$
\begin{aligned}
&\min _{w, b} \frac{1}{2}\|w\|^{2} \\
&\text { s.t. } \enspace y_{i}\left(w \cdot x_{i}+b\right)-1 \geq 0, i=1,2, \cdots, N
\end{aligned}
$$

In general, we can directly solve the above convex quadratic programming, but sometimes the original problem is not easy to solve. In this case, we need to introduce Lagrangian duality to convert the original problem into a dual problem to solve. The general form of the original quadratic program is:
$$
\begin{aligned}
&\min _{x \in R^{n}} f(x)\\
&\text { s.t. } \enspace c_{i}(x) \leq 0 \quad i=1,2 . . k\\
&h_{j}(x)=0 \quad j=1,2 . . l
\end{aligned}
$$

Introduce the Lagrangian function:
$$
L(x, \alpha, \beta)=f(x)+\sum_{i=1}^{k} \alpha_{i} c_{i}(x)+\sum_{j=1}^{l} \beta_{j} h_{j}(x)
$$

Define the maximization function for the function above:
$$
\theta_{p}(x)=\max _{\alpha, \beta, \alpha_{i} \geq 0}\left(f(x)+\sum_{i=1}^{k} \alpha_{i} c_{i}(x)+\sum_{j=1}^{l} \beta_{j} h_{j}(x)\right)
$$

The original problem is equivalent to the minimization and maximization problem of the Lagrangian function:
$$
\min _{x} \max _{\alpha, \beta, \alpha_{i} \geq 0}\left(f(x)+\sum_{i=1}^{k} \alpha_{i} c_{i}(x)+\sum_{j=1}^{l} \beta_{j} h_{j}(x)\right)=\min _{x} \theta_{p}(x)
$$

According to Lagrangian duality, the dual problem is also a maximization and minimization problem:
$$
\max _{\alpha} \min _{w, b}L(w, b, \alpha)
$$

To get the solution of this dual problem, we need to minimize $L(w, b, \alpha)$ and then calculate the maximization of it for $\alpha$. Take the partial derivative of $L(w, b, \alpha)$ with respect to $w$ and $b$ and make it equal to 0:
$$
\begin{aligned}
&\nabla_{w} L(w, b, \alpha)=w-\sum_{i=1}^{N} \alpha_{i} y_{i} x_{i}=0 \\
&\nabla_{b} L(w, b, \alpha)=\sum_{i=1}^{N} \alpha_{i} y_{i}=0
\end{aligned}
$$

We get:
$$
w = \sum_{i=1}^{N} \alpha_{i} y_{i} x_{i} \\
\sum_{i=1}^{N} \alpha_{i} y_{i}=0
$$
Substitute the result into $L(w, b, \alpha)$ to get：
$$
\begin{aligned}
L(w, b, \alpha) &=\frac{1}{2} \sum_{i=1}^{N} \sum_{j=1}^{N} \alpha_{i} \alpha_{j} y_{i} y_{j}\left(x_{i} \cdot x_{j}\right)-\sum_{i=1}^{N} \alpha_{i} y_{i}\left(\left(\sum_{j=1}^{N} \alpha_{j} y_{j} x_{j}\right) \cdot x_{i}+b\right)+\sum_{i=1}^{N} \alpha_{i} \\
&=-\frac{1}{2} \sum_{i=1}^{N} \sum_{j=1}^{N} \alpha_{i} \alpha_{j} y_{i} y_{i}\left(x_{i} \cdot x_{j}\right)+\sum_{i=1}^{N} \alpha_{i}
\end{aligned}
$$
That is:
$$
\min _{w, b}L(w, b, \alpha) = -\frac{1}{2} \sum_{i=1}^{N} \sum_{j=1}^{N} \alpha_{i} \alpha_{j} y_{i} y_{i}\left(x_{i} \cdot x_{j}\right)+\sum_{i=1}^{N} \alpha_{i}
$$

Calculate the maximization of $L(w, b, \alpha)$ to $\alpha$, that is, the dual problem:
$$
\begin{aligned}
&\max _{\alpha} -\frac{1}{2} \sum_{i=1}^{N} \sum_{j=1}^{N} \alpha_{i} \alpha_{j} y_{i} y_{j}\left(x_{i} \cdot x_{j}\right)+\sum_{i=1}^{N} \alpha_{i} \\
&\text { s.t. } \sum_{i=1}^{N} \alpha_{i} y_{i}=0, \enspace \alpha_{i} \geqslant 0, \enspace i=1,2, \cdots, N
\end{aligned}
$$

According to the condition of KTT, we get:
$$
\begin{gathered}
\nabla_{x} L\left(x^{*}, \alpha^{*}, \beta^{*}\right)=0 \\
\nabla_{\alpha} L\left(x^{*}, \alpha^{*}, \beta^{*}\right)=0 \\
\nabla_{\beta} L\left(x^{*}, \alpha^{*}, \beta^{*}\right)=0 \\
\alpha_{i}^{*} c_{i}\left(x^{*}\right)=0, \enspace i=1,2, \cdots, k \\
c_{i}\left(x^{*}\right) \leq 0, \enspace i=1,2, \cdots, k \\
\alpha_{i}^{*} \geq 0, \enspace i=1,2, \cdots, k \\
h_{j}\left(x^{*}\right)=0, \enspace j=1,2, \cdots, l
\end{gathered}
$$

Finally, the solution of the original problem can be obtained according to the dual problem as:
$$
\begin{aligned}
w^{*} &=\sum_{i=1}^{N} \alpha_{i}^{*} y_{i} x_{i} \\
b^{*} &=y_{j}-\sum_{i=1}^{N} \alpha_{i}^{*} y_{i}\left(x_{i} \cdot x_{j}\right)
\end{aligned}
$$

In [1]:
import numpy as np
import matplotlib.pyplot as plt

data_dict = {-1:np.array([[1,7], [2,8], [3,8],]), 1:np.array([[5,1], [6,-1], [7,3],])}

colors = {1:'r',-1:'g'}
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
[[ax.scatter(x[0], x[1], s=100, color=colors[i]) for x in data_dict[i]] for i in data_dict]


<Figure size 640x480 with 1 Axes>

In [3]:
def train(data):
    # parameter dictionary
    opt_dict = {}
    
    # data transformation list
    transforms = [[1,1], [-1,1], [-1,-1], [1,-1]]
    
    # obtain data from the dictionary
    all_data = []
    for yi in data:
        for featureset in data[yi]: 
            for feature in featureset:
                all_data.append(feature)
    
    # obtain the max and min value of data
    max_feature_value = max(all_data) 
    min_feature_value = min(all_data) 
    all_data = None
    
    # define a list of step sizes
    step_sizes = [max_feature_value * 0.1, max_feature_value * 0.01, max_feature_value * 0.001 ]
    
    # set up the range of parameter b
    b_range_multiple = 2
    b_multiple = 5
    latest_optimum = max_feature_value * 10

    # optimization based on different step size training
    for step in step_sizes:
        w = np.array([latest_optimum, latest_optimum])
        # convex optimization
        optimized = False
        while not optimized:
            for b in np.arange(-1*(max_feature_value*b_range_multiple), max_feature_value*b_range_multiple, step*b_multiple):
                for transformation in transforms:
                    w_t = w*transformation 
                    found_option = True

                    for i in data:
                        for xi in data[i]:
                            yi=i
                            if not yi*(np.dot(w_t,xi)+b) >= 1:
                                found_option = False

                    if found_option:
                        opt_dict[np.linalg.norm(w_t)] = [w_t,b]

            if w[0] < 0:
                optimized = True
                print('Optimized a step!') 
            else:
                w = w - step

        norms = sorted([n for n in opt_dict])
        opt_choice = opt_dict[norms[0]]
        w = opt_choice[0]
        b = opt_choice[1]
        latest_optimum = opt_choice[0][0]+step*2

    for i in data:
        for xi in data[i]:
            yi=i
            print(xi,':',yi*(np.dot(w,xi)+b)) 
    return w, b


In [4]:
w, b = train(data_dict)

Optimized a step!
Optimized a step!
Optimized a step!
[1 7] : 1.271999999999435
[2 8] : 1.271999999999435
[3 8] : 1.0399999999995864
[5 1] : 1.0479999999990506
[ 6 -1] : 1.7439999999985962
[7 3] : 1.0479999999990506


In [17]:
# define prediction function
def predict(features):
    classification = np.sign(np.dot(np.array(features),w)+b)
    if classification != 0:
        ax.scatter(features[0], features[1], s=200, marker='^', c=colors[classification])
    print(classification)
    return classification

In [18]:
predict_us = [[0, 10], [1, 3], [3, 4], [5, 6], [6, -5], [2, 5], [8, -3]]

for p in predict_us:
    predict(p)

-1.0
-1.0
-1.0
-1.0
1.0
-1.0
1.0
