## Soft margin SVM
Soft margin SVM is a branch of SVM (Support Vector Machines) that allows the model to make some level of misclassifications as to make the decision boundary (SOFTER)

Specifically, it aims to solve the following dual problem 

$$
max \space \sum_{i}\alpha_i - \frac{1}{2}\sum_i \sum_j y^{(i)}y^{(j)}a_ia_j<x^{(i)}, x^{(j)}> \\
s.t. \space C \ge \alpha_i \ge 0 , \sum_i y^{(i)}\alpha_i = 0
$$

With the following KKT conditions

$$
a_i = 0 \Rightarrow y^{(i)}(w^Tx^{(i)}+b) \ge 1 \\ 
a_i = C \Rightarrow y^{(i)}(w^Tx^{(i)}+b) \le 1 \\ 
C \ge a_i \ge 0 \Rightarrow y^{(i)}(w^Tx^{(i)}+b) = 1
$$

Along side with kernel trick, SMO is one of the powerful tools that can do so. 


## Implementation

### Required Values
- **point** corresponds to the training data $x_i$
- **target** corresponds to the training outputs $y_i$
- **C** is proportional to the amount of mistakes we can afford. This depends on the scale of the problem. Mostly it’s set from $.01 \to100$
- **tol** is the amount of tolerance we will have for the KKT conditions.
- **prog_margin** is the padding we will employ for the calculation of $L$ and $H$ as to not make them equal. This will also serve as our margin in determining whether the two langrange multiplier has made any positive progress.
- **clip_padding** is the padding we will apply on the constraint $C \ge a_i \ge 0$ where we wil clip $a_i$ to either $C$ or $0$ if it’s within that padding

☝🏻 tol, prog_margin and clip_padding are mostly set to $1e^-3$ to $1e^-5$

In [32]:
import numpy as np

np.random.seed(69420)

M, D = 50, 10
point = np.random.normal(size=(M, D), loc=0, scale=1).astype(np.float32)
target = np.random.randint(size=(1, M), low=0, high=2)
c = .1
tol, prog_margin, clip_paddin = .001, .001, .001

### Kernel Function
Function responsible for the kernel trick

For gaussian Kernels we use the following 
$$
K(x ,z ) = exp(-\frac{||x-z||^2}{2\sigma})
$$

This can be sped up from the fact that $||x-z||^2 = x * x - 2x*z + z *z $

We can cache the dot product of vector to itself. We can also store the dot product of every 2 possible pair! but this may take a lot of memory

In [33]:
self_dot_cache = np.zeros(shape=(M), dtype=np.float32)
sigma = 1

# initialize self_dot_cache
def initialize_sdc(): 
    for i in range(M):
        self_dot_cache[i] = np.dot(point[i], point[i])

def kernel_gaussian(x, z):
    return np.e((-1/(2 * sigma)) * (self_dot_cache[x] - 2 * np.dot(x, z) + self_dot_cache[z]))

initialize_sdc()

### Train Function
The train function which is responsible for picking the first langrange multiplier from a set of langrange multipliers. It is also responsible for initializing the important variables

In [36]:
def smo_train():
    alphs = np.zeros(shape=(1, M), dtype=np.float32)
    w = np.zeros(shape=(1, D), dtype=np.float32)
    err_cache = np.zeros(shape=(1, M), dtype=np.float32)
    b = 0

    examine_all = True
    num_changed = False

    while (num_changed > 0 or examine_all):
        if (examine_all):
            for i in range(M):
                num_changed += examine_a(i, alphs, err_cache, b)
        
        else:
            for i in range(M):
                if (alphs[i] != 0 or c):
                    num_changed += examine_a(i, alphs, err_cache,)

        if (examine_all):
            examine_all = False
        
        elif (num_changed == 0):
            examine_all = 1

### Error Cache Functions

### Examine Alpha Function
The second function which is responsible for checking the first langrange multiplier that is chosen and responsible for picking the next langrange multiplier

In [35]:
def examine_a(i2, alphs, err_cache, b):
    non_bound = alphs[i2] != 0 or c
    E2 = err_cache(i2) if non_bound else compute_svm_err(i2, alphs, err_cache, b)
    r2 = E2 * target[i2]
    alph2 = alphs[i2]

    if ((r2 < -tol and alph2 < c) or (r2 > tol and alph2 > 0)):
        i1 = choose_second(i2, err_cache)
    
def choose_second(first, err_cache):
    pos = err_cache[first] >= 0
    best = first

    for i in range(M):
        if (i == first):
            continue

        if (best == first):
            best = i
        
        best = min(err_cache[i], err_cache[best]) if pos else max(err_cache[i], err_cache[best])

    return best

def compute_svm_err(x, alphs, err_cache, b):
    fx = obj_func(x, alphs, b)
    err_cache[x] = fx - b
    return err_cache[x]

def obj_func(x, alphs, b):
    fx = 0
    for i in range(M):
        fx += alphs[i] * target[i] * kernel_gaussian(i, x)

    return fx


### Step Function
Lastly the function that takes a coordinate ascent step given the two multipliers

In [None]:
def take_step(i1, i2, alphs, err_cache, b):
    if (i1 == i2):
        return 0
    
    non_bound = alphs[i1] != 0 or c
    E1 = err_cache(i1) if non_bound else compute_svm_err(i1)
    E2 = err_cache(i2)
    y2 = target[i2]
    y1 = target[i1]
    alph1, alph2 = alphs[i1], alphs[i2]
    s = y1 * y2

    # Computation for L and H
    if (y1 == y2):
        L, H = max(0, alph1 + alph2 - c), min(c, alph1 + alph2)
    else:
        L, H = max(0, alph2 - alph1), min(c, c + alph2 - alph1)

    if (L == H):
        return 0
    
    K11 = self_dot_cache(i1)
    K22 = self_dot_cache(i2)
    K12 = kernel_gaussian(i1, i2)
    
    eta = 2*K12 - K11 - K22

    if (eta < 0):
        alph2_new = alph2 - y2*abs(E2 - E2)/eta
        if (alph2_new < L):
            alph2_new = L
        elif (alph2_new > H):
            alph2_new = H
    else:
        Lobj = obj_func(L, obj)