### SVM

The principle of SVM is that if there exists a hyperplane that classifies the data into two regions, there can be multiple hyperplanes capable of separating the data. SVM finds the hyperplane that has the largest margin, maximizing the distance between the hyperplane and the nearest points (support vectors) from both classes. This is known as **Maximizing the Margin**.

The idea is that a larger margin leads to better generalization on unseen data. The goal of SVM is to minimize this margin distance (D) while ensuring proper classification of the data points. This works under the concept of **Margin Maximization**.

### Margin Distance

The margin is the distance between the positive hyperplane and the negative hyperplane that touch the support vectors of each respective region. The larger the margin, the more robust the classifier is to noise and errors.

### Support Vector

Support vectors are the data points that lie closest to the decision boundary (hyperplane). These points play a crucial role in defining the position of the hyperplane. If the hyperplane is moved, keeping the same slope, the first points it meets in both positive and negative regions are the support vectors. These vectors are critical as they directly influence the margin and the decision boundary of the SVM.

### 1- Hard Margin SVM

This applies to linearly separable data where a clear hyperplane divides the data into two regions with no misclassifications. The goal is to maximize the margin between the two classes.

**Loss Function:**
\[
\text{Loss function} = \arg \min \left(\frac{2}{\|\mathbf{w}\|}\right)
\]

Subject to:
\[
y_i(\mathbf{w} \cdot \mathbf{x}_i + b) \geq 1
\]

Where \(\mathbf{w}\) is the weight vector, \(b\) is the bias, and \(y_i\) are the class labels.

### 2- Soft Margin SVM

This is used when the data is not perfectly linearly separable, meaning some misclassifications are inevitable. There is a tradeoff between maximizing the margin and minimizing classification errors. Here, we introduce slack variables \(\xi_i\) to allow for misclassification, which penalizes the model based on how far the points are from the correct margin.

**Loss Function:**
\[
\text{Loss function} = \arg \min \left(\frac{2}{\|\mathbf{w}\|}\right) + C \sum \xi_i
\]

Where \(C\) is the regularization parameter, controlling the tradeoff between maximizing the margin and minimizing the classification error, and \(\xi_i\) represents the slack variables.

### Kernel Trick

Some datasets are not linearly separable, even with a soft margin. In such cases, the **Kernel Trick** is used. The kernel function transforms the data into a higher-dimensional space where a hyperplane can be used to separate the classes. Instead of explicitly transforming the data, kernel functions compute the dot products in the higher-dimensional space efficiently, allowing SVM to work in this transformed space without high computational cost.

### Types of Kernel Functions

1. **RBF (Radial Basis Function):** 
   \[
   K(\mathbf{x}, \mathbf{x'}) = \exp\left(-\gamma \|\mathbf{x} - \mathbf{x'}\|^2\right)
   \]
   
   This kernel is widely used and maps the data to an infinite-dimensional space. It works well when the relationship between the data points is non-linear.
   
   

2. **Polynomial Kernel:** 
   \[
   K(\mathbf{x}, \mathbf{x'}) = (\mathbf{x} \cdot \mathbf{x'} + c)^d
   \]
   
   This kernel allows learning of polynomial decision boundaries of degree \(d\).


3. **Sigmoid Kernel:** 
   \[
   K(\mathbf{x}, \mathbf{x'}) = \tanh(\alpha \mathbf{x} \cdot \mathbf{x'} + c)
   \]

    This kernel behaves like a neural network's activation function, though it may not always satisfy the conditions for being a proper kernel.


4. **Linear Kernel:** 
   \[
   K(\mathbf{x}, \mathbf{x'}) = \mathbf{x} \cdot \mathbf{x'}
   \]
       
    This is used when the data is linearly separable and is the simplest kernel function.


5. **Laplacian Kernel:** 
   \[
   K(\mathbf{x}, \mathbf{x'}) = \exp\left(-\frac{\|\mathbf{x} - \mathbf{x'}\|}{\sigma}\right)
   \]
   
   This kernel is similar to the RBF kernel but uses the \(L_1\) norm instead of the \(L_2\) norm, making it sensitive to absolute distances.


In [1]:
import pandas as pd
import numpy as np

In [3]:
class SVM:
    
    def __init__(self, learning_rate = 0.01, lamda = 0.01, n_iters = 1000):
        
        self.learning_rate = learning_rate
        self.lamda = lamda
        self.n_iters = n_iters
        self.w = None
        self.b = None
    
    # fit function will work on the condition where it lies 
    ### 1-- Correctly classified -- (xi) term will be zero
    ### 2-- Misclassified -- (xi) term will not be zero
    
    ## gradient will change according to the function of t each of the x vector
    def fit(self, X_train, y_train):
        
        n_samples, n_features = X.shape
        
        #changing it into negative and positive hyperplane
        y = np.where(y <= 0, -1, 1)
        
        self.w = np.ones(n_features)
        self.b = 0
        
        for _ in range(self.n_iters):
            
            for idx, x in enumerate(X_train):
                
                condition = y_train[idx]*(np.dot(x, self.w) + self.b) >= 1
                
                ### 1-- Correctly classified
                if condition:
                    
                    self.w -= self.learning_rate*(2 * lamda * self.w)
                
                else:    ### 2-- Misclassified
                    self.w -= self.learning_rate*(2*lamda*self.w - np.dot(x, y_train[idx]))
                    self.b -= self.learning_rate * y_train[idx]
    
    
    #inference function ---> calculate where the vector lies in which hyperplane
    
    #1-- if the value is positive then 1 -- positive hyper plane
    #2-- if the value is negative then 0 -- negative hyper plane
    
    def predict(self, X_test):
        val = np.dot(X_test, self.w) + self.b
        
        return np.sign(val)
        
        