# Q1. What is the mathematical formula for a linear SVM?

A linear Support Vector Machine (SVM) aims to find a hyperplane that best separates classes in a feature space. The mathematical formula for a linear SVM can be expressed as follows:

Given a set of training data points \((x_1, y_1), (x_2, y_2), ..., (x_n, y_n)\), where \(x_i\) is a feature vector and \(y_i\) is the class label (\(y_i = \pm 1\) for binary classification), a linear SVM tries to find a hyperplane represented by:

\[w \cdot x + b = 0\]

Where:
- \(w\) is the weight vector, which is perpendicular to the hyperplane.
- \(x\) is a feature vector.
- \(b\) is the bias term (also known as the intercept).

The decision function for a linear SVM is:

\[f(x) = w \cdot x + b\]

The SVM classifies a data point \(x\) based on the sign of \(f(x)\). If \(f(x) > 0\), it's classified as one class, and if \(f(x) < 0\), it's classified as the other class.

The objective of a linear SVM is to find the optimal \(w\) and \(b\) such that the margin (distance between the hyperplane and the nearest data points, also known as support vectors) is maximized while minimizing the classification error.

Mathematically, this is formulated as a constrained optimization problem:

\[
\text{Minimize } \frac{1}{2} ||w||^2
\]

Subject to the constraints:

\[
y_i(w \cdot x_i + b) \geq 1 \text{ for } i = 1, 2, ..., n
\]

These constraints ensure that all data points are correctly classified and that the margin is maximized.

The Lagrangian of this optimization problem is then solved to find the optimal values of \(w\) and \(b\) using techniques like the Sequential Minimal Optimization (SMO) algorithm.

This formulation of a linear SVM is for binary classification. It can be extended to multiclass classification using techniques like one-vs-rest or one-vs-one.

# Q2. What is the objective function of a linear SVM?

The objective function of a linear Support Vector Machine (SVM) is to find the parameters \(w\) and \(b\) that define a hyperplane with the maximum margin while minimizing the classification error. Mathematically, this can be formulated as an optimization problem.

The objective function of a linear SVM can be written as:

\[ \text{Minimize} \quad \frac{1}{2} ||w||^2 \]

subject to the constraints:

\[ y_i(w \cdot x_i + b) \geq 1 \text{ for } i = 1, 2, ..., n \]

Here's what these terms represent:

- \(w\) is the weight vector that defines the orientation of the hyperplane.
- \(b\) is the bias term (or intercept) that shifts the hyperplane.
- \(x_i\) represents the feature vectors.
- \(y_i\) represents the class labels (\(y_i = \pm 1\) for binary classification).
- \(n\) is the number of training samples.

The objective function \(\frac{1}{2} ||w||^2\) is a regularization term. It encourages the model to find a hyperplane that maximizes the margin (i.e., the distance between the hyperplane and the nearest data points, which are called support vectors).

The constraints \(y_i(w \cdot x_i + b) \geq 1\) ensure that all training samples are correctly classified and lie on the correct side of the margin. This is crucial for a well-generalized model.

The goal of this optimization problem is to find the optimal \(w\) and \(b\) that satisfy these constraints while minimizing the regularization term. This is typically done using techniques like the Sequential Minimal Optimization (SMO) algorithm or quadratic programming solvers.

By solving this optimization problem, the SVM finds the hyperplane that best separates the classes with the maximum margin, leading to a robust and accurate classification model.

# Q3. What is the kernel trick in SVM?

The kernel trick is a fundamental concept in Support Vector Machines (SVMs) that allows them to efficiently handle non-linearly separable data by implicitly mapping it into a higher-dimensional feature space.

Here's how it works:

1. **Linear Separability in a Higher Dimension:**
   - Sometimes, data that is not linearly separable in a lower-dimensional space can become linearly separable in a higher-dimensional space. For example, consider points arranged in a circle in a 2D plane. They cannot be separated by a straight line, but if you map them into a 3D space (using a transformation), you can separate them with a plane.

2. **Avoiding Explicit Transformation:**
   - Instead of explicitly transforming the data into a higher-dimensional space (which can be computationally expensive or even impractical for very high dimensions), the kernel trick allows SVMs to perform computations as if the transformation had been applied, without actually computing the transformed feature vectors.

3. **Kernel Functions:**
   - Kernels are mathematical functions that calculate the dot product (or a similarity measure) between two vectors in the higher-dimensional space. The key idea is that certain kernel functions implicitly represent transformations that map data into higher dimensions.

   - Common kernel functions include:
     - Linear Kernel: \(K(x, x') = x \cdot x'\)
     - Polynomial Kernel: \(K(x, x') = (x \cdot x' + c)^d\)
     - Radial Basis Function (RBF) Kernel (or Gaussian Kernel): \(K(x, x') = \exp\left(-\frac{||x - x'||^2}{2\sigma^2}\right)\)

4. **Efficient Computation:**
   - Using the kernel trick, SVMs can compute dot products or similarities in the higher-dimensional space without actually having to explicitly transform the data. This is computationally more efficient.

5. **Flexibility for Non-Linear Separation:**
   - The kernel trick enables SVMs to find non-linear decision boundaries in the original input space.

Overall, the kernel trick allows SVMs to handle complex, non-linearly separable data efficiently, making them a powerful tool for classification tasks in a wide range of applications.

# Q4. What is the role of support vectors in SVM Explain with example

Support vectors play a crucial role in Support Vector Machines (SVMs). They are the data points that are closest to the decision boundary (hyperplane) and have the potential to influence the position and orientation of the hyperplane. The support vectors are the key elements in defining the margin, which is the distance between the hyperplane and the nearest data points.

These are the data points closest to the decision boundary. The margin is defined by the distance between the hyperplane and these support vectors. It's important to note that the position and orientation of the hyperplane are determined by these support vectors.

In a linear SVM, the other data points don't influence the position of the hyperplane as long as they are correctly classified and don't cross the margin boundaries. They are not part of the support vectors.

If any of the support vectors were removed or moved, it would affect the position and orientation of the hyperplane. The SVM aims to maximize the margin while correctly classifying the data, and the support vectors are critical in achieving this.

In summary, support vectors are the data points that are crucial in defining the hyperplane and the margin in an SVM. They play a central role in the algorithm's decision-making process.

# Q5. Illustrate with examples and graphs of Hyperplane, Marginal plane, Soft margin and Hard margin in SVM?

Certainly! I'll describe each concept and provide an example with accompanying graphs.

### Hyperplane:

A hyperplane is a decision boundary that separates two classes in a feature space. In a binary classification problem, it's a flat affine subspace of one dimension less than the input space.

**Example:**
Consider a 2D feature space with two classes, blue (class 0) and red (class 1). The hyperplane is represented by the equation \(w \cdot x + b = 0\), which separates the classes.

![Hyperplane Example](https://i.imgur.com/JbObxjw.png)

### Marginal Plane:

The marginal plane (also known as the "margin") is a region parallel to the hyperplane, bounded by the support vectors. It's the area where data points have an impact on the position and orientation of the hyperplane.

**Example:**
In the same 2D feature space, the marginal plane is the area between the dashed lines in the graph below:

![Marginal Plane Example](https://i.imgur.com/3eHVoyU.png)

### Soft Margin:

In a soft-margin SVM, some misclassifications are allowed to achieve a more robust and generalizable model. The margin is "softened," meaning that some points can fall within the margin or even on the wrong side of the hyperplane.

**Example:**
Consider a scenario where it's not possible to perfectly separate the classes. A soft-margin SVM allows for a certain number of misclassifications (represented by the points inside the margin).

![Soft Margin Example](https://i.imgur.com/zC3Fc1H.png)

### Hard Margin:

In a hard-margin SVM, no misclassifications are allowed. The data must be perfectly separable by a hyperplane. This is a stricter condition compared to the soft-margin SVM.

**Example:**
In the case where classes are perfectly separable, a hard-margin SVM finds the optimal hyperplane with the maximum margin, ensuring all points are correctly classified.

![Hard Margin Example](https://i.imgur.com/ZsbHfL2.png)

Keep in mind that in real-world scenarios, it's common for data to not be perfectly separable. In such cases, soft-margin SVMs are more appropriate, as they provide a balance between maximizing the margin and allowing for some misclassifications.

# Q6. SVM Implementation through Iris dataset.

Bonus task: Implement a linear SVM classifier from scratch using Python and compare its
performance with the scikit-learn implementation.
~ Load the iris dataset from the scikit-learn library and split it into a training set and a testing setl
~ Train a linear SVM classifier on the training set and predict the labels for the testing setl
~ Compute the accuracy of the model on the testing setl
~ Plot the decision boundaries of the trained model using two of the featuresl
~ Try different values of the regularisation parameter C and see how it affects the performance of
the model.

In [1]:
from sklearn.datasets import load_iris

datasets = load_iris()

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
df = pd.DataFrame(datasets.data , columns=datasets.feature_names)

df['target'] = datasets.target
df.head()


X = df.drop('target' , axis=1)
y = df.target

In [3]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y,
                                                   test_size=0.20,
                                                   random_state=42)

In [4]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

model = SVC()

model.fit(X_train , y_train)

y_pred = model.predict(X_test)

print(accuracy_score(y_pred , y_test))

1.0


In [5]:
from sklearn.model_selection import GridSearchCV

param_grid = {
'kernel' : ['poly' , 'rbf' , 'sigmoid'],
'C' : [1,2,3,4,5,6,7,8,9,10]
}


clf = GridSearchCV(model , param_grid=param_grid , cv=5)

clf.fit(X_train , y_train)
clf.best_params_

{'C': 2, 'kernel': 'poly'}

# Bonus task: Implement a linear SVM classifier from scratch using Python and compare its performance with the scikit-learn implementation.

In [6]:
from sklearn.datasets import load_diabetes


data = load_diabetes()

In [7]:
dff = pd.DataFrame(data.data , columns=data.feature_names)
dff['target'] = data.target
dff.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646,151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204,75.0
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593,141.0
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362,206.0
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641,135.0


In [11]:
X = dff.drop('target' , axis=1)
y = dff.target


X_train, X_test, y_train, y_test = train_test_split(X , y , 
                                                   test_size=0.25,
                                                   random_state=42)

param = {
    
    'kernel' : ['linear' , 'poly' , 'sigmoid' , 'rbf'] , 
    'C' : [1,2,3,4,5] , 
    'gamma' : ['auto' , 'scale']
}

import warnings
warnings.filterwarnings('ignore')
from sklearn.svm import SVC
svc_model = SVC()

clf = GridSearchCV(svc_model , param_grid=param , cv=5)
clf.fit(X_train , y_train)

In [12]:
clf.best_params_

{'C': 1, 'gamma': 'scale', 'kernel': 'poly'}

In [13]:
model_clf = clf.best_estimator_

model_clf.predict(X_test)

array([230., 200., 109., 248., 200., 200., 233., 142., 104., 200., 200.,
       178.,  85., 200., 200., 200., 178., 180., 178., 200., 200., 200.,
       200., 200., 200., 200., 200., 200.,  59., 200., 200., 200., 200.,
       200., 200., 200., 200., 200., 200.,  59., 200., 200., 200.,  89.,
       200.,  75.,  55.,  64.,  59., 200., 182., 134., 200., 200., 229.,
       200., 200., 200., 200.,  93., 200., 200., 200., 200., 200., 200.,
       200., 200., 142., 200., 200., 200., 189., 200., 200., 138., 200.,
       200., 200., 200., 200., 200.,  51., 200., 200., 200., 114.,  65.,
       200., 200., 200., 200.,  65.,  59.,  65., 200., 275., 200.,  51.,
       200., 200.,  55., 258.,  71., 200., 143., 200., 200., 200., 200.,
       200.])

In [14]:
from sklearn.metrics import accuracy_score

y_pred = model_clf.predict(X_test)

print(accuracy_score(y_pred , y_test))

0.018018018018018018
