In [None]:
ASSIGMENT:-1

Q1. What is the mathematical formula for a linear SVM?

The linear Support Vector Machine (SVM) is based on the idea of finding the optimal hyperplane that separates data points belonging to different classes in a way that maximizes the margin between them. Here's the mathematical formulation:

1. Decision Function
A linear SVM makes predictions using the following decision function:

𝑓
(
𝑥
)
=
𝑤
⋅
𝑥
+
𝑏
Where:

𝑤
: Weight vector (defines the orientation of the hyperplane).

𝑥
: Input feature vector.

𝑏
: Bias term (defines the position of the hyperplane).

⋅
: Dot product.

The decision boundary is defined as 
𝑓
(
𝑥
)
=
0
, which corresponds to the separating hyperplane.

2. Optimization Problem
The linear SVM solves the following convex optimization problem:

Minimize: 
1
2
∣
∣
𝑤
∣
∣
2
Subject to: $$ y_i (w \cdot x_i + b) \geq 1, \quad \forall i $$

Where:

∣
∣
𝑤
∣
∣
2
: The squared magnitude of the weight vector, minimizing which maximizes the margin.

𝑦
𝑖
: Label for the 
𝑖
-th sample (
+
1
 or 
−
1
).

𝑥
𝑖
: Input feature vector for the 
𝑖
-th sample.

The constraint ensures that each sample is correctly classified and lies on the correct side of the margin.

Soft-Margin SVM (Handling Misclassifications)
For datasets that are not perfectly separable, the soft-margin SVM introduces slack variables 
𝜉
𝑖
 to allow misclassifications. The optimization problem becomes:

Minimize: 
1
2
∣
∣
𝑤
∣
∣
2
+
𝐶
∑
𝑖
=
1
𝑛
𝜉
𝑖
Subject to: $$ y_i (w \cdot x_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0 $$

Where:

𝜉
𝑖
: Slack variable for the 
𝑖
-th sample.

𝐶
: Regularization parameter that controls the trade-off between maximizing the margin and minimizing misclassifications.

Q2. What is the objective function of a linear SVM?

The objective function of a linear Support Vector Machine (SVM) aims to find the optimal hyperplane that maximizes the margin between classes while minimizing classification errors. It can be expressed as an optimization problem.

Objective Function (Primal Form)
For a hard-margin SVM (when the data is perfectly linearly separable):

Minimize: 
1
2
∣
∣
𝑤
∣
∣
2
Subject to:

𝑦
𝑖
(
𝑤
⋅
𝑥
𝑖
+
𝑏
)
≥
1
,
∀
𝑖
Where:

∣
∣
𝑤
∣
∣
2
: The squared magnitude of the weight vector, which controls the margin width. Minimizing this maximizes the margin.

𝑦
𝑖
: Class label for the 
𝑖
-th training sample (
+
1
 or 
−
1
).

𝑥
𝑖
: Feature vector for the 
𝑖
-th sample.

𝑤
: Weight vector defining the orientation of the hyperplane.

𝑏
: Bias term defining the position of the hyperplane.

For a soft-margin SVM (when data is not perfectly separable), slack variables 
𝜉
𝑖
 are introduced to allow some misclassification:

Soft-Margin Objective Function:
Minimize: 
1
2
∣
∣
𝑤
∣
∣
2
+
𝐶
∑
𝑖
=
1
𝑛
𝜉
𝑖
Subject to:

𝑦
𝑖
(
𝑤
⋅
𝑥
𝑖
+
𝑏
)
≥
1
−
𝜉
𝑖
,
𝜉
𝑖
≥
0
,
∀
𝑖
Where:

𝜉
𝑖
: Slack variable representing the degree of misclassification for the 
𝑖
-th sample.

𝐶
: Regularization parameter balancing the trade-off between maximizing the margin (minimizing 
∣
∣
𝑤
∣
∣
2
) and minimizing misclassification errors (
𝜉
𝑖
).

Q3. What is the kernel trick in SVM?

The kernel trick is a fundamental concept in Support Vector Machines (SVMs) that allows the algorithm to handle non-linearly separable data by mapping it into a higher-dimensional space where it becomes linearly separable. Here's a detailed explanation:

What is the Kernel Trick?
Linear vs. Nonlinear Data:

In its original form, SVM finds a hyperplane to separate data points. However, for nonlinear data, no linear hyperplane can achieve this separation.

Feature Mapping:

To deal with nonlinear separability, the kernel trick maps the data into a higher-dimensional feature space where a linear hyperplane can separate the points.

Example: Mapping data from 2D space to 3D space to make it separable.

Avoid Explicit Transformation:

Instead of explicitly computing the transformation into the higher-dimensional space, the kernel trick calculates the dot products in the transformed space directly using kernel functions.

This avoids the computational cost of working in high dimensions.

Q4. What is the role of support vectors in SVM Explain with example

Role of Support Vectors in SVM
Define the Margin:

Support vectors are the data points that are closest to the separating hyperplane. They lie on or near the margin boundaries (for hard-margin SVM) or just inside the margin (for soft-margin SVM).

These points are crucial for determining the width of the margin, which the SVM maximizes.

Determine the Decision Boundary:

The position and orientation of the hyperplane are directly influenced by the support vectors. If you remove or change a support vector, the hyperplane will adjust accordingly.

Efficiency:

Only the support vectors are used in the decision function. This makes SVM computationally efficient because it ignores data points that do not contribute to the boundary.

Robustness:

Points far away from the decision boundary do not affect the hyperplane. This makes SVM robust to outliers and irrelevant data points (to some extent).

Example
Imagine a binary classification problem with two classes (e.g., circles and squares) in a 2D feature space.

Support Vectors:

Suppose the SVM finds a hyperplane that separates the two classes.

The data points from each class that are closest to the hyperplane are the support vectors.

These points essentially "support" the hyperplane by marking the edges of the margin.Impact of Support Vectors:

If you were to remove a point far from the decision boundary, it would not affect the hyperplane.

However, removing or modifying a support vector would change the decision boundary significantly because it directly influences the margin.



Q5. Illustrate with examples and graphs of Hyperplane, Marginal plane, Soft margin and Hard margin in
SVM?

1. Hyperplane
The hyperplane is the decision boundary that separates data points of different classes.

Example:

For 2D data: A straight line dividing two classes.

For 3D data: A flat plane separating points.

Formula: 
𝑤
⋅
𝑥
+
𝑏
=
0
, where 
𝑤
 is the weight vector and 
𝑏
 is the bias.

Representation:

In 2D, imagine a line splitting two clusters, such as red points on one side and blue points on the other.

2. Marginal Plane
The margin refers to the space between the hyperplane and the nearest data points of each class (support vectors).

The marginal planes are defined by the equations 
𝑤
⋅
𝑥
+
𝑏
=
+
1
 and 
𝑤
⋅
𝑥
+
𝑏
=
−
1
.

Example:

For 2D data: Two parallel lines on either side of the hyperplane, forming the margin.

Representation:

Imagine two dashed lines parallel to the decision boundary, marking the closest data points.

3. Hard Margin
Hard margin SVM assumes perfectly separable data and requires all points to lie outside the margin.

Constraints:

𝑦
𝑖
(
𝑤
⋅
𝑥
𝑖
+
𝑏
)
≥
1
, where 
𝑦
𝑖
 is the class label of 
𝑥
𝑖
.

Example:

In a dataset without noise, all points of one class are clearly separated from the other class by the margin.

Representation:

Perfect clusters of points with no overlap, separated by the hyperplane.

4. Soft Margin
Soft margin SVM allows some misclassification or overlap of points near the margin, accommodating noisy or non-separable data.

Constraints:

𝑦
𝑖
(
𝑤
⋅
𝑥
𝑖
+
𝑏
)
≥
1
−
𝜉
𝑖
, where 
𝜉
𝑖
≥
0
 are slack variables for misclassification.

Example:

In a dataset with noise, some points may cross into the margin or be misclassified.

Representation:

Points that violate the margin boundaries are shown within or across the dashed lines.

Q6. SVM Implementation through Iris dataset.
Load the iris dataset from the scikit-learn library and split it into a training set and a testing setl
~ Train a linear SVM classifier on the training set and predict the labels for the testing setl
~ Compute the accuracy of the model on the testing setl
~ Plot the decision boundaries of the trained model using two of the featuresl
~ Try different values of the regularisation parameter C and see how it affects the performance of
the model.

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt


 Load the iris dataset from the scikit-learn library and split it into a training set and a testing setl#df = sns.load_dataset("iris")

In [60]:
 from sklearn.datasets import load_iris

In [62]:
dataset = load_iris()

In [64]:
print(dataset.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
    - sepal length in cm
    - sepal width in cm
    - petal length in cm
    - petal width in cm
    - class:
            - Iris-Setosa
            - Iris-Versicolour
            - Iris-Virginica

:Summary Statistics:

                Min  Max   Mean    SD   Class Correlation
sepal length:   4.3  7.9   5.84   0.83    0.7826
sepal width:    2.0  4.4   3.05   0.43   -0.4194
petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fis

In [66]:
dataset.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

In [80]:
 df = pd.DataFrame(dataset.data,columns = dataset.feature_names)

Dependent and independent variable

In [95]:
x = df
y = dataset.target

In [97]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.25,random_state = 10)

~ Train a linear SVM classifier on the training set and predict the labels for the testing setl

In [107]:
from sklearn.svm import SVC


In [111]:
svc = SVC(kernel = 'linear')

In [113]:
svc.fit(x_train,y_train)

In [145]:
svc.coef_

array([[-0.04619348,  0.52133922, -1.00307924, -0.4641441 ],
       [ 0.04016251,  0.17403771, -0.55692011, -0.24365261],
       [ 0.81534546,  0.82638331, -1.93619552, -2.00712353]])

In [147]:
# prediction

y_pred = svc.predict(x_test)

In [149]:
y_pred

array([1, 2, 0, 1, 0, 1, 1, 1, 0, 1, 1, 2, 1, 0, 0, 2, 1, 0, 0, 0, 2, 2,
       2, 0, 1, 0, 1, 1, 1, 2, 1, 1, 2, 2, 2, 0, 2, 2])

In [151]:
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score


Compute the accuracy of the model on the testing setlCompute the accuracy of the model on the testing setl

In [154]:
print(classification_report(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        11
           1       1.00      1.00      1.00        15
           2       1.00      1.00      1.00        12

    accuracy                           1.00        38
   macro avg       1.00      1.00      1.00        38
weighted avg       1.00      1.00      1.00        38

[[11  0  0]
 [ 0 15  0]
 [ 0  0 12]]
1.0


hyperparameter tuning with svc

In [167]:
from sklearn.model_selection import GridSearchCV

param_grid = {  'C' : [0.1,1,10,100,1000],
                'gamma': [1,0.1,0.01,0.001,0.0001]
}

In [169]:
grid = GridSearchCV(SVC(),param_grid = param_grid,refit = True,cv = 5,verbose = 3)

In [171]:
grid.fit(x_train,y_train)

Fitting 5 folds for each of 25 candidates, totalling 125 fits
[CV 1/5] END ....................C=0.1, gamma=1;, score=0.870 total time=   0.0s
[CV 2/5] END ....................C=0.1, gamma=1;, score=0.957 total time=   0.0s
[CV 3/5] END ....................C=0.1, gamma=1;, score=1.000 total time=   0.0s
[CV 4/5] END ....................C=0.1, gamma=1;, score=0.955 total time=   0.0s
[CV 5/5] END ....................C=0.1, gamma=1;, score=0.955 total time=   0.0s
[CV 1/5] END ..................C=0.1, gamma=0.1;, score=0.826 total time=   0.0s
[CV 2/5] END ..................C=0.1, gamma=0.1;, score=0.957 total time=   0.0s
[CV 3/5] END ..................C=0.1, gamma=0.1;, score=0.955 total time=   0.0s
[CV 4/5] END ..................C=0.1, gamma=0.1;, score=0.955 total time=   0.0s
[CV 5/5] END ..................C=0.1, gamma=0.1;, score=0.955 total time=   0.0s
[CV 1/5] END .................C=0.1, gamma=0.01;, score=0.652 total time=   0.0s
[CV 2/5] END .................C=0.1, gamma=0.01

In [175]:
grid.best_params_

{'C': 10, 'gamma': 0.1}

In [191]:
y_pred = grid.predict(x_test)
print(classification_report(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        11
           1       1.00      0.93      0.97        15
           2       0.92      1.00      0.96        12

    accuracy                           0.97        38
   macro avg       0.97      0.98      0.98        38
weighted avg       0.98      0.97      0.97        38

[[11  0  0]
 [ 0 14  1]
 [ 0  0 12]]
0.9736842105263158
