**AUTHOR: RAIHAN SALMAN BAEHAQI (1103220180)**

**PART I**  

**The Fundamentals of Machine Learning**  

---

**CHAPTER 5 - Support Vector Machines**  

---

Chapter 5 explores Support Vector Machines (SVMs), powerful and versatile models capable of performing linear or nonlinear classification, regression, and outlier detection. SVMs are particularly well suited for classification of complex small- or medium-sized datasets.

---

**Linear SVM Classification**  
The fundamental idea behind SVMs is fitting the widest possible street between classes, called large margin classification.​

![Figure5-1.jpg](./05.Chapter-05/Figure5-1.jpg)

Adding more training instances "off the street" won't affect the decision boundary—it's fully determined by instances located on the edge of the street, called support vectors (circled in Figure 5-1).​

![Figure5-2.jpg](./05.Chapter-05/Figure5-2.jpg)

**Soft Margin Classification**  
**Hard margin classification** strictly imposes that all instances must be off the street and on the right side. It has two main issues: it only works if data is linearly separable, and it's sensitive to outliers.​

![Figure5-3.jpg](./05.Chapter-05/Figure5-3.jpg)

**Soft margin classification** finds a good balance between keeping the street large and limiting margin violations (instances in the middle of the street or on the wrong side).​

![Figure5-4.jpg](./05.Chapter-05/Figure5-4.jpg)

**Important**: If your SVM model is overfitting, try regularizing it by reducing C.​

Train a linear SVM model:

In [None]:
import numpy as np
from sklearn import datasets
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC

iris = datasets.load_iris()
X = iris["data"][:, (2, 3)]  # petal length, petal width
y = (iris["target"] == 2).astype(np.float64)  # Iris virginica

svm_clf = Pipeline([
        ("scaler", StandardScaler()),
        ("linear_svc", LinearSVC(C=1, loss="hinge")),
    ])

svm_clf.fit(X, y)

Make predictions:

In [None]:
>>> svm_clf.predict([[5.5, 1.7]])
array([1.])

**Important**: Unlike Logistic Regression classifiers, SVM classifiers do not output probabilities for each class.​

**Alternative Implementations**  
Use the SVC class with linear kernel:

In [None]:
SVC(kernel="linear", C=1)

Use SGDClassifier for online learning or huge datasets:

In [None]:
SGDClassifier(loss="hinge", alpha=1/(m*C))

This applies Stochastic Gradient Descent but doesn't converge as fast as LinearSVC. It's useful for online classification tasks or datasets that don't fit in memory (out-of-core training).​

**Important**: The LinearSVC class regularizes the bias term, so center the training set first by subtracting its mean (automatic with StandardScaler). Set the loss hyperparameter to "hinge" (not default). For better performance, set dual hyperparameter to False unless there are more features than training instances.  

---

**Nonlinear SVM Classification**  
Although linear SVM classifiers are efficient and work well in many cases, many datasets aren't close to being linearly separable. One approach is adding more features, such as polynomial features, which can result in a linearly separable dataset.

![Figure5-5.jpg](./05.Chapter-05/Figure5-5.jpg)

Test on the moons dataset:

In [None]:
from sklearn.datasets import make_moons
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

X, y = make_moons(n_samples=100, noise=0.15)

polynomial_svm_clf = Pipeline([
        ("poly_features", PolynomialFeatures(degree=3)),
        ("scaler", StandardScaler()),
        ("svm_clf", LinearSVC(C=10, loss="hinge"))
    ])

polynomial_svm_clf.fit(X, y)

![Figure5-6.jpg](./05.Chapter-05/Figure5-6.jpg)

**Polynomial Kernel**  
Adding polynomial features is simple but at low polynomial degrees can't deal with very complex datasets, while high polynomial degrees create a huge number of features, making the model too slow.​

The kernel trick makes it possible to get the same result as if you had added many polynomial features, even with very high-degree polynomials, without actually adding them. No combinatorial explosion occurs because you don't actually add features.​

Train using polynomial kernel:

In [None]:
from sklearn.svm import SVC

poly_kernel_svm_clf = Pipeline([
        ("scaler", StandardScaler()),
        ("svm_clf", SVC(kernel="poly", degree=3, coef0=1, C=5))
    ])

poly_kernel_svm_clf.fit(X, y)

![Figure5-7.jpg](./05.Chapter-05/Figure5-7.jpg)

**Important**: A common approach to finding the right hyperparameter values is grid search. It's often faster to first do a very coarse grid search, then a finer grid search around the best values found.

**Similarity Features**  
Another technique for nonlinear problems is adding features computed using a similarity function that measures how much each instance resembles a particular landmark.  

Equation 5-1. Gaussian RBF  
![Eq5-1.jpg](./05.Chapter-05/Eq5-1.jpg)  

This is a bell-shaped function varying from 0 (very far from the landmark) to 1 (at the landmark).  


![Figure5-8.jpg](./05.Chapter-05/Figure5-8.jpg)  

The simplest approach is creating a landmark at the location of each and every instance in the dataset. This creates many dimensions and increases chances the transformed training set will be linearly separable. The downside: a training set with m instances and n features gets transformed into a training set with m instances and m features.

**Gaussian RBF Kernel**  
Just like polynomial features, similarity features can be useful with any ML algorithm but may be computationally expensive on large training sets. The kernel trick makes it possible to obtain similar results as if you had added many similarity features.  

Train with Gaussian RBF kernel:

In [None]:
rbf_kernel_svm_clf = Pipeline([
        ("scaler", StandardScaler()),
        ("svm_clf", SVC(kernel="rbf", gamma=5, C=0.001))
    ])

rbf_kernel_svm_clf.fit(X, y)


![Figure5-9.jpg](./05.Chapter-05/Figure5-9.jpg)  

Other Kernels
Other kernels exist but are used much more rarely. Some are specialized for specific data structures. String kernels are sometimes used when classifying text documents or DNA sequences (e.g., using string subsequence kernel or kernels based on Levenshtein distance).​

Rule of thumb for kernel selection:
* Always try the linear kernel first (LinearSVC is much faster than SVC(kernel="linear")), especially if the training set is very large or has plenty of features​
* If the training set is not too large, try the Gaussian RBF kernel (works well in most cases)​
* With spare time and computing power, experiment with other kernels using cross-validation and grid search, especially if there are kernels specialized for your training set's data structure

**Computational Complexity**  
**LinearSVC** is based on the liblinear library implementing an optimized algorithm for linear SVMs. It doesn't support the kernel trick but scales almost linearly with training instances and features. Training time complexity is roughly O(m×n). The algorithm takes longer if you require very high precision, controlled by tolerance hyperparameter ϵ (tol in Scikit-Learn).​

**SVC** is based on the libsvm library implementing an algorithm supporting the kernel trick. Training time complexity is usually between
O(m<sup>2</sup> × n) and O(m<sup>3</sup> × n). It gets dreadfully slow when training instances get large (e.g., hundreds of thousands). This algorithm is perfect for complex small or medium-sized training sets. It scales well with features, especially sparse features (when each instance has few nonzero features).  

Table 5-1. Comparison of Scikit-Learn classes for SVM classification  
![Table5-1.jpg](./05.Chapter-05/Table5-1.jpg)  

---

**SVM Regression**  
SVMs support linear and nonlinear regression. The trick is to reverse the objective: instead of trying to fit the largest possible street between two classes while limiting margin violations, SVM Regression tries to fit as many instances as possible on the street while limiting margin violations (instances off the street). The street width is controlled by hyperparameter ϵ.  

![Figure5-10.jpg](./05.Chapter-05/Figure5-10.jpg)  

Perform linear SVM Regression:

In [None]:
from sklearn.svm import LinearSVR

svm_reg = LinearSVR(epsilon=1.5)
svm_reg.fit(X, y)

**Nonlinear Regression**
For nonlinear regression tasks, use a kernelized SVM model.  

![Figure5-11.jpg](./05.Chapter-05/Figure5-11.jpg)  

Use the SVR class:

In [None]:
from sklearn.svm import SVR

svm_poly_reg = SVR(kernel="poly", degree=2, C=100, epsilon=0.1)
svm_poly_reg.fit(X, y)

The SVR class is the regression equivalent of SVC, and LinearSVR is the regression equivalent of LinearSVC. LinearSVR scales linearly with training set size (like LinearSVC), while SVR gets too slow when the training set grows large (like SVC).​

**Note**: SVMs can also be used for outlier detection; see Scikit-Learn's documentation for details.  

---

**Under the Hood**  
This section explains how SVMs make predictions and how training algorithms work.​

Notation convention: The bias term is called b, and the feature weights vector is called w. No bias feature is added to input feature vectors.​

Decision Function and Predictions
The linear SVM classifier model predicts the class of a new instance x by computing the decision function w<sup>T</sup> x + b = w<sub>1</sub>x<sub>1</sub> + ⋯ + w<sub>n</sub>x<sub>n</sub> + b.  

Equation 5-2. Linear SVM classifier prediction  
![Eq5-2.jpg](./05.Chapter-05/Eq5-2.jpg)  

![Figure5-12.jpg](./05.Chapter-05/Figure5-12.jpg)  

Training a linear SVM classifier means finding values of w and b that make this margin as wide as possible while avoiding margin violations (hard margin) or limiting them (soft margin).

**Training Objective**  
The slope of the decision function equals the norm of the weight vector, ∥w∥​. Dividing the slope by 2 will multiply the margin by 2​.

![Figure5-13.jpg](./05.Chapter-05/Figure5-13.jpg)

So we want to minimize ∥w∥ to get a large margin​. To avoid margin violations (hard margin), the decision function must be greater than 1 for all positive training instances and lower than -1 for negative instances​. Define t<sup>(i)</sup> = −1 for negative instances and t<sup>(i)</sup> = 1 for positive instances, expressing the constraint as t <sup>(i)</sup> (w<sup>T</sup> x<sup>(i)</sup> + b) ≥ 1​.

Equation 5-3. Hard margin linear SVM classifier objective  
![Eq5-3.jpg](./05.Chapter-05/Eq5-3.jpg)  

Important: We minimize 1/2w<sup>T</sup>w, which equals 1/2 ∥w∥<sup>2</sup>, rather than minimizing ∥w∥​. This has a nice, simple derivative (just w), while ∥w∥ is not differentiable at w=0​. Optimization algorithms work much better on differentiable functions​.

For soft margin, introduce a slack variable ζ<sup>(i)</sup> ≥ 0 for each instance measuring how much the ith instance is allowed to violate the margin. The C hyperparameter defines the trade-off between keeping slack variables small (reducing margin violations) and making 1/2 w<sup>T</sup>w small (increasing the margin).  

Equation 5-4. Soft margin linear SVM classifier objective  
![Eq5-4.jpg](./05.Chapter-05/Eq5-4.jpg)

Quadratic Programming
Hard margin and soft margin problems are both convex quadratic optimization problems with linear constraints, known as Quadratic Programming (QP) problems. Many off-the-shelf solvers are available.​

Equation 5-5. Quadratic Programming problem  
![Eq5-5.jpg](./05.Chapter-05/Eq5-5.jpg)

Where:
* p is an n<sub>p</sub>-dimensional vector (n<sub>p</sub> = number of parameters)​
* H is an n<sub>p</sub> × n<sub>p</sub> matrix​
* f is an n<sub>p</sub>-dimensional vector​
* A is an n<sub>c</sub> × n<sub>p</sub> matrix (n<sub>c</sub> = number of constraints)​
* b is an n<sub>c</sub>-dimensional vector​

One way to train a hard margin linear SVM classifier is using an off-the-shelf QP solver with appropriate parameters. The resulting vector p will contain the bias term
b=p<sub>0</sub> and feature weights w<sub>i</sub> = p<sub>i</sub> for i = 1, 2, …, n

**The Dual Problem**  
Given a constrained optimization problem (primal problem), it's possible to express a different but closely related problem called its dual problem. The solution to the dual problem typically gives a lower bound to the solution of the primal problem, but under some conditions it can have the same solution. Luckily, the SVM problem meets these conditions, so you can choose to solve either problem—both will have the same solution.​

Equation 5-6. Dual form of the linear SVM objective  
![Eq5-6.jpg](./05.Chapter-05/Eq5-6.jpg)  

Once you find the vector α that minimizes this equation (using a QP solver), compute w and b that minimize the primal problem.​

Equation 5-7. From the dual solution to the primal solution  
![Eq5-7.jpg](./05.Chapter-05/Eq5-7.jpg)  

The dual problem is faster to solve than the primal when the number of training instances is smaller than the number of features. More importantly, the dual problem makes the kernel trick possible, while the primal does not.

**Kernelized SVMs**  
Suppose you want to apply a second-degree polynomial transformation to a two-dimensional training set, then train a linear SVM classifier on the transformed set.​

Equation 5-8. Second-degree polynomial mapping  
![Eq5-8.jpg](./05.Chapter-05/Eq5-8.jpg)  
The transformed vector is 3D instead of 2D.  

Equation 5-9. Kernel trick for a second-degree polynomial mapping  
![Eq5-9.jpg](./05.Chapter-05/Eq5-9.jpg)  

The dot product of transformed vectors equals the square of the dot product of original vectors.​

**Key insight**: If you apply transformation ϕ to all training instances, the dual problem will contain the dot product ϕ(x<sup>(i)</sup>)<sup>T</sup> ϕ(x<sup>(j)</sup>). But if ϕ is the second-degree polynomial transformation, you can replace this dot product simply by (x<sup>(i)</sup>)<sup>T</sup>x<sup>(j)</sup>)<sup>2</sup>. You don't need to transform training instances at all; just replace the dot product by its square. This makes the process much more computationally efficient.  

The function K(a,b)=(a<sup>T</sup>b)<sup>2</sup> is a second-degree polynomial kernel. In Machine Learning, a kernel is a function capable of computing the dot product ϕ(a)<sup>T</sup>ϕ(b) based only on the original vectors a and b, without having to compute (or even know about) the transformation ϕ.​

Equation 5-10. Common kernels  
![Eq5-10.jpg](./05.Chapter-05/Eq5-10.jpg)  

**Mercer's Theorem**  
According to Mercer's theorem, if a function K(a,b) respects Mercer's conditions (e.g., K must be continuous and symmetric so K(a,b)=K(b,a)), then there exists a function ϕ that maps a and b into another space (possibly with much higher dimensions) such that K(a,b)=ϕ(a)<sup>T</sup> ϕ(b). You can use K as a kernel because you know ϕ exists, even if you don't know what ϕ is. For the Gaussian RBF kernel, ϕ maps each training instance to an infinite-dimensional space.  

**Note**: Some frequently used kernels (such as sigmoid kernel) don't respect all Mercer's conditions, yet they generally work well in practice.  

If you apply the kernel trick, you end up with equations including ϕ(x<sup>(i)</sup>). In fact, w must have the same number of dimensions as ϕ(x<sup>(i)</sup>), which may be huge or even infinite, so you can't compute it. But you can make predictions without knowing w by plugging the formula for w from Equation 5-7 into the decision function.  

Equation 5-11. Making predictions with a kernelized SVM  
![Eq5-11.jpg](./05.Chapter-05/Eq5-11.jpg)  

Since α<sup>(i)</sup>≠0 only for support vectors, making predictions involves computing the dot product of the new input vector x<sup>(n)</sup> with only the support vectors, not all training instances.  

Equation 5-12. Using the kernel trick to compute the bias term  
![Eq5-12.jpg](./05.Chapter-05/Eq5-12.jpg)


**Online SVMs**  
Online learning means learning incrementally, typically as new instances arrive.​

For linear SVM classifiers, one method for implementing online SVM is using Gradient Descent (e.g., SGDClassifier) to minimize the cost function derived from the primal problem. Unfortunately, Gradient Descent converges much more slowly than methods based on QP.​

Equation 5-13. Linear SVM classifier cost function  
![Eq5-13.jpg](./05.Chapter-05/Eq5-13.jpg)  

The first sum pushes the model to have a small weight vector w, leading to a larger margin. The second sum computes the total of all margin violations. An instance's margin violation equals 0 if it's located off the street and on the correct side, or else it's proportional to the distance to the correct side. Minimizing this term ensures the model makes margin violations as small and as few as possible.  

**Hinge Loss**  
The function max⁡(0,1-t) is called the hinge loss function. It equals 0 when t≥1. Its derivative (slope) equals -1 if t<1 and 0 if t>1. It's not differentiable at t=1, but like Lasso Regression, you can still use Gradient Descent using any subderivative at t=1 (any value between -1 and 0).
  
It's also possible to implement online kernelized SVMs, as described in papers "Incremental and Decremental Support Vector Machine Learning" and "Fast Kernel Classifiers with Online and Active Learning". These kernelized SVMs are implemented in Matlab and C++. For large-scale nonlinear problems, consider using neural networks instead.

