# Chapter 5: Support Vector Machines

---

## Introduction

Support Vector Machines (SVMs) is a powerful ML models used for linear/nonlinear classification, regression, and outlier detection. Ideal for complex, small to medium datasets. Widely used and essential for any ML toolkit. This chapter covers SVM concepts, usage, and inner workings.

In [10]:
def warn(*args, **kwargs):
  pass
import warnings
warnings.warn = warn

---

## 5.1 Linear Classification

For giving a SVM Intuition as Large Margin Classification, SVM aims to find the widest possible margin (or “street”) that separates classes. Unlike basic classifiers that may overfit, SVM chooses the decision boundary farthest from closest points, ensuring better generalization. This is called large margin classification.

<img src='Images\image1.png' length=800 width=800>

**Note That**: SVMs are sensitive to feature scales. Without scaling, the decision boundary can be distorted. Applying feature scaling (e.g., StandardScaler) ensures a more accurate and balanced margin.

### Soft Margin Classification

Requires all instances to be correctly separated with no violations. Works only with linearly separable data and is highly sensitive to outliers. A single outlier can distort the margin and harm generalization.

<img src='Images\image2.png' length=800 width=800>

Soft margin allows some violations to handle non-linearly separable data and outliers. It balances margin width with misclassification.

In Scikit-Learn, the **`C`** hyperparameter controls this:

- **Low `C`** → more violations, wider margin, better generalization  
- **High `C`** → fewer violations, narrower margin, risk of overfitting

<img src='Images\image3.png' length=800 width=800>

In [1]:
# Using Scikit-learn Implementation
import numpy as np
from sklearn import datasets
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

In [4]:
df = datasets.load_iris()   # Loading Iris dataset
X = df['data'][:, (2, 3)]   # Working with petal length and width
y = (df['target']==2).astype(np.float64)    # Converting target to binary form

# Desiging a pipeline of scalar and model type for integrity and consistency
svm_clf = Pipeline([
    ('scalar', StandardScaler()),
    ('Linear_SVC', LinearSVC(C=1, loss='hinge'))
])

svm_clf.fit(X, y)

In [5]:
# Reviewing performance
svm_clf.predict([[5.5, 1.7]])

array([1.])

Linear SVM Alternatives

- Use `SVC(kernel="linear", C=1)` for a linear SVM with the `SVC` class.
- Or use `SGDClassifier(loss="hinge", alpha=1/(m*C))` for training with SGD.

`SGDClassifier` is slower to converge than `LinearSVC` but better suited for **online learning** or **large-scale datasets** (out-of-core training).

In [45]:
# Trying kernelized form of SVM
from sklearn.svm import SVC

svm_clf = Pipeline([
    ('scalar', StandardScaler()),
    ('kernel_SVC', SVC(kernel='linear', C=1))
])

svm_clf.fit(X, y)

---

## 5.2 Nonlinear SVM Classification

Linear SVMs may struggle with nonlinear data. A common solution is to **add more features** (e.g., polynomial features).

Example:  
- Original 1D data (feature `x`) is not linearly separable.  
- Adding a second feature (`x²`) transforms it into a **linearly separable 2D dataset**.

<img src='Images\image4.png' length=800 width=800>

In [44]:
# Adding nonlinearity by using PolyFeatures class in the pipeline
from sklearn.preprocessing import PolynomialFeatures

# Working also with a little bit hard data
X, y = datasets.make_moons(n_samples=100, noise=0.15)

svm_clf = Pipeline([
    ('poly_features', PolynomialFeatures(degree=3)),
    ('scalar', StandardScaler()),
    ('linear_svc', LinearSVC(C=10, loss='hinge'))
])

svm_clf.fit(X, y)

### Polynomial Features vs. Kernel Trick

- **Polynomial features** help handle nonlinearity but:
  - Low degree → can’t capture complexity  
  - High degree → too many features, slow model

- **Kernel trick** (used in `SVC`) simulates adding high-degree polynomial features **without explicitly creating them**, avoiding feature explosion. Efficient and powerful for complex data.

### Polynomial Kernel & Hyperparameters

- `SVC(kernel="poly", degree=3)` trains an SVM with a **polynomial kernel**.
- **Higher degrees** → more complex decision boundaries (risk of overfitting).
- **Lower degrees** → simpler model (risk of underfitting).
- **`coef0`** controls the balance between high- and low-degree polynomial influence.

Which Can be tunned using something like gridSearch or random search

In [43]:
# Or we may use the 'poly' kernel with the SVC

svm_clf = Pipeline([
    ('scalar', StandardScaler()),
    ('kerenl_svc', SVC(kernel='poly', degree=3, coef0=1, C=5))
])

svm_clf.fit(X, y)


### Similarity Features

Another way to handle nonlinearity is using **similarity functions** based on **landmarks**.

- A popular one is the **Gaussian RBF**: ϕγ (x, ℓ) = exp(−γ∥x − ℓ∥2)
- Outputs values close to **1 near the landmark**, and **0 far away**.
- ⚠️ Downside: **High memory and computation cost** with large datasets.

Example:  
- Add landmarks (e.g., at `x = –2`, `x = 1`)  
- Compute similarity-based features for each instance (e.g., `x = -1` → features ≈ `[0.74, 0.30]`)

➡️ The transformed dataset becomes **linearly separable**.


<img src='Images\image5.png' length=800 width=800>

### Gaussian RBF Kernel

- Like polynomial/similarity features, RBF kernel **adds features implicitly** via the **kernel trick**.
- Avoids computational cost of explicitly computing similarity features.


In [46]:
# Avoiding RBF downside with kerenel trick

svm_clf = Pipeline([
    ('scalar', StandardScaler()),
    ('kernel_svm', SVC(kernel='rbf', C=0.001, gamma=5))
])

svm_clf.fit(X, y)

Rare & Specialized Kernels

- Some kernels handle **specific data types**, like:
  - **String kernels** for text or DNA (e.g., subsequence kernel, Levenshtein kernel)

> **TIP: Choosing a Kernel**
> - Try **linear kernel** first (`LinearSVC`) — fast and good for large/feature-rich data
> - If data is smaller, try **RBF kernel** — works well in most cases
> - Experiment with other kernels if time/resources allow, especially for structured data


### Computational Complexity of SVMs

- **LinearSVC (liblinear):**
  - **No kernel trick**
  - Scales almost linearly: **O(m × n)**
  - Faster and suitable for large datasets
  - Precision controlled by `tol` (default is usually fine)

- **SVC (libsvm):**
  - **Supports kernel trick**
  - Complexity: between **O(m × n)** and **O(m² × n²)**
  - Slower for large datasets, better for **complex small/medium sets**
  - Scales well with **sparse data** (few non-zero features per instance)

> 📊 Use **LinearSVC** for large/linear tasks; use **SVC** for nonlinear or smaller datasets.

---

## 5.3 SVM Regression

Knowing also that

- SVMs can handle both **classification and regression**.
- For regression, the goal is reversed:
  - Try to fit **as many instances as possible within a margin (ε)**.
  - Limit margin violations (instances outside the margin).
- The **ε (epsilon)** hyperparameter controls the **width of the margin (or street)**.

> ✅ Larger ε → wider margin, fewer support vectors.  
> ✅ Smaller ε → narrower margin, more support vectors, tighter fit.

- The model is **ε-insensitive**: instances inside the margin do not affect predictions.
- Use `LinearSVR` for **linear** SVM regression.


In [47]:
from sklearn.svm import LinearSVR

svm_reg = LinearSVR(epsilon=1.5)
svm_reg.fit(X, y)

### Outlier Detection with SVMs

- **SVMs can also be used for outlier/anomaly detection.**
- Useful for identifying rare events or data points that deviate significantly from the norm.
- Check out Scikit-Learn’s [`OneClassSVM`](https://scikit-learn.org/stable/modules/generated/sklearn.svm.OneClassSVM.html) for implementation.


---

## Conclusion

Support Vector Machines (SVMs) are powerful tools for both classification and regression. They work well on small to medium-sized datasets, especially when decision boundaries are complex.

- **Linear SVMs** are fast and work well with high-dimensional data.
- **Nonlinear SVMs** use the **kernel trick** (e.g., RBF, polynomial) to handle complex patterns.
- Key hyperparameters: `C`, `gamma`, `degree` (for poly kernels).
- SVMs also support **regression (SVR)** and **outlier detection**.

SVMs are a solid choice for many ML tasks—just be mindful of their computational cost on large datasets.