Video: [SVM's by StatQuest with visuals](https://www.youtube.com/watch?v=efR1C6CvhmE&list=PLblh5JKOoLUICTaGLRoHQDuF_7q2GfuJF&index=72&ab_channel=StatQuestwithJoshStarmer)
---

# **Understanding Support Vector Machines (SVMs)**  

## **1. Introduction: Classifying Mice by Mass**  
We measured the mass of several mice:  
- **Red dots** = Non-obese mice  
- **Green dots** = Obese mice  

We set a **threshold**:  
- **Less than the threshold** → **Not obese**  
- **More than the threshold** → **Obese**  

### **Issue with a Simple Threshold**  
What if a new observation is **closer** to the non-obese group but still falls above the threshold? This classification doesn’t make sense.  

### **Improving the Threshold**  
Instead of using a random threshold, we:  
1. **Identify edge observations** of each cluster.  
2. **Use the midpoint** between these edge observations as the threshold.  

Now, new observations are classified based on which group they are **closer to**.  

---

## **2. Understanding Margins**  
### **What is a Margin?**  
- The **shortest distance** between an observation and the threshold.  
- If the threshold is exactly between two edge observations, the **margin is maximized**.  

### **Why Maximize the Margin?**  
- Moving the threshold **left** or **right** reduces the margin.  
- The best threshold is the one with the **largest margin**.  
- This is called a **Maximal Margin Classifier**.  

**🔔 Terminology Alert:**  
A **Maximal Margin Classifier** is a classifier that **maximizes the margin** between the two groups.  

---

## **3. The Problem with Maximal Margin Classifiers**  
What if there’s an **outlier** in the training data?  
- The margin will shrink drastically.  
- The classifier will be **too sensitive** to outliers.  
- **New observations** may be misclassified.  

### **Can We Do Better? Yes!**  
We allow **some misclassifications** to create a **more robust threshold**.  

---

## **4. Soft Margins and the Bias-Variance Tradeoff**  
### **Allowing Misclassifications**  
Instead of forcing a strict margin, we allow:  
- **Some misclassified points**.  
- **Some correctly classified points** to be inside the margin.  

This helps generalize the model and reduces overfitting.  

### **Bias-Variance Tradeoff**  
- **Strict classifiers** (no misclassifications) → **Low bias, high variance**  
- **Soft margin classifiers** (allow misclassifications) → **Higher bias, lower variance**  

🔔 **Terminology Alert:**  
- The **Soft Margin** is the new flexible margin that allows misclassifications.  
- A **Soft Margin Classifier** is also called a **Support Vector Classifier (SVC)**.  

---

## **5. Support Vector Classifiers in Higher Dimensions**  
### **What if We Have More Features?**  
- **1D Data** → Classifier is a **point**.  
- **2D Data** → Classifier is a **line**.  
- **3D Data** → Classifier is a **plane**.  
- **4D+ Data** → Classifier is a **hyperplane**.  

🔔 **Terminology Alert:**  
A **Hyperplane** is a generalization of a classifier in **higher dimensions**.  

---

## **6. The Problem with Overlapping Data**  
What if the **two groups overlap** heavily?  
Example:  
- **Drug dosages** and patient responses.  
- The drug **only works in a certain range**.  

A **Support Vector Classifier** struggles here. We need a **better solution**.  

---

## **7. Introduction to Support Vector Machines (SVMs)**  
### **How Do We Improve?**  
We introduce **Support Vector Machines (SVMs)**, which work in **higher dimensions**.  

### **How Do SVMs Work?**  
1. **Start with low-dimensional data.**  
2. **Transform it into a higher-dimensional space.**  
3. **Find a Support Vector Classifier in the higher dimension.**  

---

## **8. The Role of Kernel Functions**  
### **Why Do We Need a Kernel Function?**  
- We need a way to **transform data into higher dimensions**.  
- Different transformations work for different problems.  

### **Polynomial Kernel**  
- Uses **polynomial functions** to add dimensions.  
- The **degree (D)** controls the transformation.  
  - **D = 1** → No transformation.  
  - **D = 2** → Squared features.  
  - **D = 3** → Cubed features, etc.  
- **Cross-validation** helps choose the best **D**.  

### **Radial Basis Function (RBF) Kernel**  
- Maps data into **infinite dimensions**.  
- Works similarly to **weighted nearest neighbors**.  
- **Close points** have more influence on classification.  

🔔 **Terminology Alert:**  
The **Kernel Trick** allows us to compute high-dimensional relationships **without explicitly transforming the data**.  

---

## **9. Summary and Final Thoughts**  
### **Key Takeaways:**  
1. **Maximal Margin Classifier** maximizes the margin but is **too sensitive to outliers**.  
2. **Soft Margin Classifiers (Support Vector Classifiers)** allow misclassifications for better generalization.  
3. **Support Vector Machines (SVMs)** solve non-linear classification problems by mapping data into **higher dimensions** using **kernels**.  
4. **Polynomial and Radial Kernels** systematically transform data to improve classification.  

### **Final BAM!**  
Support Vector Machines are **powerful tools** in machine learning when dealing with **complex, overlapping, and non-linearly separable data**.  

**Quest on!** 🚀  

---

### **Understanding the Polynomial Kernel in Support Vector Machines (SVMs)**  

The **polynomial kernel** is a method used in **Support Vector Machines (SVMs)** to classify data that is **not linearly separable** in its original form. Instead of trying to separate the data in its current **low-dimensional space**, we use a **polynomial function** to **map** the data into a **higher-dimensional space**, where it becomes easier to separate using a hyperplane.  

---

## **1. Polynomial Kernel Formula**  
The polynomial kernel function is given by:  
$
K(A, B) = (A \cdot B + R)^D
$
where:  
- **A, B** → Two feature vectors (observations) in the dataset.  
- **A · B** → The **dot product** of A and B (measuring similarity).  
- **R** → A constant (also called the bias term) that controls the influence of higher-order terms.  
- **D** → The degree of the polynomial (controls complexity and flexibility).  

This function computes the relationship between every pair of points in the dataset as if they were mapped to a higher-dimensional space.  

---

## **2. Why Use the Polynomial Kernel?**  
When data is **not linearly separable**, a simple linear classifier **(a straight line or a hyperplane in higher dimensions)** will not work. The polynomial kernel enables us to:  
- **Map data into a higher-dimensional space** where separation is possible.  
- **Avoid explicitly computing higher dimensions** using the **Kernel Trick** (discussed later).  

For example, in a **one-dimensional dataset**, if we add a squared term (degree = 2), we **lift the data into a parabola shape** in **two dimensions**, making it easier to separate using a straight line.  

---

## **3. How the Polynomial Kernel Defines a Support Vector Classifier**  
### **Step 1: Compute Kernel Values (Pairwise Relationships)**  
- Instead of working directly in a higher dimension, we calculate the **kernel values** between all pairs of points.  
- This means each data point is **implicitly transformed** into a higher-dimensional space **without actually computing the transformation explicitly**.  

### **Step 2: Construct the Decision Boundary (Hyperplane)**  
- Once the kernel values are calculated, we train an **SVM model** as usual.  
- The SVM finds the **optimal hyperplane** in the high-dimensional space that **maximizes the margin** between classes.  

### **Step 3: Classify New Observations**  
- When a new observation arrives, its **kernel value** is computed with respect to the support vectors.  
- The SVM then determines which **side of the hyperplane** the observation falls on and classifies it accordingly.  

---

## **4. Example: Using a Polynomial Kernel for Classification**
Let's say we have **one-dimensional data** representing a drug dosage and whether it cured patients:  

| Dosage | Cured (1) / Not Cured (0) |
|--------|--------------------------|
| 0.5    | 0                        |
| 1.0    | 1                        |
| 1.5    | 1                        |
| 2.0    | 0                        |

The data is **not linearly separable**, meaning we cannot draw a straight line to classify cured vs. not cured patients.  

### **Transforming the Data Using a Polynomial Kernel**
If we apply a **polynomial kernel with \( D = 2 \)** (squaring the dosage values), our new feature space becomes:  

| Dosage (X) | Squared Dosage (X²) | Transformed Space |
|------------|----------------------|-------------------|
| 0.5        | 0.25                 | (0.5, 0.25)      |
| 1.0        | 1.00                 | (1.0, 1.00)      |
| 1.5        | 2.25                 | (1.5, 2.25)      |
| 2.0        | 4.00                 | (2.0, 4.00)      |

Now, the data exists in a **higher-dimensional space**, where an SVM can find a **linear boundary (a straight line in 2D)** to separate cured vs. not cured patients.  

---

## **5. Kernel Trick: Computing High-Dimensional Relationships Efficiently**  
The **kernel trick** is what makes polynomial kernels (and other kernel methods) computationally feasible.  

### **What the Kernel Trick Does:**  
- **Instead of explicitly transforming data into a higher dimension**, we compute the **pairwise kernel values** directly.  
- This saves computational resources and allows SVMs to efficiently operate in **very high-dimensional spaces**.  

For example, if we had **10,000 features**, explicitly computing all polynomial transformations would be computationally expensive. The kernel trick **avoids this problem** by computing the relationships **as if the transformation had been performed**.  

---

## **6. Choosing the Degree (D) and Coefficient (R)**
### **Impact of D (Degree) on Model Complexity**
- **Low degree (D = 1, 2):**  
  - Less complex, fewer parameters.  
  - Good for simple patterns.  
- **Higher degree (D ≥ 3):**  
  - More complex, flexible decision boundaries.  
  - Can capture intricate relationships but may lead to **overfitting**.  

### **Impact of R (Bias Term)**
- Controls the impact of higher-order polynomial terms.  
- Small **R** → More influence from lower-degree terms.  
- Large **R** → Higher-degree terms have more impact, leading to more complex decision boundaries.  

### **How to Choose D and R?**
- We use **cross-validation** to test different values of **D** and **R** and select the combination that provides the best generalization on unseen data.  

---

## **7. Comparison with Other Kernels**
| **Kernel**         | **Mathematical Form**              | **Use Case** |
|--------------------|----------------------------------|-------------|
| **Linear Kernel**  | $A \cdot B  $               | Simple, linearly separable data |
| **Polynomial Kernel**  | $(A \cdot B + R)^D $ | Non-linear patterns, curved boundaries |
| **Radial Basis Function (RBF) Kernel**  | $exp(−γ∥A−B∥^2)$ | Highly complex, captures intricate relationships |
| **Sigmoid Kernel**  | $\tanh(A \cdot B + R) $ | Similar to neural networks |

---

## **8. Summary**
### **Key Takeaways:**
- The **polynomial kernel** maps data into a **higher-dimensional space** to make classification easier.  
- Instead of explicitly computing new feature spaces, we use the **kernel trick** to compute **pairwise relationships** efficiently.  
- The **degree (D) and coefficient (R)** determine the complexity of the transformation.  
- **Cross-validation** helps find the best values for D and R.  
- The polynomial kernel is great for **moderate non-linearity**, but for highly complex data, **RBF kernels** may perform better.  

### **Final BAM! 🚀**  
The polynomial kernel is a powerful tool in **Support Vector Machines**, enabling them to classify **non-linearly separable data** efficiently!  

---



# **Understanding the Radial Basis Function (RBF) Kernel in Support Vector Machines (SVMs)**  

## **Introduction**  
We had a training dataset based on drug dosages measured in a group of patients.  
- **Red dots** represented patients who were **not cured**.  
- **Green dots** represented patients who were **cured**.  
- The drug only worked when the dosage was **just right**—not too small or too large.  

Because of the overlap in the data, we were **unable to find a satisfying support vector classifier (SVC)** to separate the cured from the non-cured patients.  

## **Using the Radial Basis Function (RBF) Kernel**  
One way to handle overlapping data is by using a **Support Vector Machine (SVM) with an RBF kernel**.  
- The RBF kernel **finds support vector classifiers in infinite dimensions**, making it impossible to visualize.  
- However, in practice, it **behaves like a weighted nearest neighbor model**:
  - **Closest observations (nearest neighbors)** influence classification the most.  
  - **Farther observations** have relatively little influence.  

For example, if a new observation appears, the **nearest points** will dictate how it is classified.  

## **How the RBF Kernel Determines Influence**  
The RBF kernel measures how much influence one observation has on another.  

### **Mathematical Representation of the RBF Kernel**  
The RBF kernel is given by:  

$
K(A, B) = \exp(-γ \| A - B \|^2)
$

where:  
- $ A, B $ are different dosage measurements.  
- $ \| A - B \|^2 $ is the **squared distance** between two points.  
- \( γ \) (gamma) **scales the squared distance**, adjusting influence.  

### **Impact of Gamma (\( γ \))**  
- **When \( γ = 1 \):**  
  - Plugging in two close observations gives **0.11**.  
- **When \( γ = 2 \):**  
  - Plugging in the same values results in **0.01** (less influence than when \( γ = 1 \)).  
- **When two observations are far apart**, their influence approaches **0**.  

Thus, **higher gamma values shrink influence to closer neighbors**, while **lower gamma values allow broader influence**.  

## **How the RBF Kernel Works in Infinite Dimensions**  
The key idea behind the RBF kernel is that it **projects data into infinite dimensions**, similar to **polynomial kernels**.  

### **Polynomial Kernel Intuition**  
A polynomial kernel can transform data into higher dimensions by adding polynomial terms.  
- Example: If we use a **polynomial kernel with \( R = 0 \) and \( D = 2 \)**, the transformation becomes:  
  $
  \text{New X-coordinate} = \text{Dosage}^2
  $
  - This shifts data to a **new 2D space** where it is more separable.  
- If we increase **\( D \) to 3**, the transformation becomes:  
  $
  \text{New X-coordinate} = \text{Dosage}^3
  $
  - Data shifts further in a **higher-dimensional space**.  

Now, what if we **keep increasing \( D \) until infinity**?  
- This creates an **infinite-dimensional transformation**, exactly what the RBF kernel does!  

## **Deriving the RBF Kernel Using Taylor Series Expansion**  
To formally prove that the **RBF kernel maps to infinite dimensions**, we use a **Taylor series expansion** of the exponential function.  

### **Taylor Series Expansion of $e^x$**  
$
e^x = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + \dots
$
- If we substitute \( x = A \cdot B \), we get an **infinite sum of polynomial kernels**.  
- The result is a **dot product in infinite dimensions**.  

Thus, when we compute the RBF kernel, the value we get is the **high-dimensional relationship between two points** in an **infinite-dimensional space**!  

## **Conclusion**  
- The **RBF kernel transforms data into infinite dimensions**, allowing **complex decision boundaries**.  
- It behaves like a **weighted nearest neighbor classifier**, where **gamma (\( γ \)) controls the influence of points**.  
- **Mathematically, it is derived from polynomial expansions using the Taylor series**.  
- **Final takeaway:** The RBF kernel is **powerful for handling non-linear data** that is not linearly separable in lower dimensions.  

---
