#**SVM & Naive bayes**



### **1. What is a Support Vector Machine (SVM)?**

**SVM** is a supervised machine learning algorithm used for **classification** and **regression**.
Its goal is to find the **best hyperplane** that separates different classes with the **maximum margin**.

📌 Example: Separating spam vs non-spam emails.

---

### **2. What is the difference between Hard Margin and Soft Margin SVM?**

| Feature     | **Hard Margin**                           | **Soft Margin**                      |
| ----------- | ----------------------------------------- | ------------------------------------ |
| Definition  | No misclassifications allowed             | Allows some misclassifications       |
| Use Case    | Only when data is **perfectly separable** | When data has **overlap or noise**   |
| Flexibility | Rigid                                     | More flexible (uses **C parameter**) |

---

### **3. What is the mathematical intuition behind SVM?**

SVM tries to **maximize the margin** between two classes by finding a hyperplane (a line in 2D, a plane in 3D, etc.).

🔸 The larger the margin, the better the generalization.

Mathematically, it solves:

$$
\min \|w\|^2 \quad \text{subject to } y_i(w \cdot x_i + b) \geq 1
$$

Where:

* $w$ is the weight vector
* $b$ is the bias
* $y_i$ are the class labels
* $x_i$ are the input features

---

### **4. What is the role of Lagrange Multipliers in SVM?**

They help convert the **constrained optimization** problem of SVM into a **dual problem**, which is easier to solve using optimization techniques.

* Only points **on or inside the margin** get **non-zero Lagrange multipliers**
* These points are the **support vectors**

---

### **5. What are Support Vectors in SVM?**

**Support Vectors** are the **data points closest to the decision boundary (hyperplane)**.

📌 These points:

* Directly affect the position of the hyperplane
* Are **critical** to the final model

---

### **6. What is a Support Vector Classifier (SVC)?**

`SVC` is the classification version of SVM in **Scikit-learn**.

```python
from sklearn.svm import SVC

model = SVC(kernel='linear')
```

It uses SVM to **classify** data into categories.

---

### **7. What is a Support Vector Regressor (SVR)?**

`SVR` is the **regression version** of SVM.
Instead of separating classes, it tries to fit a line or curve **within a certain margin of error (epsilon tube)**.

```python
from sklearn.svm import SVR
model = SVR(kernel='rbf')
```

---

### **8. What is the Kernel Trick in SVM?**

The **Kernel Trick** allows SVM to work in **non-linear spaces** without transforming the actual data.

📌 Instead of transforming data into high-dimensional space, we compute the **dot product** using a **kernel function**.

✅ Examples of kernels:

* Linear
* Polynomial
* RBF (Radial Basis Function)

---

### **9. Compare Linear Kernel, Polynomial Kernel, and RBF Kernel**

| **Kernel**     | **Use Case**                    | **Equation**                          | **Strength**               |
| -------------- | ------------------------------- | ------------------------------------- | -------------------------- |
| Linear         | When data is linearly separable | $K(x, y) = x \cdot y$                 | Simple, fast               |
| Polynomial     | For curved decision boundaries  | $K(x, y) = (x \cdot y + c)^d$         | Can model complex patterns |
| RBF (Gaussian) | When data is highly non-linear  | $K(x, y) = \exp(-\gamma \|x - y\|^2)$ | Powerful and flexible      |

---

### **10. What is the effect of the C parameter in SVM?**

The **C parameter** controls the **trade-off between margin size and classification error**.

* **Low C** → Larger margin, more tolerance for misclassification (good generalization)
* **High C** → Smaller margin, less tolerance (tries to classify every point correctly)

📌 In Scikit-learn:

```python
SVC(C=1.0, kernel='linear')
```

### **11. What is the role of the Gamma parameter in RBF Kernel SVM?**

**Gamma (γ)** controls **how far the influence of a single training example reaches** in the **RBF (Radial Basis Function) kernel**.

* **High gamma** → Each point has **short-range influence**, leading to a **complex, overfitted model**
* **Low gamma** → Each point influences a **wider area**, leading to a **smoother, simpler model**

📌 Example in Scikit-learn:

```python
SVC(kernel='rbf', gamma=0.1)
```

---

### **12. What is the Naïve Bayes classifier, and why is it called "Naïve"?**

**Naïve Bayes** is a probabilistic classifier based on **Bayes’ Theorem**.

It's called "**Naïve**" because it **assumes all features are independent**, which is rarely true in real-world data — but still works well in practice.

---

### **13. What is Bayes’ Theorem?**

Bayes' Theorem allows us to update probabilities based on new evidence:

$$
P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}
$$

Where:

* $P(A|B)$: Posterior probability (after seeing evidence)
* $P(B|A)$: Likelihood
* $P(A)$: Prior probability
* $P(B)$: Evidence (normalizing factor)

---

### **14. Explain the differences between Gaussian Naïve Bayes, Multinomial Naïve Bayes, and Bernoulli Naïve Bayes**

| Variant            | Use Case                                         | Assumes...                                           |
| ------------------ | ------------------------------------------------ | ---------------------------------------------------- |
| **Gaussian NB**    | Continuous data (e.g., age, height)              | Features follow a **normal (Gaussian)** distribution |
| **Multinomial NB** | Count-based features (e.g., word counts in text) | Features are **frequencies/counts**                  |
| **Bernoulli NB**   | Binary features (e.g., presence/absence of word) | Features are **binary (0 or 1)**                     |

---

### **15. When should you use Gaussian Naïve Bayes over other variants?**

Use **Gaussian Naïve Bayes** when your features are **continuous and normally distributed**.

📌 Example: Predicting if a person has a disease based on age, weight, and temperature.

---

### **16. What are the key assumptions made by Naïve Bayes?**

1. **Feature independence** – All features are independent of each other
2. **Equal importance** – Every feature contributes equally
3. **Class-conditional independence** – Features are independent **given the class**

⚠ These assumptions are usually false, but the algorithm still works well.

---

### **17. What are the advantages and disadvantages of Naïve Bayes?**

✅ **Advantages:**

* Very **fast** and **simple**
* Works well with **high-dimensional** data
* Requires **less training data**
* Great for **text classification**

❌ **Disadvantages:**

* Assumes **independence** (not realistic)
* Not suitable when features are **highly correlated**

---

### **18. Why is Naïve Bayes a good choice for text classification?**

* Text data is **high-dimensional** but sparse (many 0s), which Naïve Bayes handles well
* Word counts/frequencies fit **Multinomial Naïve Bayes** perfectly
* It’s **fast**, even for large datasets like spam detection, sentiment analysis, etc.

---

### **19. Compare SVM and Naïve Bayes for classification tasks**

| Feature          | **SVM**               | **Naïve Bayes**                       |
| ---------------- | --------------------- | ------------------------------------- |
| Type             | Discriminative        | Generative                            |
| Speed            | Slower to train       | Very fast                             |
| Assumptions      | No strict assumptions | Assumes feature independence          |
| Small datasets   | Performs very well    | Performs well if independence holds   |
| Text data        | Good, but slower      | Excellent (especially Multinomial NB) |
| Interpretability | Lower                 | Higher                                |

---

### **20. How does Laplace Smoothing help in Naïve Bayes?**

**Laplace Smoothing** fixes the problem of **zero probability** when a word/category **doesn’t appear** in the training data.

🔸 It adds a small constant (usually 1) to **every count**:

$$
P(word|class) = \frac{\text{count(word in class)} + 1}{\text{total words in class} + V}
$$

Where $V$ = total number of unique words (vocabulary size)

📌 Prevents multiplying by zero and improves generalization.
