## **SVM & Naive Bayes | Assignment**


### **Question 1**: What is a Support Vector Machine (SVM), and how does it work?

 A **Support Vector Machine (SVM)** is a supervised machine learning algorithm used for **classification** and **regression** tasks. Its main goal is to find the optimal decision boundary (called a *hyperplane*) that separates different classes in the feature space.

####How SVM Works:

1. **Hyperplane**:
   SVM finds the hyperplane that best separates the data into classes. In 2D, this is just a line; in higher dimensions, it becomes a plane or hyperplane.

2. **Maximum Margin**:
   SVM chooses the hyperplane that has the **maximum margin** — the distance between the hyperplane and the nearest data points from each class. These nearest points are called **support vectors**.

3. **Support Vectors**:
   These are the critical elements of the training set. They lie closest to the decision boundary and influence its position.

4. **Linear vs. Non-Linear SVM**:

   * If data is **linearly separable**, a straight hyperplane is enough.
   * For **non-linearly separable** data, SVM uses a **kernel trick** to map data into a higher-dimensional space where a linear separator can be found.

5. **Kernel Trick**:
   A mathematical technique that transforms the input data into a higher dimension to make it possible to find a separating hyperplane. Common kernels include:

   * Linear
   * Polynomial
   * Radial Basis Function (RBF)

6. **Soft Margin**:
   SVM allows some misclassifications for better generalization. The **C parameter** controls the trade-off between a wide margin and classification accuracy.


### **Question 2**: Explain the difference between Hard Margin and Soft Margin SVM.


#### Difference Between Hard Margin and Soft Margin SVM:

| Feature                          | **Hard Margin SVM**                                     | **Soft Margin SVM**                                                               |
| -------------------------------- | ------------------------------------------------------- | --------------------------------------------------------------------------------- |
| **Definition**                   | Assumes data is perfectly separable with no errors.     | Allows some misclassifications or margin violations.                              |
| **Tolerance to Errors**          | No tolerance for misclassified points.                  | Allows some points to be within the margin or misclassified.                      |
| **Use Case**                     | Suitable when data is linearly separable without noise. | Suitable for noisy or overlapping data.                                           |
| **Generalization**               | May **overfit** if applied to real-world noisy data.    | Better **generalization** on unseen data.                                         |
| **Margin Type**                  | Rigid and strict margin – no data points inside it.     | Flexible margin controlled by a regularization parameter.                         |
| **Regularization Parameter (C)** | Not used.                                               | Uses **C** to control the trade-off between margin size and classification error. |
| **Robustness**                   | Less robust to outliers.                                | More robust to outliers due to flexibility.                                       |



### **Question 3**: What is the Kernel Trick in SVM? Give one example of a kernel and explain its use case.


#### What is the Kernel Trick in SVM?

The **Kernel Trick** is a mathematical technique used in SVM to handle **non-linearly separable data**. Instead of explicitly transforming data into a higher-dimensional space, the kernel function computes the **inner product** of two data points in that space **without actually performing the transformation**. This makes computations efficient and allows SVM to find a separating hyperplane in a more complex space.

---

### Why Use the Kernel Trick?

* Real-world data is often **not linearly separable**.
* The kernel trick helps **map** data to a higher dimension where a linear separator **can** be found.
* It avoids the **computational cost** of actual transformation.



### Use Case of RBF Kernel:

* Best suited for **non-linear classification problems** where the decision boundary is **not a straight line**.
* Example: Classifying images of animals where features like size, shape, and texture do not follow a linear pattern.
* The RBF kernel can create **curved decision boundaries** to capture complex patterns.



### **Question 4**: What is a Naïve Bayes Classifier, and why is it called “naïve”?




A **Naïve Bayes Classifier** is a **probabilistic machine learning algorithm** based on **Bayes’ Theorem**. It is mainly used for **classification tasks**, such as text classification, spam detection, and sentiment analysis.

It calculates the **probability of a class given a set of features** and predicts the class with the highest probability.



#### Bayes’ Theorem Recap:

$$
P(C \mid X) = \frac{P(X \mid C) \cdot P(C)}{P(X)}
$$

Where:

* $P(C \mid X)$ = Probability of class **C** given features **X**
* $P(X \mid C)$ = Likelihood of features **X** given class **C**
* $P(C)$ = Prior probability of class **C**
* $P(X)$ = Probability of features **X**



#### Example Use Cases:

* Email spam filtering
* Sentiment analysis of reviews
* Document classification




### **Question 5**: Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants.When would you use each one?



#### **1. Gaussian Naïve Bayes**

* **Used for**: **Continuous (real-valued)** features
* **Assumption**: Features follow a **normal (Gaussian) distribution**
* **How it works**: Calculates the likelihood of features using the **probability density function** of the normal distribution
* **Use Case Examples**:

  * Iris flower classification (based on petal/sepal measurements)
  * Medical data with continuous measurements (e.g., blood pressure, age, weight)



#### **2. Multinomial Naïve Bayes**

* **Used for**: **Discrete count** features (typically **word counts** in text)
* **Assumption**: Features represent the **number of times** an event (like a word) occurs
* **How it works**: Computes probabilities based on **term frequencies** in each class
* **Use Case Examples**:

  * Text classification (e.g., spam detection, news categorization)
  * Document classification where features are word frequencies



#### **3. Bernoulli Naïve Bayes**

* **Used for**: **Binary (0/1)** features (whether a word is **present or absent**)
* **Assumption**: Features are boolean-valued
* **How it works**: Evaluates features as **present (1)** or **absent (0)** in each class
* **Use Case Examples**:

  * Binary text classification with word presence/absence (e.g., spam detection using bag-of-words with binary features)
  * Any classification where input features are binary (e.g., medical diagnosis with symptom present/absent)


### Choosing the Right Variant:

* Use **Gaussian** for real-valued inputs.
* Use **Multinomial** when working with count data like term frequency.
* Use **Bernoulli** for binary or boolean input features.


## Dataset Info:
● You can use any suitable datasets like Iris, Breast Cancer, or Wine from
sklearn.datasets or a CSV file you have.

**Question 6**: Write a Python program to:

● Load the Iris dataset

● Train an SVM Classifier with a linear

● Print the model's accuracy and support vectors.

(Include your Python code and output in the code box below.)

In [1]:
# Import required libraries
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create and train an SVM classifier with a linear kernel
svm_model = SVC(kernel='linear')
svm_model.fit(X_train, y_train)

# Predict the labels for test data
y_pred = svm_model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print results
print("Accuracy of the SVM model:", accuracy)
print("Support vectors:\n", svm_model.support_vectors_)
print("Number of support vectors for each class:", svm_model.n_support_)


Accuracy of the SVM model: 1.0
Support vectors:
 [[4.8 3.4 1.9 0.2]
 [5.1 3.3 1.7 0.5]
 [4.5 2.3 1.3 0.3]
 [5.6 3.  4.5 1.5]
 [5.4 3.  4.5 1.5]
 [6.7 3.  5.  1.7]
 [5.9 3.2 4.8 1.8]
 [5.1 2.5 3.  1.1]
 [6.  2.7 5.1 1.6]
 [6.3 2.5 4.9 1.5]
 [6.1 2.9 4.7 1.4]
 [6.5 2.8 4.6 1.5]
 [6.9 3.1 4.9 1.5]
 [6.3 2.3 4.4 1.3]
 [6.3 2.8 5.1 1.5]
 [6.3 2.7 4.9 1.8]
 [6.  3.  4.8 1.8]
 [6.  2.2 5.  1.5]
 [6.2 2.8 4.8 1.8]
 [6.5 3.  5.2 2. ]
 [7.2 3.  5.8 1.6]
 [5.6 2.8 4.9 2. ]
 [5.9 3.  5.1 1.8]
 [4.9 2.5 4.5 1.7]]
Number of support vectors for each class: [ 3 11 10]


## **Question 7**: Write a Python program to:

● Load the Breast Cancer dataset

● Train a Gaussian Naïve Bayes model

● Print its classification report including precision, recall, and F1-score.

(Include your Python code and output in the code box below.)

In [2]:
# Import necessary libraries
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create and train the Gaussian Naive Bayes model
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Predict on the test set
y_pred = gnb.predict(X_test)

# Print the classification report
print("Classification Report:\n")
print(classification_report(y_test, y_pred, target_names=data.target_names))


Classification Report:

              precision    recall  f1-score   support

   malignant       0.93      0.90      0.92        63
      benign       0.95      0.96      0.95       108

    accuracy                           0.94       171
   macro avg       0.94      0.93      0.94       171
weighted avg       0.94      0.94      0.94       171



## **Question 8**: Write a Python program to:

● Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best C and gamma.

● Print the best hyperparameters and accuracy.

(Include your Python code and output in the code box below.)

In [3]:
# Import necessary libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load the Wine dataset
wine = load_wine()
X = wine.data
y = wine.target

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define parameter grid for GridSearchCV
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [1, 0.1, 0.01, 0.001],
    'kernel': ['rbf']  # Using RBF kernel
}

# Create and fit the GridSearchCV
grid = GridSearchCV(SVC(), param_grid, refit=True, cv=5)
grid.fit(X_train, y_train)

# Make predictions
y_pred = grid.predict(X_test)

# Print best parameters and accuracy
print("Best Hyperparameters:", grid.best_params_)
print("Accuracy on test set:", accuracy_score(y_test, y_pred))


Best Hyperparameters: {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}
Accuracy on test set: 0.7777777777777778


## **Question 9**: Write a Python program to:

● Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. using
sklearn.datasets.fetch_20newsgroups).

● Print the model's ROC-AUC score for its predictions.

(Include your Python code and output in the code box below.)

In [4]:
# Import necessary libraries
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score

# Load a binary classification subset (e.g., 'sci.space' vs 'rec.sport.hockey')
categories = ['sci.space', 'rec.sport.hockey']
data = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.3, random_state=42)

# Vectorize the text using TF-IDF
vectorizer = TfidfVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

# Train a Multinomial Naive Bayes classifier
nb_model = MultinomialNB()
nb_model.fit(X_train_vec, y_train)

# Predict probabilities
y_proba = nb_model.predict_proba(X_test_vec)[:, 1]  # Probability for class 1

# Calculate and print ROC-AUC score
roc_auc = roc_auc_score(y_test, y_proba)
print("ROC-AUC Score:", roc_auc)


ROC-AUC Score: 0.9931531531531532


### **Question 10**: Imagine you’re working as a data scientist for a company that handles email communications.

Your task is to automatically classify emails as Spam or Not Spam. The emails may contain:

● Text with diverse vocabulary

● Potential class imbalance (far more legitimate emails than spam)

● Some incomplete or missing data

Explain the approach you would take to:

● Preprocess the data (e.g. text vectorization, handling missing data)

● Choose and justify an appropriate model (SVM vs. Naïve Bayes)

● Address class imbalance

● Evaluate the performance of your solution with suitable metrics
And explain the business impact of your solution.

(Include your Python code and output in the code box below.)


**ANSWER:**
####  1. **Preprocessing the Data**

* **Handle missing data**: Remove or fill null values in email text.
* **Text cleaning**: Lowercase, remove punctuation, stopwords (optional).
* **Vectorization**: Use `TfidfVectorizer` to handle diverse vocabulary and reduce the impact of common words.



####  2. **Model Choice**

* **Naïve Bayes** is preferred for **text classification** due to:

  * Simplicity
  * Speed
  * Strong performance on high-dimensional, sparse data (like emails)
* **SVM** is powerful but slower and less interpretable for large datasets.



####  3. **Handling Class Imbalance**

* Use **class weighting** (in SVM) or **resampling techniques** (like SMOTE).
* In Naïve Bayes, class imbalance can be handled by adjusting thresholds or using balanced datasets.



####  4. **Evaluation Metrics**

Use metrics that reflect performance on **imbalanced data**:

* **Precision**
* **Recall**
* **F1-score**
* **ROC-AUC Score**



####  5. **Business Impact**

* Automating spam detection reduces time and manual filtering
* Improves employee productivity
* Protects against phishing/malware
* Enhances email infrastructure efficiency



In [5]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.pipeline import make_pipeline
import numpy as np
import pandas as pd

# Simulate: Use 'talk.politics.misc' as 'spam' and 'sci.med' as 'not spam'
categories = ['talk.politics.misc', 'sci.med']
data = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))

# Convert to DataFrame for preprocessing
df = pd.DataFrame({'text': data.data, 'target': data.target})

# Introduce some missing values (simulating incomplete data)
np.random.seed(42)
missing_indices = np.random.choice(df.index, size=10, replace=False)
df.loc[missing_indices, 'text'] = None

# Handle missing data
df.dropna(subset=['text'], inplace=True)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['target'], test_size=0.3, random_state=42)

# Create pipeline: TF-IDF + Naive Bayes
pipeline = make_pipeline(
    TfidfVectorizer(),
    MultinomialNB()
)

# Train the model
pipeline.fit(X_train, y_train)

# Predict
y_pred = pipeline.predict(X_test)
y_proba = pipeline.predict_proba(X_test)[:, 1]

# Evaluation
print("Classification Report:\n")
print(classification_report(y_test, y_pred, target_names=categories))
print("ROC-AUC Score:", roc_auc_score(y_test, y_proba))


Classification Report:

                    precision    recall  f1-score   support

talk.politics.misc       0.89      0.98      0.93       302
           sci.med       0.97      0.84      0.90       225

          accuracy                           0.92       527
         macro avg       0.93      0.91      0.92       527
      weighted avg       0.92      0.92      0.92       527

ROC-AUC Score: 0.9854598969830758
