# SVM & Naive Bayes | Assignment

**Question 1: What is a Support Vector Machine (SVM), and how does it work?**


**Ans-**

Support Vector Machine (SVM):

 - A Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification (mainly) and regression problems. It is particularly powerful in high-dimensional spaces and works well when the number of dimensions exceeds the number of samples.

How SVM Works:

 - Separating Classes with a Hyperplane

 - SVM tries to find the best decision boundary (called a hyperplane) that separates data points of different classes.

 - For example, in 2D space, this boundary is a line; in 3D, it’s a plane; in higher dimensions, it’s a hyperplane.

Maximizing the Margin

 - SVM doesn’t just find any boundary — it finds the one that maximizes the margin, i.e., the distance between the hyperplane and the nearest data points from each class.

 - These nearest points are called support vectors (hence the name).

Handling Non-Linearly Separable Data

 - When data is not linearly separable, SVM uses the kernel trick to transform data into a higher-dimensional space where it becomes separable.

 - Common kernels:

 - Linear kernel (for linearly separable data)

 - Polynomial kernel

 - Radial Basis Function (RBF) kernel (very popular)

Soft Margin (for noisy data)

 - In real-world datasets, perfect separation is not always possible.

 - SVM allows some misclassification using a regularization parameter (C) that controls the trade-off between maximizing the margin and minimizing classification errors.

**Question 2: Explain the difference between Hard Margin and Soft Margin SVM.**

**Ans-**

**1. Hard Margin SVM**

Definition:
 - Hard Margin SVM requires that all training data points be classified perfectly by the hyperplane, with no misclassifications.

Conditions:

 - Works only if the data is linearly separable.

 - All data points must lie outside the margin.

Advantages:

 - Produces a very strict boundary with maximum margin.

Disadvantages:

 - Not suitable for noisy datasets (outliers can drastically affect the boundary).

 - Rarely works in real-world scenarios since perfect separation is uncommon.

**2. Soft Margin SVM**

Definition:
 - Soft Margin SVM allows some misclassifications or violations of the margin, controlled by a parameter C.

Conditions:

 - Works well for non-linearly separable data.

 - Strikes a balance between maximizing margin and minimizing classification error.

Advantages:

 - More flexible and robust against outliers and noise.

 - Works better in real-world datasets.

Disadvantages:

 - May not achieve perfect classification on training data.

**Question 3: What is the Kernel Trick in SVM? Give one example of a kernel and explain its use case.**


**Ans-**

**Kernel Trick in SVM**

 - The Kernel Trick is a mathematical technique that allows SVM to solve problems where the data is not linearly separable.

 - Instead of working in the original input space, it maps the data into a higher-dimensional space where it becomes separable by a hyperplane.

 - This makes computation efficient even when the feature space is very high-dimensional.

Example Kernel: Radial Basis Function (RBF) Kernel

Formula:

𝐾
(
𝑥
,
𝑥
′
)
=
exp
⁡
(
−
𝛾
∥
𝑥
−
𝑥
′
∥
2
)
K(x,x
′
)=exp(−γ∥x−x
′
∥
2
)

where

𝑥
,
𝑥
′
x,x
′
 = input vectors

𝛾
γ = parameter controlling the influence of each data point

**Use Case:**

 - RBF kernel is very powerful for non-linear decision boundaries.

Example:

 -  In image classification, spam detection, or handwriting recognition, the data is not linearly separable. RBF kernel maps data into a higher dimension where a hyperplane can separate classes effectively.



**Question 4: What is a Naïve Bayes Classifier, and why is it called “naïve”?**

**Ans-**

**Naïve Bayes Classifier**

**Definition:**

 - Naïve Bayes is a probabilistic machine learning algorithm based on Bayes’ Theorem.
 - It is mainly used for classification tasks (like spam detection, sentiment analysis, text categorization).

Core Idea:

 - It predicts the class of a sample based on the probability of features belonging to each class.

Why is it called “Naïve”?

   - Because it makes a naïve assumption:

 - All features are conditionally independent given the class.

 -  In reality, features are often correlated (e.g., in text, the words “money” and “bank” often occur together).
But Naïve Bayes still works surprisingly well despite this unrealistic assumption.

**Question 5: Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants. When would you use each one?**

Dataset Info:
● You can use any suitable datasets like Iris, Breast Cancer, or Wine from
sklearn.datasets or a CSV file you have.



**Ans- **

**1. Gaussian Naïve Bayes**

Assumption:
 - Features follow a continuous Gaussian (Normal) distribution.

Use Case:

 - When features are continuous (not categorical).

Example:

 - Predicting whether a patient has a disease based on height, weight, blood pressure, and cholesterol levels (all continuous values).

**2. Multinomial Naïve Bayes**

Assumption:
 - Features represent counts or frequencies.

 - Example: number of times a word appears in a document.

Use Case:

 - Best for text classification problems (spam detection, sentiment analysis, document categorization).

Example:

 - Classifying news articles into topics based on word frequencies.

**3. Bernoulli Naïve Bayes**

Assumption:
 - Features are binary (0 or 1) — presence/absence indicators.

Use Case:

 - When data is boolean in nature.

Example:

 - Text classification based on whether a word appears or not (not how many times).

 - Document classification using "word present = 1, word absent = 0".

**Dataset Info:
● You can use any suitable datasets like Iris, Breast Cancer, or Wine from
sklearn.datasets or a CSV file you have.**


**Question 6: Write a Python program to:**
    
      ● Load the Iris dataset
      ● Train an SVM Classifier with a linear kernel
      ● Print the model's accuracy and support vectors.
      (Include your Python code and output in the code box below.)

In [2]:
# **Ans-** SVM Classifier on Iris Dataset (Linear Kernel)

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# 1. Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# 2. Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# 3. Train an SVM Classifier with linear kernel
svm_model = SVC(kernel='linear')
svm_model.fit(X_train, y_train)

# 4. Predict on test data
y_pred = svm_model.predict(X_test)

# 5. Print accuracy and support vectors
accuracy = accuracy_score(y_test, y_pred)
print("SVM Classifier (Linear Kernel) Accuracy:", accuracy)
print("\nSupport Vectors:\n", svm_model.support_vectors_)
print("\nNumber of Support Vectors for each class:", svm_model.n_support_)


SVM Classifier (Linear Kernel) Accuracy: 1.0

Support Vectors:
 [[4.8 3.4 1.9 0.2]
 [5.1 3.3 1.7 0.5]
 [4.5 2.3 1.3 0.3]
 [5.6 3.  4.5 1.5]
 [5.4 3.  4.5 1.5]
 [6.7 3.  5.  1.7]
 [5.9 3.2 4.8 1.8]
 [5.1 2.5 3.  1.1]
 [6.  2.7 5.1 1.6]
 [6.3 2.5 4.9 1.5]
 [6.1 2.9 4.7 1.4]
 [6.5 2.8 4.6 1.5]
 [6.9 3.1 4.9 1.5]
 [6.3 2.3 4.4 1.3]
 [6.3 2.8 5.1 1.5]
 [6.3 2.7 4.9 1.8]
 [6.  3.  4.8 1.8]
 [6.  2.2 5.  1.5]
 [6.2 2.8 4.8 1.8]
 [6.5 3.  5.2 2. ]
 [7.2 3.  5.8 1.6]
 [5.6 2.8 4.9 2. ]
 [5.9 3.  5.1 1.8]
 [4.9 2.5 4.5 1.7]]

Number of Support Vectors for each class: [ 3 11 10]


**Question 7: Write a Python program to:**

    ● Load the Breast Cancer dataset
    ● Train a Gaussian Naïve Bayes model
    ● Print its classification report including precision, recall, and F1-score.
    (Include your Python code and output in the code box below.)

In [3]:
# **Ans-** Gaussian Naïve Bayes on Breast Cancer Dataset

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

# 1. Load the Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

# 2. Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# 3. Train Gaussian Naïve Bayes model
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# 4. Predict on test data
y_pred = gnb.predict(X_test)

# 5. Print classification report
print("Classification Report for Gaussian Naïve Bayes:\n")
print(classification_report(y_test, y_pred, target_names=data.target_names))


Classification Report for Gaussian Naïve Bayes:

              precision    recall  f1-score   support

   malignant       0.93      0.90      0.92        63
      benign       0.95      0.96      0.95       108

    accuracy                           0.94       171
   macro avg       0.94      0.93      0.94       171
weighted avg       0.94      0.94      0.94       171



**Question 8: Write a Python program to:**

    ● Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best
    C and gamma.
    ● Print the best hyperparameters and accuracy.
    (Include your Python code and output in the code box below.)

In [5]:
# **Ans-** SVM with GridSearchCV on Wine Dataset

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# 1. Load the Wine dataset
wine = load_wine()
X, y = wine.data, wine.target

# 2. Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# 3. Define SVM model and parameter grid
svm_model = SVC(kernel='rbf')
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [0.001, 0.01, 0.1, 1]
}

# 4. GridSearchCV to tune hyperparameters
grid_search = GridSearchCV(svm_model, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# 5. Best model
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

# 6. Print results
print("Best Hyperparameters:", grid_search.best_params_)
print("Best Cross-Validation Accuracy:", grid_search.best_score_)
print("Test Accuracy:", accuracy_score(y_test, y_pred))


Best Hyperparameters: {'C': 10, 'gamma': 0.001}
Best Cross-Validation Accuracy: 0.6946666666666667
Test Accuracy: 0.7777777777777778


**Question 9: Write a Python program to:**

    ● Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. using
    sklearn.datasets.fetch_20newsgroups).
    ● Print the model's ROC-AUC score for its predictions.
    (Include your Python code and output in the code box below.)

In [4]:
# **Ans**:- Naïve Bayes on Text Dataset with ROC-AUC

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score
import numpy as np

# 1. Load subset of 20 Newsgroups dataset (binary classification for ROC-AUC)
categories = ['sci.space', 'rec.sport.baseball']
newsgroups = fetch_20newsgroups(subset='all', categories=categories)

X, y = newsgroups.data, newsgroups.target  # y = 0/1

# 2. Convert text to TF-IDF features
vectorizer = TfidfVectorizer(stop_words='english')
X_tfidf = vectorizer.fit_transform(X)

# 3. Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X_tfidf, y, test_size=0.3, random_state=42
)

# 4. Train Naïve Bayes classifier
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)

# 5. Predict probabilities for ROC-AUC
y_probs = nb_model.predict_proba(X_test)[:, 1]

# 6. Compute ROC-AUC score
roc_auc = roc_auc_score(y_test, y_probs)

print("Naïve Bayes ROC-AUC Score:", roc_auc)


Naïve Bayes ROC-AUC Score: 1.0


**Question 10: Imagine you’re working as a data scientist for a company that handles email communications.
Your task is to automatically classify emails as Spam or Not Spam. The emails may
contain:**

    ● Text with diverse vocabulary
    ● Potential class imbalance (far more legitimate emails than spam)
    ● Some incomplete or missing data

Explain the approach you would take to:

    ● Preprocess the data (e.g. text vectorization, handling missing data)
    ● Choose and justify an appropriate model (SVM vs. Naïve Bayes)
    ● Address class imbalance
    ● Evaluate the performance of your solution with suitable metrics
    
And explain the business impact of your solution.
(Include your Python code and output in the code box below.)

**Ans-**

Approach

1. Preprocessing

- Handling Missing Data:

 - Drop emails with no text, or replace missing values with an empty string.

- Text Vectorization:

 - Use TF-IDF Vectorizer (better than raw counts because it downweights common words like "the", "and").

- Feature Engineering (optional):

 - Add metadata features (e.g., number of links, special characters, email length).

2. Model Choice: SVM vs. Naïve Bayes

- Naïve Bayes (MultinomialNB):

 - Works well for text classification, fast, interpretable.

 - Assumes word independence (naïve) but performs surprisingly well.

- SVM:

 - Strong classifier with RBF/linear kernel, handles high-dimensional text data.

 - Slower for very large datasets.

- Choice:

 - Start with Multinomial Naïve Bayes (fast, scalable).

 - Compare with Linear SVM to see if accuracy improves.

3. Handling Class Imbalance

- Options:

 - Use class weights (e.g., class_weight='balanced' in SVM).

 - Use resampling techniques (SMOTE oversampling spam, or undersampling ham).

 - Use threshold tuning on predicted probabilities.

4. Evaluation Metrics

- Since class imbalance exists, accuracy alone is misleading.
Use:

 - Precision & Recall (especially recall for spam → avoid missing spam emails).

 - F1-Score (balance between precision & recall).

 - ROC-AUC (overall discrimination ability).

5. Business Impact

 - Reduced risk of missing spam → protects users from phishing/fraud.

 - Less false positives → ensures important legitimate emails aren’t marked spam.

 - Customer trust increases, fewer complaints about lost emails.

 - Operational efficiency → automated filtering saves human effort.

In [6]:
# Spam Classification with Preprocessing & Evaluation

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report, roc_auc_score
import numpy as np

# 1. Load synthetic dataset (spam vs ham example)
categories = ['sci.space', 'rec.sport.baseball']  # simulate ham vs spam
data = fetch_20newsgroups(subset='all', categories=categories)

X, y = data.data, data.target  # y=0/1

# Simulate missing data
X = [doc if doc.strip() != "" else "missing" for doc in X]

# 2. Vectorization
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X_tfidf = vectorizer.fit_transform(X)

# 3. Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X_tfidf, y, test_size=0.3, random_state=42, stratify=y
)

# 4a. Train Multinomial Naïve Bayes
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)
y_pred_nb = nb_model.predict(X_test)
y_probs_nb = nb_model.predict_proba(X_test)[:, 1]

# 4b. Train Linear SVM with class balancing
svm_model = LinearSVC(class_weight='balanced', random_state=42)
svm_model.fit(X_train, y_train)
y_pred_svm = svm_model.predict(X_test)

# 5. Evaluation
print("=== Naïve Bayes Report ===")
print(classification_report(y_test, y_pred_nb, target_names=data.target_names))
print("Naïve Bayes ROC-AUC:", roc_auc_score(y_test, y_probs_nb))

print("\n=== Linear SVM Report ===")
print(classification_report(y_test, y_pred_svm, target_names=data.target_names))
# LinearSVC does not provide probabilities directly → skip ROC-AUC here


=== Naïve Bayes Report ===
                    precision    recall  f1-score   support

rec.sport.baseball       0.98      0.99      0.99       299
         sci.space       0.99      0.98      0.99       296

          accuracy                           0.99       595
         macro avg       0.99      0.99      0.99       595
      weighted avg       0.99      0.99      0.99       595

Naïve Bayes ROC-AUC: 0.9997966193618367

=== Linear SVM Report ===
                    precision    recall  f1-score   support

rec.sport.baseball       0.99      0.99      0.99       299
         sci.space       0.99      0.99      0.99       296

          accuracy                           0.99       595
         macro avg       0.99      0.99      0.99       595
      weighted avg       0.99      0.99      0.99       595

