# SVM & Naive Bayes | Assignment

# Question 1: What is a Support Vector Machine (SVM), and how does it work?

**Definition:**  
Support Vector Machine (SVM) is a **supervised machine learning algorithm** mainly used for classification and regression problems. It works by finding the **optimal hyperplane** that separates data points of different classes with the **maximum margin**. The points that lie closest to the hyperplane and influence its position are called **support vectors**.

**Working Principle:**  
- SVM transforms data into a **higher-dimensional space** if required (using kernel functions) to make it linearly separable.  
- It identifies the **hyperplane** that **maximizes the margin** between classes.  
- **Support vectors** are the key data points that define the hyperplane.  
- **Kernel trick** allows SVM to handle non-linear data effectively (Linear, Polynomial, RBF kernels).

**Example/Use-case:**  
- Email spam detection (spam vs non-spam)  
- Image classification  
- Handwriting recognition

# Question 2: Explain the difference between Hard Margin and Soft Margin SVM.

**Hard Margin SVM:**  
- Requires **perfectly linearly separable data**.  
- No misclassification allowed; tries to **maximize the margin** strictly.  
- Sensitive to **outliers**; even a single outlier can affect the hyperplane.

**Soft Margin SVM:**  
- Allows **some misclassification** for better generalization.  
- Introduces a **penalty parameter (C)** to balance margin size and misclassification.  
- Works well with **noisy or overlapping data**.

**Example:**  
- Hard Margin → Clean dataset with well-separated classes  
- Soft Margin → Real-world dataset with overlapping classes

# Question 3: What is the Kernel Trick in SVM? Give one example of a kernel and explain its use case.

**Kernel Trick:**  
- A technique in SVM to handle **non-linear data** by transforming it into a **higher-dimensional space** where it becomes linearly separable.  
- It allows SVM to find an optimal hyperplane **without explicitly computing the transformation**.  

**Example – RBF (Radial Basis Function) Kernel:**  
- Measures similarity between points using a **Gaussian function**.

**Use Case:**  
- Works well in **image classification** where data is complex and non-linearly separable.  

**Summary:**  
- Kernel trick → handles non-linear data efficiently  
- RBF kernel → popular for complex datasets

# Question 4: What is a Naïve Bayes Classifier, and why is it called “naïve”?

 **Definition:-**
- Naïve Bayes is a **supervised machine learning algorithm** used for classification.  
- It is based on **Bayes’ Theorem**, which calculates the probability of a class given the input features.

**Why “Naïve”?:-**
- Called **naïve** because it assumes that **all features are independent** of each other,  
  which is rarely true in real-world data, but simplifies computation.

**Example:-**
- Email spam detection (predicting spam vs non-spam)

# Question 5: Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants. When would you use each one?

* **Gaussian Naïve Bayes:**  
  - Assumes that **features follow a normal (Gaussian) distribution**.  
  - **Use case:** Continuous data, e.g., **Iris dataset** (flower measurements).

* **Multinomial Naïve Bayes:**  
  - Works with **discrete count data**, like word frequencies.  
  - **Use case:** Text classification, e.g., **spam detection** or **news categorization**.

* **Bernoulli Naïve Bayes:**  
  - Works with **binary/Boolean features** (0 or 1).  
  - **Use case:** Presence/absence of a feature, e.g., **email word occurrence** in spam detection.

* **Dataset Info:**  
  -  You Can use any datasets from `sklearn.datasets` like **Iris**, **Breast Cancer**, **Wine**, or your own CSV file.


  # PRACTICAL QUESTION WITH ANSWER:-

    



# Question 6: Write a Python program to:
● Load the Iris dataset
● Train an SVM Classifier with a linear kernel
● Print the model's accuracy and support vectors.
(Include your Python code and output in the code box below.)

In [1]:
# Question 6: SVM on Iris dataset

# Import libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train SVM Classifier with linear kernel
svm_model = SVC(kernel='linear')
svm_model.fit(X_train, y_train)

# Predict on test set
y_pred = svm_model.predict(X_test)

# Print accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))

# Print support vectors
print("Support Vectors:\n", svm_model.support_vectors_)


Accuracy: 1.0
Support Vectors:
 [[4.8 3.4 1.9 0.2]
 [5.1 3.3 1.7 0.5]
 [4.5 2.3 1.3 0.3]
 [5.6 3.  4.5 1.5]
 [5.4 3.  4.5 1.5]
 [6.7 3.  5.  1.7]
 [5.9 3.2 4.8 1.8]
 [5.1 2.5 3.  1.1]
 [6.  2.7 5.1 1.6]
 [6.3 2.5 4.9 1.5]
 [6.1 2.9 4.7 1.4]
 [6.5 2.8 4.6 1.5]
 [6.9 3.1 4.9 1.5]
 [6.3 2.3 4.4 1.3]
 [6.3 2.8 5.1 1.5]
 [6.3 2.7 4.9 1.8]
 [6.  3.  4.8 1.8]
 [6.  2.2 5.  1.5]
 [6.2 2.8 4.8 1.8]
 [6.5 3.  5.2 2. ]
 [7.2 3.  5.8 1.6]
 [5.6 2.8 4.9 2. ]
 [5.9 3.  5.1 1.8]
 [4.9 2.5 4.5 1.7]]


# Question 7: Write a Python program to:
● Load the Breast Cancer dataset
● Train a Gaussian Naïve Bayes model
● Print its classification report including precision, recall, and F1-score.
(Include your Python code and output in the code box below.)

In [2]:
# Question 7: Gaussian Naïve Bayes on Breast Cancer dataset

# Import libraries
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

# Load Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Gaussian Naïve Bayes model
gnb_model = GaussianNB()
gnb_model.fit(X_train, y_train)

# Predict on test set
y_pred = gnb_model.predict(X_test)

# Print classification report
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.93      0.90      0.92        63
           1       0.95      0.96      0.95       108

    accuracy                           0.94       171
   macro avg       0.94      0.93      0.94       171
weighted avg       0.94      0.94      0.94       171



# Question 8: Write a Python program to:
● Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best
C and gamma.
● Print the best hyperparameters and accuracy.
(Include your Python code and output in the code box below.)

In [3]:
# Question 8: SVM on Wine dataset with GridSearchCV

# Import libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load Wine dataset
wine = load_wine()
X = wine.data
y = wine.target

# Split dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define SVM and parameter grid
svm = SVC()
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [0.001, 0.01, 0.1, 1],
    'kernel': ['rbf']
}

# GridSearchCV to find best hyperparameters
grid = GridSearchCV(svm, param_grid, cv=5)
grid.fit(X_train, y_train)

# Predict on test set
y_pred = grid.predict(X_test)

# Print best hyperparameters and accuracy
print("Best Hyperparameters:", grid.best_params_)
print("Accuracy:", accuracy_score(y_test, y_pred))


Best Hyperparameters: {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}
Accuracy: 0.7777777777777778


# Question 9: Write a Python program to:
● Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. using
sklearn.datasets.fetch_20newsgroups).
● Print the model's ROC-AUC score for its predictions.
(Include your Python code and output in the code box below.)

In [6]:
# Question 9: Naïve Bayes on 20 Newsgroups text dataset with ROC-AUC

# Import libraries
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier
import numpy as np

# Load dataset (subset for faster execution)
categories = ['rec.autos', 'sci.space', 'comp.graphics']
newsgroups = fetch_20newsgroups(subset='all', categories=categories)

# Vectorize text data
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(newsgroups.data)
y = newsgroups.target

# Binarize labels for ROC-AUC
y_bin = label_binarize(y, classes=np.unique(y))

# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y_bin, test_size=0.3, random_state=42)

# Train Multinomial Naïve Bayes with OneVsRest for multi-class ROC-AUC
nb_model = OneVsRestClassifier(MultinomialNB())
nb_model.fit(X_train, y_train)

# Predict probabilities
y_prob = nb_model.predict_proba(X_test)

# Compute ROC-AUC score (macro average)
roc_score = roc_auc_score(y_test, y_prob, average='macro')
print("ROC-AUC Score:", roc_score)




ROC-AUC Score: 0.9940406924865924


Question 10: Imagine you’re working as a data scientist for a company that handles
email communications.
Your task is to automatically classify emails as Spam or Not Spam. The emails may
contain:
● Text with diverse vocabulary
● Potential class imbalance (far more legitimate emails than spam)
● Some incomplete or missing data
Explain the approach you would take to:
● Preprocess the data (e.g. text vectorization, handling missing data)
● Choose and justify an appropriate model (SVM vs. Naïve Bayes)
● Address class imbalance
● Evaluate the performance of your solution with suitable metrics
And explain the business impact of your solution.
(Include your Python code and output in the code box below.)

# Question 10: Email Spam Classification

## 1. Data Preprocessing
* Handle missing data by **filling or dropping nulls**.  
* Convert text into numerical features using **TF-IDF vectorization**.  
* Optionally, remove stopwords, punctuation, and perform **lowercasing/stemming**.

## 2. Model Selection
* **Naïve Bayes** is suitable for **text classification**, especially with discrete features (word counts or TF-IDF).  
* **SVM** can also be used, especially with **high-dimensional sparse text data**.  
* Given the **text nature** and **simplicity**, Naïve Bayes is preferred for speed and interpretability.

## 3. Handling Class Imbalance
* Use **class weights** (for SVM) or **oversampling/undersampling**.  
* In Naïve Bayes, imbalance can be addressed by **adjusting prior probabilities**.

## 4. Performance Evaluation
* Metrics: **Accuracy, Precision, Recall, F1-Score, ROC-AUC**.  
* For imbalanced data, **Precision and Recall** are more important than Accuracy.

## 5. Business Impact
* Automatic spam detection improves **email productivity**, reduces **security risks**, and enhances **user satisfaction**.  
* Helps company **filter malicious emails** before they reach users.


In [8]:
# Question 10: Email Spam Classification with Naïve Bayes

# Import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.preprocessing import LabelBinarizer

# Load dataset (SMS Spam dataset from UCI via URL for demo)
url = "https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv"
df = pd.read_csv(url, sep='\t', header=None, names=['class','text'])

# Handle missing data
df.dropna(inplace=True)

# Features and target
X = df['text']
y = df['class']

# Convert target to binary (ham=0, spam=1)
lb = LabelBinarizer()
y_bin = lb.fit_transform(y)

# Text vectorization
vectorizer = TfidfVectorizer(stop_words='english')
X_vect = vectorizer.fit_transform(X)

# Split dataset (stratify to maintain class balance)
X_train, X_test, y_train, y_test = train_test_split(X_vect, y_bin, test_size=0.3, random_state=42, stratify=y_bin)

# Train Naïve Bayes (Multinomial)
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)

# Predict
y_pred = nb_model.predict(X_test)
y_prob = nb_model.predict_proba(X_test)

# Evaluation
print("Classification Report:\n", classification_report(y_test, y_pred))
roc_score = roc_auc_score(y_test, y_prob[:,1])
print("ROC-AUC Score:", roc_score)


Classification Report:
               precision    recall  f1-score   support

           0       0.96      1.00      0.98      1448
           1       1.00      0.75      0.86       224

    accuracy                           0.97      1672
   macro avg       0.98      0.88      0.92      1672
weighted avg       0.97      0.97      0.97      1672

ROC-AUC Score: 0.9877170481452249


  y = column_or_1d(y, warn=True)
