Question 1:  What is a Support Vector Machine (SVM), and how does it work?

-	SVM is a discriminative classifier that works by finding the optimal hyperplane that separates data points of different classes with the maximum margin (i.e., the largest possible distance between data points of both classes).
1. Linear SVM (for linearly separable data):
•	In 2D space, a hyperplane is a straight line.
•	SVM finds the best line (hyperplane) that separates the two classes.
•	The goal is to maximize the margin, which is the distance between the hyperplane and the closest data points from each class (called support vectors).
Example:
Class 1 ● ● ●
           |
--------   ← Optimal Hyperplane
           |
Class 2 ○ ○ ○

2. Support Vectors:
•	These are the critical data points that lie closest to the hyperplane.
•	They influence the position and orientation of the hyperplane.
•	Removing a support vector would change the model.

3. Non-linear SVM:
•	For data that isn't linearly separable, SVM uses a technique called the kernel trick.
Kernel Trick:
•	Maps data from a low-dimensional space to a higher-dimensional space where it becomes linearly separable.
•	Common kernels:
o	Linear Kernel
o	Polynomial Kernel
o	RBF (Radial Basis Function) or Gaussian Kernel
o	Sigmoid Kernel



Question 2: Explain the difference between Hard Margin and Soft Margin SVM.

Difference Between Hard Margin and Soft Margin SVM
Support Vector Machines (SVM) aim to find the optimal hyperplane that separates data points of different classes. However, depending on the nature of the data (whether it's perfectly separable or has noise/overlap), two approaches are used:

1. Hard Margin SVM
Definition:
Hard Margin SVM tries to find a hyperplane that perfectly separates the data without any misclassifications.
Conditions:
•	Works only when data is linearly separable.
•	No data points are allowed inside the margin or on the wrong side of the hyperplane.
Objective:
•	Maximize the margin
•	No tolerance for misclassification.
Disadvantages:
•	Very sensitive to noise or outliers.
•	Not suitable for real-world datasets that are rarely perfectly separable.
2. Soft Margin SVM
Definition:
Soft Margin SVM allows some misclassification or margin violations to improve generalization on noisy or overlapping data.
Conditions:
•	Works for non-linearly separable data.
•	Allows trade-off between maximizing margin and minimizing classification error.
Objective:
•	Introduces a regularization parameter (C) to balance:
o	Large margin (simplicity) and
o	Small error (accuracy)
C parameter:
•	Small C: More tolerance for misclassification (larger margin).
•	Large C: Less tolerance for error (narrow margin, stricter separation).


Question 3: What is the Kernel Trick in SVM? Give one example of a kernel and explain its use case.

The Kernel Trick is a technique used in Support Vector Machines (SVM) to handle non-linearly separable data by implicitly mapping the original features into a higher-dimensional space — without ever computing that mapping explicitly.

This allows the SVM to find a linear hyperplane in the transformed space, which corresponds to a non-linear decision boundary in the original space.

This is useful
In many real-world datasets, the data cannot be separated by a straight line. The kernel trick helps SVM create non-linear decision boundaries using linear methods in higher dimensions.

Without Kernel Trick:
You would need to manually transform data using feature engineering.

This could be computationally expensive and complex.

With Kernel Trick:
You just use a kernel function like K(x, y) instead of computing the transformed feature space.

The function computes the dot product in the higher-dimensional space implicitly.

Question 4: What is a Naïve Bayes Classifier, and why is it called “naïve”?

The Naïve Bayes Classifier is a probabilistic machine learning algorithm based on Bayes’ Theorem, used mainly for classification tasks.
It predicts the class of a data point by calculating the probability of each class given the feature values and selecting the one with the highest probability.
📘 Bayes' Theorem:
Bayes' Theorem:
P(C | X) = [P(X | C) * P(C)] / P(X)
Where:
P(C | X): Posterior probability of class C given features X  
P(X | C): Likelihood of features X given class C  
P(C): Prior probability of class C  
P(X): Evidence (normalizing constant)
It is called “naïve” because it makes a naïve assumption.


Question 5: Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants. When would you use each one?

1. Gaussian Naïve Bayes
Description:
Assumes that the features are continuous and follow a normal (Gaussian) distribution.
Formula:
Each feature is modeled using:
$$
P(x_i \mid y) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( -\frac{(x_i - \mu)^2}{2\sigma^2} \right)
$$

Where:
	μ: mean of the feature for class y,
	sigma^2: variance of the feature for class y
 Use Case:
	When input features are real-valued (continuous).
	E.g., Iris classification, medical data, sensor data.
2. Multinomial Naïve Bayes
 Description:
Assumes features represent discrete counts (e.g., number of times a word appears in a document). Often used in text classification problems.
 Formula:
Uses the multinomial distribution to model the likelihood of a set of features given a class.
$$
P(X \mid y) = \frac{(\sum x_i)!}{x_1! \cdot x_2! \cdots x_n!} \prod_{i=1}^{n} P(x_i \mid y)^{x_i}
$$

Use Case:
	When features are word counts, term frequencies, or frequency features.
	E.g., Spam detection, news classification, sentiment analysis.
3. Bernoulli Naïve Bayes
Description:
Assumes features are binary (0 or 1), indicating presence or absence of a feature.
Example:
In text, instead of using word counts, Bernoulli NB uses whether a word exists in the document or not.
Use Case:
	When features are binary.
	E.g., Short text classification, binary feature datasets, event occurrence prediction.


Question 6:   Write a Python program to:
● Load the Iris dataset
● Train an SVM Classifier with a linear kernel
● Print the model's accuracy and support vectors.

In [1]:
# Import required libraries
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Step 1: Load the Iris dataset
iris = datasets.load_iris()
X = iris.data       # Features
y = iris.target     # Labels

# Step 2: Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3: Train an SVM classifier with a linear kernel
model = SVC(kernel='linear')
model.fit(X_train, y_train)

# Step 4: Make predictions on the test set
y_pred = model.predict(X_test)

# Step 5: Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Step 6: Print results
print("Model Accuracy:", accuracy)
print("Support Vectors:\n", model.support_vectors_)
print("Support Vector Indices:", model.support_)
print("Number of Support Vectors for each class:", model.n_support_)


Model Accuracy: 1.0
Support Vectors:
 [[4.8 3.4 1.9 0.2]
 [5.1 3.3 1.7 0.5]
 [4.5 2.3 1.3 0.3]
 [5.6 3.  4.5 1.5]
 [5.4 3.  4.5 1.5]
 [6.7 3.  5.  1.7]
 [5.9 3.2 4.8 1.8]
 [5.1 2.5 3.  1.1]
 [6.  2.7 5.1 1.6]
 [6.3 2.5 4.9 1.5]
 [6.1 2.9 4.7 1.4]
 [6.5 2.8 4.6 1.5]
 [6.9 3.1 4.9 1.5]
 [6.3 2.3 4.4 1.3]
 [6.3 2.5 5.  1.9]
 [6.3 2.8 5.1 1.5]
 [6.3 2.7 4.9 1.8]
 [6.  3.  4.8 1.8]
 [6.  2.2 5.  1.5]
 [6.2 2.8 4.8 1.8]
 [6.5 3.  5.2 2. ]
 [7.2 3.  5.8 1.6]
 [5.6 2.8 4.9 2. ]
 [5.9 3.  5.1 1.8]
 [4.9 2.5 4.5 1.7]]
Support Vector Indices: [ 31  33  91  22  45  54  59  60  62  73  79  80 105 110   5  16  30  42
  68  81  87 101 112 113 116]
Number of Support Vectors for each class: [ 3 11 11]


Question 7:  Write a Python program to: ● Load the Breast Cancer dataset ● Train a Gaussian Naïve Bayes model ● Print its classification report including precision, recall, and F1-score.

In [2]:
# Step 1: Import necessary libraries
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

# Step 2: Load the breast cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Step 3: Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 4: Create and train the Gaussian Naïve Bayes model
model = GaussianNB()
model.fit(X_train, y_train)

# Step 5: Make predictions on the test set
y_pred = model.predict(X_test)

# Step 6: Print classification report
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))


Classification Report:
              precision    recall  f1-score   support

   malignant       1.00      0.93      0.96        43
      benign       0.96      1.00      0.98        71

    accuracy                           0.97       114
   macro avg       0.98      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114



Question 8: Write a Python program to: ● Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best C and gamma. ● Print the best hyperparameters and accuracy.

In [3]:
# Step 1: Import required libraries
from sklearn import datasets
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Step 2: Load the Wine dataset
wine = datasets.load_wine()
X = wine.data
y = wine.target

# Step 3: Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 4: Define the SVM model and parameter grid
model = SVC()
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [0.001, 0.01, 0.1, 1],
    'kernel': ['rbf']  # using RBF kernel
}

# Step 5: Perform Grid Search with cross-validation
grid_search = GridSearchCV(model, param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Step 6: Make predictions with the best model
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

# Step 7: Print the results
print("Best Hyperparameters:", grid_search.best_params_)
print("Test Accuracy:", accuracy_score(y_test, y_pred))


Best Hyperparameters: {'C': 100, 'gamma': 0.001, 'kernel': 'rbf'}
Test Accuracy: 0.8333333333333334


Question 9: Write a Python program to: ● Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. using sklearn.datasets.fetch_20newsgroups). ● Print the model's ROC-AUC score for its predictions.

In [4]:
# Step 1: Import necessary libraries
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import label_binarize

# Step 2: Load a subset of the 20 Newsgroups dataset (binary classification for ROC-AUC)
categories = ['sci.space', 'comp.graphics']  # binary classification
newsgroups = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))

X = newsgroups.data
y = newsgroups.target  # 0 or 1

# Step 3: Convert text data to TF-IDF features
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(X)

# Step 4: Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)

# Step 5: Train the Naïve Bayes model
model = MultinomialNB()
model.fit(X_train, y_train)

# Step 6: Predict probabilities
y_probs = model.predict_proba(X_test)[:, 1]  # Probabilities for class 1

# Step 7: Calculate and print ROC-AUC score
auc = roc_auc_score(y_test, y_probs)
print("ROC-AUC Score:", auc)


ROC-AUC Score: 0.9853776041666666


Question 10: Imagine you’re working as a data scientist for a company that handles email communications. Your task is to automatically classify emails as Spam or Not Spam. The emails may contain: ● Text with diverse vocabulary ● Potential class imbalance (far more legitimate emails than spam) ● Some incomplete or missing data Explain the approach you would take to: ● Preprocess the data (e.g. text vectorization, handling missing data) ● Choose and justify an appropriate model (SVM vs. Naïve Bayes) ● Address class imbalance ● Evaluate the performance of your solution with suitable metrics And explain the business impact of your solution.


## Problem Statement

You are a data scientist tasked with building a classifier to label emails as **Spam** or **Not Spam**. The dataset contains:

- Text with diverse vocabulary  
- Class imbalance (more legitimate emails than spam)  
- Incomplete or missing data in some records  

---

## 1.  Data Preprocessing

### a. Handling Missing Data

- Drop emails where both **subject** and **body** are missing.
- Fill missing values with empty strings for individual fields.

```python
import pandas as pd

# Simulated loading
df = pd.read_csv("emails.csv")

# Fill missing text fields
df['subject'] = df['subject'].fillna("")
df['body'] = df['body'].fillna("")

# Combine subject and body into one feature
df['text'] = df['subject'] + " " + df['body']


b. Text Cleaning and Vectorization

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# TF-IDF vectorization
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X = vectorizer.fit_transform(df['text'])

# Target variable
y = df['label']  # 1 for spam, 0 for not spam


2. Model Choice: SVM vs Naïve Bayes

| Model       | Pros                                  | Cons                 |
| ----------- | ------------------------------------- | -------------------- |
| Naïve Bayes | Fast, good for text, interpretable    | Assumes independence |
| SVM         | High accuracy, handles imbalance well | Slower, needs tuning |


3. Handling Class Imbalance


In [None]:
from sklearn.model_selection import train_test_split

# Stratified split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)


4. Evaluation Metrics

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.metrics import classification_report, roc_auc_score

# Naïve Bayes
nb = MultinomialNB()
nb.fit(X_train, y_train)
y_pred_nb = nb.predict(X_test)
y_prob_nb = nb.predict_proba(X_test)[:, 1]

print("Naive Bayes Classification Report:")
print(classification_report(y_test, y_pred_nb))
print("Naive Bayes ROC-AUC:", roc_auc_score(y_test, y_prob_nb))


In [None]:
# SVM with probability calibration
from sklearn.calibration import CalibratedClassifierCV

svm = SVC(kernel='linear', class_weight='balanced')
svm_calibrated = CalibratedClassifierCV(svm)
svm_calibrated.fit(X_train, y_train)
y_pred_svm = svm_calibrated.predict(X_test)
y_prob_svm = svm_calibrated.predict_proba(X_test)[:, 1]

print("SVM Classification Report:")
print(classification_report(y_test, y_pred_svm))
print("SVM ROC-AUC:", roc_auc_score(y_test, y_prob_svm))


5. Business Impact

| Business Benefit | Explanation                                                |
| ---------------- | ---------------------------------------------------------- |
| ✅ User Trust     | Prevents false spam blocks, improves satisfaction          |
| ✅ Security       | Blocks phishing, scams, and malicious links                |
| ✅ Efficiency     | Automates email filtering, reduces human moderation effort |
| ✅ Compliance     | Helps meet data protection and anti-spam laws              |
