### SVM & Naive Bayes

1. What is a Support Vector Machine (SVM), and how does it work?
- A Support Vector Machine (SVM) is a supervised learning model that finds the decision boundary (hyperplane) that maximizes the margin—the distance to the nearest training points called support vectors—thereby improving generalization. For linearly separable data, it learns a maximal-margin hyperplane; with soft margins, it allows some misclassifications to balance margin size and errors. Using kernels, SVMs map inputs into higher-dimensional spaces to separate data that isn’t linearly separable in the original space. SVMs also extend to regression and can be adapted to multiclass via one-vs-rest or one-vs-one schemes.

2. Explain the difference between Hard Margin and Soft Margin SVM.
- Hard margin SVM assumes perfectly linearly separable data and finds a hyperplane with the maximum margin while allowing zero misclassifications; it’s simple but highly sensitive to outliers and infeasible when classes overlap. Soft margin SVM relaxes this by introducing slack variables and a regularization parameter C, allowing some violations to balance margin width against classification errors; this makes it robust to noise and applicable to non-separable data. In practice, hard margin is rare; soft margin (tuning C) is the standard choice.

3. What is the Kernel Trick in SVM? Give one example of a kernel and
explain its use case.
- The kernel trick lets an SVM learn nonlinear decision boundaries by implicitly mapping data into a higher-dimensional feature space without computing that mapping explicitly; instead, it uses a kernel function K(x, x′) that equals the dot product in the mapped space, keeping computation efficient. Example: the RBF (Gaussian) kernel K(x, x′) = exp(−γ‖x − x′‖²) creates flexible, localized decision regions; it’s well-suited when class boundaries are complex and not linearly separable, with γ controlling how tightly the influence of each training point decays with distance.

4. What is a Naïve Bayes Classifier, and why is it called “naïve”?
- A Naïve Bayes classifier is a simple probabilistic model that applies Bayes’ theorem to compute the posterior probability of each class given an input’s features and predicts the class with the highest posterior. It’s called “naïve” because it assumes conditional independence among features given the class—that each feature contributes to the class probability independently—which greatly simplifies computation and parameter estimation, often working well in high-dimensional tasks like text classification despite the assumption being unrealistic in many real datasets.

5. Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants.
When would you use each one?
- Gaussian NB: assumes each continuous feature follows a class-conditional normal distribution; use for real-valued, approximately continuous data (e.g., sensor readings, measurements).

- Multinomial NB: models counts/frequencies per feature with a multinomial likelihood; use for discrete count data like word counts or TF (or TF–IDF as an approximation) in document classification.

- Bernoulli NB: models binary feature presence/absence; use when features are boolean indicators (e.g., whether a word appears at least once, on/off flags) or when sparsity and presence information matter more than counts.

In [1]:
""" 6. Write a Python program to:
● Load the Iris dataset
● Train an SVM Classifier with a linear kernel
● Print the model's accuracy and support vectors."""

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
import numpy as np

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

clf = SVC(kernel="linear", C=1.0, random_state=42)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
acc = accuracy_score(y_test, y_pred)

support_vectors = clf.support_vectors_
n_support = clf.n_support_
support_idx = clf.support_

print(f"Accuracy: {acc:.4f}")
print(f"Number of support vectors: {support_vectors.shape[0]} (per class: {n_support.tolist()})")
print("First 5 support vectors:\n", np.array2string(support_vectors[:5], precision=3))
print("Support indices (first 20):", support_idx[:20].tolist())

Accuracy: 1.0000
Number of support vectors: 23 (per class: [3, 11, 9])
First 5 support vectors:
 [[4.5 2.3 1.3 0.3]
 [4.8 3.4 1.9 0.2]
 [5.1 3.3 1.7 0.5]
 [6.8 2.8 4.8 1.4]
 [6.  2.9 4.5 1.5]]
Support indices (first 20): [48, 63, 71, 2, 11, 20, 39, 53, 64, 67, 68, 82, 87, 118, 1, 5, 7, 55, 73, 75]


In [2]:
""" 7. Write a Python program to:
● Load the Breast Cancer dataset
● Train a Gaussian Naïve Bayes model
● Print its classification report including precision, recall, and F1-score."""

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

data = load_breast_cancer()
X, y = data.data, data.target
target_names = data.target_names

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

gnb = GaussianNB()
gnb.fit(X_train, y_train)

y_pred = gnb.predict(X_test)
print(classification_report(y_test, y_pred, target_names=target_names, digits=3))

              precision    recall  f1-score   support

   malignant      0.927     0.905     0.916        42
      benign      0.945     0.958     0.952        72

    accuracy                          0.939       114
   macro avg      0.936     0.932     0.934       114
weighted avg      0.938     0.939     0.938       114



In [3]:
""" 8. Write a Python program to:
● Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best
C and gamma.
● Print the best hyperparameters and accuracy. """

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

wine = load_wine()
X, y = wine.data, wine.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

svc = SVC(kernel="rbf")
param_grid = {
    "C": [0.1, 1, 3, 10, 30, 100],
    "gamma": ["scale", "auto", 0.001, 0.003, 0.01, 0.03, 0.1]
}

grid = GridSearchCV(
    estimator=svc,
    param_grid=param_grid,
    scoring="accuracy",
    cv=5,
    n_jobs=-1,
    refit=True
)
grid.fit(X_train, y_train)

best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)
test_acc = accuracy_score(y_test, y_pred)

print("Best hyperparameters:", grid.best_params_)
print(f"CV Best Score (mean accuracy): {grid.best_score_:.4f}")
print(f"Test Accuracy: {test_acc:.4f}")

Best hyperparameters: {'C': 3, 'gamma': 0.001}
CV Best Score (mean accuracy): 0.7542
Test Accuracy: 0.6944


In [4]:
""" 9. Write a Python program to:
● Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. using
sklearn.datasets.fetch_20newsgroups).
● Print the model's ROC-AUC score for its predictions."""

# Python 3.x
# Task: Train Naive Bayes on 20 Newsgroups text and report ROC-AUC

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
import numpy as np

categories = ['sci.space', 'rec.autos']
data = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))
X_text, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(
    X_text, y, test_size=0.2, random_state=42, stratify=y
)

vect = TfidfVectorizer(
    max_features=50000,
    ngram_range=(1,2),
    stop_words='english'
)
X_train_tfidf = vect.fit_transform(X_train)
X_test_tfidf = vect.transform(X_test)

clf = MultinomialNB(alpha=1.0)
clf.fit(X_train_tfidf, y_train)

y_prob = clf.predict_proba(X_test_tfidf)[:, 1]
auc = roc_auc_score(y_test, y_prob)
print(f"ROC-AUC: {auc:.4f}")

ROC-AUC: 0.9914


10. Imagine you’re working as a data scientist for a company that handles
email communications.
Your task is to automatically classify emails as Spam or Not Spam. The emails may
contain:
● Text with diverse vocabulary
● Potential class imbalance (far more legitimate emails than spam)
● Some incomplete or missing data
Explain the approach you would take to:
● Preprocess the data (e.g. text vectorization, handling missing data)
● Choose and justify an appropriate model (SVM vs. Naïve Bayes)
● Address class imbalance
● Evaluate the performance of your solution with suitable metrics
And explain the business impact of your solution.

-  Preprocess: join subject+body, lowercase, keep/normalize URLs/emails/numbers as tokens, fill missing text with empty string and add missing flags, vectorize with TF‑IDF using word (1–2) and character (3–5) n‑grams.
-  Model: start with Multinomial Naïve Bayes as a fast baseline; prefer a linear SVM (LinearSVC) for stronger precision/recall on sparse text; calibrate SVM probabilities if thresholding risk.
-  Imbalance: use stratified splits, class_weight="balanced" (for SVM), tune decision threshold on validation to hit target precision/recall, and avoid SMOTE on raw text; consider modest undersampling of ham if needed.
-  Tuning: cross-validate TF‑IDF params (ngram_range, max_features) and model params (NB alpha, SVM C), optimizing PR‑AUC/average precision.
-  Evaluate: report precision, recall, F1, PR‑AUC; select an operating threshold and show confusion matrix; monitor performance over time and by segments (domain, language). Business impact: fewer spam/phishing emails reaching inboxes, reduced false blocks of legitimate mail, improved security/compliance, and lower manual review effort via calibrated, thresholdable risk scores.