**SVM & Naive Bayes | Assignment**


Question 1: What is a Support Vector Machine (SVM), and how does it work?

Answer- A Support Vector Machine (SVM) is a supervised machine learning model used primarily for binary classification, and also for regression (known as Support Vector Regression, SVR). It works by finding an optimal decision boundary—called a hyperplane—that best separates two classes of data. The margin between this hyperplane and the closest data points of each class is maximized to achieve better generalization to unseen data .


*  In 2D, the hyperplane is a line; in higher dimensions, it’s a plane or an
   (n – 1)-dimensional separator.

* In 2D, the hyperplane is a line; in higher dimensions, it’s a plane or an (n – 1)-dimensional separator
In 2D, the hyperplane is a line; in higher dimensions, it’s a plane or an (n – 1)-dimensional separator



Question 2: Explain the difference between Hard Margin and Soft Margin SVM.

Answer- **Hard Margin SVM:**

**Principle:**

Aims to find the widest possible margin between classes, ensuring that all data points are correctly classified (no misclassifications).

**Requirements**:

Assumes the data is linearly separable, meaning a straight line (or hyperplane in higher dimensions) can perfectly divide the classes.

**Limitations:**

Highly sensitive to outliers and noise in the data. If even a single point is misclassified, a hard margin SVM might not be able to find a suitable hyperplane, potentially leading to overfitting.

**Soft Margin SVM:**

**Principle**:

Allows for some data points to be misclassified, introducing a "slack" or tolerance for errors. This is achieved by using slack variables in the optimization process.

**Requirements**:

Does not require perfect separation of data. It aims to find a balance between maximizing the margin and minimizing the number of misclassifications.

**Advantages:**

More robust to outliers and noisy data than hard margin SVMs. Can handle data that is not perfectly linearly separable. Generally provides better generalization performance on real-world datasets.

**Trade-off:**

Introduces a parameter (often denoted as 'C') that controls the trade-off between maximizing the margin and minimizing the number of misclassifications. A higher 'C' value means less tolerance for misclassifications (closer to hard margin), while a lower 'C' value means more tolerance (allowing for more misclassifications).


Question 3: What is the Kernel Trick in SVM? Give one example of a kernel and
explain its use case.

Answer- The Kernel Trick is a method used in Support Vector Machines (SVMs) to handle non-linearly separable data. Instead of explicitly mapping data points into a higher-dimensional feature space where they become linearly separable, the kernel trick employs a kernel function that directly computes the dot product of the transformed data points in that higher-dimensional space. This avoids the computationally expensive process of explicitly transforming the data, making it efficient for high-dimensional spaces, even infinite-dimensional ones.

One example of a kernel is the Radial Basis Function (RBF) Kernel, also known as the Gaussian Kernel.

K(x, y) = exp(-γ * ||x - y||^2)

where:

* x and y are data points in the original feature space.

* ||x - y||^2 is the squared Euclidean distance between x and y.

* γ (gamma) is a hyperparameter that controls the influence of individual training samples. A larger γ leads to a more complex decision boundary, while a smaller γ results in a smoother boundary.

Use Case:

The RBF kernel is widely used when the relationship between data points is non-linear and complex, and there is no clear prior knowledge about the structure of the data. For instance, in image classification, where pixel values can have intricate non-linear relationships, the RBF kernel can effectively capture these complexities to classify images. It's also frequently used in bioinformatics for tasks like protein classification or gene expression analysis, where data often exhibits non-linear patterns. Its ability to create complex decision boundaries makes it suitable for problems where a simple linear separation is insufficient.



Question 4: What is a Naïve Bayes Classifier, and why is it called “naïve”?

Answer- A Naive Bayes classifier is a probabilistic machine learning algorithm that uses Bayes' Theorem to classify data. It's called "naïve" because it makes a simplifying assumption that all features are independent of each other, given the class label. This independence assumption is often unrealistic in real-world data, but despite this, Naive Bayes classifiers often perform surprisingly well.

**Why is it called "naïve"?**

The "naïve" part of the name refers to the core assumption of the algorithm: that all features are independent of each other, given the class label. In other words, the presence or absence of one feature doesn't affect the probability of another feature, given the class. This is a strong assumption and rarely holds true in real-world scenarios where features often interact with each other.

**Why is it used if it's "naïve"?**

Despite its simplicity and the unrealistic independence assumption, Naive Bayes classifiers are widely used for several reasons:

**Computational efficiency:**The independence assumption simplifies the calculations, making the algorithm very fast, especially for large datasets.

**Good performance:** In many practical situations, particularly in text classification (like spam filtering or sentiment analysis) and other areas, Naive Bayes classifiers can achieve surprisingly good accuracy, even with the simplifying assumption.

**Simple implementation:** The algorithm is relatively easy to understand and implement.
In essence, the "naïve" label highlights the simplifying assumption, while the algorithm's effectiveness in various applications demonstrates its practical value despite this simplification, according to several data science websites.

Question 5: Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants.When would you use each one?

Answer-Naive Bayes classifiers come in several variants, each suited for different types of data. Gaussian Naive Bayes is used for continuous data that follows a Gaussian (normal) distribution. Multinomial Naive Bayes is ideal for discrete data, particularly when features represent counts (like word frequencies in text). Bernoulli Naive Bayes is best for binary data, where features indicate presence or absence of an attribute.

**Gaussian Naive Bayes:**

Assumption:

 Features are continuous and follow a Gaussian (normal) distribution.

Use Case:

 Suitable for datasets where feature values are numerical and can be reasonably approximated by a normal distribution. For example:
 * Classifying flowers based on petal length and width, where these measurements are assumed to be normally distributed.
 * Predicting house prices based on features like area, number of bedrooms, etc., which might follow a normal distribution.

**Multinomial Naive Bayes:**

Assumption:

Features are discrete counts, often representing frequencies of items (e.g., words in a text document).

Use Case:

 Commonly used for text classification tasks where features are word counts or term frequencies. For example:
 * Classifying emails as spam or not spam based on the frequency of certain words.
 * Categorizing news articles based on the frequency of keywords.
 * Analyzing customer reviews to determine sentiment (positive, negative,
 neutral) based on word frequencies.

**Bernoulli Naive Bayes:**

Assumption:

Features are binary (0 or 1), representing presence or absence of an attribute.

Use Case:

Appropriate for datasets where features are binary. For example:
* Text classification using a "bag-of-words" model where features indicate whether a word is present in a document (1) or not (0).
* Document classification where features indicate presence or absence of specific keywords.
* Classifying news articles based on the presence or absence of certain keywords.

Question 6: Write a Python program to:

● Load the Iris dataset

● Train an SVM Classifier with a linear kernel

● Print the model's accuracy and support vectors.


In [1]:
from sklearn import datasets
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [2]:
# 1. Load Iris dataset
iris = datasets.load_iris()
X = iris.data        # all 4 features
y = iris.target

In [3]:
# 2. (Optional but recommended) — standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [4]:
# 3. Split into train/test (e.g. 75% train, 25% test)
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.25, random_state=42)

In [5]:
# 4. Train SVM with linear kernel
clf = SVC(kernel='linear', random_state=42)
clf.fit(X_train, y_train)


In [6]:
# 5. Compute accuracy on test set
accuracy = clf.score(X_test, y_test)
print(f"Test set accuracy: {accuracy:.3f}")

Test set accuracy: 0.974


In [7]:
# 6. Access support vectors
support = clf.support_vectors_
print(f"Number of support vectors: {len(support)}")
print("Support vectors (first 5 shown):")
print(support[:5])

Number of support vectors: 26
Support vectors (first 5 shown):
[[-0.90068117  0.55861082 -1.16971425 -0.92054774]
 [-1.62768839 -1.74335684 -1.39706395 -1.18381211]
 [-0.29484182 -0.13197948  0.42173371  0.3957741 ]
 [-0.53717756 -0.13197948  0.42173371  0.3957741 ]
 [-0.41600969 -1.74335684  0.13754657  0.13250973]]


Question 7: Write a Python program to:

● Load the Breast Cancer dataset

● Train a Gaussian Naïve Bayes model

● Print its classification report including precision, recall, and F1-score.

In [9]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

In [10]:
# 1. Load the dataset
data = load_breast_cancer()
X = data.data      # 569 samples × 30 features
y = data.target

In [11]:
# 2. Split into training and test sets (e.g. 80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2,
                                                    random_state=42)

In [12]:
# 3. Initialize and train Gaussian Naive Bayes
clf = GaussianNB()
clf.fit(X_train, y_train)

In [13]:
# 4. Predict on test set
y_pred = clf.predict(X_test)

In [14]:
# 5. Print classification report
report = classification_report(y_test, y_pred,
                               target_names=data.target_names)
print("Classification report:\n", report)

Classification report:
               precision    recall  f1-score   support

   malignant       1.00      0.93      0.96        43
      benign       0.96      1.00      0.98        71

    accuracy                           0.97       114
   macro avg       0.98      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114



Question 8: Write a Python program to:

● Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best C and gamma.

● Print the best hyperparameters and accuracy.


In [15]:
from sklearn import datasets
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

In [16]:
# 1. Load Wine dataset
wine = datasets.load_wine()
X, y = wine.data, wine.target

In [17]:
# 2. Split data (e.g. 80% train / 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)

In [18]:
# 3. Define parameter grid to search
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [0.001, 0.01, 0.1, 1],
    'kernel': ['rbf']  # tuning C and gamma only for RBF kernel
}

In [19]:
# 4. Set up GridSearchCV
grid = GridSearchCV(
    estimator=SVC(),
    param_grid=param_grid,
    scoring='accuracy',
    cv=5,
    n_jobs=-1,
    refit=True
)

In [20]:
# 5. Fit grid search
grid.fit(X_train, y_train)

In [21]:
# 6. Extract results
best_params = grid.best_params_
best_score_cv = grid.best_score_

In [22]:
# 7. Evaluate on test set using best estimator
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)


In [23]:
# 8. Print results
print("Best hyperparameters found (via CV):", best_params)
print(f"Best cross‑validation accuracy: {best_score_cv:.3f}")
print(f"Test‑set accuracy using best model: {test_accuracy:.3f}")

Best hyperparameters found (via CV): {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}
Best cross‑validation accuracy: 0.747
Test‑set accuracy using best model: 0.778


Question 9: Write a Python program to:

● Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. using
 sklearn.datasets.fetch_20newsgroups).

● Print the model's ROC-AUC score for its predictions.

In [24]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score


In [25]:
# 1. Load a binary subset of the 20 newsgroups dataset
categories = ['rec.sport.hockey', 'sci.space']
data = fetch_20newsgroups(subset='train', categories=categories,
                          shuffle=True, random_state=42, remove=('headers','footers','quotes'))
X_train_raw, y_train = data.data, data.target

data_test = fetch_20newsgroups(subset='test', categories=categories,
                               shuffle=True, random_state=42, remove=('headers','footers','quotes'))
X_test_raw, y_test = data_test.data, data_test.target

In [26]:
# 2. Text → numeric pipeline
vectorizer = CountVectorizer()
tfidf = TfidfTransformer()
X_train_counts = vectorizer.fit_transform(X_train_raw)
X_train_tfidf = tfidf.fit_transform(X_train_counts)

X_test_counts = vectorizer.transform(X_test_raw)
X_test_tfidf = tfidf.transform(X_test_counts)

In [27]:
# 3. Train a Multinomial Naïve Bayes classifier
clf = MultinomialNB()
clf.fit(X_train_tfidf, y_train)


In [28]:
# 4. Get predicted probabilities and compute ROC‑AUC
y_proba = clf.predict_proba(X_test_tfidf)[:, 1]  # probability of class "sci.space" (label 1)
roc_auc = roc_auc_score(y_test, y_proba)

print(f"ROC‑AUC score on test set: {roc_auc:.3f}")

ROC‑AUC score on test set: 0.993


Question 10: Imagine you’re working as a data scientist for a company that handles
email communications.
Your task is to automatically classify emails as Spam or Not Spam. The emails may
contain:

● Text with diverse vocabulary

● Potential class imbalance (far more legitimate emails than spam)

● Some incomplete or missing data

Explain the approach you would take to:

● Preprocess the data (e.g. text vectorization, handling missing data)

● Choose and justify an appropriate model (SVM vs. Naïve Bayes)

● Address class imbalance

● Evaluate the performance of your solution with suitable metrics
And explain the business impact of your solution.

Answer-


**1.Preprocessing the Data**

* Clean and normalize text: convert to lowercase, remove HTML tags, punctuation, digits, URLs, email headers, and stopwords; expand contractions; apply stemming or lemmatization
* Vectorize text: use sparse representations like Bag‑of‑Words or TF‑IDF, optionally with feature pruning (e.g. remove rare terms or use top‑k)
Analytics Vidhya

* Handle missing/incomplete data:

 * If text is missing: drop if very rare or impute or treat as “unknown”.

 * For metadata features (e.g. sender, subject fields): fill missing with special null token or average/most‑common category.

**2.Model Choice: Naïve Bayes vs. SVM**

 * Naïve Bayes (typically Multinomial or Bernoulli):

  * Fast, scalable, good for high‑dimensional sparse text.

  * Works well even with relatively small datasets due to independence assumption and Laplace smoothing

* SVM:

  * Generally achieves higher accuracy and stronger separation in high‑dimension text spaces, especially with kernel trick (linear or RBF)

  * More computationally expensive, requires parameter tuning (e.g. C, kernel).


* Recommendation:

 * If you have large enough labeled data and compute resources: linear‑kernel SVM with TF‑IDF or even word embeddings tends to yield higher precision/recall overall.

 * If you need interpretability, speed, or limited data: Naïve Bayes is a strong baseline and may perform nearly as well with proper smoothing and feature engineering.

3. **Addressing Class Imbalance**

* Resampling approaches:

 * Oversample the minority class (spam) using SMOTE or random oversampling.

 * Undersample the majority class to reduce skew.

 * A combined SMOTE + undersample (Tomek links) approach can help


* Cost‑sensitive learning:

 * For SVM: assign higher misclassification penalty (class weight) for spam.
 * For Naïve Bayes: adjust class priors to up‑weight spam.
* Ensemble approaches: stacking multiple classifiers (e.g. NB + LR + SVM or boosting) can boost minority recall while maintaining accuracy

**4. Performance Evaluation – Metrics and Protocol**

* Split strategy: stratified train/validation/test or cross‑validation to maintain spam/ham ratio.

* Key metrics:

 * Precision and recall for the minority class (spam).

 * F1-score balancing them.

 * ROC‑AUC / PR‑AUC for overall discrimination, particularly PR‑AUC as spam is rare.

 * Confusion matrix: to monitor false positives (legit flagged as spam) and false negatives.

* Threshold tuning: adjust decision threshold to balance false positive vs false negative tradeoffs as per business tolerance.

**5. Business Impact**

* Reduced spam exposure: lower end‑user annoyance, phishing threats, and malware risk.

* Improved productivity: less time wasted sorting spam or recovering legitimate email incorrectly filtered (i.e., low false positives).

* Cost savings: decreased overhead on manual review or security incidents.

* Better user goodwill: users trust your product more if spam filter is accurate yet conservative about false positives.

* Additionally, model monitoring for concept drift (spam patterns evolve) ensures sustained performance.

**Example Pipeline Summary**

| Step                    | Action                                                              |
| ----------------------- | ------------------------------------------------------------------- |
| Text cleaning           | Lowercase, remove HTML, stopwords, stemming/lemmatization           |
| Vectorization           | TF‑IDF or BoW; maybe embeddings if compute allows                   |
| Missing data treatment  | Null tokens or imputation for missing fields                        |
| Resample/weighting      | SMOTE/undersample + class‑weighting                                 |
| Model training          | Baseline Naïve Bayes, then SVM (linear kernel + grid search on C)   |
| Threshold tuning        | Tune spam detection threshold based on recall/precision trade‑off   |
| Evaluation              | Precision, recall, F1, PR‑AUC, confusion matrix                     |
| Deployment & monitoring | Periodic performance checks, retraining on new data as spam evolves |

**Why this approach works**

* Robust to missing data: simple probabilistic NB handles missing easily; SVM with sparse representation handles sparsity.

* Handles class imbalance: through resampling and weighting or ensemble methods targeting spam recall.

* Scalable: both NB and linear SVM scale well to large email datasets.

* Interpretable: feature tokens and weights allow analysis, audits, compliance.



























