## SVM & Naive Bayes

1. What is a Support Vector Machine (SVM), and how does it work?

-> 1. What is a Support Vector Machine (SVM), and how does it work?

A Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification and regression tasks. Its main goal is to find the optimal decision boundary (hyperplane) that best separates the data points of different classes in a feature space.

How it works:

Finding the Hyperplane:

In a two-dimensional space, the hyperplane is a line that separates the data points of different classes.

In higher dimensions, it becomes a plane or multidimensional surface.

SVM tries to find the hyperplane that maximizes the margin, i.e., the distance between the hyperplane and the nearest data points from each class.

Support Vectors:

The data points that are closest to the hyperplane and influence its position are called support vectors.

These points are critical, as removing them would change the decision boundary.

Maximizing the Margin:

The optimal hyperplane is the one that gives the largest margin (the widest gap between classes).

Mathematically, this becomes a quadratic optimization problem that minimizes the classification error while maximizing the margin.

Non-linear Separation (Kernel Trick):

If data is not linearly separable, SVM uses a kernel function (e.g., polynomial, RBF) to transform data into a higher-dimensional space where it can be linearly separated.

2. Explain the difference between Hard Margin and Soft Margin SVM

-> 1. Hard Margin SVM:
In a hard margin SVM, the algorithm assumes that the data is perfectly linearly separable. It tries to find a hyperplane that separates the two classes without allowing any misclassification or overlap. Every data point must be correctly classified and must lie outside the margin boundaries.
This approach works well only when the data is clean and there is a clear gap between classes. However, it’s very sensitive to noise and outliers, because even one misclassified or noisy point can make it impossible to find a valid hyperplane.

2.Soft Margin SVM:
The soft margin approach is more practical for real-world data, which often has noise or overlapping points. It allows some points to violate the margin or be misclassified, introducing slack variables to measure how much each point deviates from the ideal margin.
A penalty parameter (C) controls this trade-off:

A large C tries to minimize misclassification (behaves more like a hard margin).

A small C allows more violations to get a smoother, more generalizable boundary.

In simple terms, soft margin SVM balances between maximizing the margin and allowing small classification errors to avoid overfitting and handle complex data more effectively.

3. What is the Kernel Trick in SVM? Give one example of a kernel and explain its use case.

-> The Kernel Trick is a mathematical technique used in Support Vector Machines (SVM) to handle non-linearly separable data.
Instead of transforming the data into a higher-dimensional space manually, the kernel trick allows SVM to implicitly map the data into a higher-dimensional feature space, where a linear separation becomes possible — all without explicitly computing the coordinates in that space.

This saves a lot of computation and makes it possible to separate data that cannot be divided by a straight line in its original form.

Example – Radial Basis Function (RBF) Kernel:
The RBF (Gaussian) kernel is one of the most commonly used kernels. It measures similarity between two data points using the distance between them. The formula is:

𝐾
(
𝑥
,
𝑥
′
)
=
exp
⁡
(
−
𝛾
∣
∣
𝑥
−
𝑥
′
∣
∣
2
)
K(x,x
′
)=exp(−γ∣∣x−x
′
∣∣
2
)

Here, γ (gamma) determines how far the influence of a single training point reaches.

Use case:
RBF kernels are widely used when data is non-linear — for example, in classifying medical records, image recognition, or financial fraud detection — where the relationship between features and classes is complex and cannot be captured by a simple straight line.

4. What is a Naïve Bayes Classifier, and why is it called “naïve”?

-> A Naïve Bayes Classifier is a probabilistic machine learning model based on Bayes’ Theorem, used primarily for classification tasks.
It predicts the class of a data point by calculating the posterior probability of each class given the input features and selecting the class with the highest probability.

Bayes’ Theorem is given as:

𝑃
(
𝐶
∣
𝑋
)
=
𝑃
(
𝑋
∣
𝐶
)
⋅
𝑃
(
𝐶
)
𝑃
(
𝑋
)
P(C∣X)=
P(X)
P(X∣C)⋅P(C)
	​


Where:

𝑃
(
𝐶
∣
𝑋
)
P(C∣X): Posterior probability of class
𝐶
C given features
𝑋
X

𝑃
(
𝑋
∣
𝐶
)
P(X∣C): Likelihood of features given class
𝐶
C

𝑃
(
𝐶
)
P(C): Prior probability of class
𝐶
C

𝑃
(
𝑋
)
P(X): Probability of the features

It is called “naïve” because it assumes that all features are independent of each other given the class label.
In real-world data, this independence assumption is rarely true — hence the name “naïve” — but despite this simplification, the model performs remarkably well in many applications such as spam detection, sentiment analysis, and document classification.

5. Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants. When would you use each one?

-> Gaussian Naïve Bayes:

Assumes that the features follow a normal (Gaussian) distribution.

Suitable for continuous data, such as height, weight, temperature, or sensor readings.

Use case: Predicting whether a patient has a disease based on continuous medical test results.

Multinomial Naïve Bayes:

Used when features represent counts or frequencies (non-negative integers).

Commonly applied in text classification, where features are word counts or term frequencies.

Use case: Classifying emails as spam or not spam using word occurrence counts.

Bernoulli Naïve Bayes:

Used for binary/boolean features, where each feature can take only two values (e.g., 0 or 1).

It models the presence or absence of a particular feature.

Use case: Document classification where we only care whether a word appears in a document (not how many times).

6. Write a Python program to: Load the Iris dataset Train an SVM Classifier with a linear kernelPrint the model’s accuracy and support vectors


In [1]:
# Import necessary libraries
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train SVM Classifier with linear kernel
svm_model = SVC(kernel='linear')
svm_model.fit(X_train, y_train)

# Predict on the test set
y_pred = svm_model.predict(X_test)

# Print model accuracy and support vectors
print("Model Accuracy:", accuracy_score(y_test, y_pred))
print("Number of Support Vectors for each class:", svm_model.n_support_)
print("Support Vectors:\n", svm_model.support_vectors_)


Model Accuracy: 1.0
Number of Support Vectors for each class: [ 3 11 10]
Support Vectors:
 [[4.8 3.4 1.9 0.2]
 [5.1 3.3 1.7 0.5]
 [4.5 2.3 1.3 0.3]
 [5.6 3.  4.5 1.5]
 [5.4 3.  4.5 1.5]
 [6.7 3.  5.  1.7]
 [5.9 3.2 4.8 1.8]
 [5.1 2.5 3.  1.1]
 [6.  2.7 5.1 1.6]
 [6.3 2.5 4.9 1.5]
 [6.1 2.9 4.7 1.4]
 [6.5 2.8 4.6 1.5]
 [6.9 3.1 4.9 1.5]
 [6.3 2.3 4.4 1.3]
 [6.3 2.8 5.1 1.5]
 [6.3 2.7 4.9 1.8]
 [6.  3.  4.8 1.8]
 [6.  2.2 5.  1.5]
 [6.2 2.8 4.8 1.8]
 [6.5 3.  5.2 2. ]
 [7.2 3.  5.8 1.6]
 [5.6 2.8 4.9 2. ]
 [5.9 3.  5.1 1.8]
 [4.9 2.5 4.5 1.7]]


7. Write a Python program to:
● Load the Breast Cancer dataset
● Train a Gaussian Naïve Bayes model
● Print its classification report including precision, recall, and F1-score.


In [2]:
# Import necessary libraries
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

# Load the Breast Cancer dataset
cancer = datasets.load_breast_cancer()
X = cancer.data
y = cancer.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Gaussian Naive Bayes model
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Predict on test set
y_pred = gnb.predict(X_test)

# Print classification report
print("Classification Report:\n")
print(classification_report(y_test, y_pred))


Classification Report:

              precision    recall  f1-score   support

           0       0.93      0.90      0.92        63
           1       0.95      0.96      0.95       108

    accuracy                           0.94       171
   macro avg       0.94      0.93      0.94       171
weighted avg       0.94      0.94      0.94       171



8. Write a Python program to: ● Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best C and gamma. ● Print the best hyperparameters and accuracy.

In [3]:
# Import necessary libraries
from sklearn import datasets
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load the Wine dataset
wine = datasets.load_wine()
X = wine.data
y = wine.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define parameter grid for C and gamma
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [1, 0.1, 0.01, 0.001],
    'kernel': ['rbf']
}

# Perform GridSearchCV
grid = GridSearchCV(SVC(), param_grid, refit=True, verbose=0)
grid.fit(X_train, y_train)

# Predict on the test set
y_pred = grid.predict(X_test)

# Print best parameters and accuracy
print("Best Hyperparameters:", grid.best_params_)
print("Best Model Accuracy:", accuracy_score(y_test, y_pred))


Best Hyperparameters: {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}
Best Model Accuracy: 0.7777777777777778


9. Write a Python program to:
● Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. using
sklearn.datasets.fetch_20newsgroups).
● Print the model's ROC-AUC score for its predictions.


In [4]:
# Import necessary libraries
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import label_binarize
from sklearn.model_selection import train_test_split

# Load text dataset
data = fetch_20newsgroups(subset='all', categories=['rec.sport.baseball', 'sci.space', 'comp.graphics'], remove=('headers', 'footers', 'quotes'))

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.3, random_state=42)

# Convert text data into TF-IDF features
vectorizer = TfidfVectorizer(stop_words='english')
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Train Multinomial Naïve Bayes classifier
nb_model = MultinomialNB()
nb_model.fit(X_train_tfidf, y_train)

# Predict probabilities for ROC-AUC
y_pred_prob = nb_model.predict_proba(X_test_tfidf)

# Binarize labels (for multi-class ROC-AUC)
y_test_bin = label_binarize(y_test, classes=[0, 1, 2])

# Calculate ROC-AUC score (macro average)
roc_auc = roc_auc_score(y_test_bin, y_pred_prob, average='macro')

# Print the ROC-AUC score
print("ROC-AUC Score:", roc_auc)


ROC-AUC Score: 0.9895453018243434


10. Imagine you’re working as a data scientist for a company that handles
email communications.
Your task is to automatically classify emails as Spam or Not Spam. The emails may
contain:
● Text with diverse vocabulary
● Potential class imbalance (far more legitimate emails than spam)
● Some incomplete or missing data
Explain the approach you would take to:
● Preprocess the data (e.g. text vectorization, handling missing data)
● Choose and justify an appropriate model (SVM vs. Naïve Bayes)
● Address class imbalance
● Evaluate the performance of your solution with suitable metrics
And explain the business impact of your solution.


In [5]:
# Import required libraries
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
from imblearn.over_sampling import RandomOverSampler

# Load dataset (using categories similar to spam vs. not spam)
categories = ['sci.electronics', 'talk.politics.misc']  # assume 'sci.electronics' as 'not spam' and 'talk.politics.misc' as 'spam'
data = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))

# Create DataFrame for easier handling
df = pd.DataFrame({'text': data.data, 'target': data.target})

# Handle missing data (replace missing text with 'unknown')
df['text'].fillna('unknown', inplace=True)

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['target'], test_size=0.3, random_state=42, stratify=df['target'])

# Convert text data into numerical form using TF-IDF
vectorizer = TfidfVectorizer(stop_words='english', max_features=3000)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Handle class imbalance using oversampling
oversampler = RandomOverSampler(random_state=42)
X_res, y_res = oversampler.fit_resample(X_train_tfidf, y_train)

# Train Naive Bayes classifier
nb_model = MultinomialNB()
nb_model.fit(X_res, y_res)

# Predict on test data
y_pred = nb_model.predict(X_test_tfidf)
y_prob = nb_model.predict_proba(X_test_tfidf)[:, 1]

# Evaluate performance
print("=== Classification Report ===")
print(classification_report(y_test, y_pred))

print("\n=== Confusion Matrix ===")
print(confusion_matrix(y_test, y_pred))

print("\n=== ROC-AUC Score ===")
print(roc_auc_score(y_test, y_prob))


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['text'].fillna('unknown', inplace=True)


=== Classification Report ===
              precision    recall  f1-score   support

           0       0.94      0.94      0.94       295
           1       0.93      0.93      0.93       233

    accuracy                           0.94       528
   macro avg       0.93      0.93      0.93       528
weighted avg       0.94      0.94      0.94       528


=== Confusion Matrix ===
[[278  17]
 [ 17 216]]

=== ROC-AUC Score ===
0.9891612715501564
