Question 1: What is a Support Vector Machine (SVM), and how does it work?

Ans: A Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification and regression tasks.

It works by finding the best hyperplane that separates data points of different classes with the maximum margin (the distance between the hyperplane and the nearest data points, called support vectors).

For linearly separable data, SVM draws a straight line (in 2D) or a plane (in higher dimensions).

For non-linear data, it uses the kernel trick (e.g., polynomial, RBF) to map data into higher dimensions where it becomes separable

Question 2: Explain the difference between Hard Margin and Soft Margin SVM.

Ans: Hard Margin SVM:

Assumes data is perfectly linearly separable.

Finds a hyperplane that separates classes with no misclassification allowed.

Works well only when there is no noise or overlap in data.

Soft Margin SVM:

Allows some misclassifications by introducing a penalty term.

Balances between maximizing margin and minimizing classification errors.

Controlled by a parameter C (high C → less tolerance to errors, low C → wider margin with more tolerance).


Question 3: What is the Kernel Trick in SVM? Give one example of a kernel and explain its use case.

Ans: The Kernel Trick in SVM is a method that allows the algorithm to handle non-linearly separable data by mapping it into a higher-dimensional space without explicitly computing the transformation. This makes it possible to find a separating hyperplane in complex datasets.

Use case: Commonly used when data is not linearly separable. For instance, in image classification, RBF can separate data points that form circular or irregular boundaries.

Question 4: What is a Naïve Bayes Classifier, and why is it called “naïve”?

Ans: A Naïve Bayes Classifier is a supervised machine learning algorithm based on Bayes’ Theorem, used mainly for classification tasks such as spam filtering, sentiment analysis, and text categorization.

It is called “naïve” because it makes a strong assumption that all features are independent of each other, which is rarely true in real-world data. Despite this simplification, it often works surprisingly well, especially for high-dimensional datasets like text.

Question 5: Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants.
When would you use each one?

Ans: 1. Gaussian Naïve Bayes

Assumes features follow a normal (Gaussian) distribution.

Suitable for continuous data (e.g., height, weight, exam scores).

Example: Classifying patients based on continuous medical measurements (blood pressure, cholesterol).

2. Multinomial Naïve Bayes

Assumes features are counts or frequencies.

Suitable for discrete data (word counts in text, event occurrences).

Example: Text classification or spam detection, where features are word frequencies.

3. Bernoulli Naïve Bayes

Assumes features are binary (0/1).

Suitable when data represents presence/absence of a feature.

Example: Document classification where a word is either present (1) or absent (0)

Question 6: Write a Python program to:
● Load the Iris dataset
● Train an SVM Classifier with a linear kernel
● Print the model's accuracy and support vectors.




In [1]:
# Import libraries
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = datasets.load_iris()
X, y = iris.data, iris.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train an SVM classifier with a linear kernel
svm_model = SVC(kernel='linear')
svm_model.fit(X_train, y_train)

# Predictions
y_pred = svm_model.predict(X_test)

# Model accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)

# Support vectors
print("Support Vectors:\n", svm_model.support_vectors_)


Model Accuracy: 1.0
Support Vectors:
 [[4.8 3.4 1.9 0.2]
 [5.1 3.3 1.7 0.5]
 [4.5 2.3 1.3 0.3]
 [5.6 3.  4.5 1.5]
 [5.4 3.  4.5 1.5]
 [6.7 3.  5.  1.7]
 [5.9 3.2 4.8 1.8]
 [5.1 2.5 3.  1.1]
 [6.  2.7 5.1 1.6]
 [6.3 2.5 4.9 1.5]
 [6.1 2.9 4.7 1.4]
 [6.5 2.8 4.6 1.5]
 [6.9 3.1 4.9 1.5]
 [6.3 2.3 4.4 1.3]
 [6.3 2.8 5.1 1.5]
 [6.3 2.7 4.9 1.8]
 [6.  3.  4.8 1.8]
 [6.  2.2 5.  1.5]
 [6.2 2.8 4.8 1.8]
 [6.5 3.  5.2 2. ]
 [7.2 3.  5.8 1.6]
 [5.6 2.8 4.9 2. ]
 [5.9 3.  5.1 1.8]
 [4.9 2.5 4.5 1.7]]


Question 7: Write a Python program to:
● Load the Breast Cancer dataset
● Train a Gaussian Naïve Bayes model
● Print its classification report including precision, recall, and F1-score.
(Include your Python code and output in the code box below.)


In [2]:
# Import libraries
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

# Load the Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train a Gaussian Naïve Bayes model
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Predictions
y_pred = gnb.predict(X_test)

# Print classification report
print("Classification Report:\n")
print(classification_report(y_test, y_pred, target_names=data.target_names))


Classification Report:

              precision    recall  f1-score   support

   malignant       0.93      0.90      0.92        63
      benign       0.95      0.96      0.95       108

    accuracy                           0.94       171
   macro avg       0.94      0.93      0.94       171
weighted avg       0.94      0.94      0.94       171



Question 8: Write a Python program to:
● Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best
C and gamma.
● Print the best hyperparameters and accuracy.
(Include your Python code and output in the code box below.)

In [3]:
# Import libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load Wine dataset
wine = load_wine()
X, y = wine.data, wine.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Define parameter grid for C and gamma
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [0.001, 0.01, 0.1, 1],
    'kernel': ['rbf']
}

# GridSearchCV with cross-validation
grid = GridSearchCV(SVC(), param_grid, refit=True, cv=5, verbose=0)
grid.fit(X_train, y_train)

# Best model
best_model = grid.best_estimator_

# Predictions
y_pred = best_model.predict(X_test)

# Print best parameters and accuracy
print("Best Hyperparameters:", grid.best_params_)
print("Accuracy on Test Data:", accuracy_score(y_test, y_pred))


Best Hyperparameters: {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}
Accuracy on Test Data: 0.7777777777777778


Question 9: Write a Python program to:
● Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. using
sklearn.datasets.fetch_20newsgroups).
● Print the model's ROC-AUC score for its predictions.
(Include your Python code and output in the code box below.)

In [4]:
# Import libraries
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score

# Load a subset of the 20 newsgroups dataset (for binary classification)
categories = ['sci.space', 'rec.sport.baseball']
newsgroups = fetch_20newsgroups(subset='all', categories=categories)

# Feature extraction using TF-IDF
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(newsgroups.data)
y = newsgroups.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train Naïve Bayes Classifier
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)

# Predict probabilities
y_prob = nb_model.predict_proba(X_test)[:, 1]

# Calculate ROC-AUC score
roc_auc = roc_auc_score(y_test, y_prob)
print("ROC-AUC Score:", roc_auc)


ROC-AUC Score: 1.0


Question 10: Imagine you’re working as a data scientist for a company that handles
email communications.
Your task is to automatically classify emails as Spam or Not Spam. The emails may
contain:
● Text with diverse vocabulary
● Potential class imbalance (far more legitimate emails than spam)
● Some incomplete or missing data
Explain the approach you would take to:
● Preprocess the data (e.g. text vectorization, handling missing data)
● Choose and justify an appropriate model (SVM vs. Naïve Bayes)
● Address class imbalance
● Evaluate the performance of your solution with suitable metrics
And explain the business impact of your solution.
(Include your Python code and output in the code box below.)

Ans: Approach Explanation

Preprocessing the Data

Handle missing data: Replace missing text with an empty string or drop rows if excessive.

Text vectorization: Use TfidfVectorizer to convert emails into numerical features while reducing the effect of frequent words.

Normalization: Optional, depending on model.

Choosing the Model

Naïve Bayes: Fast, works well for text (esp. with word counts).

SVM: More accurate but slower on large text data.

👉 For large-scale email classification, Multinomial Naïve Bayes is a good first choice due to its efficiency and performance with text data.

Handling Class Imbalance

Use class weights (for SVM) or resampling techniques (SMOTE/undersampling).

Alternatively, tune the decision threshold to improve recall for the minority class (spam).

Evaluation Metrics

Accuracy alone is misleading with imbalance.

Use Precision, Recall, F1-score, and ROC-AUC.

High recall for spam is important to catch unwanted emails.

Business Impact

Reduces spam exposure, improving productivity and security.

Saves employees’ time, prevents phishing/malware attacks.

Builds trust with customers by ensuring legitimate communications aren’t marked as spam.

In [5]:
# Import libraries
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, roc_auc_score
from imblearn.over_sampling import RandomOverSampler
import numpy as np

# Load a text dataset (simulate spam detection: spam categories vs ham categories)
categories = ['rec.autos', 'talk.politics.misc']  # simulating ham vs spam
data = fetch_20newsgroups(subset='all', categories=categories)

# Handle missing values: replace None with empty string
texts = [doc if doc is not None else "" for doc in data.data]
y = data.target

# Convert text to TF-IDF features
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X = vectorizer.fit_transform(texts)

# Handle class imbalance with oversampling
ros = RandomOverSampler(random_state=42)
X_resampled, y_resampled = ros.fit_resample(X, y)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X_resampled, y_resampled, test_size=0.3, random_state=42
)

# Train Naïve Bayes model
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)

# Predictions
y_pred = nb_model.predict(X_test)
y_prob = nb_model.predict_proba(X_test)[:, 1]

# Evaluation
print("Classification Report:\n")
print(classification_report(y_test, y_pred, target_names=data.target_names))
print("ROC-AUC Score:", roc_auc_score(y_test, y_prob))


Classification Report:

                    precision    recall  f1-score   support

         rec.autos       1.00      0.99      0.99       289
talk.politics.misc       0.99      1.00      0.99       305

          accuracy                           0.99       594
         macro avg       0.99      0.99      0.99       594
      weighted avg       0.99      0.99      0.99       594

ROC-AUC Score: 0.9997050314822168
