Question 1: What is a Support Vector Machine (SVM), and how does it work?

A Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification and regression tasks. It is especially well known for its performance in binary classification problems.

How It Works (Intuitively):

SVM finds the best decision boundary (called a hyperplane) that separates classes in the feature space with the largest possible margin.

Question 2: Explain the difference between Hard Margin and Soft Margin SVM.

Support Vector Machines (SVMs) aim to find the optimal separating hyperplane between classes. The key difference between Hard Margin and Soft Margin lies in how strictly they separate the data.

Hard Margin SVM

Assumes that the data is perfectly linearly separable.
No misclassifications are allowed — all points must be on the correct side of the margin.

Soft Margin SVM

Allows some misclassifications or margin violations.

Introduces a penalty parameter C to control the trade-off between maximizing the margin and minimizing classification error.

Question 3: What is the Kernel Trick in SVM? Give one example of a kernel and
explain its use case.

The Kernel Trick is a technique used in Support Vector Machines (SVMs) that allows them to implicitly map input data into a higher-dimensional feature space where a linear decision boundary can be found, even if the data is not linearly separable in its original space. This is achieved without explicitly calculating the coordinates of the data in the higher-dimensional space, which saves significant computational cost. Instead, a kernel function computes the dot product of the transformed data points in the higher-dimensional space directly from the original lower-dimensional data.

One example of a kernel is the Radial Basis Function (RBF) Kernel, also known as the Gaussian Kernel.
RBF Kernel Use Case:
The RBF kernel is widely used for non-linear classification problems where the decision boundary is complex and cannot be represented by a straight line or plane in the original feature space. For instance, consider a dataset where two classes are intertwined in a circular pattern in a 2D plane. A linear SVM would fail to separate these classes. The RBF kernel implicitly projects these 2D points into a higher-dimensional space (e.g., 3D), where a hyperplane can effectively separate the classes. This allows the SVM to find a non-linear decision boundary in the original 2D space, such as a circle or an ellipse, which accurately separates the data. It is particularly effective when dealing with data that exhibits complex, non-linear relationships between features.

Question 4: What is a Naïve Bayes Classifier, and why is it called “naïve”?

A Naive Bayes classifier is a probabilistic machine learning algorithm that uses Bayes' theorem to predict the probability of a data point belonging to a specific class. It's called "naïve" because it makes the simplifying assumption that all features used to make the prediction are independent of each other, which is often not true in real-world scenarios. Despite this simplification, Naive Bayes can be surprisingly effective, especially in text classification tasks.

Question 5: Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants.
When would you use each one?

Naive Bayes classifiers come in different flavors, each suited for different types of data. Gaussian Naive Bayes is used for continuous data, assuming a normal distribution. Multinomial Naive Bayes is used for discrete data, often word counts in text. Bernoulli Naive Bayes is used for binary or boolean features.



In [1]:
'''You can use any suitable datasets like Iris, Breast Cancer, or Wine from
sklearn.datasets or a CSV file you have.
Question 6: Write a Python program to:
● Load the Iris dataset
● Train an SVM Classifier with a linear kernel
● Print the model's accuracy and support vectors.
(Include your Python code and output in the code box below.)'''

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train an SVM Classifier with linear kernel
svm_model = SVC(kernel='linear')
svm_model.fit(X_train, y_train)

# Predict on test set
y_pred = svm_model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f" Model Accuracy: {accuracy:.2f}")

# Print support vectors
print("\n Support Vectors:")
print(svm_model.support_vectors_)

# Print support vector indices per class
print("\n Support Vector Indices per Class:")
print(svm_model.support_)

# Print number of support vectors per class
print("\nNumber of Support Vectors per Class:")
print(svm_model.n_support_)


 Model Accuracy: 1.00

 Support Vectors:
[[4.8 3.4 1.9 0.2]
 [5.1 3.3 1.7 0.5]
 [4.5 2.3 1.3 0.3]
 [5.6 3.  4.5 1.5]
 [5.4 3.  4.5 1.5]
 [6.7 3.  5.  1.7]
 [5.9 3.2 4.8 1.8]
 [5.1 2.5 3.  1.1]
 [6.  2.7 5.1 1.6]
 [6.3 2.5 4.9 1.5]
 [6.1 2.9 4.7 1.4]
 [6.5 2.8 4.6 1.5]
 [6.9 3.1 4.9 1.5]
 [6.3 2.3 4.4 1.3]
 [6.3 2.5 5.  1.9]
 [6.3 2.8 5.1 1.5]
 [6.3 2.7 4.9 1.8]
 [6.  3.  4.8 1.8]
 [6.  2.2 5.  1.5]
 [6.2 2.8 4.8 1.8]
 [6.5 3.  5.2 2. ]
 [7.2 3.  5.8 1.6]
 [5.6 2.8 4.9 2. ]
 [5.9 3.  5.1 1.8]
 [4.9 2.5 4.5 1.7]]

 Support Vector Indices per Class:
[ 31  33  91  22  45  54  59  60  62  73  79  80 105 110   5  16  30  42
  68  81  87 101 112 113 116]

 Number of Support Vectors per Class:
[ 3 11 11]


In [2]:
'''Question 7: Write a Python program to:
● Load the Breast Cancer dataset
● Train a Gaussian Naïve Bayes model
● Print its classification report including precision, recall, and F1-score.
'''
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report, accuracy_score

# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split into train and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Gaussian Naive Bayes model
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Make predictions
y_pred = gnb.predict(X_test)

# Print accuracy
print(f" Accuracy: {accuracy_score(y_test, y_pred):.2f}")

# Print classification report
print("\n Classification Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))



 Accuracy: 0.97

 Classification Report:
              precision    recall  f1-score   support

   malignant       1.00      0.93      0.96        43
      benign       0.96      1.00      0.98        71

    accuracy                           0.97       114
   macro avg       0.98      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114



In [5]:
'''Question 8: Write a Python program to:
● Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best
C and gamma.
● Print the best hyperparameters and accuracy'''

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load Wine dataset
data = load_wine()
X = data.data
y = data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define SVM and parameter grid for GridSearch
svm = SVC(kernel='rbf')

param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': ['scale', 0.001, 0.01, 0.1, 1]
}

# Setup GridSearchCV
grid_search = GridSearchCV(svm, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Best estimator and predictions
best_svm = grid_search.best_estimator_
y_pred = best_svm.predict(X_test)

# Output results
print(" Best Hyperparameters:", grid_search.best_params_)
print(f"Accuracy on test set: {accuracy_score(y_test, y_pred):.2f}")


 Best Hyperparameters: {'C': 100, 'gamma': 'scale'}
Accuracy on test set: 0.83


In [6]:
'''Question 9: Write a Python program to:
● Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. using
sklearn.datasets.fetch_20newsgroups).
● Print the model's ROC-AUC score for its predictions'''

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import label_binarize

# Load subset of 20 newsgroups (binary classification)
categories = ['comp.graphics', 'sci.med']
newsgroups = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))

X = newsgroups.data
y = newsgroups.target  # 0 or 1

# Vectorize text data using TF-IDF
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X_tfidf = vectorizer.fit_transform(X)

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)

# Train Multinomial Naive Bayes
model = MultinomialNB()
model.fit(X_train, y_train)

# Predict probabilities for positive class
y_probs = model.predict_proba(X_test)[:, 1]

# Calculate ROC-AUC score
roc_auc = roc_auc_score(y_test, y_probs)

print(f" ROC-AUC score: {roc_auc:.3f}")



 ROC-AUC score: 0.989


In [7]:
'''Question 10: Imagine you’re working as a data scientist for a company that handles
email communications.
Your task is to automatically classify emails as Spam or Not Spam. The emails may
contain:
● Text with diverse vocabulary
● Potential class imbalance (far more legitimate emails than spam)
● Some incomplete or missing data
Explain the approach you would take to:
● Preprocess the data (e.g. text vectorization, handling missing data)
● Choose and justify an appropriate model (SVM vs. Naïve Bayes)
● Address class imbalance
● Evaluate the performance of your solution with suitable metrics
And explain the business impact of your solution.
'''
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.utils import compute_sample_weight

# Load a subset of 20 newsgroups as proxy for spam (sci.crypt) and ham (talk.politics.misc)
categories = ['sci.crypt', 'talk.politics.misc']
data = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))

X_raw = data.data
y = data.target  # 0 or 1

# Simulate some missing data: replace 5% of samples with empty strings
rng = np.random.default_rng(42)
missing_indices = rng.choice(len(X_raw), size=int(0.05 * len(X_raw)), replace=False)
for idx in missing_indices:
    X_raw[idx] = ""

# Text vectorization with TF-IDF
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X = vectorizer.fit_transform(X_raw)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Handle class imbalance with sample weights (Naive Bayes doesn't have class_weight param)
sample_weights = compute_sample_weight(class_weight='balanced', y=y_train)

# Train Multinomial Naive Bayes
model = MultinomialNB()
model.fit(X_train, y_train, sample_weight=sample_weights)

# Predict
y_pred = model.predict(X_test)
y_probs = model.predict_proba(X_test)[:, 1]

# Evaluate
print("Classification Report:\n", classification_report(y_test, y_pred, target_names=data.target_names))
roc_auc = roc_auc_score(y_test, y_probs)
print(f"ROC-AUC Score: {roc_auc:.3f}")


Classification Report:
                     precision    recall  f1-score   support

         sci.crypt       0.97      0.83      0.89       199
talk.politics.misc       0.82      0.97      0.88       155

          accuracy                           0.89       354
         macro avg       0.89      0.90      0.89       354
      weighted avg       0.90      0.89      0.89       354

ROC-AUC Score: 0.976
