Question 1: What is a Support Vector Machine (SVM), and how does it work?

**Support Vector Machine (SVM):-**
A Support Vector Machine (SVM) is a supervised machine learning algorithm primarily used for classification and regression tasks. It works by finding an optimal hyperplane that best separates data points of different classes in a feature space. The goal of the SVM is to maximize the margin, which is the distance between the hyperplane and the closest data points from each class. These closest points are called support vectors, and they are crucial in defining the position and orientation of the hyperplane.

How SVM Works:

a. Hyperplane: This is the decision boundary that separates different classes. In 2D space, it is a line; in higher dimensions, it becomes a plane or hyperplane.

b. Support Vectors: These are the data points nearest to the hyperplane and they influence the position and angle of the hyperplane. Only these points are used for the learning process.

c. Margin Maximization: SVM chooses the hyperplane that maximizes the margin, providing the largest separation between classes, which generally improves the model’s generalization on new data.

d. Linearly Separable Data: When data can be separated by a straight line or flat hyperplane, SVM finds the best linear boundary.

e. Non-linear Data and Kernel Trick: If data is not linearly separable, SVM uses kernel functions (e.g., polynomial, radial basis function (RBF), sigmoid) to map data into a higher-dimensional space where a linear separator can be found. This approach is called the "kernel trick."

f. Soft Margin: To handle noisy data and allow some misclassifications, SVM uses a soft margin, balancing margin maximization and classification errors via a regularization parameter.

Question 2: Explain the difference between Hard Margin and Soft Margin SVM.

**Hard Margin SVM:-**

a. Aims to find a hyperplane that strictly separates the classes with no misclassification allowed.

b. Maximizes the margin between the two classes, with data points lying exactly on the margin boundary or outside it.

c. Requires data to be perfectly linearly separable without any noise or outliers.

d. Is highly sensitive to outliers; even a single point inside the margin or misclassified can disrupt the model and make it fail.

e. Does not involve a regularization parameter for balancing margin and errors since misclassification is not permitted.

f. Optimization objective: minimize the norm of the weight vector subject to every point being correctly classified with a margin of at least 1.

**Soft Margin SVM:-**

a. Allows some misclassifications or margin violations by introducing slack variables.

b. Balances between maximizing margin and minimizing classification errors using a regularization parameter C.

c. Suitable for non-linearly separable or noisy datasets where perfect separation is not possible.

d. The parameter C controls the trade-off:

A high C emphasizes fewer misclassifications (closer to hard margin).

A low C allows more errors but leads to a wider margin and potentially better generalization.

e. More flexible and robust for real-world data.

f. Optimization involves minimizing a combination of the margin (weight norm) and the sum of slack variables weighted by C.

Question 3: What is the Kernel Trick in SVM? Give one example of a kernel and explain its use case.

**Kernel Trick in SVM:-**
The Kernel Trick in Support Vector Machines (SVM) is a technique that allows SVMs to efficiently perform classification on non-linearly separable data by implicitly mapping the input data into a higher-dimensional feature space. Instead of explicitly transforming the data into this higher space (which could be computationally expensive or infeasible), the kernel trick uses a kernel function to compute the inner products of data points in this high-dimensional space directly, enabling the SVM to find a linear separating hyperplane there.

How the Kernel Trick Works:

a. It replaces the dot product in the input feature space with a kernel function that corresponds to the dot product in a transformed, higher-dimensional space.

b. This implicit mapping helps convert a non-linear classification problem into a linear one in the new space.

c. The kernel function effectively measures similarity between pairs of data points in this higher-dimensional space without performing the expensive transformation explicitly.

Example Kernel: Radial Basis Function (RBF) Kernel

a. The RBF kernel, also known as the Gaussian kernel, is defined as:

K(x,y)=exp⁡(−(∥x−y∥)2/2(σ)2)

b. It measures similarity based on the distance between two points x and y and can handle very complex, localized decision boundaries.

c. Use case: The RBF kernel is widely used for datasets where the relationship between class labels and features is non-linear and complicated. It works well in scenarios like image classification, bioinformatics, and speech recognition because it can create flexible, smooth boundaries that separate classes in a non-linear manner.

Question 4: What is a Naive Bayes Classifier, and why is it called “naive”?

**Naive Bayes Classifier:-**
A Naive Bayes Classifier is a supervised machine learning classification algorithm based on Bayes' Theorem. It predicts the class of a data point by calculating the probabilities of each class given the features of the data and selects the class with the highest posterior probability.

It is called “naive” because:

a. The independence assumption is "naïve" because features in real data are often correlated or dependent.

b. Despite this simplification, Naïve Bayes classifiers perform surprisingly well in many practical applications, especially in text classification, spam filtering, and sentiment analysis.

c. The algorithm applies Bayes' Theorem and models the likelihood by multiplying the conditional probabilities of each feature independently.

Question 5: Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants.When would you use each one?

**Gaussian Naïve Bayes:-**

a. Assumes that the features follow a Gaussian (normal) distribution.

b. Suitable for continuous-valued features (e.g., height, weight, temperature).

c. The model uses the mean and variance of each feature per class to calculate probabilities.

d. Use case: When the data features are continuous and approximately normally distributed, such as in sensor data, health metrics, or many classical classification datasets like Iris.

**Multinomial Naive Bayes:-**

a. Assumes that feature vectors represent discrete counts or frequencies (e.g., word counts in documents).

b. Applies the multinomial distribution to model the likelihood of feature occurrence.

c. Commonly applied in text classification tasks using bag-of-words or frequency-based features.

d. Use case: When features represent counts or frequencies, such as the number of times a word appears in a document for spam filtering or sentiment analysis.

**Bernoulli Naïve Bayes:-**

a. Assumes binary/bool type features indicating the presence or absence of a feature (1 if feature present, 0 if absent).

b. Models feature occurrence using the Bernoulli distribution.

c. Suitable for binary/boolean features.

d. Use case: When features are binary, such as classifying emails as spam based on presence/absence of particular keywords or detecting user clicks (yes/no) in web applications.

Question 6: Write a Python program to:

● Load the Iris dataset

● Train an SVM Classifier with a linear kernel

● Print the model's accuracy and support vectors.




In [1]:
from sklearn import datasets
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train an SVM classifier with a linear kernel
svm_clf = SVC(kernel='linear')
svm_clf.fit(X_train, y_train)

# Predict on test set
y_pred = svm_clf.predict(X_test)

# Print accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model accuracy: {accuracy:.2f}")

# Print number of support vectors
print(f"Number of support vectors: {svm_clf.support_vectors_.shape[0]}")


Model accuracy: 1.00
Number of support vectors: 24


Question 7: Write a Python program to:

● Load the Breast Cancer dataset

● Train a Gaussian Naïve Bayes model

● Print its classification report including precision, recall, and F1-score.


In [2]:
from sklearn.datasets import load_breast_cancer
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Load Breast Cancer dataset
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Gaussian Naive Bayes model
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Predict on test data
y_pred = gnb.predict(X_test)

# Print classification report
print(classification_report(y_test, y_pred, target_names=cancer.target_names))


              precision    recall  f1-score   support

   malignant       0.93      0.90      0.92        63
      benign       0.95      0.96      0.95       108

    accuracy                           0.94       171
   macro avg       0.94      0.93      0.94       171
weighted avg       0.94      0.94      0.94       171



Question 8: Write a Python program to:

● Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best C and gamma.

● Print the best hyperparameters and accuracy.


In [3]:
from sklearn import datasets
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# Load Wine dataset
wine = datasets.load_wine()
X = wine.data
y = wine.target

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define SVM classifier
svm = SVC()

# Hyperparameter grid for C and gamma (using RBF kernel)
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': ['scale', 'auto', 0.1, 1, 10],
    'kernel': ['rbf']
}

# Setup GridSearchCV
grid_search = GridSearchCV(svm, param_grid, cv=5, scoring='accuracy')

# Train with grid search
grid_search.fit(X_train, y_train)

# Best parameters and model
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

# Prediction and accuracy on test set
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Best hyperparameters: {best_params}")
print(f"Test set accuracy: {accuracy:.2f}")


Best hyperparameters: {'C': 100, 'gamma': 'scale', 'kernel': 'rbf'}
Test set accuracy: 0.78


Question 9: Write a Python program to:

● Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. using
sklearn.datasets.fetch_20newsgroups).

● Print the model's ROC-AUC score for its predictions.

In [4]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import label_binarize

# Load text dataset (subset for speed)
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
newsgroups = fetch_20newsgroups(subset='all', categories=categories)

# Feature extraction using TF-IDF vectorizer
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(newsgroups.data)
y = newsgroups.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Multinomial Naive Bayes model
model = MultinomialNB()
model.fit(X_train, y_train)

# Predict probabilities on test set
y_prob = model.predict_proba(X_test)

# Binarize the output labels for ROC-AUC computation
y_test_binarized = label_binarize(y_test, classes=[0, 1, 2, 3])

# Calculate ROC-AUC score (one-vs-rest)
roc_auc = roc_auc_score(y_test_binarized, y_prob, average='macro', multi_class='ovr')

print(f"ROC-AUC score: {roc_auc:.4f}")


ROC-AUC score: 0.9961


Question 10: Imagine you’re working as a data scientist for a company that handles
email communications.
Your task is to automatically classify emails as Spam or Not Spam. The emails may
contain:

● Text with diverse vocabulary

● Potential class imbalance (far more legitimate emails than spam)

● Some incomplete or missing data

Explain the approach you would take to:

● Preprocess the data (e.g. text vectorization, handling missing data)

● Choose and justify an appropriate model (SVM vs. Naïve Bayes)

● Address class imbalance

● Evaluate the performance of your solution with suitable metrics
And explain the business impact of your solution.

Data Preprocessing

a. Text Vectorization: Use techniques like TF-IDF vectorization or word embeddings to convert email text into numeric feature vectors capturing important words and their relevancy.

b. Handling Missing Data: Since emails may be incomplete, handle missing or empty fields by imputing with placeholders or ignoring them during vectorization. Also clean text by removing stopwords, punctuation, and normalizing text.

c. Feature Engineering: Extract additional features like presence of links, sender reputation, punctuation patterns, etc., if available.

Model Choice: SVM vs Naïve Bayes

a. Naïve Bayes: Efficient for high-dimensional sparse data such as text (bag-of-words), and works well with smaller datasets. Assumes feature independence which may not always hold but handles diversity in vocabulary.

b. SVM: Often more powerful with complex boundaries and able to handle overlapping classes better, especially with kernel trick. However, SVMs may take longer to train and are sensitive to tuning especially in imbalanced settings.

c. Justification: Start with Multinomial Naïve Bayes for text due to efficiency and reasonable performance. Consider SVM with a linear or RBF kernel as an alternative or ensemble method if performance needs improvement.

Addressing Class Imbalance

a. Use techniques like:

Resampling: Oversample minority class (spam) or undersample majority class (non-spam).

Class Weights: Assign higher misclassification cost to minority class in SVM or Naïve Bayes.

Synthetic Data: Generate synthetic spam samples using techniques like SMOTE.

b. These methods help prevent the model from being biased towards the majority legitimate emails.

Evaluation Metrics

a. Accuracy alone is misleading in imbalanced scenarios.

b. Use metrics capturing class-wise performance:

c. Precision: Proportion of detected spam that is actually spam (important to minimize false alarms).

d. Recall: How many actual spam emails were detected (important to catch as many spam as possible).

e. F1-score: Harmonic mean of precision and recall.

f. ROC-AUC: Overall capability of distinguishing spam from non-spam across thresholds.

g. Also monitor confusion matrix to understand types of errors.

Business Impact

a. Accurate spam classification reduces user exposure to unwanted emails, improving user experience and productivity.

b. Minimizing false positives avoids legitimate emails being flagged as spam, preserving customer trust.

c. Automated spam filtering saves manual moderation costs and enables scalable email services.

d. Enhances security by blocking phishing and malicious emails, reducing company risks.

e. Overall, robust spam classification directly supports customer satisfaction, cost savings, and security compliance.