1] What is a Support Vector Machine (SVM), and how does it work?
- A Support Vector Machine (SVM) is a supervised learning algorithm mainly used for classification. It works by finding the optimal hyperplane that separates data into classes with the maximum margin, relying only on key data points called support vectors. If data is not linearly separable, SVM uses kernel functions like polynomial or RBF to transform it into a higher-dimensional space where separation is possible. Its strengths are accuracy and effectiveness in high-dimensional data, though it can be computationally heavy and requires careful parameter tuning.

2] Explain the difference between Hard Margin and Soft Margin SVM.
- In a Hard Margin SVM, the algorithm tries to find a hyperplane that perfectly separates the data into two classes without allowing any misclassification. This approach works only when the data is linearly separable and there is a clear gap between the classes. While it ensures a strict separation, it is very sensitive to noise and outliers, because even a single wrongly placed point can make perfect separation impossible.

- A Soft Margin SVM, on the other hand, allows some misclassifications in order to find a balance between maximizing the margin and minimizing classification errors. This flexibility is controlled by a parameter (usually called
C), which determines how much penalty is given to misclassified points. A higher C tries to reduce misclassification, potentially leading to overfitting, while a lower C allows more errors but may generalize better. Because real-world data is rarely perfectly separable, Soft Margin SVM is more commonly used than Hard Margin SVM.

3] What is the Kernel Trick in SVM? Give one example of a kernel and explain its use case
- The kernel trick in SVM is a mathematical technique that allows the algorithm to handle data that is not linearly separable in its original space. Instead of explicitly transforming the data into a higher-dimensional space, the kernel trick uses kernel functions to compute the similarity between data points as if they were mapped to that higher-dimensional space. This makes it possible for SVM to construct a decision boundary in complex scenarios without the heavy computation that comes with directly performing the transformation.

One common example is the Radial Basis Function (RBF) kernel, also known as the Gaussian kernel. The RBF kernel maps data into an infinite-dimensional space, making it particularly powerful for handling non-linear relationships. Its use case is in situations where class boundaries are curved or irregular rather than straight lines. For example, if you have data points arranged in concentric circles belonging to different classes, a linear hyperplane cannot separate them, but the RBF kernel can transform the data into a space where they become separable by a simple hyperplane. This makes it one of the most widely used kernels in real-world SVM applications.

4] What is a Naïve Bayes Classifier, and why is it called “naïve”?
- A Naïve Bayes Classifier is a probabilistic machine learning algorithm based on Bayes’ theorem, which calculates the probability of a class given a set of features. It is commonly used for classification tasks such as spam detection, sentiment analysis, and text categorization. The model works by estimating the likelihood of each class based on the individual features of the data, then predicting the class with the highest probability.

It is called “naïve” because it makes a strong simplifying assumption: that all features are conditionally independent of each other given the class label. In reality, features often have correlations, but the algorithm ignores them for the sake of simplicity. Despite this unrealistic assumption, Naïve Bayes performs surprisingly well in many practical applications, especially with high-dimensional data such as text, where the independence assumption is “good enough” to provide fast and accurate classification.

5] Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants.
When would you use each one?
- Gaussian Naïve Bayes is used when the features are continuous and follow a normal (Gaussian) distribution. It models each feature by estimating its mean and variance for every class and then uses these to calculate probabilities. A common use case is in medical data analysis, where measurements like blood pressure, weight, or height are continuous values that can reasonably be assumed to follow a bell-shaped distribution.

Multinomial Naïve Bayes is designed for discrete features, particularly those representing counts. Instead of modeling data with a continuous distribution, it calculates the probability of features as frequencies. This makes it very popular in text classification tasks such as spam detection or topic modeling, where features often represent the number of times a word appears in a document.

Bernoulli Naïve Bayes, on the other hand, is used when features are binary, meaning they take only two values such as 0 or 1. It is especially useful when the presence or absence of a feature matters more than its frequency. For instance, in document classification, Bernoulli Naïve Bayes would consider whether a word appears in a text at all, rather than how many times it appears, which is useful when word occurrence itself is more informative than word count.

In [1]:
''' 6] Write a Python program to:
● Load the Iris dataset
● Train an SVM Classifier with a linear kernel
● Print the model's accuracy and support vectors.
'''
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train SVM Classifier with linear kernel
svm_model = SVC(kernel='linear')
svm_model.fit(X_train, y_train)

# Make predictions
y_pred = svm_model.predict(X_test)

# Print accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)

# Print support vectors
print("Support Vectors:\n", svm_model.support_vectors_)


Model Accuracy: 1.0
Support Vectors:
 [[4.8 3.4 1.9 0.2]
 [5.1 3.3 1.7 0.5]
 [4.5 2.3 1.3 0.3]
 [5.6 3.  4.5 1.5]
 [5.4 3.  4.5 1.5]
 [6.7 3.  5.  1.7]
 [5.9 3.2 4.8 1.8]
 [5.1 2.5 3.  1.1]
 [6.  2.7 5.1 1.6]
 [6.3 2.5 4.9 1.5]
 [6.1 2.9 4.7 1.4]
 [6.5 2.8 4.6 1.5]
 [6.9 3.1 4.9 1.5]
 [6.3 2.3 4.4 1.3]
 [6.3 2.8 5.1 1.5]
 [6.3 2.7 4.9 1.8]
 [6.  3.  4.8 1.8]
 [6.  2.2 5.  1.5]
 [6.2 2.8 4.8 1.8]
 [6.5 3.  5.2 2. ]
 [7.2 3.  5.8 1.6]
 [5.6 2.8 4.9 2. ]
 [5.9 3.  5.1 1.8]
 [4.9 2.5 4.5 1.7]]


In [2]:
''' 7] Write a Python program to:
● Load the Breast Cancer dataset
● Train a Gaussian Naïve Bayes model
● Print its classification report including precision, recall, and F1-score.
'''
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

# Load the Breast Cancer dataset
breast_cancer = datasets.load_breast_cancer()
X = breast_cancer.data
y = breast_cancer.target

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train Gaussian Naïve Bayes model
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Make predictions
y_pred = gnb.predict(X_test)

# Print classification report
print("Classification Report:\n")
print(classification_report(y_test, y_pred, target_names=breast_cancer.target_names))


Classification Report:

              precision    recall  f1-score   support

   malignant       0.93      0.90      0.92        63
      benign       0.95      0.96      0.95       108

    accuracy                           0.94       171
   macro avg       0.94      0.93      0.94       171
weighted avg       0.94      0.94      0.94       171



In [3]:
''' 8] Write a Python program to:
● Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best C and gamma.
● Print the best hyperparameters and accuracy
'''
from sklearn import datasets
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load the Wine dataset
wine = datasets.load_wine()
X = wine.data
y = wine.target

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Define parameter grid for C and gamma
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [0.001, 0.01, 0.1, 1],
    'kernel': ['rbf']
}

# Train SVM using GridSearchCV
grid = GridSearchCV(SVC(), param_grid, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)

# Best hyperparameters
print("Best Hyperparameters:", grid.best_params_)

# Predict with best model
y_pred = grid.best_estimator_.predict(X_test)

# Print accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Test Accuracy:", accuracy)


Best Hyperparameters: {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}
Test Accuracy: 0.7777777777777778


In [4]:
'''9] Write a Python program to:
● Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. using
sklearn.datasets.fetch_20newsgroups).
● Print the model's ROC-AUC score for its predictions.
'''
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

# Load the dataset (binary classification: 'sci.space' vs 'rec.autos')
categories = ['sci.space', 'rec.autos']
newsgroups = fetch_20newsgroups(subset='all', categories=categories)

X = newsgroups.data
y = newsgroups.target

# Convert text to TF-IDF features
vectorizer = TfidfVectorizer(stop_words='english')
X_tfidf = vectorizer.fit_transform(X)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X_tfidf, y, test_size=0.3, random_state=42
)

# Train Naïve Bayes Classifier
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)

# Predict probabilities
y_prob = nb_model.predict_proba(X_test)[:, 1]

# Compute ROC-AUC score
roc_auc = roc_auc_score(y_test, y_prob)
print("ROC-AUC Score:", roc_auc)


ROC-AUC Score: 0.9993878175696358


In [6]:
''' 10] Imagine you’re working as a data scientist for a company that handles
email communications.
Your task is to automatically classify emails as Spam or Not Spam. The emails may
contain:
● Text with diverse vocabulary
● Potential class imbalance (far more legitimate emails than spam)
● Some incomplete or missing data
Explain the approach you would take to:
● Preprocess the data (e.g. text vectorization, handling missing data)
● Choose and justify an appropriate model (SVM vs. Naïve Bayes)
● Address class imbalance
● Evaluate the performance of your solution with suitable metrics
And explain the business impact of your solution.
'''


'''Approach to Spam Classification

1. Preprocessing the data

Clean text (lowercasing, removing HTML tags, special characters, and stopwords).

Use TF-IDF vectorization to represent text, since it emphasizes important words over frequent ones.

Handle missing or incomplete data by dropping rows with no usable content or replacing with placeholders like "missing".

2. Model Choice

Naïve Bayes: Fast, efficient, and works well with high-dimensional text data.

SVM (with linear kernel): More powerful for complex vocabulary and subtle word patterns, often providing higher accuracy.

Decision: Start with SVM for accuracy, benchmark against Naïve Bayes for speed.

3. Handling Class Imbalance

Use class weights in SVM to penalize misclassification of minority class (spam).

Optionally apply resampling techniques like SMOTE (oversampling) or undersampling.

4. Evaluation Metrics

Go beyond accuracy since imbalance can make accuracy misleading.

Use Precision, Recall, F1-score, ROC-AUC.

Recall is especially critical, since missing spam (false negatives) is worse for business impact.

5. Business Impact

Improves user trust by reducing the chance of spam reaching inboxes.

Saves employee time by filtering junk emails automatically.

Protects from phishing and malware, reducing financial and reputational risks '''

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report, roc_auc_score

# 1. Load synthetic email-like dataset (spam-like categories vs not spam)
categories = ['sci.space', 'rec.autos']  # spam-like vs non-spam-like
data = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))

X = data.data
y = data.target

# 2. Handle missing values by replacing empty strings
X = ["missing" if text.strip() == "" else text for text in X]

# 3. Vectorize text with TF-IDF
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X_tfidf = vectorizer.fit_transform(X)

# 4. Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.3, random_state=42, stratify=y)

# 5. Train SVM with class imbalance handling
svm_model = SVC(kernel='linear', class_weight='balanced', probability=True, random_state=42)
svm_model.fit(X_train, y_train)

# 6. Predictions
y_pred = svm_model.predict(X_test)
y_prob = svm_model.predict_proba(X_test)[:, 1]

# 7. Evaluation
print("Classification Report:\n", classification_report(y_test, y_pred, target_names=data.target_names))
print("ROC-AUC Score:", roc_auc_score(y_test, y_prob))


Classification Report:
               precision    recall  f1-score   support

   rec.autos       0.92      0.93      0.92       297
   sci.space       0.93      0.92      0.92       297

    accuracy                           0.92       594
   macro avg       0.92      0.92      0.92       594
weighted avg       0.92      0.92      0.92       594

ROC-AUC Score: 0.965525059801154
