# **Question 1:** What is a Support Vector Machine (SVM), and how does it work?

A Support Vector Machine (SVM) is a powerful supervised machine learning algorithm used mainly for classification and also for regression tasks. It works by finding the best possible decision boundary (hyperplane) that separates data points of different classes.

In two-dimensional space, this hyperplane is simply a line.

In higher dimensions, it becomes a plane or hyperplane.

**How it works:**

*Identify support vectors* – the data points that lie closest to the decision boundary. These are the most important points because they influence the position and orientation of the hyperplane.

*Maximize the margin* – SVM tries to maximize the distance between the hyperplane and the nearest support vectors. A larger margin means better generalization to unseen data.

*Handle non-linear data* – If data is not linearly separable, SVM uses the Kernel Trick to transform the data into a higher-dimensional space where a linear separation becomes possible.

**Key Idea:**

*Support Vectors:* The critical data points that determine the decision boundary.

*Margin:* The gap between the boundary and support vectors; SVM maximizes this margin.

*Kernels:* Functions (like linear, polynomial, RBF) that allow SVM to handle complex, non-linear patterns.

**In short:**

SVM works by finding an optimal hyperplane that separates classes with the widest possible margin, ensuring good classification accuracy and robustness, even on complex datasets.

#**Question 2:** Explain the difference between Hard Margin and Soft Margin SVM.

Support Vector Machines (SVM) can classify data in two main ways depending on whether the dataset is perfectly separable or not: Hard Margin and Soft Margin.

1. **Hard Margin SVM**

* Assumes the dataset is linearly separable (i.e., classes can be separated by a straight line/hyperplane without errors).

* The decision boundary is chosen such that no data points fall inside the margin and no misclassification is allowed.

* It finds the hyperplane with the maximum margin while strictly separating classes.

**Limitations:** Very sensitive to noise and outliers – even one misclassified point can break the model.

2. **Soft Margin SVM**

* Used when the dataset is not perfectly separable (common in real-world data).

* Introduces a penalty parameter C to allow some misclassification or margin violations.

*Balances between:*

* Maximizing the margin, and

* Minimizing classification errors.

**Advantage:** More robust to noise and works better with overlapping classes.

**In short:**

* Hard Margin SVM → strict, no errors, works only for perfectly separable data.

* Soft Margin SVM → flexible, allows errors, suitable for real-world noisy data.

#**Question 3:** What is the Kernel Trick in SVM? Give one example of a kernel and explain its use case.

The Kernel Trick is a mathematical technique used in Support Vector Machines (SVM) to handle data that is not linearly separable in its original feature space.

Instead of explicitly transforming the data into a higher-dimensional space (which can be computationally expensive), the kernel trick uses a kernel function to compute the similarity (dot product) between data points in that higher-dimensional space without ever performing the actual transformation.

This allows SVM to create complex, non-linear decision boundaries while keeping the computations efficient.

**Example of a Kernel** – Radial Basis Function (RBF) Kernel

**Use Case:**

The RBF kernel is widely used when data is not linearly separable.
It maps data into an infinite-dimensional space, making it possible to separate complex shapes and clusters.

Example: Image classification or handwriting recognition (digits that cannot be separated by a straight line).

**Other Common Kernels:**

Linear Kernel: Best when data is linearly separable.

Polynomial Kernel: Useful when relationships between features are polynomial in nature.

Sigmoid Kernel: Similar to neural networks’ activation functions.

**In short:**

The Kernel Trick allows SVM to efficiently create non-linear decision boundaries by using kernel functions like RBF, making it powerful for tasks where data is not linearly separable.

#**Question 4:** What is a Naïve Bayes Classifier, and why is it called “naïve”?

**What is a Naïve Bayes Classifier?**

A Naïve Bayes Classifier is a simple but powerful machine learning algorithm used for classification tasks such as spam filtering, sentiment analysis, and document categorization.

It is called a probabilistic classifier because it makes predictions based on the likelihood (probability) of a class given the features in the data.

**Why is it called “naïve”?**

It is called “naïve” because it makes a strong assumption of conditional independence among features — meaning it assumes that all features contribute independently to the outcome, even though in real-world data, features are often correlated.

Example: In email classification, the words “free” and “win” often appear together in spam. Naïve Bayes treats them as independent, which is not strictly true — hence “naïve.”

**Key Advantages:**

* Simple and fast to train.

* Works well with high-dimensional data (e.g., text classification).

* Performs surprisingly well despite the naïve independence assumption.

**In short:**

Naïve Bayes is a probabilistic classifier based on Bayes’ Theorem. It is called naïve because it assumes all features are independent of each other — an assumption that is rarely true but makes the algorithm computationally efficient and effective.

#**Question 5:** Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants. When would you use each one?

Naïve Bayes has several variants, each designed to work best with different types of data. The three most common are Gaussian, Multinomial, and Bernoulli Naïve Bayes.

1. **Gaussian Naïve Bayes**

Description: Assumes that the features follow a normal (Gaussian) distribution.

Best for: Continuous (real-valued) data.

**Example use case:**

Iris dataset (flower measurements like petal length, sepal width).

Medical diagnosis where measurements (e.g., blood pressure, temperature) are continuous.

2. **Multinomial Naïve Bayes**

Description: Works with discrete counts (frequency of events).

Best for: Features that represent counts or term frequencies.

**Example use case:**

Text classification (spam detection, topic categorization) using word counts.

Sentiment analysis with bag-of-words representation.

3. **Bernoulli Naïve Bayes**

Description: Designed for binary/boolean features (0 or 1, presence or absence).

Best for: Data where features indicate whether something occurs or not.

**Example use case:**

Document classification based on whether a word appears in the document (not how many times).

Recommendation systems where features are yes/no indicators.

**In short:**

* Use Gaussian NB for continuous features,

* Multinomial NB for word counts or frequency-based text features, and

* Bernoulli NB for binary features (word presence/absence).

#**Question 6:** Write a Python program to:

● Load the Iris dataset

● Train an SVM Classifier with a linear kernel

● Print the model's accuracy and support vectors.

In [1]:
# Import necessary libraries
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize SVM classifier with linear kernel
svm_model = SVC(kernel='linear')

# Train the model
svm_model.fit(X_train, y_train)

# Make predictions
y_pred = svm_model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

# Print support vectors
print("Support Vectors:")
print(svm_model.support_vectors_)

Model Accuracy: 1.00
Support Vectors:
[[4.8 3.4 1.9 0.2]
 [5.1 3.3 1.7 0.5]
 [4.5 2.3 1.3 0.3]
 [5.6 3.  4.5 1.5]
 [5.4 3.  4.5 1.5]
 [6.7 3.  5.  1.7]
 [5.9 3.2 4.8 1.8]
 [5.1 2.5 3.  1.1]
 [6.  2.7 5.1 1.6]
 [6.3 2.5 4.9 1.5]
 [6.1 2.9 4.7 1.4]
 [6.5 2.8 4.6 1.5]
 [6.9 3.1 4.9 1.5]
 [6.3 2.3 4.4 1.3]
 [6.3 2.8 5.1 1.5]
 [6.3 2.7 4.9 1.8]
 [6.  3.  4.8 1.8]
 [6.  2.2 5.  1.5]
 [6.2 2.8 4.8 1.8]
 [6.5 3.  5.2 2. ]
 [7.2 3.  5.8 1.6]
 [5.6 2.8 4.9 2. ]
 [5.9 3.  5.1 1.8]
 [4.9 2.5 4.5 1.7]]


#**Question 7:** Write a Python program to:

● Load the Breast Cancer dataset

● Train a Gaussian Naïve Bayes model

● Print its classification report including precision, recall, and F1-score.

In [2]:
# Import necessary libraries
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

# Load the Breast Cancer dataset
cancer = datasets.load_breast_cancer()
X = cancer.data
y = cancer.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize Gaussian Naive Bayes model
gnb = GaussianNB()

# Train the model
gnb.fit(X_train, y_train)

# Make predictions
y_pred = gnb.predict(X_test)

# Print classification report
report = classification_report(y_test, y_pred, target_names=cancer.target_names)
print("Classification Report:\n")
print(report)

Classification Report:

              precision    recall  f1-score   support

   malignant       0.93      0.90      0.92        63
      benign       0.95      0.96      0.95       108

    accuracy                           0.94       171
   macro avg       0.94      0.93      0.94       171
weighted avg       0.94      0.94      0.94       171



#**Question 8:** Write a Python program to:

● Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best C and gamma.

● Print the best hyperparameters and accuracy.

In [4]:
# Import necessary libraries
from sklearn import datasets
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load the Wine dataset
wine = datasets.load_wine()
X = wine.data
y = wine.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define parameter grid for GridSearchCV
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [0.001, 0.01, 0.1, 1],
    'kernel': ['rbf']  # Using RBF kernel for tuning C and gamma
}

# Initialize SVM classifier
svm_model = SVC()

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=svm_model, param_grid=param_grid, cv=5, scoring='accuracy')

# Train the model with GridSearchCV
grid_search.fit(X_train, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_
print("Best Hyperparameters:", best_params)

# Evaluate accuracy on test set
y_pred = grid_search.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Test Set Accuracy: {accuracy:.2f}")

Best Hyperparameters: {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}
Test Set Accuracy: 0.78


#**Question 9:** Write a Python program to:

● Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. using
sklearn.datasets.fetch_20newsgroups).

● Print the model's ROC-AUC score for its predictions.

In [5]:
# Import necessary libraries
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import label_binarize

# Load the 20 Newsgroups dataset (subset for simplicity)
categories = ['alt.atheism', 'comp.graphics', 'sci.med', 'talk.politics.misc']
newsgroups = fetch_20newsgroups(subset='all', categories=categories, remove=('headers','footers','quotes'))

X = newsgroups.data
y = newsgroups.target

# Convert text data into TF-IDF features
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X_tfidf = vectorizer.fit_transform(X)

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.3, random_state=42)

# Initialize Multinomial Naive Bayes classifier
nb_model = MultinomialNB()

# Train the model
nb_model.fit(X_train, y_train)

# Predict probabilities for ROC-AUC
y_prob = nb_model.predict_proba(X_test)

# Binarize the labels for multi-class ROC-AUC
y_test_bin = label_binarize(y_test, classes=[0, 1, 2, 3])

# Compute ROC-AUC score (macro-average)
roc_auc = roc_auc_score(y_test_bin, y_prob, average='macro', multi_class='ovr')
print(f"ROC-AUC Score: {roc_auc:.2f}")

ROC-AUC Score: 0.98


#**Question 10:** Imagine you’re working as a data scientist for a company that handle email communications.

Your task is to automatically classify emails as Spam or Not Spam. The emails may contain:

● Text with diverse vocabulary

● Potential class imbalance (far more legitimate emails than spam)

● Some incomplete or missing data.

Explain the approach you would take to:

● Preprocess the data (e.g. text vectorization, handling missing data)

● Choose and justify an appropriate model (SVM vs. Naïve Bayes)

● Address class imbalance

● Evaluate the performance of your solution with suitable metrics
And explain the business impact of your solution.

1. **Preprocessing the Data**

**Handling missing data:**

* For missing text entries, either remove those emails or replace missing content with an empty string to avoid errors in vectorization.

* For other features (like metadata), impute missing numerical values with mean/median and categorical values with mode.

**Text cleaning and normalization:**

* Convert all text to lowercase, remove punctuation, numbers, and stopwords.

* Apply stemming or lemmatization to reduce words to their base forms.

**Text vectorization:**

* Use TF-IDF (Term Frequency-Inverse Document Frequency) or Count Vectorization to convert emails into numerical feature vectors.

* TF-IDF is preferred for handling diverse vocabulary as it downweights common words and highlights distinctive terms.

2. **Model Selection**

**Naïve Bayes (Multinomial):**

* Performs very well on text classification tasks.

* Efficient on high-dimensional, sparse data like emails.

* Assumes feature independence, which is a reasonable approximation for words in an email.

**SVM (Support Vector Machine):**

* Can also perform well, especially with linear or kernel-based SVMs.

* Often more computationally intensive for large datasets.

**Justification:**

* For spam detection, Multinomial Naïve Bayes is typically preferred due to simplicity, speed, and strong performance on text data, especially with bag-of-words features.

3. **Addressing Class Imbalance**

* Spam is often underrepresented compared to legitimate emails. To handle this:

  * Use resampling techniques:

    * Oversampling the minority class (e.g., SMOTE).

    * Undersampling the majority class.

  * Alternatively, apply class weighting in models (e.g., class_weight='balanced' in SVM).

  * Ensure evaluation metrics are chosen to account for imbalance.

4. **Evaluation Metrics**

* Accuracy alone is not sufficient due to class imbalance. Use:

  * Precision: Measures how many predicted spam emails are truly spam.

  * Recall (Sensitivity): Measures how many actual spam emails are correctly detected.

  * F1-score: Harmonic mean of precision and recall; balances false positives and false negatives.

  * ROC-AUC: Measures overall discrimination ability of the model between spam and non-spam.

* A confusion matrix can also provide detailed insights into true positives, false positives, etc.

5. **Business Impact**

* Automating spam detection:

  * Reduces workload for employees manually sorting emails.

  * Protects users from phishing, scams, and malicious content.

  * Improves user satisfaction by keeping inboxes clean.

  * Helps maintain organizational security and trust.

* A robust and well-tuned model ensures fewer legitimate emails are incorrectly flagged while effectively filtering spam, enhancing operational efficiency and safety.

In [6]:
# Import necessary libraries
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.preprocessing import label_binarize
from imblearn.over_sampling import SMOTE
import numpy as np

# Load a subset of 20 Newsgroups dataset (simulating spam vs non-spam)
categories = ['rec.sport.hockey', 'sci.med', 'talk.politics.misc', 'comp.graphics']
newsgroups = fetch_20newsgroups(subset='all', categories=categories, remove=('headers','footers','quotes'))

# Simulate binary labels: 0 = Not Spam, 1 = Spam (for demonstration)
y = np.array([0 if cat in [0, 1] else 1 for cat in newsgroups.target])
X = newsgroups.data

# Handle missing data by replacing None with empty string
X = [text if text is not None else '' for text in X]

# Convert text to TF-IDF features
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X_tfidf = vectorizer.fit_transform(X)

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.3, random_state=42, stratify=y)

# Handle class imbalance using SMOTE
smote = SMOTE(random_state=42)
X_train_res, y_train_res = smote.fit_resample(X_train, y_train)

# Train Multinomial Naive Bayes classifier
nb_model = MultinomialNB()
nb_model.fit(X_train_res, y_train_res)

# Make predictions
y_pred = nb_model.predict(X_test)
y_prob = nb_model.predict_proba(X_test)[:, 1]

# Evaluate performance
print("Classification Report:\n")
print(classification_report(y_test, y_pred))
roc_auc = roc_auc_score(y_test, y_prob)
print(f"ROC-AUC Score: {roc_auc:.2f}")

Classification Report:

              precision    recall  f1-score   support

           0       0.95      0.95      0.95       592
           1       0.94      0.95      0.94       530

    accuracy                           0.95      1122
   macro avg       0.95      0.95      0.95      1122
weighted avg       0.95      0.95      0.95      1122

ROC-AUC Score: 0.99
