Question 1: What is a Support Vector Machine (SVM), and how does it work?

Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression and outliers detection.

The main goal of SVM is to maximize the margin between the two classes. The larger the margin the better the model performs on new and unseen data.

Key Concepts of Support Vector Machine
Hyperplane: A decision boundary separating different classes in feature space and is represented by the equation wx + b = 0 in linear classification.

Support Vectors: The closest data points to the hyperplane, crucial for determining the hyperplane and margin in SVM.

Margin: The distance between the hyperplane and the support vectors. SVM aims to maximize this margin for better classification performance.

Kernel: A function that maps data to a higher-dimensional space enabling SVM to handle non-linearly separable data.

Hard Margin: A maximum-margin hyperplane that perfectly separates the data without misclassifications.

Question 2: Explain the difference between Hard Margin and Soft Margin SVM.

Ans:- Hard Margin SVM requires strictly linearly separable data and cannot tolerate any misclassifications, making it sensitive to outliers, while Soft Margin SVM relaxes this constraint by allowing for some misclassified points or margin violations using slack variables controlled by a hyperparameter, providing a more flexible.

Hard Margin SVM
Strict Separation: Aims to find a hyperplane that perfectly separates the data points of different classes without any errors.

Maximizes Margin: Seeks the largest possible margin between the separating hyperplane and the nearest data points (support vectors).
Requires Linear Separability: Fails if the data is not perfectly linearly separable.

Sensitivity to Outliers: A single outlier can drastically influence the decision boundary, leading to a sensitive and potentially poorly generalizing model.

Soft Margin SVM

Tolerates Errors: Allows for some misclassifications or points to fall within the margin by introducing slack variables.
Flexibility: Offers a flexible decision boundary by balancing the trade-off between margin maximization and the number of misclassified points.
Robust to Noise: Handles datasets that are not perfectly linearly separable or contain outliers more effectively.
Parameter Control (C): The degree of allowable error is controlled by a regularization parameter, commonly denoted as C, which determines how much weight is given to margin maximization versus error minimization.

Question 3: What is the Kernel Trick in SVM? Give one example of a kernel and
explain its use case.

Ans:- The Kernel Trick in SVM is a computational method that allows Support Vector Machines to classify non-linearly separable data by mapping it into a higher-dimensional feature space where it becomes linearly separable, without explicitly transforming the data points, thus avoiding high computational costs.

**the Kernel Trick:-**

**Implicit Mapping:** The trick is to implicitly map the data into a higher-dimensional space using a kernel function, rather than directly transforming each data point into that space.

**Computational Efficiency:** This approach avoids the computationally expensive process of transforming data into a higher dimension, as it operates directly on the dot products of the input data points.

**Handling Non-Linearity:** It enables SVMs to find a linear decision boundary in the higher-dimensional space to separate complex, non-linearly separable data in the original lower-dimensional space.

Question 4: What is a Naïve Bayes Classifier, and why is it called “naïve”?

Ans:- A Naïve Bayes Classifier is a supervised machine learning algorithm that uses Bayes' theorem to predict the probability of a class, such as a spam email or a particular sentiment. It is called "naïve" because it makes a simplifying, unrealistic assumption that all features used for classification are independent of each other, even though in reality, they often have dependencies.

Naïve Bayes Classifier:-

Probabilistic Classifier: It is a statistical classifier that predicts the probability of an event based on prior probabilities and observed evidence, following Bayes' Theorem.

Supervised Learning: It learns from labeled data (data where the correct outcome is known) to classify new, unseen data.

Generative Model: It attempts to simulate how the input data is distributed for each class.

Text Classification: Naïve Bayes classifiers are particularly effective and widely used for text classification tasks, such as identifying spam emails or determining the sentiment of text.

Why is it called "naïve"?

The name "naïve" comes from the core assumption that all the features are conditionally independent, given the class label.

Feature Independence: The classifier assumes that the presence or value of one feature does not affect the presence or value of another feature when predicting the class

Unrealistic Assumption: In most real-world scenarios, this assumption does not hold true. For example, in classifying a fruit, the color (red) and shape (round) might be correlated (apples are often red and round), but a Naïve Bayes classifier treats them as if they are completely independent.


Simplified Calculation: Despite being unrealistic, this "naïve" assumption simplifies the complex calculations of Bayes' theorem, making the algorithm much easier and faster to compute.


Question 5: Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants.
When would you use each one?


Ans:-Gaussian Naïve Bayes is for continuous data that follows a normal distribution, Multinomial Naïve Bayes is for discrete data with multiple outcomes like word counts in text, and Bernoulli Naïve Bayes is for binary (0/1) data, such as the presence or absence of a word in a document

**1. Gaussian Naïve Bayes**
Description: This variant assumes that continuous features in your dataset are drawn from a Gaussian (normal) distribution. It calculates the mean and standard deviation for each feature within each class.

When to use: Use it when your dataset contains continuous data that approximates a normal distribution. For example, classifying data based on features like height, weight,

**2. Multinomial Naïve Bayes**

Description: It is designed for discrete counts. Instead of binary presence/absence, it considers the frequency or count of a feature.
When to use: This variant is ideal for data where features represent counts or frequencies, making it excellent for text classification tasks where you might use term frequencies (word counts) to classify documents.

**3. Bernoulli Naïve Bayes**

Description: This algorithm handles binary or boolean features (0 or 1, True or False). It only cares about whether a feature is present or absent, not its frequency.
When to use: Use this for binary features, such as in document classification where you want to know if a specific word appears in a document, but not how many times.

In [4]:
# ● You can use any suitable datasets like Iris, Breast Cancer, or Wine from
# sklearn.datasets or a CSV file you have.
# Question 6: Write a Python program to:
# ● Load the Iris dataset
# ● Train an SVM Classifier with a linear kernel
# ● Print the model's accuracy and support vectors

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

In [5]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

In [6]:
iris = datasets.load_iris()
X = iris.data
y = iris.target

In [7]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

In [8]:
svm_clf = SVC(kernel='linear', C=1.0, random_state=42)
svm_clf.fit(X_train, y_train)

In [9]:
y_pred = svm_clf.predict(X_test)


In [10]:
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)

Model Accuracy: 1.0


In [11]:
print("\nSupport Vectors (first 5 shown):")
print(svm_clf.support_vectors_[:5])


Support Vectors (first 5 shown):
[[4.8 3.4 1.9 0.2]
 [5.1 3.3 1.7 0.5]
 [4.5 2.3 1.3 0.3]
 [5.6 3.  4.5 1.5]
 [5.4 3.  4.5 1.5]]


In [12]:
# Question 7: Write a Python program to:
# ● Load the Breast Cancer dataset
# ● Train a Gaussian Naïve Bayes model
# ● Print its classification report including precision, recall, and F1-score

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')



In [13]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

In [14]:
cancer = datasets.load_breast_cancer()
X = cancer.data
y = cancer.target

In [15]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

In [16]:
gnb = GaussianNB()
gnb.fit(X_train, y_train)

In [17]:
y_pred = gnb.predict(X_test)

In [18]:
print("Classification Report:\n")
print(classification_report(y_test, y_pred, target_names=cancer.target_names))

Classification Report:

              precision    recall  f1-score   support

   malignant       0.93      0.90      0.92        63
      benign       0.95      0.96      0.95       108

    accuracy                           0.94       171
   macro avg       0.94      0.93      0.94       171
weighted avg       0.94      0.94      0.94       171



In [25]:
# Question 8: Write a Python program to:
# ● Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best
# C and gamma.
# ● Print the best hyperparameters and accuracy

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

In [26]:
wine = datasets.load_wine()
X = wine.data
y = wine.target

In [27]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

In [28]:
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [0.001, 0.01, 0.1, 1],
    'kernel': ['rbf']
}

In [29]:
grid = GridSearchCV(SVC(), param_grid, refit=True, cv=5, verbose=0)
grid.fit(X_train, y_train)

In [30]:
print("Best Hyperparameters:", grid.best_params_)


Best Hyperparameters: {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}


In [31]:
y_pred = grid.predict(X_test)


In [32]:
accuracy = accuracy_score(y_test, y_pred)
print("Test Accuracy:", accuracy)

Test Accuracy: 0.7777777777777778


In [33]:
# Question 9: Write a Python program to:
# ● Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. using
# sklearn.datasets.fetch_20newsgroups).
# ● Print the model's ROC-AUC score for its predictions.


In [34]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

In [35]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import label_binarize

In [36]:
categories = ['alt.atheism', 'comp.graphics', 'sci.med', 'rec.sport.baseball']
newsgroups = fetch_20newsgroups(subset='all', categories=categories)

In [37]:

X = newsgroups.data
y = newsgroups.target

In [38]:
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X_tfidf = vectorizer.fit_transform(X)

In [39]:
X_train, X_test, y_train, y_test = train_test_split(
    X_tfidf, y, test_size=0.3, random_state=42
)

In [40]:
nb_clf = MultinomialNB()
nb_clf.fit(X_train, y_train)

In [41]:
y_prob = nb_clf.predict_proba(X_test)


In [42]:
y_test_bin = label_binarize(y_test, classes=range(len(categories)))


In [43]:
roc_auc = roc_auc_score(y_test_bin, y_prob, average="macro")
print("ROC-AUC Score:", roc_auc)

ROC-AUC Score: 0.9988458828980952


 Question 10: Imagine you’re working as a data scientist for a company that handles
 email communications.
 Your task is to automatically classify emails as Spam or Not Spam. The emails may
  contain:

● Text with diverse vocabulary

● Potential class imbalance (far more legitimate emails than spam)

● Some incomplete or missing data
Explain the approach you would take to:

 ● Preprocess the data (e.g. text vectorization, handling missing data)

 ● Choose and justify an appropriate model (SVM vs. Naïve Bayes)

 ● Address class imbalance

 ● Evaluate the performance of your solution with suitable metrics
 And explain the business impact of your solution. -->

Given a dataset of emails with potentially diverse vocabulary, class imbalance, and missing data, here's a structured approach to pre-processing, model selection, addressing class imbalance, evaluating performance, and understanding the business impact of a spam email classification system:

1. Data Preprocessing:
Text Cleaning:
Lowercase conversion: Normalize text to lowercase for consistent comparison.

Removal of stop words: Eliminate common words like "the", "and", "a" that don't contribute much meaning.

Punctuation removal: Remove punctuation marks that might not be relevant to classification.

Stemming/Lemmatization: Reduce words to their root form for better generalization.

**Text Vectorization:**

Bag-of-Words (BoW): Represent each email as a vector where each word's frequency is counted.

Term Frequency-Inverse Document Frequency (TF-IDF): Weight words based on their relevance to the corpus, giving more importance to words that appear less frequently across documents.

N-grams: Consider sequences of n words (bigrams, trigrams) to capture phrases and context.

Text Vectorization:

Bag-of-Words (BoW): Represent each email as a vector where each word's frequency is counted.

Term Frequency-Inverse Document Frequency (TF-IDF): Weight words based on their relevance to the corpus, giving more importance to words that appear less frequently across documents.

N-grams: Consider sequences of n words (bigrams, trigrams) to capture phrases and context.

3. Addressing Class Imbalance:
Oversampling: Replicate minority class examples to balance the dataset.
Undersampling: Randomly remove samples from the majority class.
SMOTE (Synthetic Minority Over-sampling Technique): Generate synthetic instances of the minority class.
Cost-sensitive learning: Assign higher penalties to misclassifying minority class instances


4. Evaluation Metrics:
Accuracy: Overall proportion of correctly classified instances.
Precision: Proportion of predicted positive instances that are actually positive.
Recall: Proportion of actual positive instances that are correctly predicted.
F1-Score: Harmonic mean of precision and recall, a good metric for imbalanced datasets.
ROC-AUC (Receiver Operating Characteristic-Area Under Curve): Measures the model's ability to distinguish between classes, useful for evaluating different thresholds.


