#### Assignment Code: DA-AG-013
### SVM & Naive Bayes | Assignment

Question 1: What is a Support Vector Machine (SVM), and how does it work?
Answer:
-
A Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification and regression tasks, but it's mostly used for classification.

The main idea of SVM is to find the best boundary (hyperplane) that separates data points of different classes.
- In a 2D space, this boundary is a line.
- In a 3D space, it is a plane.
- In higher dimensions, it is a hyperplane.

-> Key Concepts:

 1.Hyperplane:
 - A decision boundary that separates different classes.
 - The goal is to choose the one that maximizes the margin between the classes.

 2.Margin:
 - The distance between the hyperplane and the nearest data points from each class.
 - SVM tries to maximize this margin to ensure good separation.

 3.Support Vectors:
 - The data points that are closest to the hyperplane.
 - These points are critical because they "support" the hyperplane — if you remove them, the position of the hyperplane could change.

 4.Kernel Trick:
 - If the data is not linearly separable, SVM uses a kernel function to transform the data into a higher dimension where it becomes linearly separable.
 - Common kernels:
   - Linear
   - Polynomial
   - RBF (Radial Basis Function or Gaussian)

-> Advantages of SVM:
 - Effective in high-dimensional spaces
 - Works well even when number of features > number of samples
 - Memory efficient (uses only support vectors)

-> Disadvantages:
 - Not ideal for large datasets (training time is high)
 - Less effective when classes are heavily overlapping
 - Choosing the right kernel and tuning parameters can be tricky

Question 2: Explain the difference between Hard Margin and Soft Margin SVM.
Answer:
-
 1.Hard Margin SVM:

-> Definition:
- A Hard Margin SVM assumes that the data is linearly separable — i.e., you can draw a straight line (or hyperplane) that perfectly separates the two classes without any errors.

-> Characteristics:
 - No misclassifications allowed.
 - Maximizes the margin strictly between the two classes.
 - Only works well when data is clean and clearly separable.

-> Limitation:
 - Very sensitive to outliers and noisy data.
 - May overfit if the real-world data is not perfectly separable (which is often the case).


 2.Soft Margin SVM:

-> Definition:
 - A Soft Margin SVM allows some misclassification of data points to create a more flexible and generalizable model — ideal for real-world, noisy, or overlapping data.

-> Characteristics:
 - Introduces a penalty for misclassified points using a regularization parameter (C).
 - Balances margin maximization and classification error.
 - More robust and adaptable to complex datasets.

-> Role of Parameter C:
 - Large C → Less tolerance for errors (closer to hard margin).
 - Small C → More tolerance for errors (larger margin, better generalization).


#### Key Differences:

| Feature                     | Hard Margin SVM                         | Soft Margin SVM             |
| --------------------------- | --------------------------------------- | --------------------------- |
| Assumes perfect separation? | Yes                                     | No                          |
| Allows misclassifications?  | No                                      | Yes                         |
| Works well on noisy data?   | Poorly                                  | Better                      |
| Flexibility                 | Rigid (strict margin)                   | Flexible (controlled by C)  |
| Risk of Overfitting         | High if data is not perfectly separable | Lower due to regularization |


-> Example:
 - Imagine trying to draw a line between two groups of fruits (say apples and oranges):
   - Hard Margin: Requires all apples on one side and all oranges on the other — even if that means ignoring a few weird-shaped apples.
   - Soft Margin: Allows a few fruits to be on the wrong side if that helps draw a better, more stable line overall.

Question 3: What is the Kernel Trick in SVM? Give one example of a kernel and explain its use case.
Answer:
-
The Kernel Trick is a mathematical technique used in Support Vector Machines (SVMs) to transform non-linearly separable data into a higher-dimensional space where it can be separated by a linear hyperplane.

Instead of actually computing the coordinates in the higher dimension (which is computationally expensive), the kernel trick computes the inner product between the images of all pairs of data in the feature space, using a kernel function.

-> Why is it useful?
 - Real-world data is often not linearly separable.
 - The kernel trick allows SVM to find complex boundaries between classes without explicitly transforming the data.
 - Enables SVM to work with non-linear decision boundaries.

-> How it works (Simple View):

Imagine trying to separate a set of points shaped like concentric circles. In 2D, it's impossible to draw a straight line between the classes. But if we map the data into 3D (like lifting the inner circle upwards), a linear plane can separate them — this is what the kernel trick helps SVM do, mathematically.

- Example of a Kernel: RBF (Radial Basis Function) / Gaussian Kernel
Formula:
𝐾(𝑥,x′)=exp⁡(−𝛾∥𝑥−𝑥′∥2)

 - x, x': two feature vectors
 - γ (gamma): defines how far the influence of a single training example reaches.


-> Use Case:
 - Used when the relationship between features and labels is highly non-linear.
 - Excellent for image classification, text categorization, or medical data, where patterns are complex.

-> Real-life Example:
 - Handwritten digit recognition (like MNIST dataset): The digits are not linearly separable, but using the RBF kernel, SVM can learn complex boundaries between digit classes like '3' and '8'.

#### Other Common Kernels (for reference):
| Kernel Type    | Use Case                                                                                               |
| -------------- | ------------------------------------------------------------------------------------------------------ |
| **Linear**     | When data is linearly separable or has many features (e.g., text data)                                 |
| **Polynomial** | When interaction between features matters (e.g., image classification with curved decision boundaries) |
| **Sigmoid**    | Used in neural networks, behaves like a two-layer perceptron                                           |


Question 4: What is a Naïve Bayes Classifier, and why is it called “naïve”?
Answer:
-
The Naïve Bayes Classifier is a supervised machine learning algorithm based on Bayes’ Theorem, used primarily for classification tasks.

It’s called "naïve" because it assumes that all features are independent of each other, which is rarely true in real-world data — hence, the name.

-> Bayes’ Theorem (Simplified):

P(A∣B)=P(B∣A)⋅P(A) / P(B)

-> In classification:
 - P(A∣B): Probability of class A given features B
 - P(B∣A): Probability of features B given class A
 - P(A): Prior probability of class A
 - P(B): Probability of features B

 How Naïve Bayes Works (Step-by-step):
 - Calculate Prior Probability for each class (e.g., spam vs. not spam).
 - Calculate Likelihood: For each feature, compute the probability of that feature occurring in a given class.
 - Apply Bayes’ Theorem to get the posterior probability for each class.
 - Choose the class with the highest posterior probability as the prediction.

Why is it called “Naïve”?
 - Because it naïvely assumes that all features are independent.
 - For example, if you're classifying emails, it assumes that the occurrence of the word "free" is independent of the word "offer", even though they often appear together.
 - Despite this simplifying assumption, Naïve Bayes often performs surprisingly well, especially in text classification.


Example Use Cases:
 - Spam filtering
 - Sentiment analysis
 - Document classification
 - Medical diagnosis (with categorical features)


#### Types of Naïve Bayes:
| Type                        | Used When                                                 |
| --------------------------- | --------------------------------------------------------- |
| **Gaussian Naïve Bayes**    | Features are continuous and follow a normal distribution  |
| **Multinomial Naïve Bayes** | Features are counts (e.g., word frequencies in text data) |
| **Bernoulli Naïve Bayes**   | Features are binary (e.g., word present or not)           |


Question 5: Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants. When would you use each one?
Answer:
-
Naïve Bayes has three main variants, each designed for different types of input features.

 1.Gaussian Naïve Bayes

Description:
 - Assumes that continuous features follow a normal (Gaussian) distribution.
 - For each feature, it calculates the mean and standard deviation for each class.
 - Then, it uses the Gaussian probability density function to compute the likelihood of features.

- Formula:

𝑃(𝑥𝑖∣𝑦)=12𝜋𝜎2exp(−(𝑥𝑖−𝜇)22𝜎2)P(xi​ ∣y)= 2πσ 2 ​ 1​ exp(− 2σ 2
 
Use Case:
 - When features are real-valued and continuous, e.g.:
   - Medical data (e.g., height, weight, blood pressure)
   - Sensor readings
   - Iris flower dataset

 2.Multinomial Naïve Bayes

Description:
 - Assumes that features represent discrete frequency counts (e.g., how often a word appears).
 - Commonly used for document classification or text analysis.

Use Case:
 - When features are counts or frequencies, e.g.:
   - Text classification (e.g., spam detection, sentiment analysis)
   - Bag-of-Words or TF-IDF vectors

Example:
 - Email classification based on the number of times each word appears.


 3.Bernoulli Naïve Bayes

Description:
 - Designed for binary/boolean features (0 or 1).
 - Assumes features are either present or absent (not counts).
 - Models feature presence/absence independently for each class.

Use Case:
 - When features are binary, e.g.:
   - Whether a specific word appears or not in a document
   - Text classification with binary features
   - Market basket analysis (item purchased: yes/no)

Example:
 - Spam filtering based on whether words like “free” or “win” are present.

#### Comparison Table:

| Variant         | Feature Type    | Best For                                  |
| --------------- | --------------- | ----------------------------------------- |
| **Gaussian**    | Continuous      | Sensor data, numerical features           |
| **Multinomial** | Discrete counts | Text classification with word frequencies |
| **Bernoulli**   | Binary (0 or 1) | Binary text features or presence/absence  |


Dataset Info:
● You can use any suitable datasets like Iris, Breast Cancer, or Wine from
sklearn.datasets or a CSV file you have.

Question 6: Write a Python program to:
● Load the Iris dataset
● Train an SVM Classifier with a linear kernel
● Print the model's accuracy and support vectors.
(Include your Python code and output in the code box below.)
Answer:
-
Python Program using SVM on Iris Dataset

Here is a Python program that:
 - Loads the Iris dataset
 - Trains a Support Vector Machine (SVM) classifier with a linear kernel
 - Prints the accuracy and the support vectors

In [10]:
# Import necessary libraries
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Split into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train an SVM Classifier with a linear kernel
svm_model = SVC(kernel='linear')
svm_model.fit(X_train, y_train)

# Predict on the test set
y_pred = svm_model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print accuracy and support vectors
print("Model Accuracy:", accuracy)
print("Number of Support Vectors for each class:", svm_model.n_support_)
print("Support Vectors:\n", svm_model.support_vectors_)

Model Accuracy: 1.0
Number of Support Vectors for each class: [ 3 11 11]
Support Vectors:
 [[4.8 3.4 1.9 0.2]
 [5.1 3.3 1.7 0.5]
 [4.5 2.3 1.3 0.3]
 [5.6 3.  4.5 1.5]
 [5.4 3.  4.5 1.5]
 [6.7 3.  5.  1.7]
 [5.9 3.2 4.8 1.8]
 [5.1 2.5 3.  1.1]
 [6.  2.7 5.1 1.6]
 [6.3 2.5 4.9 1.5]
 [6.1 2.9 4.7 1.4]
 [6.5 2.8 4.6 1.5]
 [6.9 3.1 4.9 1.5]
 [6.3 2.3 4.4 1.3]
 [6.3 2.5 5.  1.9]
 [6.3 2.8 5.1 1.5]
 [6.3 2.7 4.9 1.8]
 [6.  3.  4.8 1.8]
 [6.  2.2 5.  1.5]
 [6.2 2.8 4.8 1.8]
 [6.5 3.  5.2 2. ]
 [7.2 3.  5.8 1.6]
 [5.6 2.8 4.9 2. ]
 [5.9 3.  5.1 1.8]
 [4.9 2.5 4.5 1.7]]


Question 7: Write a Python program to:
● Load the Breast Cancer dataset
● Train a Gaussian Naïve Bayes model
● Print its classification report including precision, recall, and F1-score.
(Include your Python code and output in the code box below.)
Answer:
-
Python Program using Gaussian Naïve Bayes on Breast Cancer Dataset

This program:
 - Loads the Breast Cancer dataset from sklearn.datasets
 - Trains a Gaussian Naïve Bayes classifier
 - Prints the classification report including precision, recall, and F1-score

In [13]:
# Import necessary libraries
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the Gaussian Naive Bayes model
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Predict on the test set
y_pred = gnb.predict(X_test)

# Print classification report
print("Classification Report:\n")
print(classification_report(y_test, y_pred, target_names=data.target_names))

Classification Report:

              precision    recall  f1-score   support

   malignant       1.00      0.93      0.96        43
      benign       0.96      1.00      0.98        71

    accuracy                           0.97       114
   macro avg       0.98      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114



Question 8: Write a Python program to:
● Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best
C and gamma.
● Print the best hyperparameters and accuracy.
(Include your Python code and output in the code box below.)
Answer:
-
This program:
 - Loads the Wine dataset
 - Trains an SVM classifier
 - Uses GridSearchCV to find the best combination of C and gamma
 - Prints the best hyperparameters and accuracy

In [16]:
# Import necessary libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load the Wine dataset
data = load_wine()
X = data.data
y = data.target

# Split the data into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the SVM model
svm = SVC()

# Set up the parameter grid for GridSearchCV
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [1, 0.1, 0.01, 0.001],
    'kernel': ['rbf']
}

# Initialize GridSearchCV
grid = GridSearchCV(svm, param_grid, cv=5)
grid.fit(X_train, y_train)

# Make predictions on the test set
y_pred = grid.predict(X_test)

# Print best hyperparameters and accuracy
print("Best Hyperparameters:", grid.best_params_)
print("Test Set Accuracy:", accuracy_score(y_test, y_pred))

Best Hyperparameters: {'C': 100, 'gamma': 0.001, 'kernel': 'rbf'}
Test Set Accuracy: 0.8333333333333334


Question 9: Write a Python program to:
● Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. using
sklearn.datasets.fetch_20newsgroups).
● Print the model's ROC-AUC score for its predictions.
(Include your Python code and output in the code box below.)
Answer:
-
This program:
 - Loads a synthetic text dataset (fetch_20newsgroups)
 - Trains a Naïve Bayes classifier on text data
 - Calculates and prints the ROC-AUC score

In [25]:
# Import necessary libraries
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

# Load a subset of the 20 newsgroups dataset (binary classification for ROC-AUC)
categories = ['sci.med', 'rec.autos']  # 2 classes for binary classification
data = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))

# Vectorize the text data
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(data.data)
y = data.target

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Naive Bayes classifier
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)

# Predict probabilities for the positive class
y_proba = nb_model.predict_proba(X_test)[:, 1]

# Compute ROC-AUC score
roc_auc = roc_auc_score(y_test, y_proba)

# Print ROC-AUC score
print("ROC-AUC Score:", roc_auc)

ROC-AUC Score: 0.9929591836734695


ROC-AUC Score: 0.9841|

Question 10: Imagine you’re working as a data scientist for a company that handles
email communications.
Your task is to automatically classify emails as Spam or Not Spam. The emails may
contain:
● Text with diverse vocabulary
● Potential class imbalance (far more legitimate emails than spam)
● Some incomplete or missing data
Explain the approach you would take to:
● Preprocess the data (e.g. text vectorization, handling missing data)
● Choose and justify an appropriate model (SVM vs. Naïve Bayes)
● Address class imbalance
● Evaluate the performance of your solution with suitable metrics
And explain the business impact of your solution.
(Include your Python code and output in the code box below.)
Answer:
-
Spam Detection Using Machine Learning

Imagine you're tasked with building an email spam classifier using real-world email data that has:
 - Text content
 - Missing values
 - Class imbalance (more "Not Spam" than "Spam")

In [None]:
#1.Preprocessing the Data

##Handling Missing Data:

# Replace missing values with empty strings
import pandas as pd
data['email_text'] = data['email_text'].fillna('')

In [None]:
#Text Vectorization (Convert text to numeric):

from sklearn.feature_extraction.text import TfidfVectorizer

# Convert text to TF-IDF vectors
vectorizer = TfidfVectorizer(stop_words='english', max_df=0.95)
X = vectorizer.fit_transform(data['email_text'])

In [None]:
#Target Variable:

y = data['label']  # 0 = Not Spam, 1 = Spam

#### 2.Choosing the Model: SVM vs Naïve Bayes

| Model           | Pros                                            | Cons                                     |
| --------------- | ----------------------------------------------- | ---------------------------------------- |
| **Naïve Bayes** | Fast, handles high-dimensional sparse data well | Assumes feature independence             |
| **SVM**         | Accurate and robust with kernels                | Slower with large datasets, needs tuning |

 Choice: Naïve Bayes (Multinomial) — Ideal for text classification, faster, handles word frequency vectors effectively.|

#### 3.Handling Class Imbalance
 - Spam vs Not Spam = Imbalanced
 - Use class_weight (for SVM) or resampling techniques

Example: Use SMOTE to oversample the minority class:

In [None]:
from imblearn.over_sampling import SMOTE
smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(X, y)


#### 4.Evaluation Metrics
Because of imbalance, accuracy is not enough. Use:
 - Precision (how many predicted spams are actually spam)
 - Recall (how many actual spams were correctly identified)
 - F1-Score (balance between precision & recall)
 - ROC-AUC Score

In [36]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, roc_auc_score

# Simulate spam dataset using 2 categories
categories = ['talk.politics.misc', 'rec.sport.baseball']  # Assume one is spam-like
data = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))

# Create a DataFrame
import pandas as pd
df = pd.DataFrame({'email_text': data.data, 'label': data.target})

# Fill missing text
df['email_text'] = df['email_text'].fillna('')

# TF-IDF vectorization
vectorizer = TfidfVectorizer(stop_words='english', max_df=0.95)
X = vectorizer.fit_transform(df['email_text'])
y = df['label']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model: Multinomial Naïve Bayes
model = MultinomialNB()
model.fit(X_train, y_train)

# Predictions and Probabilities
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]

# Evaluation
print("Classification Report:\n", classification_report(y_test, y_pred))
print("ROC-AUC Score:", roc_auc_score(y_test, y_proba))

Classification Report:
               precision    recall  f1-score   support

           0       0.93      0.99      0.96       193
           1       0.99      0.91      0.95       161

    accuracy                           0.96       354
   macro avg       0.96      0.95      0.96       354
weighted avg       0.96      0.96      0.96       354

ROC-AUC Score: 0.9909889614778105


In [None]:
####