# SVM & Naive Bayes


Q1. What is Support Vector Machine (SVM)?
- Support Vector Machine (SVM) is a powerful supervised machine learning algorithm that can be used for both classification and regression tasks, although it is most commonly applied in classification problems.

  The main concept behind SVM is to find an optimal hyperplane that best separates the data points of different classes in the feature space. This separation is done in such a way that the margin (the distance between the hyperplane and the closest data points from each class, known as support vectors) is maximized.

  If the data is not linearly separable in its original space, SVM uses kernel functions to project the data into a higher-dimensional space where a linear separation is possible.
- How SVM Works
  1. Linear Separation

 -  If the data is linearly separable, SVM finds a straight line (in 2D), a plane (in 3D), or a hyperplane (in higher dimensions) that separates the classes with the maximum margin.

  2. Maximizing the Margin

  - A large margin reduces the chance of misclassification and improves generalization to new data.

  3. Handling Non-linear Data

- If the data is not linearly separable, SVM uses the kernel trick to transform the data into a higher-dimensional space where it becomes linearly separable.

-  Common kernels include:

 - Linear

 - Polynomial

 - Radial Basis Function (RBF)

 - Sigmoid

  4. Soft Margin for Noisy Data

-  In real-world data, perfect separation may not be possible. SVM introduces a soft margin controlled by a parameter C that allows some misclassifications in exchange for better generalization.

Q2. Explain the difference between Hard Margin and Soft Margin SVM.
-  Difference Between Hard Margin and Soft Margin SVM
  1. Hard Margin SVM

-   Definition:
  
  A Hard Margin SVM tries to find a hyperplane that separates the data perfectly without any misclassification.

- When Used:

 - Data is linearly separable (no overlap between classes).

 - No noise or outliers in the dataset.

-  Advantages:

 - Clear and strict separation between classes.

 - Simpler decision boundary when perfect separation is possible.

- Disadvantages:

 - Very sensitive to noise and outliers — even one misclassified point can make perfect separation impossible.

- Example:

  If we have two completely distinct groups of points with no overlap, Hard Margin SVM will create a boundary with zero tolerance for misclassification.

2. Soft Margin SVM

- Definition:
  
  A Soft Margin SVM allows some misclassification or overlap between classes to achieve a better generalization on unseen data. This is controlled by a regularization parameter C.

-  When Used:

 - Data is not perfectly separable.

 - There may be noise or overlapping class distributions.

-  Advantages:

 - More robust to noise and outliers.

 - Works with real-world datasets where perfect separation is rare.

- Disadvantages:

 - May misclassify some training points to improve generalization.

- Example:
 - In a spam email classification problem, some borderline emails might be misclassified to ensure the model performs better on new, unseen emails.



Q3.  What is the Kernel Trick in SVM? Give one example of a kernel and
explain its use case.
-  Kernel Trick in SVM

  Definition:

  The Kernel Trick is a mathematical technique used in Support Vector Machines (SVM) to enable the algorithm to separate data that is not linearly separable in its original space.
  
   It works by implicitly mapping the input data into a higher-dimensional feature space without explicitly computing the transformation. In this higher-dimensional space, the data can often be separated by a linear hyperplane.

   The key advantage is that the kernel trick allows complex, non-linear boundaries to be learned without high computational cost of directly transforming the data.

-  How It Works
  
 1. Instead of computing a transformation φ(x) explicitly, the kernel trick computes the dot product in the higher-dimensional space using a kernel function K(xᵢ, xⱼ) directly in the original space.

  2. This avoids handling high-dimensional vectors explicitly and makes the computation more efficient.

  Mathematically:

   K(xᵢ, xⱼ) = φ(xᵢ) · φ(xⱼ)

- Example Kernel: Radial Basis Function (RBF) Kernel

  Formula:
  
  K(x, x′) = exp(−γ ||x − x′||²)
- Where:

 - γ (gamma) controls how far the influence of a single training example reaches.

-  Use Case:

 - RBF kernel is widely used when the decision boundary between classes is non-linear and complex.

 - For example, in handwriting recognition (digits 0–9), the RBF kernel can separate similar-looking digits like “4” and “9” by mapping their pixel data into a higher dimension where separation is easier.

Q4.   What is a Naïve Bayes Classifier, and why is it called “naïve”?

  i. Naïve Bayes Classifier

  - Definition:

   The Naïve Bayes Classifier is a family of probabilistic machine learning algorithms based on Bayes’ Theorem.
  
  It is primarily used for classification tasks, especially in text classification (e.g., spam detection, sentiment analysis).

 - Bayes’ Theorem:

     P(A|B) = [ P(B|A) × P(A) ] / P(B)

    Where:

 - P(A|B): Posterior probability — probability of class A given evidence B.

 - P(B|A): Likelihood — probability of evidence B given class A.

 - P(A): Prior probability of class A.

 - P(B): Prior probability of evidence B.

ii.   Why is it Called “Naïve”?

   - It is called “naïve” because it assumes that all features (predictors) are independent of each other given the class label.
In real-world datasets, this assumption is often not true — features can be correlated — but the classifier still works surprisingly well in many situations despite this unrealistic assumption.

Q5. Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants.
When would you use each one?

   1. Gaussian Naïve Bayes
  
  Definition:

 - Assumes that the continuous features follow a Gaussian (Normal) distribution within each class.

 - Probability density function is calculated using the Gaussian formula.

 Formula:
      
           P(xᵢ | y) = (1 / √(2πσ²)) × exp(−(xᵢ − μ)² / (2σ²))

  Where μ is the mean and σ² is the variance of the feature values for class y.

 When to Use:

 - When the features are continuous and can be modeled with a bell-shaped curve.

 - Example: Predicting a person’s health risk based on continuous measurements like blood pressure, height, or weight.

  2. Multinomial Naïve Bayes

 Definition:

 - Designed for discrete features, particularly count data.

 - Often used for document classification where features represent the frequency of terms in a document.

 When to Use:

 - When features represent counts or frequencies.

 - Example: Spam email detection using the number of times a specific word appears in an email.

  3. Bernoulli Naïve Bayes

   Definition:

 - Assumes that features are binary (0 or 1), indicating the presence or absence of a particular feature.

 - Unlike Multinomial NB, it only considers whether a feature is present, not how many times it appears.

 When to Use:

 - When features are binary indicators.

 - Example: Classifying documents based on whether certain keywords appear or not (regardless of frequency).

| Variant     | Data Type  | Example                            |
| ----------- | ---------- | ---------------------------------- |
| Gaussian    | Continuous | Predicting disease from lab values |
| Multinomial | Count data | Spam filtering                     |
| Bernoulli   | Binary     | Keyword presence classification    |


- Gaussian Naïve Bayes: Continuous data (e.g., Iris, Breast Cancer dataset).

- Multinomial Naïve Bayes: Discrete count data (e.g., word frequencies).

- Bernoulli Naïve Bayes: Binary features (e.g., keyword presence).


In [21]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.metrics import accuracy_score

# Load a dataset (Iris)
iris = datasets.load_iris()
X, y = iris.data, iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# 1. Gaussian Naïve Bayes
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred_gnb = gnb.predict(X_test)
print("Gaussian NB Accuracy:", accuracy_score(y_test, y_pred_gnb))

# 2. Multinomial Naïve Bayes (need non-negative integer features)
# For demo, converting to integer counts
import numpy as np
X_train_counts = np.abs(X_train.astype(int))
X_test_counts = np.abs(X_test.astype(int))
mnb = MultinomialNB()
mnb.fit(X_train_counts, y_train)
y_pred_mnb = mnb.predict(X_test_counts)
print("Multinomial NB Accuracy:", accuracy_score(y_test, y_pred_mnb))

# 3. Bernoulli Naïve Bayes (binary features)
X_train_bin = (X_train_counts > 0).astype(int)
X_test_bin = (X_test_counts > 0).astype(int)
bnb = BernoulliNB()
bnb.fit(X_train_bin, y_train)
y_pred_bnb = bnb.predict(X_test_bin)
print("Bernoulli NB Accuracy:", accuracy_score(y_test, y_pred_bnb))


Gaussian NB Accuracy: 0.9111111111111111
Multinomial NB Accuracy: 0.8222222222222222
Bernoulli NB Accuracy: 0.6666666666666666


Q6. Write a Python program to:
● Load the Iris dataset
● Train an SVM Classifier with a linear kernel
● Print the model's accuracy and support vectors.
(Include your Python code and output in the code box below.)

In [22]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = datasets.load_iris()
X, y = iris.data, iris.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# Train SVM classifier with a linear kernel
svm_clf = SVC(kernel='linear')
svm_clf.fit(X_train, y_train)

# Make predictions
y_pred = svm_clf.predict(X_test)

# Print accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)

# Print support vectors
print("\nSupport Vectors:\n", svm_clf.support_vectors_)


Model Accuracy: 1.0

Support Vectors:
 [[4.8 3.4 1.9 0.2]
 [4.5 2.3 1.3 0.3]
 [5.1 3.8 1.9 0.4]
 [5.1 2.5 3.  1.1]
 [6.2 2.2 4.5 1.5]
 [6.  2.9 4.5 1.5]
 [5.9 3.2 4.8 1.8]
 [6.9 3.1 4.9 1.5]
 [6.7 3.1 4.7 1.5]
 [6.8 2.8 4.8 1.4]
 [6.  2.7 5.1 1.6]
 [6.3 2.5 4.9 1.5]
 [6.5 3.2 5.1 2. ]
 [6.3 2.7 4.9 1.8]
 [6.3 2.5 5.  1.9]
 [6.  2.2 5.  1.5]
 [5.6 2.8 4.9 2. ]
 [5.9 3.  5.1 1.8]
 [6.2 2.8 4.8 1.8]
 [7.2 3.  5.8 1.6]]


Q7. Write a Python program to:
● Load the Breast Cancer dataset
● Train a Gaussian Naïve Bayes model
● Print its classification report including precision, recall, and F1-score.
(Include your Python code and output in the code box below.)

In [23]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

# Load Breast Cancer dataset
cancer = datasets.load_breast_cancer()
X, y = cancer.data, cancer.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Train Gaussian Naïve Bayes model
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Make predictions
y_pred = gnb.predict(X_test)

# Print classification report
print("Classification Report:\n")
print(classification_report(y_test, y_pred, target_names=cancer.target_names))


Classification Report:

              precision    recall  f1-score   support

   malignant       0.97      0.89      0.93        64
      benign       0.94      0.98      0.96       107

    accuracy                           0.95       171
   macro avg       0.95      0.94      0.94       171
weighted avg       0.95      0.95      0.95       171



Q8.  Write a Python program to:
● Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best
C and gamma.
● Print the best hyperparameters and accuracy.
(Include your Python code and output in the code box below.)

In [24]:
from sklearn import datasets
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load Wine dataset
wine = datasets.load_wine()
X, y = wine.data, wine.target

# Split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Define parameter grid for GridSearchCV
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [0.001, 0.01, 0.1, 1],
    'kernel': ['rbf']
}

# Create SVM model
svm_clf = SVC()

# Perform GridSearchCV
grid_search = GridSearchCV(svm_clf, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Best parameters
print("Best Hyperparameters:", grid_search.best_params_)

# Best model prediction
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)


Best Hyperparameters: {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}
Model Accuracy: 0.7777777777777778


Q9.  Write a Python program to:
● Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. using
sklearn.datasets.fetch_20newsgroups).
● Print the model's ROC-AUC score for its predictions.
(Include your Python code and output in the code box below.)

In [25]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
import numpy as np

# 1. Load dataset (two categories for binary ROC-AUC)
categories = ['rec.sport.hockey', 'sci.space']
data = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))

# 2. Split data
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.3, random_state=42)

# 3. Convert text to TF-IDF features
vectorizer = TfidfVectorizer(stop_words='english')
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# 4. Train Naive Bayes Classifier
clf = MultinomialNB()
clf.fit(X_train_tfidf, y_train)

# 5. Predict probabilities
y_prob = clf.predict_proba(X_test_tfidf)[:, 1]

# 6. Compute ROC-AUC score
roc_auc = roc_auc_score(y_test, y_prob)

print("ROC-AUC Score:", round(roc_auc, 4))


ROC-AUC Score: 0.9955


Q10.  Imagine you’re working as a data scientist for a company that handles
email communications.
Your task is to automatically classify emails as Spam or Not Spam. The emails may
contain:
● Text with diverse vocabulary
● Potential class imbalance (far more legitimate emails than spam)
● Some incomplete or missing data
Explain the approach you would take to:
● Preprocess the data (e.g. text vectorization, handling missing data)
● Choose and justify an appropriate model (SVM vs. Naïve Bayes)
● Address class imbalance
● Evaluate the performance of your solution with suitable metrics
And explain the business impact of your solution.
(Include your Python code and output in the code box below.)
  1. Approach
  a. Preprocessing

- Handling Missing Data:

 - Remove completely empty emails or replace missing text with an empty string ("").

 - Ensure there are no NaN values before vectorization.

- Text Vectorization:

 - Use TF-IDF Vectorization (TfidfVectorizer) to convert text into numerical features.

 - Lowercasing, removing stopwords, and possibly stemming/lemmatization can help.

  b. Model Choice
- Naïve Bayes (MultinomialNB) works well for text classification, especially when word frequency is important and vocabulary is large. It is fast and interpretable.

- SVM (LinearSVC) can handle high-dimensional sparse data well and often yields slightly better accuracy, but is slower and doesn't produce direct probability estimates.

- Choice: Start with MultinomialNB for speed and good baseline performance. Move to SVM if higher precision is needed and computational cost is acceptable.

  c. Addressing Class Imbalance
- Spam datasets often have more "Not Spam" than "Spam".

- Solutions:

 - Use class_weight='balanced' (for SVM) or adjust prior probabilities (for Naïve Bayes).

 - Apply oversampling (SMOTE) or undersampling.

 - Use precision-recall metrics instead of accuracy.

  d. Evaluation Metrics
- Since we care more about catching spam without flagging too many legit emails, use:

 - Precision: % of predicted spam that is actually spam.

 - Recall: % of actual spam correctly detected.

 - F1-score: Balance of precision & recall.

 - ROC-AUC: Overall separability.

  e. Business Impact
- Positive Impact:

 - Reduce spam reaching customers, improving trust.

 - Save employees' time by reducing inbox clutter.

 - Lower phishing risks → protect company data.

- Risks:

 - Too many false positives → important emails lost.

 - Balance threshold to minimize business disruption.

 2. Python Implementation

In [26]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.utils.class_weight import compute_class_weight
import numpy as np

# Example synthetic dataset
data = {
    'text': [
        'Win money now!!!', 'Lowest price on meds', 'Meeting at 10am', 'Lunch plans today?',
        'You have won a lottery', 'Your invoice is attached', 'Buy cheap products now',
        'Let’s catch up tomorrow', None, 'Congratulations, you won!'
    ],
    'label': [1, 1, 0, 0, 1, 0, 1, 0, 0, 1]  # 1 = Spam, 0 = Not Spam
}
df = pd.DataFrame(data)

# Handle missing values
df['text'] = df['text'].fillna("")

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(
    df['text'], df['label'], test_size=0.3, random_state=42, stratify=df['label']
)

# Vectorization
vectorizer = TfidfVectorizer(stop_words='english', lowercase=True)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Train Naive Bayes classifier
clf = MultinomialNB()
clf.fit(X_train_tfidf, y_train)

# Predictions
y_pred = clf.predict(X_test_tfidf)
y_prob = clf.predict_proba(X_test_tfidf)[:, 1]

# Evaluation
print("Classification Report:\n", classification_report(y_test, y_pred))
print("ROC-AUC Score:", round(roc_auc_score(y_test, y_prob), 4))


Classification Report:
               precision    recall  f1-score   support

           0       0.00      0.00      0.00         2
           1       0.33      1.00      0.50         1

    accuracy                           0.33         3
   macro avg       0.17      0.50      0.25         3
weighted avg       0.11      0.33      0.17         3

ROC-AUC Score: 1.0


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
