#SVM & Naive Bayes

#1. What is a Support Vector Machine (SVM), and how does it work?

- A Support Vector Machine (SVM) is a supervised machine learning algorithm used primarily for classification, but it can also be used for regression. It is particularly powerful for binary classification problems.

- How Does SVM Work?
1. Separating Hyperplane
SVM finds the hyperplane that maximizes the margin between the two classes. The margin is the distance between the hyperplane and the nearest data points from each class. These nearest points are called support vectors.

2. Maximizing the Margin
The goal is to maximize this margin so that the classifier is more robust and generalizes better.

3. Support Vectors
Support vectors are the critical elements of the dataset—they are the data points that lie closest to the decision boundary. The position of these vectors determines the hyperplane.

4. Mathematical Objective
Given training data
(
𝑥
1
,
𝑦
1
)
,
(
𝑥
2
,
𝑦
2
)
,
…
,
(
𝑥
𝑛
,
𝑦
𝑛
)
(x
1
​
 ,y
1
​
 ),(x
2
​
 ,y
2
​
 ),…,(x
n
​
 ,y
n
​
 ), where
𝑥
𝑖
∈
𝑅
𝑛
x
i
​
 ∈R
n
  and
𝑦
𝑖
∈
{
−
1
,
1
}
y
i
​
 ∈{−1,1}, the goal is to solve:

Minimize
1
2
∥
𝑤
∥
2
Minimize
2
1
​
 ∥w∥
2

Subject to
𝑦
𝑖
(
𝑤
⋅
𝑥
𝑖
+
𝑏
)
≥
1
for all
𝑖
Subject to y
i
​
 (w⋅x
i
​
 +b)≥1for all i




#2. Explain the difference between Hard Margin and Soft Margin SVM.

- Hard Margin SVM
Hard Margin SVM is used when the data is perfectly linearly separable. It tries to find a hyperplane that separates the classes with the maximum margin and does not allow any misclassification. All data points must lie outside or exactly on the margin boundaries. This approach works well only when the data is clean and has no overlap or outliers.

- Soft Margin SVM
Soft Margin SVM is a more flexible version used when the data is not perfectly separable. It allows some misclassifications or violations of the margin by introducing slack variables. A penalty parameter
𝐶
C controls the trade-off between maximizing the margin and minimizing classification error. Soft Margin SVM is more robust and suitable for real-world, noisy datasets where perfect separation is not possible.



#3.  What is the Kernel Trick in SVM? Give one example of a kernel and explain its use case.

- The Kernel Trick is a technique used in Support Vector Machines (SVMs) to handle non-linearly separable data. It allows the SVM to find a hyperplane in a higher-dimensional space without explicitly transforming the data into that space.

Instead of mapping the data points manually, the kernel trick uses a kernel function to compute the inner product of data points in the transformed feature space, making the computation much more efficient.

- Example of a Kernel: Radial Basis Function (RBF) / Gaussian Kernel
Formula:

𝐾
(
𝑥
,
𝑥
′
)
=
exp
⁡
(
−
𝛾
∥
𝑥
−
𝑥
′
∥
2
)
K(x,x
′
 )=exp(−γ∥x−x
′
 ∥
2
 )
Where:

𝑥
x and
𝑥
′
x
′
  are two input vectors

𝛾
γ is a parameter that controls the width of the Gaussian

#4. What is a Naïve Bayes Classifier, and why is it called “naïve”?

- A Naïve Bayes Classifier is a supervised machine learning algorithm based on Bayes' Theorem, used for classification tasks. It predicts the class of a given data point based on probabilities of feature values.

The core idea is to calculate the posterior probability of each class given the input features, and choose the class with the highest probability.

- Why is it Called “Naïve”?
It is called “naïve” because it assumes that all features are independent of each other given the class label — which is rarely true in real-world data.

Example of the Naïve Assumption:
If we are classifying emails as spam or not spam based on words used:

Naïve Bayes assumes that the presence of the word “money” is independent of the word “free”, even though they may often occur together in spam emails.

#5. Gaussian Naïve Bayes

This variant assumes that the features are continuous and follow a normal (Gaussian) distribution. It is used when the data includes real-valued numerical inputs like age, temperature, or income. For each class, the model estimates the mean and variance of the features to calculate probabilities.

When to use it:

Use Gaussian Naïve Bayes when your features are continuous numerical values, such as in medical or sensor data.

Multinomial Naïve Bayes

This version is suitable for discrete data, especially word counts or frequencies. It is commonly used in text classification tasks, where the features represent how often a word appears in a document. It works well with high-dimensional data like text, where the frequency of terms matters.

When to use it:

Use Multinomial Naïve Bayes when your features represent counts, like how many times a word appears in text data.

Bernoulli Naïve Bayes

This variant is designed for binary (0 or 1) features, indicating the presence or absence of a feature. It doesn’t care how many times a word appears, only whether it appears at all. It is useful for tasks where binary features make more sense than counts.

When to use it:

Use Bernoulli Naïve Bayes when your features are binary, like in spam detection where it matters whether certain keywords are present in an email, not how many times they appear.



#6. Write a Python program to:
● Load the Iris dataset
● Train an SVM Classifier with a linear kernel
● Print the model's accuracy and support vectors.

In [1]:
# Import necessary libraries
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = datasets.load_iris()
X = iris.data       # Features
y = iris.target     # Target labels

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create an SVM classifier with a linear kernel
svm_model = SVC(kernel='linear')

# Train the model
svm_model.fit(X_train, y_train)

# Predict on the test data
y_pred = svm_model.predict(X_test)

# Calculate and print accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

# Print support vectors
print("\nSupport Vectors:")
print(svm_model.support_vectors_)


Model Accuracy: 1.00

Support Vectors:
[[4.8 3.4 1.9 0.2]
 [5.1 3.3 1.7 0.5]
 [4.5 2.3 1.3 0.3]
 [5.6 3.  4.5 1.5]
 [5.4 3.  4.5 1.5]
 [6.7 3.  5.  1.7]
 [5.9 3.2 4.8 1.8]
 [5.1 2.5 3.  1.1]
 [6.  2.7 5.1 1.6]
 [6.3 2.5 4.9 1.5]
 [6.1 2.9 4.7 1.4]
 [6.5 2.8 4.6 1.5]
 [6.9 3.1 4.9 1.5]
 [6.3 2.3 4.4 1.3]
 [6.3 2.5 5.  1.9]
 [6.3 2.8 5.1 1.5]
 [6.3 2.7 4.9 1.8]
 [6.  3.  4.8 1.8]
 [6.  2.2 5.  1.5]
 [6.2 2.8 4.8 1.8]
 [6.5 3.  5.2 2. ]
 [7.2 3.  5.8 1.6]
 [5.6 2.8 4.9 2. ]
 [5.9 3.  5.1 1.8]
 [4.9 2.5 4.5 1.7]]


#7. Write a Python program to:
● Load the Breast Cancer dataset
● Train a Gaussian Naïve Bayes model
● Print its classification report including precision, recall, and F1-score.

In [2]:
# Import necessary libraries
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data       # Features
y = data.target     # Target labels

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train a Gaussian Naïve Bayes model
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Make predictions on the test set
y_pred = gnb.predict(X_test)

# Print the classification report
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))


Classification Report:
              precision    recall  f1-score   support

   malignant       1.00      0.93      0.96        43
      benign       0.96      1.00      0.98        71

    accuracy                           0.97       114
   macro avg       0.98      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114



#8. Write a Python program to:
● Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best
C and gamma.
● Print the best hyperparameters and accuracy

In [3]:
# Import necessary libraries
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load the Wine dataset
data = load_wine()
X = data.data
y = data.target

# Split into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the parameter grid for GridSearchCV
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [0.001, 0.01, 0.1, 1]
}

# Create an SVM model with RBF kernel
svm_model = SVC(kernel='rbf')

# Set up GridSearchCV
grid_search = GridSearchCV(svm_model, param_grid, cv=5, scoring='accuracy')

# Train using GridSearchCV
grid_search.fit(X_train, y_train)

# Get the best estimator and make predictions
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

# Print best hyperparameters and accuracy
print("Best Hyperparameters:")
print(grid_search.best_params_)

print(f"\nTest Accuracy: {accuracy_score(y_test, y_pred):.2f}")


Best Hyperparameters:
{'C': 100, 'gamma': 0.001}

Test Accuracy: 0.83


#9.  Write a Python program to:
● Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. using
sklearn.datasets.fetch_20newsgroups).
● Print the model's ROC-AUC score for its predictions.

In [4]:
# Import necessary libraries
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score

# Load two categories from the 20 Newsgroups dataset for binary classification
categories = ['sci.med', 'rec.sport.baseball']
data = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))

# Split into train and test sets
X_train_raw, X_test_raw, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)

# Convert text data to TF-IDF features
vectorizer = TfidfVectorizer(stop_words='english', max_df=0.7)
X_train = vectorizer.fit_transform(X_train_raw)
X_test = vectorizer.transform(X_test_raw)

# Train a Multinomial Naive Bayes classifier
model = MultinomialNB()
model.fit(X_train, y_train)

# Predict probabilities for the positive class
y_prob = model.predict_proba(X_test)[:, 1]

# Calculate and print the ROC-AUC score
roc_auc = roc_auc_score(y_test, y_prob)
print(f"ROC-AUC Score: {roc_auc:.2f}")


ROC-AUC Score: 1.00


#10.  Imagine you’re working as a data scientist for a company that handles
email communications.
Your task is to automatically classify emails as Spam or Not Spam. The emails may
contain:
● Text with diverse vocabulary

● Potential class imbalance (far more legitimate emails than spam)

● Some incomplete or missing data

Explain the approach you would take to:

● Preprocess the data (e.g. text vectorization, handling missing data)

● Choose and justify an appropriate model (SVM vs. Naïve Bayes)

● Address class imbalance

● Evaluate the performance of your solution with suitable metrics

And explain the business impact of your solution.






Here's an approach to classifying emails as Spam or Not Spam, considering the characteristics mentioned:

**1. Data Preprocessing:**

*   **Handling Missing Data:** Identify and handle any missing data points. Depending on the nature of the missing data (e.g., missing subject line, missing body), you could consider:
    *   **Imputation:** Replace missing values with a placeholder (e.g., "missing subject", "empty body").
    *   **Dropping:** If a significant portion of an email is missing, you might consider dropping it, although this should be done cautiously to avoid losing valuable data.
*   **Text Vectorization:** Convert the raw text data into a numerical format that machine learning models can understand. Given the diverse vocabulary, TF-IDF (Term Frequency-Inverse Document Frequency) is a suitable choice. TF-IDF captures the importance of words in a document relative to the entire corpus. You could also explore techniques like word embeddings (e.g., Word2Vec, GloVe) for more semantic representations, but TF-IDF is a good starting point for this problem.
*   **Cleaning Text:** Before vectorization, perform text cleaning steps like:
    *   Removing punctuation and special characters.
    *   Converting text to lowercase.
    *   Removing stop words (common words like "the", "a", "is").
    *   Stemming or lemmatization to reduce words to their root form.

**2. Model Choice and Justification (SVM vs. Naïve Bayes):**

Both SVM and Naïve Bayes are candidates for this task, but each has its pros and cons:

*   **Naïve Bayes (Specifically Multinomial Naïve Bayes):**
    *   **Pros:**
        *   Simple and fast to train, even on large datasets.
        *   Works well with high-dimensional data like text.
        *   Performs well in many text classification tasks, especially with discrete features like word counts or TF-IDF.
    *   **Cons:**
        *   The "naïve" independence assumption may not hold true in real-world email data (word occurrences are often dependent).
        *   May not capture complex relationships between features as effectively as SVM.

*   **Support Vector Machine (SVM):**
    *   **Pros:**
        *   Effective in high-dimensional spaces.
        *   Can handle non-linearly separable data using kernels.
        *   Often performs well on text classification tasks.
    *   **Cons:**
        *   Training can be computationally expensive, especially on very large datasets.
        *   Choosing the right kernel and hyperparameters can be challenging.

**Justification:**

Given the potential for complex relationships between words in spam emails, and the fact that SVM can handle high-dimensional data and non-linear relationships, **SVM is generally a stronger choice than Naïve Bayes for spam classification.** While Naïve Bayes can provide a good baseline, SVM's ability to find optimal separating hyperplanes often leads to better performance on this type of problem. You could start with a linear SVM and then explore RBF or other kernels if needed.

**3. Addressing Class Imbalance:**

Class imbalance is a common issue in spam detection, as legitimate emails far outnumber spam emails. Ignoring this can lead to a model that is biased towards the majority class (not spam) and performs poorly on the minority class (spam). To address this:

*   **Resampling Techniques:**
    *   **Oversampling:** Create synthetic samples of the minority class (spam) to increase its representation in the training data. Techniques like SMOTE (Synthetic Minority Over-sampling Technique) are effective.
    *   **Undersampling:** Randomly remove samples from the majority class (not spam) to balance the dataset. Be cautious with undersampling, as it can lead to loss of valuable information.
*   **Using Class Weights:** Most machine learning libraries allow you to assign higher weights to the minority class during training. This penalizes misclassifications of the minority class more heavily, encouraging the model to pay more attention to spam emails.
*   **Choosing Appropriate Metrics:** Relying solely on accuracy can be misleading with imbalanced data. Use metrics that are more sensitive to the performance on the minority class (see next point).

**4. Evaluating Performance with Suitable Metrics:**

For imbalanced datasets, it's crucial to use metrics beyond simple accuracy:

*   **Confusion Matrix:** Provides a detailed breakdown of true positives, true negatives, false positives, and false negatives.
*   **Precision:** The proportion of correctly classified spam emails out of all emails predicted as spam. High precision is important to minimize false positives (legitimate emails classified as spam).
*   **Recall (Sensitivity):** The proportion of correctly classified spam emails out of all actual spam emails. High recall is important to minimize false negatives (spam emails classified as legitimate).
*   **F1-Score:** The harmonic mean of precision and recall, providing a balanced measure of the model's performance.
*   **ROC Curve and AUC (Area Under the Curve):** The ROC curve plots the true positive rate against the false positive rate at various threshold settings. AUC provides a single measure of the model's ability to distinguish between the two classes. A higher AUC indicates better performance.

**5. Business Impact of the Solution:**

Implementing an effective spam classification solution has significant business impact:

*   **Increased Productivity:** Users spend less time sifting through spam, allowing them to focus on important emails.
*   **Improved Security:** Spam emails often contain phishing attempts, malware, or other security threats. Classifying and filtering them reduces the risk of security breaches.
*   **Reduced Storage Costs:** Filtering out spam reduces the amount of data that needs to be stored on email servers.
*   **Enhanced User Experience:** Users have a cleaner and more organized inbox, leading to a better experience with the email system.
*   **Compliance and Legal Benefits:** In some industries, effective spam filtering is required for compliance with regulations.

By taking this comprehensive approach, you can build a robust and effective spam classification system that provides significant value to the company.