#SVM & Naive Bayes | Assignment

1. What is a Support Vector Machine (SVM), and how does it work?
   - A Support Vector Machine (SVM) is a powerful supervised machine learning algorithm used for classification and regression tasks.
   
   - How it works:

   - Hyperplane: In a 2D space, a hyperplane is a line. In higher dimensions, it's a flat subspace. The SVM algorithm aims to find the hyperplane that maximizes the margin between the different classes.
   - Support Vectors: These are the data points closest to the hyperplane. They play a crucial role in defining the position and orientation of the hyperplane.
   - Margin: The margin is the distance between the hyperplane and the nearest data points (support vectors) from each class.
   - Kernel Trick: SVM can handle non-linearly separable data by using the kernel trick.

2. Explain the difference between Hard Margin and Soft Margin SVM.
   - Hard Margin SVM:
     - Strict Separation: Hard Margin SVM aims to find a hyperplane that perfectly separates the data points of different classes.
     - Sensitivity to Outliers: Hard Margin SVM is highly sensitive to outliers.
     
     - Mathematical Formulation: In the mathematical formulation of Hard Margin SVM, there is a strict constraint that all data points must be on the correct side of the margin.
     - Use Case: Hard Margin SVM is suitable for data that is known to be linearly separable and free of noise or outliers.

     Soft Margin SVM:
     - Tolerance for Misclassification: Soft Margin SVM allows for some misclassification of data points or points to lie within the margin.
     
     - Robustness to Outliers: Soft Margin SVM is more robust to outliers and noise compared to Hard Margin SVM.
     - Mathematical Formulation: The mathematical formulation of Soft Margin SVM includes the slack variables and the regularization parameter  C  to allow for a flexible margin.
     - Use Case: Soft Margin SVM is more widely used in practice because real-world data is often noisy and not perfectly linearly separable.

3. What is the Kernel Trick in SVM? Give one example of a kernel and
explain its use case.
   - The Kernel Trick is a fundamental concept in SVM that allows it to handle non-linearly separable data without explicitly transforming the data into a higher-dimensional space.
   - Mathematically:
   The kernel function satisfies the property:  K(xi,xj)=ϕ(xi)⋅ϕ(xj) , where  ϕ  is the mapping function that transforms the data from the original space to the higher-dimensional space.
   - Use Case: The RBF kernel is particularly useful when the relationship between the data points and the class labels is non-linear and complex.

4. What is a Naïve Bayes Classifier, and why is it called “naïve”?
   - A  Naïve Bayes Classifier is a probabilistic machine learning algorithm used for classification tasks. It is based on Bayes' theorem with a strong (and often unrealistic) assumption of independence between the features.
   - It's called "naïve" because it makes a strong and often false assumption that all features are independent of each other given the class.

5. Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants.
When would you use each one?
   - 1. Gaussian Naïve Bayes
   Assumption: This variant assumes that the continuous features associated with each class are distributed according to a Gaussian (normal) distribution.
   How it Works: It calculates the mean and standard deviation of each feature for each class. When classifying a new data point, it calculates the probability of that data point's features occurring given the Gaussian distribution of each class.
   Use Case: Gaussian Naïve Bayes is typically used when your features are continuous and are assumed to follow a normal distribution. For example, in a dataset with features like height, weight, or temperature, which are often normally distributed.
   2. Multinomial Naïve Bayes
   Assumption: This variant is suitable for features that represent counts or frequencies. It assumes that the features are generated from a multinomial distribution.
   How it Works: It calculates the probability of observing a particular count for each feature given a class. It is commonly used in text classification where features are word counts or frequencies.
   Use Case: Multinomial Naïve Bayes is widely used for text classification problems, such as spam filtering, document categorization, and sentiment analysis. It works well with discrete features representing counts, like the number of times a word appears in a document.
   3. Bernoulli Naïve Bayes
   Assumption: This variant is designed for binary or boolean features (features that are either present or absent). It assumes that features are generated from a Bernoulli distribution.
   How it Works: It calculates the probability of a feature being present (having a value of 1) or absent (having a value of 0) given a class.
   Use Case: Bernoulli Naïve Bayes is suitable for classification tasks where features are binary. This is also often used in text classification, particularly when the presence or absence of a word is more important than its frequency (e.g., for short texts or when dealing with a very large vocabulary).

6. Write a Python program to:

● Load the Iris dataset

● Train an SVM Classifier with a linear kernel

● Print the model's accuracy and support vectors.

In [4]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

iris = datasets.load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

svm_classifier = SVC(kernel='linear')
svm_classifier.fit(X_train, y_train)

y_pred = svm_classifier.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

print("\nSupport Vectors:")
print(svm_classifier.support_vectors_)

Model Accuracy: 1.00

Support Vectors:
[[4.8 3.4 1.9 0.2]
 [5.1 3.3 1.7 0.5]
 [4.5 2.3 1.3 0.3]
 [5.6 3.  4.5 1.5]
 [5.4 3.  4.5 1.5]
 [6.7 3.  5.  1.7]
 [5.9 3.2 4.8 1.8]
 [5.1 2.5 3.  1.1]
 [6.  2.7 5.1 1.6]
 [6.3 2.5 4.9 1.5]
 [6.1 2.9 4.7 1.4]
 [6.5 2.8 4.6 1.5]
 [6.9 3.1 4.9 1.5]
 [6.3 2.3 4.4 1.3]
 [6.3 2.8 5.1 1.5]
 [6.3 2.7 4.9 1.8]
 [6.  3.  4.8 1.8]
 [6.  2.2 5.  1.5]
 [6.2 2.8 4.8 1.8]
 [6.5 3.  5.2 2. ]
 [7.2 3.  5.8 1.6]
 [5.6 2.8 4.9 2. ]
 [5.9 3.  5.1 1.8]
 [4.9 2.5 4.5 1.7]]


7. Write a Python program to:

● Load the Breast Cancer dataset

● Train a Gaussian Naïve Bayes model

● Print its classification report including precision, recall, and F1-score.

In [5]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

breast_cancer = datasets.load_breast_cancer()
X = breast_cancer.data
y = breast_cancer.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

gnb_classifier = GaussianNB()
gnb_classifier.fit(X_train, y_train)

y_pred = gnb_classifier.predict(X_test)

print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=breast_cancer.target_names))

Classification Report:
              precision    recall  f1-score   support

   malignant       0.93      0.90      0.92        63
      benign       0.95      0.96      0.95       108

    accuracy                           0.94       171
   macro avg       0.94      0.93      0.94       171
weighted avg       0.94      0.94      0.94       171



8. Write a Python program to:

● Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best C and gamma.

● Print the best hyperparameters and accuracy.

In [8]:
from sklearn import datasets
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load the Wine dataset
wine = datasets.load_wine()
X = wine.data
y = wine.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define the parameter grid for GridSearchCV
param_grid = {'C': [0.1, 1, 10, 100],
              'gamma': [1, 0.1, 0.01, 0.001],
              'kernel': ['rbf']} # Using RBF kernel, which is common for C and gamma tuning

# Create an SVM classifier
svm = SVC()

# Create GridSearchCV object
grid_search = GridSearchCV(svm, param_grid, cv=5) # 5-fold cross-validation

# Fit the GridSearchCV to the training data
grid_search.fit(X_train, y_train)

# Print the best hyperparameters found
print("Best Hyperparameters:", grid_search.best_params_)

# Get the best model
best_svm_model = grid_search.best_estimator_

# Predict on the test set using the best model
y_pred = best_svm_model.predict(X_test)

# Calculate and print the accuracy of the best model
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy with Best Hyperparameters: {accuracy:.2f}")

Best Hyperparameters: {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}
Model Accuracy with Best Hyperparameters: 0.78


9. Write a Python program to:

● Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. using sklearn.datasets.fetch_20newsgroups).

● Print the model's ROC-AUC score for its predictions.

In [10]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelEncoder

categories = ['alt.atheism', 'soc.religion.christian']
newsgroups_data = fetch_20newsgroups(subset='all', categories=categories, shuffle=True, random_state=42)

X = newsgroups_data.data
y = newsgroups_data.target

label_encoder = LabelEncoder()
y_binary = label_encoder.fit_transform(y)

X_train, X_test, y_train_binary, y_test_binary = train_test_split(X, y_binary, test_size=0.3, random_state=42)

vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

nb_classifier = MultinomialNB()
nb_classifier.fit(X_train_tfidf, y_train_binary)

y_pred_proba = nb_classifier.predict_proba(X_test_tfidf)[:, 1]

roc_auc = roc_auc_score(y_test_binary, y_pred_proba)

print(f"Model ROC-AUC Score: {roc_auc:.2f}")

Model ROC-AUC Score: 0.99


In [14]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report

wine = datasets.load_wine()
X = wine.data
y = wine.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

gnb_classifier = GaussianNB()
gnb_classifier.fit(X_train, y_train)

y_pred_gnb = gnb_classifier.predict(X_test)

accuracy_gnb = accuracy_score(y_test, y_pred_gnb)
print("Gaussian Naïve Bayes Performance on Wine Dataset:")
print(f"Model Accuracy: {accuracy_gnb:.2f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_gnb, target_names=wine.target_names))

print("-" * 50) # Separator

best_svm_params = {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'} # Replace with actual best params if different
svm_classifier = SVC(**best_svm_params)
svm_classifier.fit(X_train, y_train)

y_pred_svm = svm_classifier.predict(X_test)

accuracy_svm = accuracy_score(y_test, y_pred_svm)
print("SVM Performance on Wine Dataset (with Best Hyperparameters):")
print(f"Model Accuracy: {accuracy_svm:.2f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_svm, target_names=wine.target_names))

Gaussian Naïve Bayes Performance on Wine Dataset:
Model Accuracy: 1.00

Classification Report:
              precision    recall  f1-score   support

     class_0       1.00      1.00      1.00        19
     class_1       1.00      1.00      1.00        21
     class_2       1.00      1.00      1.00        14

    accuracy                           1.00        54
   macro avg       1.00      1.00      1.00        54
weighted avg       1.00      1.00      1.00        54

--------------------------------------------------
SVM Performance on Wine Dataset (with Best Hyperparameters):
Model Accuracy: 0.78

Classification Report:
              precision    recall  f1-score   support

     class_0       0.85      0.89      0.87        19
     class_1       0.83      0.71      0.77        21
     class_2       0.62      0.71      0.67        14

    accuracy                           0.78        54
   macro avg       0.77      0.77      0.77        54
weighted avg       0.79      0.78      0.

10. Imagine you’re working as a data scientist for a company that handles
email communications.

Your task is to automatically classify emails as Spam or Not Spam. The emails may contain:

● Text with diverse vocabulary

● Potential class imbalance (far more legitimate emails than spam)

● Some incomplete or missing data

Explain the approach you would take to:

● Preprocess the data (e.g. text vectorization, handling missing data)

● Choose and justify an appropriate model (SVM vs. Naïve Bayes)

● Address class imbalance

● Evaluate the performance of your solution with suitable metrics
And explain the business impact of your solution.

## Approach to Email Spam Classification

Here's a comprehensive approach to classifying emails as Spam or Not Spam, considering the characteristics of the data and the goal:

**1. Data Preprocessing:**

Given the nature of email data, several preprocessing steps are crucial:

*   **Handling Missing Data:** Emails can have missing subjects, body text, or sender information. Strategies include:
    *   **Imputation:** Filling missing values with a placeholder (e.g., an empty string for text fields).
    *   **Removal:** If the amount of missing data is small and doesn't significantly impact the dataset size, rows with missing values can be removed.
*   **Text Cleaning:** Raw email text contains noise that needs to be removed or standardized:
    *   **Lowercasing:** Convert all text to lowercase to treat words like "Spam" and "spam" as the same.
    *   **Removing Punctuation and Special Characters:** These often don't contribute to the meaning of the text for classification.
    *   **Removing Stop Words:** Words like "the," "a," "is," etc., are common and usually not indicative of spam.
    *   **Stemming or Lemmatization:** Reducing words to their root form (e.g., "running," "runs," "ran" to "run") can help reduce the vocabulary size and improve model generalization.
*   **Text Vectorization:** Machine learning models require numerical input. Text data needs to be converted into numerical representations:
    *   **Bag-of-Words (BoW):** Represents the text as a collection of words, where the order doesn't matter. The value for each word is its frequency in the document or a binary presence/absence indicator.
    *   **TF-IDF (Term Frequency-Inverse Document Frequency):** Weights words based on their frequency in a document and their rarity across all documents. This helps highlight words that are more specific to spam or non-spam emails. TF-IDF is often preferred over simple BoW for text classification.

**2. Model Choice and Justification (SVM vs. Naïve Bayes):**

Both SVM and Naïve Bayes are suitable candidates for text classification, but they have different strengths:

*   **Naïve Bayes (specifically Multinomial or Bernoulli):**
    *   **Justification:** Naïve Bayes is a classic choice for text classification due to its simplicity, efficiency, and good performance, especially with high-dimensional data like text features (word counts or presence). The "naïve" independence assumption, while often violated in reality, doesn't always negatively impact performance in practice.
    *   **Variants:** Multinomial Naïve Bayes is suitable if using word counts or frequencies (BoW), while Bernoulli Naïve Bayes is better if using binary features (word presence/absence).
*   **Support Vector Machine (SVM):**
    *   **Justification:** SVMs are powerful and can find complex decision boundaries, even in high-dimensional spaces. With the right kernel (like the RBF kernel), SVM can handle non-linear relationships between features. SVM often performs very well in practice for classification tasks.
    *   **Considerations:** SVM can be computationally more expensive to train than Naïve Bayes, especially on very large datasets.

**Choice:**

For this task, both models are viable. **Multinomial Naïve Bayes** is often a good starting point due to its speed and effectiveness on text data. **SVM** with an appropriate kernel (like RBF) could potentially achieve higher accuracy if the relationship between features and classes is complex, but it would require more computational resources and hyperparameter tuning. Given the potential for a diverse vocabulary and complex patterns in spam, **SVM with TF-IDF features** might offer better performance, but Naïve Bayes is a strong baseline.

**3. Addressing Class Imbalance:**

The scenario mentions potential class imbalance (far more legitimate emails than spam). This is a common problem in spam detection and can lead to models that are biased towards the majority class (not spam). Strategies to address this include:

*   **Resampling Techniques:**
    *   **Oversampling Minority Class:** Creating synthetic samples of the minority class (spam) to balance the dataset. Techniques like SMOTE (Synthetic Minority Over-sampling Technique) are popular.
    *   **Undersampling Majority Class:** Randomly removing samples from the majority class (not spam) to reduce its size. This can lead to loss of information.
*   **Using Evaluation Metrics Sensitive to Imbalance:** Accuracy can be misleading with imbalanced data.
*   **Using Class Weights:** Some algorithms (including SVM and some Naïve Bayes implementations) allow you to assign higher weights to the minority class during training, making the model penalize misclassifications of the minority class more heavily.
*   **Collecting More Data:** If possible, gathering more data for the minority class is the most effective solution.

**4. Evaluating Performance with Suitable Metrics:**

With class imbalance, accuracy is not sufficient. Here are suitable metrics:

*   **Confusion Matrix:** A table summarizing the counts of true positives, true negatives, false positives, and false negatives.
*   **Precision:** Of all emails classified as spam, what proportion were actually spam? (TP / (TP + FP)) - Important for minimizing legitimate emails being marked as spam.
*   **Recall (Sensitivity):** Of all actual spam emails, what proportion were correctly identified as spam? (TP / (TP + FN)) - Important for catching as much spam as possible.
*   **F1-Score:** The harmonic mean of precision and recall, providing a single metric that balances both.
*   **ROC-AUC (Receiver Operating Characteristic - Area Under the Curve):** Measures the ability of the classifier to distinguish between classes. A higher AUC indicates better performance. This is a good metric for imbalanced datasets as it considers the trade-off between true positive rate and false positive rate across different probability thresholds.

**Business Impact of the Solution:**

Implementing an effective spam classification solution has significant business impact:

*   **Increased User Productivity:** Employees spend less time sifting through spam, allowing them to focus on important tasks.
*   **Reduced Security Risks:** Spam emails can contain phishing attempts, malware, or other security threats. Effective filtering reduces the likelihood of employees falling victim to these attacks.
*   **Improved System Performance:** Less spam reduces the load on email servers and network resources.
*   **Enhanced Customer Satisfaction:** For businesses that handle customer communications via email, ensuring legitimate emails are delivered and spam is filtered improves the customer experience.
*   **Cost Savings:** Reduced security incidents and improved productivity can lead to significant cost savings for the company.

By carefully considering these steps, you can build a robust and effective email spam classification system.