#SVM & NAIVE BAYES

Question 1: What is a Support Vector Machine (SVM), and how does it work?
  - Answer: A Support Vector Machine (SVM) is a supervised learning algorithm used for classification and regression tasks. It aims to find the best hyperplane that separates the data into different classes with the maximum margin.
  
  How SVM Works
    1. Hyperplane Selection: SVM selects the hyperplane that maximizes the margin between classes. The margin is the distance between the hyperplane and the nearest data points from each class.
    2. Support Vectors: The data points that lie closest to the hyperplane are called support vectors. These points are crucial in defining the hyperplane and determining the classification boundary.
    3. Kernel Trick: SVM uses the kernel trick to handle non-linearly separable data. It maps the data into a higher-dimensional space where it becomes linearly separable, allowing SVM to find a hyperplane that separates the classes.
    4. Classification: Once the hyperplane is determined, SVM classifies new data points based on which side of the hyperplane they lie on.

Question 2: Explain the difference between Hard Margin and Soft Margin SVM.
  - Answer: The main difference between Hard Margin SVM and Soft Margin SVM lies in how they handle the classification of data points.
  
  Hard Margin SVM
  
  •	Strict Separation: Hard Margin SVM requires that all data points be classified correctly and lie on the correct side of the hyperplane.
  
  •	No Misclassifications: It does not allow for any misclassifications, meaning every data point must be on the right side of the decision boundary.
  
  •	Sensitive to Outliers: Hard Margin SVM is sensitive to outliers and noise in the data, as a single misclassified point can significantly impact the hyperplane.
  
  •	Hard Margin SVM focuses solely on maximizing the margin.
    
  Soft Margin SVM
    
  •	Allows Misclassifications: Soft Margin SVM allows for some misclassifications by introducing slack variables that permit data points to lie on the wrong side of the hyperplane.
    
  •	Trade-off between Margin and Misclassifications: It finds a balance between maximizing the margin and minimizing the number of misclassifications.
     
  •	More Robust: Soft Margin SVM is more robust to outliers and noise, as it can tolerate some errors in classification.
    
  •	Soft Margin SVM balances margin maximization with minimizing misclassifications.

Question 3: What is the Kernel Trick in SVM? Give one example of a kernel and explain its use case.
  - Answer: The Kernel Trick is a technique used in Support Vector Machines (SVMs) to handle non-linearly separable data. Instead of explicitly mapping the data into a higher-dimensional space, the Kernel Trick uses a kernel function to compute the dot product of the data points in the higher-dimensional space. This allows SVMs to operate in the higher-dimensional space without explicitly transforming the data.

  Example: Radial Basis Function (RBF) Kernel

  One popular kernel function is the Radial Basis Function (RBF) kernel, also known as the Gaussian kernel. It is defined as:

  K (x, y) = exp (-γ ||x - y||^2)

  where γ is a hyperparameter that controls the width of the kernel.

  Use Case: Non-Linear Classification

  The RBF kernel is particularly useful when the data is not linearly separable in the original feature space. By using the RBF kernel, SVM can map the data into a higher-dimensional space where it becomes linearly separable.

  For example, consider a dataset with two features (x1, x2) and two classes (blue and red). The data points are distributed in a circular pattern, making it non-linearly separable.

        | x1 | x2 | Class |
        | -- | -- | ----- |
        | 1 | 1 | Blue |
        | 1 | -1 | Blue |
        | -1 | 1 | Blue |
        | -1 | -1 | Red |
        | 0.5 | 0.5 | Blue |
        | 0.5 | -0.5 | Blue |

  Using the RBF kernel, SVM can map the data into a higher-dimensional space where the classes become linearly separable. The SVM model can then learn a decision boundary that separates the classes effectively.

Question 4: What is a Naïve Bayes Classifier, and why is it called “naïve”?
  - Answer: A Naïve Bayes Classifier is a type of supervised learning algorithm based on Bayes' theorem. It is used for classification tasks, where the goal is to predict the class label of a new instance based on its features.
  
  The Naïve Bayes Classifier is called "naïve" because it makes a strong assumption about the independence of features. Specifically, it assumes that:
    
    - Features are conditionally independent: Given the class label, the features are independent of each other.
    
    - No correlation between features: The presence or absence of one feature does not affect the presence or absence of another feature.

  This assumption is often violated in real-world data, where features can be correlated or dependent on each other. Despite this, Naïve Bayes Classifiers can still perform well in many cases, especially when the number of features is large.

Question 5: Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants. When would you use each one?
  - Answer: The Naïve Bayes algorithm has several variants, each suited for different types of data distributions. The three main variants are Gaussian Naïve Bayes, Multinomial Naïve Bayes, and Bernoulli Naïve Bayes.
  
  Gaussian Naïve Bayes\
      •	Assumes Continuous Data: Gaussian Naïve Bayes assumes that the features follow a Gaussian (normal) distribution.\
      •	Used for Continuous Features: It is suitable for datasets with continuous features, such as numeric values.
      
      Example Use Cases:\
      •	Predicting continuous outcomes, like predicting house prices based on features like area and number of rooms.\
      •	Classifying data with continuous features, such as image classification based on pixel values.

  Multinomial Naïve Bayes\
      •	Assumes Discrete Counts: Multinomial Naïve Bayes is suitable for features that represent discrete counts, such as word frequencies in text data.\
      •	Used for Text Classification: It is commonly used in text classification tasks, such as spam detection and sentiment analysis.
    
      Example Use Cases:\
      •	Classifying text documents into categories like spam or not spam based on word frequencies.\
      •	Sentiment analysis of customer reviews to determine positive or negative sentiment.

  Bernoulli Naïve Bayes\
      •	Assumes Binary Features: Bernoulli Naïve Bayes is suitable for binary features, where each feature is either present or absent.\
      •	Used for Binary Features: It is commonly used in applications where features are represented as binary vectors.
      
      Example Use Cases:\
      •	Text classification with binary features, such as presence or absence of specific words.\
      •	Document classification based on binary features like keyword presence.

  The choice of Naïve Bayes variant depends on the nature of the data and the problem at hand:\
      •	Gaussian Naïve Bayes: Use for continuous features with a Gaussian distribution.\
      •	Multinomial Naïve Bayes: Use for discrete count data, such as word frequencies in text.\
      •	Bernoulli Naïve Bayes: Use for binary features, where each feature is either present or absent.



In [1]:
# Question 6: Write a Python program to:
# Load the Iris dataset
# Train an SVM Classifier with a linear kernel
# Print the model's accuracy and support vectors.
# (Include your Python code and output in the code box below.)

# Import necessary libraries
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train an SVM Classifier with a linear kernel
svm_classifier = svm.SVC(kernel='linear')
svm_classifier.fit(X_train, y_train)

# Predict the labels of the test set
y_pred = svm_classifier.predict(X_test)

# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.3f}")

# Print the support vectors
support_vectors = svm_classifier.support_vectors_
print(f"Number of Support Vectors: {len(support_vectors)}")
print("Support Vectors:")
print(support_vectors)


Model Accuracy: 1.000
Number of Support Vectors: 25
Support Vectors:
[[4.8 3.4 1.9 0.2]
 [5.1 3.3 1.7 0.5]
 [4.5 2.3 1.3 0.3]
 [5.6 3.  4.5 1.5]
 [5.4 3.  4.5 1.5]
 [6.7 3.  5.  1.7]
 [5.9 3.2 4.8 1.8]
 [5.1 2.5 3.  1.1]
 [6.  2.7 5.1 1.6]
 [6.3 2.5 4.9 1.5]
 [6.1 2.9 4.7 1.4]
 [6.5 2.8 4.6 1.5]
 [6.9 3.1 4.9 1.5]
 [6.3 2.3 4.4 1.3]
 [6.3 2.5 5.  1.9]
 [6.3 2.8 5.1 1.5]
 [6.3 2.7 4.9 1.8]
 [6.  3.  4.8 1.8]
 [6.  2.2 5.  1.5]
 [6.2 2.8 4.8 1.8]
 [6.5 3.  5.2 2. ]
 [7.2 3.  5.8 1.6]
 [5.6 2.8 4.9 2. ]
 [5.9 3.  5.1 1.8]
 [4.9 2.5 4.5 1.7]]


In [2]:
# Question 7: Write a Python program to:
# Load the Breast Cancer dataset
# Train a Gaussian Naïve Bayes model
# Print its classification report including precision, recall, and F1-score.
# (Include your Python code and output in the code box below.)

# Import necessary libraries
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report
from sklearn.preprocessing import StandardScaler

# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train a Gaussian Naïve Bayes model
gnb_model = GaussianNB()
gnb_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = gnb_model.predict(X_test)

# Print the classification report
print ("Classification Report:")
print (classification_report(y_test, y_pred))


Classification Report:
              precision    recall  f1-score   support

           0       0.98      0.93      0.95        43
           1       0.96      0.99      0.97        71

    accuracy                           0.96       114
   macro avg       0.97      0.96      0.96       114
weighted avg       0.97      0.96      0.96       114



In [3]:
# Question 8: Write a Python program to:
# Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best C and gamma.
# Print the best hyperparameters and accuracy.
# (Include your Python code and output in the code box below.)

# Import necessary libraries
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

# Load the Wine dataset
wine = datasets.load_wine()
X = wine.data
y = wine.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the hyperparameter grid
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': ['scale', 'auto'],
    'kernel': ['rbf', 'linear', 'poly']
}

# Initialize the SVM model and GridSearchCV
svm_model = svm.SVC()
grid_search = GridSearchCV(estimator=svm_model, param_grid=param_grid, cv=5)

# Fit the grid search to the training data
grid_search.fit(X_train, y_train)

# Get the best hyperparameters and estimator
best_params = grid_search.best_params_
best_estimator = grid_search.best_estimator_

# Print the best hyperparameters
print("Best Hyperparameters:", best_params)

# Evaluate the best model on the test set
y_pred = best_estimator.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Print the accuracy
print("Accuracy:", accuracy)



Best Hyperparameters: {'C': 0.1, 'gamma': 'scale', 'kernel': 'linear'}
Accuracy: 1.0


In [4]:
# Question 9: Write a Python program to:
# Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. using sklearn.datasets.fetch_20newsgroups).
# Print the model's ROC-AUC score for its predictions.
# (Include your Python code and output in the code box below.)

# Import necessary libraries
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelBinarizer

# Load the 20 Newsgroups dataset
newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(newsgroups.data, newsgroups.target, test_size=0.2, random_state=42)

# Vectorize the text data
vectorizer = TfidfVectorizer()
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

# Train a Multinomial Naïve Bayes model
model = MultinomialNB()
model.fit(X_train_vectorized, y_train)

# Make predictions on the test set
y_pred = model.predict_proba(X_test_vectorized)

# Calculate the ROC-AUC score
lb = LabelBinarizer()
y_test_binarized = lb.fit_transform(y_test)
roc_auc = roc_auc_score(y_test_binarized, y_pred, multi_class='ovr')

# Print the ROC-AUC score
print(f"ROC-AUC Score: {roc_auc:.3f}")


ROC-AUC Score: 0.958


In [5]:
# Question 10: Imagine you’re working as a data scientist for a company that handles email communications. Your task is to automatically classify emails as Spam or Not Spam. The emails may contain:
# Text with diverse vocabulary
# Potential class imbalance (far more legitimate emails than spam)
# Some incomplete or missing data
# Explain the approach you would take to:
# Preprocess the data (e.g. text vectorization, handling missing data)
# Choose and justify an appropriate model (SVM vs. Naïve Bayes)
# Address class imbalance
# Evaluate the performance of your solution with suitable metrics And explain the business impact of your solution.
# (Include your Python code and output in the code box below.)

# Import necessary libraries
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, f1_score
from sklearn.preprocessing import LabelBinarizer
from imblearn.over_sampling import SMOTE
from collections import Counter

# Load dataset (using 20 Newsgroups as a proxy for email classification)
categories = ['alt.atheism', 'talk.religion.misc']
newsgroups = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(newsgroups.data, newsgroups.target, test_size=0.2, random_state=42)

# Vectorize text data
vectorizer = TfidfVectorizer()
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

# Address class imbalance using SMOTE
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_vectorized, y_train)

# Print class distribution before and after resampling
print("Original class distribution:", Counter(y_train))
print("Resampled class distribution:", Counter(y_train_resampled))

# Train Naïve Bayes model
model = MultinomialNB()
model.fit(X_train_resampled, y_train_resampled)

# Make predictions
y_pred = model.predict(X_test_vectorized)

# Evaluate model performance
print("Classification Report:")
print(classification_report(y_test, y_pred))
print("F1-score:", f1_score(y_test, y_pred, average='weighted'))



Original class distribution: Counter({np.int64(0): 639, np.int64(1): 502})
Resampled class distribution: Counter({np.int64(1): 639, np.int64(0): 639})
Classification Report:
              precision    recall  f1-score   support

           0       0.72      0.89      0.80       160
           1       0.80      0.57      0.67       126

    accuracy                           0.75       286
   macro avg       0.76      0.73      0.73       286
weighted avg       0.76      0.75      0.74       286

F1-score: 0.7400015714622455


Approach to Email Classification

For Preprocessing the Data
1. Text Vectorization: Use techniques like TF-IDF (Term Frequency-Inverse Document Frequency) or word embeddings (e.g., Word2Vec, GloVe) to convert text into numerical vectors that can be processed by machine learning algorithms.
2. Handling Missing Data: For missing subject lines or body content, consider imputing with a placeholder or using the available content for classification. For other features, imputation or removal might be necessary based on the extent of missingness.
3. Data Cleaning: Remove stop words, punctuation, and special characters to reduce noise and improve model performance.

Appropriate Model
1. Naïve Bayes: Given the nature of text data and the potential for a large number of features (words), Naïve Bayes can be a strong baseline. It's particularly effective for text classification tasks due to its simplicity and efficiency.
2. SVM: Support Vector Machines can also be effective, especially with the right kernel (e.g., linear or RBF). SVMs are robust to high-dimensional data and can find complex decision boundaries.

Justification:
- Naïve Bayes is often preferred for text classification due to its simplicity, interpretability, and efficiency. It handles high-dimensional data well and is less prone to overfitting.
- SVM can be a good alternative if the dataset is not extremely large, and there's a need for more complex decision boundaries. However, it might require more tuning and computational resources.

Addressing Class Imbalance
1. Resampling Techniques: Use oversampling the minority class (spam), undersampling the majority class (not spam), or synthetic sampling methods like SMOTE (Synthetic Minority Over-sampling Technique).
2. Class Weights: Adjust class weights in the model to give more importance to the minority class. Many algorithms, including SVM and some implementations of Naïve Bayes, support class weighting.
3. Evaluation Metrics: Focus on metrics that are robust to class imbalance, such as F1-score, precision, recall, and AUC-ROC, rather than accuracy alone.

Evaluating the Performance
1. Precision: Measures the proportion of true positives among all positive predictions made by the model.
2. Recall: Measures the proportion of true positives among all actual positive instances.
3. F1-score: The harmonic means of precision and recall, providing a balanced measure of both.
4. AUC-ROC: Measures the model's ability to distinguish between classes, with higher values indicating better performance.

Business Impact
1. Reduced Manual Effort: Automating email classification reduces the need for manual sorting, saving time and resources.
2. Improved User Experience: By accurately filtering out spam, users receive fewer unwanted emails, enhancing their experience and productivity.
3. Security: Effective spam filtering can reduce the risk of phishing attacks and malware distribution through emails.
4. Cost Savings: Reducing the volume of spam can decrease the costs associated with storage, bandwidth, and support.
