# Question and Answer

# 1. What is a Support Vector Machine (SVM), and how does it work?

  >A Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification and regression tasks. It works by finding the optimal hyperplane that best separates data points of different classes in a high-dimensional space, maximizing the margin between the closest points (support vectors) of each class to ensure robust and accurate predictions.

# 2. Explain the difference between Hard Margin and Soft Margin SVM.

  >Hard Margin SVM is a type of Support Vector Machine that assumes the data is linearly separable and aims to find a hyperplane that perfectly separates the classes without any misclassification, enforcing strict boundaries.

  >Soft Margin SVM allows for some misclassification or overlap between classes by introducing a penalty for errors, making it suitable for non-linearly separable data and improving generalization by balancing margin maximization and classification accuracy.

# 3.: What is the Kernel Trick in SVM? Give one example of a kernel and explain its use case.

  >The Kernel Trick in Support Vector Machines (SVM) is a mathematical technique that enables the algorithm to operate in a high-dimensional feature space without explicitly computing the coordinates of the data in that space, allowing SVM to solve non-linear classification problems by applying a kernel function.

  >An example is the Radial Basis Function (RBF) kernel, which is commonly used when the relationship between class labels and features is non-linear. It maps input features into an infinite-dimensional space, making it effective for complex decision boundaries in tasks like image classification or bioinformatics.

# 4. What is a Naïve Bayes Classifier, and why is it called “naïve”?

  >A Naïve Bayes Classifier is a probabilistic machine learning algorithm based on Bayes’ Theorem, used for classification tasks. It assumes that the features in a dataset are conditionally independent given the class label, which simplifies computation and enables efficient prediction.

  >It is called “naïve” because of this strong and often unrealistic assumption of feature independence, which rarely holds true in real-world data but still yields surprisingly effective results.

# 5.Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants.When would you use each one?

  >Gaussian Naïve Bayes is a variant of the Naïve Bayes classifier that assumes continuous features follow a normal (Gaussian) distribution. It is best used when the input features are real-valued and continuous, such as in medical data or sensor readings.

  >Multinomial Naïve Bayes is designed for discrete count data and assumes features represent the frequency of events. It is commonly used in text classification tasks like spam detection or document categorization, where features are word counts or term frequencies.

  >Bernoulli Naïve Bayes models binary/boolean features, assuming each feature is either present or absent. It is suitable for tasks like sentiment analysis or email classification where features indicate the presence or absence of specific terms.

  

In [1]:
# 6. Write a Python program to:
# Load the Iris dataset
# Train an SVM Classifier with a linear kernel
# Print the model's accuracy and support vectors.

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train an SVM classifier with a linear kernel
svm_model = SVC(kernel='linear')
svm_model.fit(X_train, y_train)

# Predict and calculate accuracy
y_pred = svm_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Print accuracy and support vectors
print(f"Model Accuracy: {accuracy:.2f}")
print("Support Vectors:")
print(svm_model.support_vectors_)


Model Accuracy: 1.00
Support Vectors:
[[4.8 3.4 1.9 0.2]
 [5.1 3.3 1.7 0.5]
 [4.5 2.3 1.3 0.3]
 [5.6 3.  4.5 1.5]
 [5.4 3.  4.5 1.5]
 [6.7 3.  5.  1.7]
 [5.9 3.2 4.8 1.8]
 [5.1 2.5 3.  1.1]
 [6.  2.7 5.1 1.6]
 [6.3 2.5 4.9 1.5]
 [6.1 2.9 4.7 1.4]
 [6.5 2.8 4.6 1.5]
 [6.9 3.1 4.9 1.5]
 [6.3 2.3 4.4 1.3]
 [6.3 2.8 5.1 1.5]
 [6.3 2.7 4.9 1.8]
 [6.  3.  4.8 1.8]
 [6.  2.2 5.  1.5]
 [6.2 2.8 4.8 1.8]
 [6.5 3.  5.2 2. ]
 [7.2 3.  5.8 1.6]
 [5.6 2.8 4.9 2. ]
 [5.9 3.  5.1 1.8]
 [4.9 2.5 4.5 1.7]]


In [3]:

# 7.Write a Python program to:
# Load the Breast Cancer dataset
# Train a Gaussian Naïve Bayes model
# Print its classification report including precision, recall, and F1-score.

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Gaussian Naïve Bayes model
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Predict on test data
y_pred = gnb.predict(X_test)

# Print classification report
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))


Classification Report:
              precision    recall  f1-score   support

   malignant       0.93      0.90      0.92        63
      benign       0.95      0.96      0.95       108

    accuracy                           0.94       171
   macro avg       0.94      0.93      0.94       171
weighted avg       0.94      0.94      0.94       171



In [4]:
# 8. Write a Python program to:
# Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best C and gamma.
# Print the best hyperparameters and accuracy.

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load the Wine dataset
wine = load_wine()
X = wine.data
y = wine.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define parameter grid for C and gamma
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [0.001, 0.01, 0.1, 1],
    'kernel': ['rbf']  # Using RBF kernel for non-linear classification
}

# Perform GridSearchCV
grid_search = GridSearchCV(SVC(), param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Get best parameters and evaluate accuracy
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Print results
print("Best Hyperparameters:", best_params)
print(f"Model Accuracy: {accuracy:.2f}")


Best Hyperparameters: {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}
Model Accuracy: 0.78


In [5]:
# 9.Write a Python program to:
# Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. using sklearn.datasets.fetch_20newsgroups).
# Print the model's ROC-AUC score for its predictions.

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import label_binarize

# Load a subset of the 20 Newsgroups dataset (binary classification for ROC-AUC)
categories = ['sci.space', 'rec.sport.baseball']
newsgroups = fetch_20newsgroups(subset='all', categories=categories)

# Feature extraction using TF-IDF
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(newsgroups.data)
y = newsgroups.target

# Binarize labels for ROC-AUC
y_binary = label_binarize(y, classes=[0, 1])

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y_binary, test_size=0.3, random_state=42)

# Train Naïve Bayes classifier
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train.ravel())

# Predict probabilities
y_proba = nb_model.predict_proba(X_test)[:, 1]

# Compute ROC-AUC score
roc_auc = roc_auc_score(y_test, y_proba)
print(f"ROC-AUC Score: {roc_auc:.2f}")


ROC-AUC Score: 1.00


In [6]:
# 10. Imagine you’re working as a data scientist for a company that handles email communications.
#Your task is to automatically classify emails as Spam or Not Spam. The emails may contain:
# Text with diverse vocabulary
# Potential class imbalance (far more legitimate emails than spam)
# Some incomplete or missing data
#Explain the approach you would take to:
# Preprocess the data (e.g. text vectorization, handling missing data)
# Choose and justify an appropriate model (SVM vs. Naïve Bayes)
# Address class imbalance
# Evaluate the performance of your solution with suitable metrics
# And explain the business impact of your solution.

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.preprocessing import label_binarize
import numpy as np

# Load synthetic email data (spam vs. not spam)
categories = ['rec.sport.hockey', 'talk.politics.misc']  # Simulating spam vs. legit
data = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))

# Handle missing data
emails = [text if text else "no content" for text in data.data]

# Vectorize text using TF-IDF
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(emails)
y = data.target

# Binarize labels for ROC-AUC
y_binary = label_binarize(y, classes=[0, 1])

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y_binary, test_size=0.3, random_state=42)

# Train Naïve Bayes model
model = MultinomialNB()
model.fit(X_train, y_train.ravel())

# Predict and evaluate
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]

print("Classification Report:")
print(classification_report(y_test, y_pred))

roc_auc = roc_auc_score(y_test, y_proba)
print(f"ROC-AUC Score: {roc_auc:.2f}")


Classification Report:
              precision    recall  f1-score   support

           0       0.92      0.99      0.95       309
           1       0.98      0.88      0.93       224

    accuracy                           0.94       533
   macro avg       0.95      0.93      0.94       533
weighted avg       0.95      0.94      0.94       533

ROC-AUC Score: 1.00
