###Question 1: What is a Support Vector Machine (SVM), and how does it work?

Ans:A Support Vector Machine (SVM) is a powerful supervised machine learning algorithm primarily used for classification and, in some cases, regression tasks. It is particularly effective for problems where the data is high-dimensional or not linearly separable

It works:

1.	Identify the Optimal Hyperplane:
o	SVM searches for the hyperplane that maximizes the margin between classes.
o	This ensures better separation and generalization to unseen data.
2.	Use of Support Vectors:
o	Only the support vectors influence the position of the hyperplane.
o	These points are used to calculate the margin and optimize the decision boundary.
3.	Handling Non-Linearly Separable Data:
o	SVM uses kernel functions (e.g., polynomial, RBF) to transform data into higher dimensions where a linear separator is possible.
4.	Regularization (C Parameter):
o	Controls the trade-off between maximizing the margin and minimizing classification error.
o	A higher C penalizes misclassifications more strictly.
5.	Loss Function (Hinge Loss):
o	Penalizes incorrect classifications and margin violations.
o	Combined with regularization to form the optimization objective


###Question 2: Explain the difference between Hard Margin and Soft Margin SVM.
Ans:
Hard Margin
Maximize margin.
Sensitive, requires perfectly linearly separable data
Not applicable, no regularization parameter
Simple, computationally efficient

Soft Margin
Maximize margin, minimize margin violations.
Robust, handles noisy data with margin violations.
Controlled by regularization parameter C.
May require more computational resources




###Question 3: What is the Kernel Trick in SVM? Give one example of a kernel and explain its use case.
Ans:
The Kernel Trick is a mathematical technique used in Support Vector Machines (SVMs) to handle non-linearly separable data by implicitly mapping it into a higher-dimensional space where it becomes linearly separable — without explicitly computing that transformation.

Use Case Example:
 classifying handwritten digits (like MNIST dataset). The digit images are not linearly separable, but by using an RBF kernel, the SVM can find a nonlinear decision boundary that separates digits effectively in a high-dimensional space.

###Question 4: What is a Naïve Bayes Classifier, and why is it called “naïve”?

Ans:

A Naïve Bayes Classifier is a probabilistic machine learning algorithm based on Bayes’ Theorem.
It is commonly used for classification tasks, such as spam detection, sentiment analysis, and text categorization.

It predicts the class of a given data point based on probabilities calculated from prior knowledge of the data.


###Question 5: Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants.When would you use each one?

Ans:
Gaussian Naïve Bayes
Assumes that features follow a normal (Gaussian) distribution within each class.
It models the likelihood of continuous features using the probability density function (PDF) of the normal distribution.

Multinomial Naïve Bayes: features represent discrete counts (like the number of times a word appears).The likelihood is modeled using a multinomial distribution

Bernoulli Naïve Bayes
features are binary (0 or 1) — indicating presence or absence of a feature.
The model calculates the likelihood of each binary feature using

When we should use:
Gaussian Naïve Bayes → for continuous data,
Multinomial Naïve Bayes → for count-based data,
Bernoulli Naïve Bayes → for binary presence/absence data.

###Question 6:
● You can use any suitable datasets like Iris, Breast Cancer, or Wine from
sklearn.datasets or a CSV file you have.
 Write a Python program to:
● Load the Iris dataset
● Train an SVM Classifier with a linear kernel
● Print the model's accuracy and support vectors.

In [1]:
# Import required libraries
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# 1. Load the Iris dataset
iris = datasets.load_iris()
X = iris.data      # Features
y = iris.target    # Target labels

# 2. Split data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Create and train the SVM classifier with a linear kernel
svm_model = SVC(kernel='linear', random_state=42)
svm_model.fit(X_train, y_train)

# 4. Make predictions
y_pred = svm_model.predict(X_test)

# 5. Calculate and print model accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", round(accuracy * 100, 2), "%")

# 6. Print the support vectors
print("\nSupport Vectors:\n", svm_model.support_vectors_)
print("\nNumber of Support Vectors for Each Class:", svm_model.n_support_)


Model Accuracy: 100.0 %

Support Vectors:
 [[4.8 3.4 1.9 0.2]
 [5.1 3.3 1.7 0.5]
 [4.5 2.3 1.3 0.3]
 [5.6 3.  4.5 1.5]
 [5.4 3.  4.5 1.5]
 [6.7 3.  5.  1.7]
 [5.9 3.2 4.8 1.8]
 [5.1 2.5 3.  1.1]
 [6.  2.7 5.1 1.6]
 [6.3 2.5 4.9 1.5]
 [6.1 2.9 4.7 1.4]
 [6.5 2.8 4.6 1.5]
 [6.9 3.1 4.9 1.5]
 [6.3 2.3 4.4 1.3]
 [6.3 2.5 5.  1.9]
 [6.3 2.8 5.1 1.5]
 [6.3 2.7 4.9 1.8]
 [6.  3.  4.8 1.8]
 [6.  2.2 5.  1.5]
 [6.2 2.8 4.8 1.8]
 [6.5 3.  5.2 2. ]
 [7.2 3.  5.8 1.6]
 [5.6 2.8 4.9 2. ]
 [5.9 3.  5.1 1.8]
 [4.9 2.5 4.5 1.7]]

Number of Support Vectors for Each Class: [ 3 11 11]


##Question 7: Write a Python program to:
● Load the Breast Cancer dataset
● Train a Gaussian Naïve Bayes model
● Print its classification report including precision, recall, and F1-score.


In [2]:
# Import required libraries
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

# 1. Load the Breast Cancer dataset
breast_cancer = datasets.load_breast_cancer()
X = breast_cancer.data      # Features
y = breast_cancer.target    # Target labels

# 2. Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Create and train the Gaussian Naïve Bayes model
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# 4. Make predictions
y_pred = gnb.predict(X_test)

# 5. Print classification report (precision, recall, F1-score)
print("Classification Report for Gaussian Naïve Bayes:\n")
print(classification_report(y_test, y_pred, target_names=breast_cancer.target_names))


Classification Report for Gaussian Naïve Bayes:

              precision    recall  f1-score   support

   malignant       1.00      0.93      0.96        43
      benign       0.96      1.00      0.98        71

    accuracy                           0.97       114
   macro avg       0.98      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114



###Question 8: Write a Python program to:
● Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best
C and gamma.
● Print the best hyperparameters and accuracy.


In [3]:
# Import required libraries
from sklearn import datasets
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# 1. Load the Wine dataset
wine = datasets.load_wine()
X = wine.data      # Features
y = wine.target    # Labels

# 2. Split data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Define the SVM model
svm_model = SVC()

# 4. Define the hyperparameter grid for GridSearchCV
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [1, 0.1, 0.01, 0.001],
    'kernel': ['rbf']  # RBF kernel works well for most datasets
}

# 5. Initialize GridSearchCV
grid_search = GridSearchCV(svm_model, param_grid, refit=True, cv=5, verbose=1)

# 6. Fit the model using GridSearchCV
grid_search.fit(X_train, y_train)

# 7. Make predictions on the test set
y_pred = grid_search.predict(X_test)

# 8. Print the best hyperparameters and accuracy
print("\nBest Hyperparameters found by GridSearchCV:")
print(grid_search.best_params_)

accuracy = accuracy_score(y_test, y_pred)
print("\nTest Set Accuracy:", round(accuracy * 100, 2), "%")


Fitting 5 folds for each of 16 candidates, totalling 80 fits

Best Hyperparameters found by GridSearchCV:
{'C': 100, 'gamma': 0.001, 'kernel': 'rbf'}

Test Set Accuracy: 83.33 %


###Question 9: Write a Python program to:
Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. using
sklearn.datasets.fetch_20newsgroups).
 Print the model's ROC-AUC score for its predictions.
(Include your Python code and output in the code box below.)

In [4]:

# Import required libraries
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import label_binarize

# 1. Load the 20 Newsgroups dataset (using a few categories for simplicity)
categories = ['sci.space', 'rec.sport.hockey', 'comp.graphics', 'talk.politics.misc']
newsgroups = fetch_20newsgroups(subset='all', categories=categories, remove=('headers', 'footers', 'quotes'))

X = newsgroups.data   # Text data
y = newsgroups.target # Target labels

# 2. Split into training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 3. Convert text to numerical feature vectors using TF-IDF
vectorizer = TfidfVectorizer(stop_words='english', max_features=3000)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# 4. Train a Multinomial Naïve Bayes classifier
nb_model = MultinomialNB()
nb_model.fit(X_train_tfidf, y_train)

# 5. Predict probabilities for ROC-AUC calculation
y_prob = nb_model.predict_proba(X_test_tfidf)

# 6. Compute ROC-AUC score (for multiclass classification)
# We binarize the labels for multi-class ROC-AUC
y_test_binarized = label_binarize(y_test, classes=range(len(categories)))
roc_auc = roc_auc_score(y_test_binarized, y_prob, average='macro', multi_class='ovr')

# 7. Print the ROC-AUC score
print("ROC-AUC Score for Multinomial Naïve Bayes Model:", round(roc_auc, 4))


ROC-AUC Score for Multinomial Naïve Bayes Model: 0.9826


###Question 10: Imagine you’re working as a data scientist for a company that handles
email communications.
Your task is to automatically classify emails as Spam or Not Spam. The emails may
contain:
 Text with diverse vocabulary
 Potential class imbalance (far more legitimate emails than spam)
 Some incomplete or missing data
Explain the approach you would take to:
 Preprocess the data (e.g. text vectorization, handling missing data)
 Choose and justify an appropriate model (SVM vs. Naïve Bayes)
 Address class imbalance
 Evaluate the performance of your solution with suitable metrics
And explain the business impact of your solution.

Ans:

Approach to Spam Email Classification
1. Data Preprocessing:
Handle missing text → remove or replace with "unknown".
Clean text: lowercase, remove punctuation, stopwords, URLs.
Use TF-IDF vectorization (with n-grams) to convert text into numerical form.
Apply stemming/lemmatization to normalize words.
________________________________________
2. Model Choice:
Use Multinomial Naïve Bayes → fast, works well for text and word frequencies.
SVM can be tested later for higher accuracy but is slower.
________________________________________
3. Handling Class Imbalance:
Using SMOTE (oversampling) or class_weight='balanced'.
Adjust decision threshold to improve spam recall.
________________________________________
4. Evaluation Metrics:
Use Precision, Recall, F1-score, and ROC-AUC (not just accuracy).
Aim for high recall (catch more spam) while maintaining good precision.
________________________________________
5. Business Impact:
Filters spam automatically → saves time and boosts productivity.
Reduces phishing risks and builds user trust.
Keeps communication efficient and secure

