#Theory Question

Question 1:  What is a Support Vector Machine (SVM), and how does it work?

Answer:A Support Vector Machine (SVM) is a supervised learning algorithm used for classification and regression. It works by finding the best boundary, called a hyperplane, that separates different classes in the data. We maximize the margin between the classes to improve accuracy and handle new data points effectively.

Question 2: Explain the difference between Hard Margin and Soft Margin SVM.

Answer:In Hard Margin SVM, we assume the data is perfectly separable and draw a boundary without allowing any misclassification. In Soft Margin SVM, we allow some errors or overlaps so the model can handle noisy or non-linearly separable data better. This makes Soft Margin more practical in real-world scenarios.

Question 3: What is the Kernel Trick in SVM? Give one example of a kernel and
explain its use case.

Answer:The Kernel Trick in SVM allows us to transform data into a higher dimension so it becomes easier to separate with a hyperplane. Instead of computing this transformation directly, we use kernel functions to do it efficiently. For example, the RBF (Radial Basis Function) kernel is useful when data is not linearly separable, as it creates circular decision boundaries around data points.

Question 4: What is a Naïve Bayes Classifier, and why is it called “naïve”?

Answer:A Naive Bayes Classifier is a probabilistic algorithm based on Bayes’ theorem that is mainly used for classification tasks. It is called “naive” because it assumes all features are independent of each other, which is rarely true in real life. Still, this simple assumption makes the model fast and effective, especially for text classification problems.

Question 5: Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants.
When would you use each one?

Answer:
* Gaussian Naive Bayes is used when the features are continuous and follow a normal distribution, like height or weight data.
* Multinomial Naive Bayes works well with discrete counts, such as word frequencies in text classification.
* Bernoulli Naive Bayes is suitable when features are binary, like whether a word is present or absent in an email for spam detection.

Dataset Info:
* You can use any suitable datasets like Iris, Breast Cancer, or Wine from
sklearn.datasets or a CSV file you have.

#Practical Questions

Question 6:   Write a Python program to:
* Load the Iris dataset
* Train an SVM Classifier with a linear kernel
* Print the model's accuracy and support vectors.
* (Include your Python code and output in the code box below.)

In [11]:
# Answer
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
import pandas as pd

iris = datasets.load_iris()
df = pd.DataFrame(data = iris.data, columns=iris.feature_names)
df['target'] = iris.target

X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

model = SVC(kernel='linear')
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Support Vectors:", model.support_vectors_)


Accuracy: 1.0
Support Vectors: [[5.1 3.3 1.7 0.5]
 [4.5 2.3 1.3 0.3]
 [4.8 3.4 1.9 0.2]
 [6.  3.4 4.5 1.6]
 [5.7 2.8 4.5 1.3]
 [6.  2.7 5.1 1.6]
 [6.9 3.1 4.9 1.5]
 [5.9 3.2 4.8 1.8]
 [4.9 2.4 3.3 1. ]
 [6.1 2.9 4.7 1.4]
 [6.7 3.1 4.7 1.5]
 [6.2 2.2 4.5 1.5]
 [6.3 2.5 4.9 1.5]
 [6.2 2.8 4.8 1.8]
 [6.3 2.7 4.9 1.8]
 [6.1 3.  4.9 1.8]
 [6.5 3.2 5.1 2. ]
 [6.  3.  4.8 1.8]
 [5.9 3.  5.1 1.8]
 [4.9 2.5 4.5 1.7]
 [7.2 3.  5.8 1.6]
 [6.3 2.8 5.1 1.5]]


Question 7:  Write a Python program to:
* Load the Breast Cancer dataset
* Train a Gaussian Naïve Bayes model
* Print its classification report including precision, recall, and F1-score.
* (Include your Python code and output in the code box below.)

In [12]:
# Answer
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report
import pandas as pd

cancer = load_breast_cancer()

df = pd.DataFrame(data = cancer.data, columns=cancer.feature_names)
df['target'] = cancer.target

X = df.drop('target', axis=1)
y = df['target']


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=1
)


model = GaussianNB()
model.fit(X_train, y_train)


y_pred = model.predict(X_test)


print("Classification Report:\n")
print(classification_report(y_test, y_pred, target_names=cancer.target_names))


Classification Report:

              precision    recall  f1-score   support

   malignant       0.94      0.92      0.93        63
      benign       0.95      0.96      0.96       108

    accuracy                           0.95       171
   macro avg       0.94      0.94      0.94       171
weighted avg       0.95      0.95      0.95       171



Question 8: Write a Python program to:
* Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best
C and gamma.
* Print the best hyperparameters and accuracy.
* (Include your Python code and output in the code box below.)

In [13]:
# Answer
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
import pandas as pd


wine = load_wine()

df = pd.DataFrame(data = wine.data, columns=wine.feature_names)
df['target'] = wine.target

X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=1
)


param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [0.001, 0.01, 0.1, 1],
    'kernel': ['rbf']
}


grid = GridSearchCV(SVC(), param_grid, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)


best_model = grid.best_estimator_


y_pred = best_model.predict(X_test)


print("Best Hyperparameters:", grid.best_params_)
print("Test Accuracy:", accuracy_score(y_test, y_pred))


Best Hyperparameters: {'C': 100, 'gamma': 0.001, 'kernel': 'rbf'}
Test Accuracy: 0.7592592592592593


Question 9: Write a Python program to:
* Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. using
sklearn.datasets.fetch_20newsgroups).
* Print the model's ROC-AUC score for its predictions.
* (Include your Python code and output in the code box below.)

In [14]:
# Answer
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score

categories = ['rec.sport.hockey', 'sci.space']
data = fetch_20newsgroups(subset='all', categories=categories)

X = data.data
y = data.target

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

model = MultinomialNB()
model.fit(X_train, y_train)

y_prob = model.predict_proba(X_test)[:, 1]

print("ROC-AUC Score:", roc_auc_score(y_test, y_prob))


ROC-AUC Score: 0.9999887177751452


Question 10: Imagine you’re working as a data scientist for a company that handles
email communications.
Your task is to automatically classify emails as Spam or Not Spam. The emails may
contain:
* Text with diverse vocabulary
* Potential class imbalance (far more legitimate emails than spam)
* Some incomplete or missing data
#Explain the approach you would take to:
* Preprocess the data (e.g. text vectorization, handling missing data)
* Choose and justify an appropriate model (SVM vs. Naïve Bayes)
* Address class imbalance
* Evaluate the performance of your solution with suitable metrics
And explain the business impact of your solution.


Answer:To build an email spam classifier, I would first preprocess the data by converting the email text into numerical features using TF-IDF vectorization, which helps capture the importance of words, and handle missing data by removing empty emails or filling missing fields with placeholders. For the model choice, I would prefer Naïve Bayes because it works very well with text data, is fast, and handles high-dimensional sparse features better than SVM for large datasets. Since email datasets usually have far more legitimate emails than spam, I would address the class imbalance by using resampling techniques like SMOTE or by applying class weights in the model. For evaluation, instead of only accuracy, I would use metrics suited for imbalanced data such as precision, recall, F1-score, and ROC-AUC. Finally, the business impact would be significant:

* Accurate spam detection protects users from malicious emails.

* Reduces the risk of missing important legitimate emails.

* Improves user trust and overall experience.

* Saves time by filtering junk emails automatically.