1. What is a Support Vector Machine (SVM), and how does it work?
  - Support Vector Machine (SVM) is a supervised learning algorithm used for classification and regression. It works by finding the optimal hyperplane that best separates different classes in the feature space. The goal is to maximize the margin (distance) between the hyperplane and the nearest data points (called support vectors).

2. Explain the difference between Hard Margin and Soft Margin SVM.
  - Hard Margin SVM: Assumes data is perfectly linearly separable. It requires all points to be correctly classified without error. Not practical for noisy data.

  - Soft Margin SVM: Allows some misclassifications (controlled by parameter C). Balances maximizing margin with minimizing classification error, making it more robust to noise.

3. What is the Kernel Trick in SVM? Give one example of a kernel and
explain its use case.
  - The Kernel Trick allows SVM to classify data that is not linearly separable by implicitly mapping it into a higher-dimensional space without explicitly computing the transformation.

  - Example: Radial Basis Function (RBF) Kernel
Use case: When the data has non-linear decision boundaries (e.g., circular clusters).

4. What is a Naïve Bayes Classifier, and why is it called “naïve”?
  - Naïve Bayes is a probabilistic classifier based on Bayes’ Theorem that assumes all features are independent given the class label.
  - It’s called “naïve” because in reality, features are rarely completely independent, but the assumption simplifies computation and often works well in practice.

5. Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants.
When would you use each one?
  - Gaussian NB: Assumes features follow a normal distribution. Used for continuous data (e.g., Iris dataset).

  - Multinomial NB: Used for discrete counts (e.g., word counts in text classification).

  - Bernoulli NB: Assumes binary features (0/1 presence). Useful for text classification with binary indicators of word presence.


In [1]:
"""
Dataset Info:
● You can use any suitable datasets like Iris, Breast Cancer, or Wine from
sklearn.datasets or a CSV file you have.
Question 6: Write a Python program to:
● Load the Iris dataset
● Train an SVM Classifier with a linear kernel
● Print the model's accuracy and support vectors.
"""

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load dataset
iris = datasets.load_iris()
X, y = iris.data, iris.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train SVM with linear kernel
svm_model = SVC(kernel='linear')
svm_model.fit(X_train, y_train)

# Predictions and accuracy
y_pred = svm_model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Support Vectors:", svm_model.support_vectors_)


Accuracy: 1.0
Support Vectors: [[4.8 3.4 1.9 0.2]
 [5.1 3.3 1.7 0.5]
 [4.5 2.3 1.3 0.3]
 [5.6 3.  4.5 1.5]
 [5.4 3.  4.5 1.5]
 [6.7 3.  5.  1.7]
 [5.9 3.2 4.8 1.8]
 [5.1 2.5 3.  1.1]
 [6.  2.7 5.1 1.6]
 [6.3 2.5 4.9 1.5]
 [6.1 2.9 4.7 1.4]
 [6.5 2.8 4.6 1.5]
 [6.9 3.1 4.9 1.5]
 [6.3 2.3 4.4 1.3]
 [6.3 2.8 5.1 1.5]
 [6.3 2.7 4.9 1.8]
 [6.  3.  4.8 1.8]
 [6.  2.2 5.  1.5]
 [6.2 2.8 4.8 1.8]
 [6.5 3.  5.2 2. ]
 [7.2 3.  5.8 1.6]
 [5.6 2.8 4.9 2. ]
 [5.9 3.  5.1 1.8]
 [4.9 2.5 4.5 1.7]]


In [2]:
"""
Question 7: Write a Python program to:
● Load the Breast Cancer dataset
● Train a Gaussian Naïve Bayes model
● Print its classification report including precision, recall, and F1-score.
"""
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Gaussian NB
nb_model = GaussianNB()
nb_model.fit(X_train, y_train)

# Predictions
y_pred = nb_model.predict(X_test)
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.93      0.90      0.92        63
           1       0.95      0.96      0.95       108

    accuracy                           0.94       171
   macro avg       0.94      0.93      0.94       171
weighted avg       0.94      0.94      0.94       171



In [3]:
"""
Question 8: Write a Python program to:
● Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best
C and gamma.
● Print the best hyperparameters and accuracy.
"""
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load dataset
wine = load_wine()
X, y = wine.data, wine.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# GridSearch for best params
param_grid = {'C': [0.1, 1, 10], 'gamma': [0.01, 0.1, 1]}
grid = GridSearchCV(SVC(kernel='rbf'), param_grid, cv=5)
grid.fit(X_train, y_train)

# Results
print("Best Parameters:", grid.best_params_)
y_pred = grid.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))


Best Parameters: {'C': 10, 'gamma': 0.01}
Accuracy: 0.6666666666666666


In [4]:
"""
Question 9: Write a Python program to:
● Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. using
sklearn.datasets.fetch_20newsgroups).
● Print the model's ROC-AUC score for its predictions.
"""
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score

# Load dataset
data = fetch_20newsgroups(subset='all')
X, y = data.data, data.target

# Vectorization
vectorizer = TfidfVectorizer(stop_words='english')
X_vec = vectorizer.fit_transform(X)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_vec, y, test_size=0.3, random_state=42)

# Train NB
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)

# ROC-AUC score
y_prob = nb_model.predict_proba(X_test)
print("ROC-AUC Score:", roc_auc_score(y_test, y_prob, multi_class='ovr'))


ROC-AUC Score: 0.9934558707675476


10. Question 10: Imagine you’re working as a data scientist for a company that handles
email communications.
Your task is to automatically classify emails as Spam or Not Spam. The emails may
contain:
● Text with diverse vocabulary
● Potential class imbalance (far more legitimate emails than spam)
● Some incomplete or missing data
Explain the approach you would take to:
● Preprocess the data (e.g. text vectorization, handling missing data)
● Choose and justify an appropriate model (SVM vs. Naïve Bayes)
● Address class imbalance
● Evaluate the performance of your solution with suitable metrics
And explain the business impact of your solution.

#### Preprocessing
- Handle missing data by replacing empty emails with placeholders (e.g., `"no_content"`).  
- Use **TF-IDF vectorization** to convert text into numerical features.  
- Normalize and clean text:  
  - Remove stopwords  
  - Remove punctuation and special characters  
  - Convert to lowercase  


###  Model Choice
- **Naïve Bayes** (Multinomial NB) is efficient and effective for text classification, especially with word probability–based models.  
- **SVM** can work better with high-dimensional sparse text features but is computationally more expensive.  
- For a starting point, I’d use **Multinomial Naïve Bayes**.  



###  Handling Class Imbalance
- If spam emails are fewer than legitimate ones, balance the dataset by:  
  - Using **SMOTE (Synthetic Minority Oversampling Technique)**  
  - Applying **class weights** to penalize misclassification of minority class  



### Evaluation Metrics
- Use metrics beyond just accuracy:  
  - **Precision:** How many predicted spams are actually spam  
  - **Recall:** How many actual spams are correctly identified  
  - **F1-score:** Balance between precision and recall  
  - **ROC-AUC:** Measures overall discriminative ability of the classifier  



###  Business Impact
- A robust spam detection system will:  
  - Reduce time wasted on unwanted emails  
  - Protect users from phishing and malicious attacks  
  - Improve overall trust in the company’s communication system  
  - Enhance productivity and security for both employees and customers  


In [5]:
"""
Question 10: Imagine you’re working as a data scientist for a company that handles
email communications.
Your task is to automatically classify emails as Spam or Not Spam. The emails may
contain:
● Text with diverse vocabulary
● Potential class imbalance (far more legitimate emails than spam)
● Some incomplete or missing data
Explain the approach you would take to:
● Preprocess the data (e.g. text vectorization, handling missing data)
● Choose and justify an appropriate model (SVM vs. Naïve Bayes)
● Address class imbalance
● Evaluate the performance of your solution with suitable metrics
And explain the business impact of your solution.
"""
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
import numpy as np

# Synthetic email dataset (you’d replace with real emails)
emails = ["Win money now!!!", "Meeting at 10am", "Lowest price on meds", "Project deadline tomorrow", "Earn $$$ fast"]
labels = [1, 0, 1, 0, 1]  # 1=Spam, 0=Not Spam

# Vectorization
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(emails)
y = np.array(labels)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train NB
model = MultinomialNB()
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.00      0.00      0.00         1
           1       0.50      1.00      0.67         1

    accuracy                           0.50         2
   macro avg       0.25      0.50      0.33         2
weighted avg       0.25      0.50      0.33         2



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
