In [None]:
## Question 1: What is a Support Vector Machine (SVM), and how does it work?

"""
Answer: A Support Vector Machine (SVM) is a supervised machine learning technique that determines the appropriate "hyperplane" for categorising input points and maximising
the margin between them. It operates by detecting "support vectors"—the data points nearest to the hyperplane—that establish the border, and it employs a "kernel trick" to
handle non-linear data by transferring it to a higher-dimensional space where it may be linearly separated.
SVM works in the following maner:
Finding the Optimal Hyperplane: The primary purpose of an SVM is to identify the hyperplane that gives the greatest separation between distinct data classes.
Define the Margin: This separation is measured by the "margin," which is the distance between the hyperplane and the nearest data points from each class.
Identifying Support Vectors: The data points on the edge of this margin are referred to as support vectors. These are the most essential sites since removing or moving
them may cause the hyperplane to alter.
Classification: Once the ideal hyperplane has been identified (in either the original or a higher-dimensional space), it is used to categorise new, previously unseen datasets.

"""

In [None]:
## Question 2: Explain the difference between Hard Margin and Soft Margin SVM

"""
Answer: Hard Margin SVM requires perfectly linearly separable data and insists that no data points fall within the margin or on the incorrect side of the decision boundary,
making it sensitive to outliers. In contrast, Soft Margin SVM allows for some misclassifications and margin violations via slack variables, resulting in a more flexible
and robust model capable of handling imperfect real-world data containing outliers or overlapping classes.
Hard Margin SVMs aim for complete separation, which can lead to overfitting on noisy data, whereas Soft Margin SVMs "embrace the messiness of reality" by embracing imperfect
boundaries to increase generalisation on complicated, real-world data.

"""

In [None]:
## Question 3: What is the Kernel Trick in SVM? Give one example of a kernel and explain its use case.

"""
Answer: The Kernel Trick is an SVM approach that implicitly transfers non-linearly separable data into a higher-dimensional space, making it linearly separable and allowing
a linear classifier to locate a boundary. It avoids explicitly computing high-dimensional coordinates by computing the dot product of vectors in the higher space with a
kernel function. A common example is the Radial Basis Function (RBF) Kernel, which maps data to an infinite-dimensional space to handle complicated patterns, resulting
in a nonlinear decision boundary in the original space and a linear plane in the higher dimension.
In many cases, data cannot be divided into distinct classes using a simple straight line (linear boundary) in its original low-dimensional space.
Solution: The kernel trick suggests transforming this non-linearly separable data into a higher-dimensional feature space that can be separated by a linear hyperplane.
Instead of executing the computationally intensive transformation (mapping each data point into the new space), a kernel function is utilised. This function takes two
original data points and calculates their dot product in a higher-dimensional space.The SVM method then operates on these dot products, thereby locating the linear separator
in the higher-dimensional space, resulting in a nonlinear decision boundary in the original data space.

"""

In [None]:
## Question 4: What is a Naïve Bayes Classifier, and why is it called “naïve”?

"""
Answer: A Naïve Bayes classifier is a simple probabilistic classification technique that uses Bayes' Theorem. It assumes that all features used for classification are
independent of each other inside the class. The "naïve" algorithm's premise of feature independence may be unrealistic in real-world data, yet it nonetheless performs well
in tasks such as text classification and spam detection.

"""

In [None]:
## Question 5: Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants.When would you use each one?

"""
Answer: There are three types of Naïve Bayes: Gaussian Naïve Bayes for continuous data with a normal distribution, Multinomial Naive Bayes for discrete data with feature
counts like word frequencies, and Bernoulli Naïve Bayes for binary features with presence or absence. These models are suitable for various classification tasks.
Gaussian Naive Bayes is used for classification problems requiring continuous numerical data, such as estimating property prices based on variables like size or income,
or classifying medical data based on symptoms like age or weight.
Multinominal Naive Bayes is ideal for text classification tasks like spam detection and document categorisation, where features represent the frequency of words
in a document.
Bernoulli Naïve Bayes is used for Suitable for text classification, when a word's presence, rather than frequency, indicates the category. For example, deciding
if a document is favourable or negative based on the presence or lack of specific terms.

"""

In [None]:
"""
Question 6: Write a Python program to:
● Load the Iris dataset
● Train an SVM Classifier with a linear kernel
● Print the model's accuracy and support vectors.
"""

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
import numpy as np

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train an SVM Classifier with a linear kernel
svm_model = SVC(kernel='linear')
svm_model.fit(X_train, y_train)

y_pred = svm_model.predict(X_test)

# Calculating and printing the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.4f}")

# Print the support vectors
print("\nSupport Vectors:")
print(svm_model.support_vectors_)


Model Accuracy: 1.0000

Support Vectors:
[[4.8 3.4 1.9 0.2]
 [5.1 3.3 1.7 0.5]
 [4.5 2.3 1.3 0.3]
 [5.6 3.  4.5 1.5]
 [5.4 3.  4.5 1.5]
 [6.7 3.  5.  1.7]
 [5.9 3.2 4.8 1.8]
 [5.1 2.5 3.  1.1]
 [6.  2.7 5.1 1.6]
 [6.3 2.5 4.9 1.5]
 [6.1 2.9 4.7 1.4]
 [6.5 2.8 4.6 1.5]
 [6.9 3.1 4.9 1.5]
 [6.3 2.3 4.4 1.3]
 [6.3 2.8 5.1 1.5]
 [6.3 2.7 4.9 1.8]
 [6.  3.  4.8 1.8]
 [6.  2.2 5.  1.5]
 [6.2 2.8 4.8 1.8]
 [6.5 3.  5.2 2. ]
 [7.2 3.  5.8 1.6]
 [5.6 2.8 4.9 2. ]
 [5.9 3.  5.1 1.8]
 [4.9 2.5 4.5 1.7]]


In [None]:
"""
Question 7: Write a Python program to:
● Load the Breast Cancer dataset
● Train a Gaussian Naïve Bayes model
● Print its classification report including precision, recall, and F1-score.
"""

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

breast_cancer = load_breast_cancer()
X = breast_cancer.data
y = breast_cancer.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

gnb_model = GaussianNB()
gnb_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = gnb_model.predict(X_test)

# Printing the classification report including precision, recall, and F1-score
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=breast_cancer.target_names))


Classification Report:
              precision    recall  f1-score   support

   malignant       0.93      0.90      0.92        63
      benign       0.95      0.96      0.95       108

    accuracy                           0.94       171
   macro avg       0.94      0.93      0.94       171
weighted avg       0.94      0.94      0.94       171



In [None]:
"""
Question 8: Write a Python program to:
● Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best
C and gamma.
● Print the best hyperparameters and accuracy
"""

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

wine = load_wine()
X = wine.data
y = wine.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [0.001, 0.01, 0.1, 1]
}

svm = SVC(kernel='rbf')

grid = GridSearchCV(svm, param_grid, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)

best_model = grid.best_estimator_

y_pred = best_model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)

# Printing the best hyperparameters and accuracy
print("Best Hyperparameters:")
print(f"C: {grid.best_params_['C']}")
print(f"gamma: {grid.best_params_['gamma']}")
print(f"\nBest Model Accuracy: {accuracy:.4f}")


Best Hyperparameters:
C: 100
gamma: 0.001

Best Model Accuracy: 0.8333


In [None]:
"""
Question 9: Write a Python program to:
● Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. using
sklearn.datasets.fetch_20newsgroups).
● Print the model's ROC-AUC score for its predictions.
"""

# Import necessary libraries
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import roc_auc_score
import numpy as np

# Loading the 20 Newsgroups dataset
newsgroups = fetch_20newsgroups(
    subset='train',
    remove=('headers', 'footers', 'quotes')
)
X_text = newsgroups.data
y = newsgroups.target

X_train_text, X_test_text, y_train, y_test = train_test_split(
    X_text, y, test_size=0.2, random_state=42
)

vectorizer = TfidfVectorizer(max_features=5000)
X_train = vectorizer.fit_transform(X_train_text)
X_test = vectorizer.transform(X_test_text)

nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)

y_prob = nb_model.predict_proba(X_test)

# Computing and printing the ROC-AUC score
roc_auc = roc_auc_score(y_test, y_prob, multi_class='ovr', average='macro')
print(f"ROC-AUC Score: {roc_auc:.4f}")


ROC-AUC Score: 0.9578


In [None]:
"""
Question 10: Imagine you’re working as a data scientist for a company that handles email communications.
Your task is to automatically classify emails as Spam or Not Spam. The emails may contain:
● Text with diverse vocabulary
● Potential class imbalance (far more legitimate emails than spam)
● Some incomplete or missing data
Explain the approach you would take to:
● Preprocess the data (e.g. text vectorization, handling missing data)
● Choose and justify an appropriate model (SVM vs. Naïve Bayes)
● Address class imbalance
● Evaluate the performance of your solution with suitable metrics And explain the business impact of your solution.
"""
"""
Answer: As a data scientist for an email communications company, my goal would be to create a reliable spam detection system that minimises user inconvenience while
efficiently screening malicious emails. Spam classification is a traditional binary text classification problem in which emails are labelled as "Spam" (positive class)
or "Not Spam" (negative class, such as legitimate emails). Given the challenges—diverse language in text, class imbalance (e.g., 90%+ valid emails), and incomplete/missing
data—I would use a structured machine learning pipeline. I've outlined the technique in detail below, with a focus on preprocessing, model selection, dealing with imbalance,
evaluation, and business impact.
Preprocessing Data
Text data requires preprocessing before it can be used in machine learning models. Emails frequently include noisy, unstructured language, missing fields, or incomplete entries.
Handling Missing or incomplete data:
Scan the dataset for any missing values in important fields like as subject, body, sender, or metadata. Incomplete emails may have empty bodies or garbled text.
Strategies:
Removal: Remove rows that are completely missing the body/subject. For partial missing data, assign simple answers such as "Unknown" to subjects or use mean imputation for
numerical metadata.
Filling gaps: To avoid model crashes, substitute missing bodies in text fields with placeholders. If metadata, like as timestamps, is lacking, discard the rows that are not
critical.
Outlier Handling: Remove extremely short/long emails, as they may be artefacts.
Models cannot process raw text, therefore vectorisation converts cleaned text into numerical features.
The preferred method is the TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer from sklearn.feature_extraction.text.TfidfVectorizer because it handles
diverse vocabulary by downweighting common phrases (e.g., "free" may be prevalent in spam but not discriminatory across all emails) and emphasising rare, instructive
terms (e.g., "Viagra"). Set options like as max_features=5000-10000 for efficiency and ngram_range=(1,2) to capture phrases (for example, "free money").
After preprocessing, split into train/validation/test sets (e.g., 70/15/15) using stratified sampling to maintain class distribution.
Given the huge dimensionality of the text features (thousands of TF-IDF terms), I would first investigate simple, interpretable models for spam detection. There are two options: Support Vector Machine (SVM) and Naïve Bayes (NB).
Comparison:
Naïve Bayes assumes feature independence, a simplification that works well with text due to the "curse of dimensionality." It's probabilistic, trains quickly (in O(n) time),
and handles sparse data well. However, it may underperform if the features are highly connected.
SVM: Locates a hyperplane that maximises the margin between classes in a multidimensional space. It is resilient to irrelevant features and does not presuppose independence,
making it ideal for text categorisation. Linear kernels are favoured for speed on large vocabulary sets; RBF kernels could be used if non-linear bounds are required, although
they are computationally expensive.
I recommend starting with Naïve Bayes as the primary model. because of the following reasons:
Efficiency: Spam datasets can be huge (millions of emails), and NB trains quickly without requiring hyperparameter tuning like SVM's C (regularisation).
Text performance: NB excels at spam filtering (for example, it is the foundation of many email clients, including early versions of Gmail) because email text frequently
approximates the independence assumption—words are conditionally independent given the class.
Interpretability: Probabilities are simple to explain (for example, "This email is 95% spam due to words like 'lottery'").
SVM as a strong alternative or benchmark: If NB's accuracy is insufficient (for example, because to correlated features), move to SVM, which often produces 1-2% higher
F1-scores on benchmarks like as the Enron-Spam dataset. To tweak the C parameter of the SVM, I would utilise GridSearchCV.
Class imbalance (for example, 95% Not Spam) might bias models towards the dominant class, resulting in high accuracy but poor spam detection (many false negatives).
Strategies:
Resampling Techniques:
Oversampling the minority class: Use imbalanced-learn's SMOTE (Synthetic Minority Over-sampling Technique) to create synthetic spam samples by interpolating existing ones.
To prevent overfitting, avoid using pure random oversampling.
Undersampling the majority class: Random downsampling Not spam emails, but exercise caution to avoid losing valuable data—for example, combine with oversampling for balance.
Evaluating Performance with Appropriate Metrics
Accuracy is deceiving due to imbalance. I'd utilise a multi-metric evaluation on the test set.
Key metrics:
Precision is the proportion of predicted Spam that is actually Spam (TP/(TP + FP)). Users despise valid emails being marked as spam, therefore minimising false positives
is critical.
Recall (Sensitivity): The proportion of genuine Spam captured (TP/(TP + FN)). Prioritise this to guarantee a high spam detection rate, as missing spam can lead to phishing
attacks.
F1-Score is the harmonic mean of Precision and Recall (2 * (Precision * Recall) / (Precision + Recall). A balanced metric for unbalanced data; aim for >0.90 macro-averaged.
The ROC-AUC score measures the trade-off between true positive rate and false positive rate. Ideal for binary classification with a target >0.95, as it manages imbalance well.
Implementing this spam classifier would significantly improve the company's operations and consumer satisfaction:
Improved user experience: Users benefit from cleaner inboxes, less frustration, and increased engagement by capturing over 95% of spam (high recall) with few false positives
(<1% of valid emails detected). This could increase client retention by 5-10%, according to industry benchmarks.
Security and compliance: Early identification of phishing/spam decreases risks such as data breaches and viruses, potentially saving millions in remediation expenses. It also
simplifies compliance with legislation such as GDPR/CAN-SPAM by automating filtering.
Operational Efficiency: Automating classification scalable to handle large volumes of email (millions per day) without manual inspection, freeing up support workers.
Training/inference is inexpensive (NB works on commodity hardware) and can be deployed via APIs (for example, in email servers using Flask/Docker).
ROI Quantification: If the organisation processes 1 million emails per day with 5% spam, the model could prevent 50K spam emails from reaching users, resulting in ~$100K+
annual savings (e.g., reduced server storage and user complaints). A/B testing after deployment would measure improvements in metrics such as customer satisfaction scores.
Scalability and iteration: Begin with batch processing and progress to real-time (e.g., using Kafka streams). Monitor drift (e.g., developing spam strategies) using tools
such as Evidently AI, and retrain quarterly.

"""

import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
import numpy as np

# Download NLTK resources (run once)
nltk.download('stopwords')
nltk.download('punkt')

# Synthetic dataset (replace with pd.read_csv('emails.csv') for real data)
data = {
    'text': [
        'Buy cheap viagra now! Click here for discount.',  # Spam
        'Meeting tomorrow at 10 AM in conference room.',  # Not Spam
        'Win free lottery tickets today! Urgent offer.',  # Spam
        'Your invoice is attached. Please review.',  # Not Spam
        'Hello, how are you? Missing subject.',  # Not Spam (incomplete)
        'Free money transfer to your account. Act fast!',  # Spam
        '',  # Missing/incomplete
        'Project update: Sales increased by 20%.',  # Not Spam
        'Dear user, your account is suspended. Login now.',  # Spam
        'Lunch invitation from team lead.'  # Not Spam
    ],
    'label': [1, 0, 1, 0, 0, 1, 0, 0, 1, 0]  # 1=Spam, 0=Not Spam
}
df = pd.DataFrame(data)

# Handle missing/incomplete data
df = df.dropna(subset=['label'])  # Drop rows with missing labels
df['text'] = df['text'].fillna('[No Content]')  # Impute missing text
df = df[df['text'].str.len() > 10]  # Remove very short/incomplete (outliers)

# Text cleaning function
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove punctuation/numbers
    tokens = nltk.word_tokenize(text)
    tokens = [stemmer.stem(token) for token in tokens if token not in stop_words]
    return ' '.join(tokens)

df['clean_text'] = df['text'].apply(clean_text)

# Vectorization with TF-IDF
vectorizer = TfidfVectorizer(max_features=1000, ngram_range=(1, 2))  # Limit for demo; handles diverse vocab
X = vectorizer.fit_transform(df['clean_text'])
y = df['label']

# Split (stratified to preserve imbalance)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
print(f"Training set shape: {X_train.shape}, Imbalance: {np.bincount(y_train)}")

from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report, roc_auc_score
from imblearn.over_sampling import SMOTE
from sklearn.pipeline import Pipeline

# Address imbalance first (see next section)
smote = SMOTE(random_state=42)
X_train_res, y_train_res = smote.fit_resample(X_train, y_train)

# Train Naïve Bayes (primary model)
nb_model = MultinomialNB()
nb_model.fit(X_train_res, y_train_res)

# Train SVM for comparison
svm_model = LinearSVC(class_weight='balanced', random_state=42)  # Handles imbalance
svm_model.fit(X_train_res, y_train_res)

# Predictions
y_pred_nb = nb_model.predict(X_test)
y_prob_nb = nb_model.predict_proba(X_test)[:, 1]  # Prob for positive class
y_pred_svm = svm_model.predict(X_test)
y_prob_svm = svm_model.decision_function(X_test)  # Use decision for ROC

from sklearn.model_selection import cross_val_score

# Evaluate Naïve Bayes
print("Naïve Bayes Classification Report:")
print(classification_report(y_test, y_pred_nb, target_names=['Not Spam', 'Spam']))

roc_auc_nb = roc_auc_score(y_test, y_prob_nb)
print(f"Naïve Bayes ROC-AUC: {roc_auc_nb:.4f}")

# Cross-validation F1 (macro for balance)
cv_f1_nb = cross_val_score(nb_model, X_train_res, y_train_res, cv=5, scoring='f1_macro')
print(f"Naïve Bayes CV F1-Macro: {cv_f1_nb.mean():.4f} (+/- {cv_f1_nb.std() * 2:.4f})")

# Evaluate SVM for comparison
print("\nSVM Classification Report:")
print(classification_report(y_test, y_pred_svm, target_names=['Not Spam', 'Spam']))

roc_auc_svm = roc_auc_score(y_test, y_prob_svm)
print(f"SVM ROC-AUC: {roc_auc_svm:.4f}")

cv_f1_svm = cross_val_score(svm_model, X_train_res, y_train_res, cv=5, scoring='f1_macro')
print(f"SVM CV F1-Macro: {cv_f1_svm.mean():.4f} (+/- {cv_f1_svm.std() * 2:.4f})")

