#Question 1:  What is a Support Vector Machine (SVM), and how does it work?
#Answer:
A **Support Vector Machine (SVM)** is a supervised machine learning algorithm widely used for classification and regression tasks. Its primary aim is to find the optimal way to separate data points belonging to different classes by constructing a boundary known as a **hyperplane**.

### How SVM Works

- **Hyperplane and Margin**: In a feature space (which can be two- or multidimensional), SVM seeks a hyperplane that separates data points into two classes. Out of many possible hyperplanes, SVM chooses the one that **maximizes the margin**—that is, the distance between the hyperplane and the nearest data points from either class. These nearby data points are called **support vectors** because they directly influence the position and orientation of the hyperplane.

- **Support Vectors**: Only the data points closest to the decision boundary (the support vectors) affect the hyperplane's position. If these are removed or changed, the hyperplane could shift—this helps make SVM robust and efficient.

- **Linear vs. Nonlinear Separation**:
  - **Linear SVM**: Used when data can be separated cleanly with a straight line (hyperplane in higher dimensions).
  - **Nonlinear SVM**: For more complex or non-separable data, SVMs use a mathematical tool called the **kernel trick**. A kernel function transforms data into a higher-dimensional space, where a linear separator is possible—even if it was not in the original space.

- **Soft Margin vs. Hard Margin**:
  - **Hard Margin**: No data points are allowed inside the margin—useful for data that can be perfectly separated.
  - **Soft Margin**: Allows some misclassifications or points inside the margin to handle overlapping or non-perfectly separable data. This trade-off is controlled by a regularization parameter, often denoted as "C".

- **Algorithm Steps**:
  1. Preprocess and clean data.
  2. Choose a suitable kernel function.
  3. Train the SVM to find the hyperplane that maximizes the margin.
  4. Use the trained SVM to classify new, unseen examples based on which side of the hyperplane they fall.

### Imp Concepts
- **Hyperplane**: The boundary SVM creates to separate classes.
- **Support Vectors**: Data points closest to the hyperplane, determining its position.
- **Margin**: The distance between the hyperplane and the support vectors; SVM maximizes this for better generalization.
- **Kernel Trick**: A mathematical function that lets SVM handle nonlinear patterns in data by mapping it into a higher-dimensional space without explicitly calculating the new coordinates.

### Applications
SVMs are used in many fields, from image and speech recognition to natural language processing and bioinformatics, because of their versatility and solid theoretical foundations.

In summary, SVMs are powerful tools for building robust classification models, especially when the distinction between classes is subtle or multidimensional.

---

#Question 2: Explain the difference between Hard Margin and Soft Margin SVM.
#Answer:
A Support Vector Machine (SVM) is a powerful machine learning algorithm, often used for classification, that tries to find the best separating boundary—called a hyperplane—between classes. The way SVM defines and enforces this boundary depends on the concepts of "margin," leading to two major variants: Hard Margin SVM and Soft Margin SVM. Understanding the distinction is critical for anyone applying SVM to real-world data.

**Hard Margin SVM: Strict Separation**

Hard Margin SVM operates under a very strict assumption: the dataset is perfectly linearly separable, with no noise or overlap between classes. The goal is to find a hyperplane that divides the two classes with the largest possible margin—that is, the maximum distance between the hyperplane and the closest data points of each class, which are termed "support vectors".

- **Mathematical Formulation**:
  - For every data point $$(x_i, y_i)$$, where $y_i$ is the class label ($+1$ or $-1$), Hard Margin SVM enforces:
    
    $$y_i (w \cdot x_i + b) \geq 1$$
    
    Here, "w" is the weights vector, and "b" is the bias term.

- **Optimization Objective**:
  - The SVM maximizes the margin, which can be written as:
    
    $$\text{Margin} = \frac{2}{\|w\|}$$
    
    So, the optimization problem becomes minimizing $\frac{1}{2}\|w\|^2$ subject to the above constraint.

- **Characteristics**:
  - **No Misclassification**: Every data point must be on the correct side of the margin.
  - **Sensitive to Noise/Outliers**: Even a single mis-labeled or noisy data point can make finding a hard margin impossible.
  - **Use Case**: Works only when data is perfectly linearly separable—rare for real-world datasets.

- **Intuitive Example**:
  - Imagine two classes of points on a plane. If they’re completely separated (no overlaps or outliers), the Hard Margin SVM draws the widest possible strip (margin) between them, such that all points lie outside this strip on the correct side.

**Soft Margin SVM: Allowing Flexibility**

Real-world data rarely allows perfect separation. There may be overlaps, noise, or outliers. Soft Margin SVM addresses this by relaxing the no-misclassification rule, permitting some points to be misclassified or within the margin. This increases the model’s flexibility and generalization to unseen data.

- **Mathematical Formulation**:
  - Introduces "slack variables" ($\xi_i$) for each data point, quantifying how much the constraint is violated:
    
   $$ y_i (w \cdot x_i + b) \geq 1 - \xi_i,\qquad \xi_i \geq 0$$
    
    If $\xi_i = 0$, the point is correctly classified and outside the margin; $0<\xi_i< 1$ means inside the margin but correctly classified; $\xi_i>1$ means misclassified.

- **Optimization Objective**:
  - The objective now balances maximizing the margin and minimizing the sum of violations (slack):
    
    $$\text{Minimize: } \frac{1}{2}\|w\|^2 + C\sum_{i=1}^N \xi_i$$
    
    - $C$ is a regularization parameter controlling the trade-off:
      - High $C$: Stricter penalty for misclassification (closer to hard margin).
      - Low $C$: More tolerance for errors, resulting in a wider margin.

- **Characteristics**:
  - **Tolerance for Misclassification**: Some points can violate the margin constraints.
  - **Handles Noisy Data**: More robust to outliers and label errors.
  - **Use Case**: Suitable for nearly all practical scenarios with imperfect, noisy, or overlapping data.

- **Example**:
  - With soft margin SVM, the model might let a few points fall within the margin, or even on the wrong side, in order to achieve a generally better separation of the main bulk of the data. This trade-off leads to a model that is far less likely to overfit on noise.

**Key Differences and Their Implications**

| Aspect                   | Hard Margin SVM                                 | Soft Margin SVM                                                |
|--------------------------|------------------------------------------------|---------------------------------------------------------------|
| Misclassification        | Not allowed                                    | Allowed (regulated by C parameter)                         |
| Data Requirements        | Perfect linear separability                    | Works with overlapping, noisy, or imperfect data         |
| Robustness to Outliers   | Poor (very sensitive)                          | Good (tolerates outliers and mislabeled points)         |
| Regularization           | None                                           | Controlled by parameter C                                     |
| Use Case                 | Theoretical, clean synthetic data              | Real-world data, almost all practical problems                |

**Practical Impacts**

1. **Hard Margin SVM** is rarely used in real-world computing, as most datasets are not perfectly clean. It's mainly discussed as an idealized case or for understanding the underlying math of SVMs.
2. **Soft Margin SVM** is the default in most SVM implementations, as it offers a key practical advantage. By balancing between maximizing the margin and controlling misclassification, it can generalize much better to new, unseen data.

**Mathematical Insights**

- The parameter $C$ in the Soft Margin SVM is crucial: setting $C$ very high simulates hard margin behavior; setting it low increases tolerance for violations and allows for a smoother, more general decision boundary.
- The optimization problem for Soft Margin SVM is a form of convex quadratic programming, ensuring a single global optimum can be found.

**Visual Understanding**

- Imagine plotting two classes on a 2D plot. A Hard Margin SVM fits the widest possible straight boundary between them, provided there’s zero overlap. If a single point crosses this boundary, no solution is possible.
- A Soft Margin SVM, however, will “bend the rules,” letting some points exist closer or even over the boundary if that results in a better overall separation.

**Outlier Sensitivity and Generalization**

Hard Margin SVM is extremely brittle—one noisy data point can completely alter or prevent a solution. Soft Margin SVM is designed with flexibility, using slack variables to absorb irregularities, allowing it to build a model that generalizes better, rather than memorizing every single training point.

**Summary**

- Hard Margin SVM strictly separates classes with no errors, suitable only for ideal datasets.
- Soft Margin SVM tolerates misclassifications, introducing slack variables and a regularization parameter to manage the trade-off between margin width and error.
- In practice, Soft Margin is almost always preferred: it is robust, generalizes well, and can mold itself to the messiness of real data.

This nuanced balancing act—maximizing the margin, yet remaining lenient enough to handle errors—is what gives the Soft Margin SVM its power and practicality in the field of machine learning.

---

#Question 3: What is the Kernel Trick in SVM? Give one example of a kernel and explain its use case.
#Answer:
The **Kernel Trick** is a fundamental concept that transforms Support Vector Machines (SVMs) from simple linear classifiers into powerful tools for solving complex, non-linear problems.

## What is the Kernel Trick in SVM?

An SVM constructs a decision boundary (a hyperplane) to separate different classes in the data. For linear problems—where the classes can be separated with a straight line (or hyperplane in higher dimensions)—this is straightforward. However, **many real-world datasets are not linearly separable**: their data points form intricate, non-linear patterns that a simple straight line can’t crack.

**The Kernel Trick addresses this challenge by allowing SVMs to classify data that isn't linearly separable in its original space.** Here’s how:

### 1. Feature Mapping for Nonlinear Problems

Imagine a dataset in two dimensions (2D) shaped like concentric circles—say, the inner circle is one class and the outer ring another. No straight line can separate these classes in 2D. If you **map** these data points to a higher-dimensional space (for example, from 2D to 3D), you might find that they become linearly separable: a plane can now distinguish between them.

Formally, this transformation is accomplished using a **feature map**: a function φ(x) that transforms the input vector x into a new, higher-dimensional space. In this new space, the previously non-linear problem may become linear, making it possible for an SVM to draw a separating hyperplane.

### 2. The Heart of the Kernel Trick

But here’s the issue: **explicitly mapping data to higher dimensions is computationally expensive, sometimes even infeasible, especially as the number of dimensions grows**. The innovation of the kernel trick is that it **computes the inner products of the data in the higher-dimensional space—without ever explicitly calculating the transformation**. It leverages the insight that the SVM algorithm fundamentally relies on inner products (dot products) of sample vectors.

So, instead of mapping to higher dimensions and then calculating φ(x)·φ(y), you use a kernel function K(x, y) that computes this value directly from the original, lower-dimensional data:

> **K(x, y) = φ(x)·φ(y)**

As a result, you can efficiently learn very complex boundaries—sometimes in infinite-dimensional spaces—without heavy computation.

## Types of Kernel Functions

There are many kinds of kernels, each inducing a different type of mapping. Some of the most widely used include:

- **Linear Kernel**: For linearly separable data. *K(x, y) = x·y*
- **Polynomial Kernel**: Maps to higher-degree polynomial feature spaces. *K(x, y) = (x·y + c)^d*
- **Radial Basis Function (RBF) Kernel / Gaussian Kernel**: Popular for non-linear data. *K(x, y) = exp(-γ||x - y||²)*
- **Sigmoid Kernel**: Mimics neural networks. *K(x, y) = tanh(α x·y + c)*

Each kernel brings its own bias and is suitable for different types of data.

## Example: The Radial Basis Function (RBF) Kernel

### Formula
The **RBF kernel** (also called the Gaussian kernel) is defined as:

$$
K(\mathbf{x}, \mathbf{x}') = \exp\left(-\gamma \|\mathbf{x} - \mathbf{x}'\|^2\right)
$$

where:

- $\gamma$ is a positive hyperparameter that defines how far the influence of a single training example reaches,
- $\|\mathbf{x} - \mathbf{x}'\|$ is the Euclidean distance between vectors.

### Why Use the RBF Kernel?

- **Handles Nonlinear Data**: The RBF kernel can project data into an *infinite-dimensional* space, so it’s exceptionally adept at dealing with data where classes are separated by highly complex, curved boundaries.
- **General-Purpose**: It works well when there’s little prior knowledge about the data’s structure. When in doubt, many practitioners start with the RBF kernel due to its flexibility and reliability.

### Practical Use Case

Suppose you want to classify data where classes form clusters or blobs in the feature space, and the boundary between them is not a straight line. An SVM with an RBF kernel can draw a non-linear, flexible boundary that wraps around these clusters, separating them accurately.

For example, image classification tasks (such as distinguishing digits or handwritten characters), bioinformatics (gene classification), and real-world sensor data commonly use the RBF kernel.

#### Key Hyperparameters

1. **C**: The regularization parameter controls the trade-off between maximizing margin and minimizing misclassification error.
2. **γ (gamma)**: Defines the radius of influence of a single training example; higher values lead to more complex models (risk of overfitting), while lower values yield smoother decision boundaries.

## Why Is the Kernel Trick So Important?

- **Efficiency**: Avoids the computational burden of explicit feature mapping.
- **Flexibility**: Enables SVMs to fit almost any complex boundary.
- **Theoretical Robustness**: The approach remains grounded in rigorous mathematics, especially via Mercer's Theorem, which ensures that only valid kernel functions are used.

## Intuitive Analogy

Imagine drawing a squiggly line to separate red and blue dots on a 2D sheet of paper. With the kernel trick, you metaphorically “lift” parts of the paper into 3D (without ever doing so physically) so a straight cut in 3D does what would be a contorted, impossible curve in 2D. The trick is that you perform this mathematical “lifting” implicitly and efficiently.

## Conclusion

The **kernel trick is the secret weapon that empowers SVMs to solve complex, nonlinear problems by operating in higher-dimensional spaces—efficiently and without explicit computation.** The RBF kernel is a prime example, allowing SVMs to capture subtle patterns, making these models highly effective for a range of real-world applications, from computer vision to bioinformatics.

By understanding and correctly tuning the kernel and its parameters, you unlock the true versatility and power of SVMs for modern machine learning.

---

#Question 4: What is a Naive Bayes Classifier, and why is it called “naive”?
#Answer:
A **Naive Bayes Classifier** is a simple, yet powerful supervised machine learning algorithm used for classification tasks. Its strength lies in its foundation on **Bayes’ Theorem**, which allows it to calculate the probability that a data point belongs to a particular class, based on the values of its features.

### How the Naive Bayes Classifier Works

- **Probabilistic Model:** It assigns class labels to cases using probability, specifically computing the probability of each class given the input features and selecting the class with the highest probability.
- **Bayes’ Theorem:** The core principle is Bayes’ Theorem, which relates the conditional and marginal probability of random events:

  $$
  P(y|X) = \frac{P(X|y) \cdot P(y)}{P(X)}
  $$

  - $P(y|X)$: Posterior probability of class $y$ given features $X$
  - $P(y)$: Prior probability of class $y$
  - $P(X|y)$: Likelihood of features $X$ given class $y$
  - $P(X)$: Marginal probability of features (acts as a normalizing constant)

- **Naive Assumption:** The “naïve” aspect refers to a simplifying assumption: **all features used for classification are considered independent of each other, given the class**.

  In formal terms, for features $x_1, x_2, ..., x_n$:

  $$
  P(x_1, x_2, ..., x_n|y) = \prod_{i=1}^n P(x_i|y)
  $$

  This means that the presence or absence of a particular feature of a class is unrelated to the presence or absence of any other feature, even if they are actually correlated in reality.

- **Classification Rule:** The classifier computes the posterior probability for each class and assigns the class with the highest posterior probability.

### Why “Naive”?

The algorithm is termed **“naive”** because it assumes **feature independence**—that is, it treats each feature as if it provides completely independent information about the outcome, which is rarely the case in real-world data. In other words, it ignores any possible correlation between features.

Despite this often unrealistic assumption, Naive Bayes Classifiers tend to perform well in practice, particularly for tasks like spam filtering, text classification, and sentiment analysis—especially when features are numerous and the independence assumption is not overly violated.

### Key Points and Real-World Implications

- **Advantages:**
  - **Fast and scalable:** It is simple to implement and can efficiently process large datasets.
  - **Works with small amounts of data:** Requires less training data for parameter estimation compared to more complex models.
  - **Performs well on high-dimensional data:** Often used in document and email classification because word occurrences (as features) are considered approximately independent.

- **Disadvantages:**
  - **Feature independence is rare:** If features are correlated or not truly independent, prediction accuracy may suffer.
  - **May be less accurate:** For complex tasks with highly dependent features, more advanced algorithms might outperform Naive Bayes.

- **Popular Applications:**
  - Spam filtering in emails.
  - Sentiment analysis in text.
  - Document categorization.

**In summary:**  
A Naive Bayes Classifier is called “naive” because it simplistically assumes that all input features are 'independent' of one another given the class label. This naïveté greatly simplifies computation and often provides surprisingly effective results, despite the strong and usually unrealistic independence assumption.

---

#Question 5: Describe the Gaussian, Multinomial, and Bernoulli Naive Bayes variants. When would you use each one?
#Answer:
The three main variants of Naive Bayes classifiers—**Gaussian, Multinomial, and Bernoulli**—differ in how they model feature distributions, which determines their suitability for different data types and applications.

## Gaussian Naive Bayes

**What it is:**  
This variant assumes that the input features are **continuous** and that their values, for each class, follow a **Gaussian (normal) distribution**. For each feature within each class, the classifier calculates the mean and variance from the training data, then uses the probability density function of the normal distribution to estimate the likelihood of the feature values given the class.

**Mathematical model:**  
$$
P(x_i|y) = \frac{1}{\sigma \sqrt{2\pi}} \exp\left(-\frac{(x_i-\mu)^2}{2\sigma^2}\right)
$$
where $x_i$ is the observed value, $\mu$ is the mean, and $\sigma$ is the standard deviation for feature $x_i$ under class $y$.

**Use case:**  
Use **Gaussian Naive Bayes** when your features are **real-valued, continuous, and roughly follow a bell-curve distribution**. Typical applications include:
- Medical data (e.g., predicting disease based on lab measurements)
- Weather prediction (temperature, humidity)
- Any problem where input attributes are numeric and may approximate a normal distribution.

## Multinomial Naive Bayes

**What it is:**  
This variant is designed for **discrete (count) data**. Most commonly, it’s applied when features represent the number of times an event occurs, such as word counts in text classification. It models the likelihood of the features using a **multinomial distribution**.

**Mathematical model:**  
Estimates the probabilities of observing specific counts of each “feature” (e.g., word) given the class, leveraging maximum likelihood with smoothing (e.g., Laplace smoothing for unseen words).

**Use case:**  
Use **Multinomial Naive Bayes** when your data consists of **discrete counts or frequencies**, especially:
- Text classification (spam detection, topic categorization, sentiment analysis)
- Document categorization using term frequencies or tf-idf scores
- Problems involving count data (e.g., number of product purchases).

## Bernoulli Naive Bayes

**What it is:**  
The Bernoulli variant is suitable for **binary (yes/no, 0/1, present/absent) data**. It models each feature with a **Bernoulli (binomial) distribution**, considering only the presence or absence of each feature, regardless of how many times it occurs.

**Mathematical model:**  
For every feature $x_i$, the probability is given by:
$$
P(x_i|y) = p(i|y)^{x_i} (1-p(i|y))^{1-x_i}
$$
where $p(i|y)$ is the probability that the $i$-th feature appears in a sample belonging to class $y$, and $x_i$ is 0 or 1.

**Use case:**  
Use **Bernoulli Naive Bayes** when:
- **Features are binary** (e.g., presence/absence of a word in a document)
- The focus is on the **occurrence** of features rather than their count
- Example applications include binary text features (bag-of-words with binary indicators), simple document classification, real-time or streaming binary event detection, and basic sentiment analysis or spam detection.

## Comparison Table

| Variant        | Assumes Features Are        | Suitable Data                           | Typical Use Cases                                    |
|----------------|----------------------------|-----------------------------------------|------------------------------------------------------|
| Gaussian       | Continuous, real-valued     | Numeric features; normal distribution   | Medical data, sensor readings, numeric attributes    |
| Multinomial    | Discrete, count-based      | Count or frequency vectors              | Text classification (word counts, tf-idf), NLP tasks |
| Bernoulli      | Binary (0/1)               | Presence/absence (Boolean)              | Binary text features, spam detection, sentiment      |

## Choosing the Right Variant

- **Choose Gaussian Naive Bayes** when your features are continuous and normally distributed.
- **Choose Multinomial Naive Bayes** for discrete counts (e.g., term frequency in document classification).
- **Choose Bernoulli Naive Bayes** for binary/Boolean features, where you care about the presence or absence of features, not their counts.

---




In [3]:
'''
Question 6: Write a Python program to:
● Load the Iris dataset
● Train an SVM Classifier with a linear kernel
● Print the model's accuracy and support vectors.
'''
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

#load the iris dataset
iris= load_iris()
df= pd.DataFrame(iris.data, columns= iris.feature_names)
df["Target"]= iris.target

#split the data into train and test
X= df.drop("Target", axis= 1)
y= df["Target"]
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size= 0.2, random_state= 42)
classifier= SVC(kernel= "linear")
classifier.fit(X_train, y_train)
y_pred= classifier.predict(X_test)
print("Accuracy: ", accuracy_score(y_test, y_pred))
print("Support Vectors: ", classifier.support_vectors_)

Accuracy:  1.0
Support Vectors:  [[4.8 3.4 1.9 0.2]
 [5.1 3.3 1.7 0.5]
 [4.5 2.3 1.3 0.3]
 [5.6 3.  4.5 1.5]
 [5.4 3.  4.5 1.5]
 [6.7 3.  5.  1.7]
 [5.9 3.2 4.8 1.8]
 [5.1 2.5 3.  1.1]
 [6.  2.7 5.1 1.6]
 [6.3 2.5 4.9 1.5]
 [6.1 2.9 4.7 1.4]
 [6.5 2.8 4.6 1.5]
 [6.9 3.1 4.9 1.5]
 [6.3 2.3 4.4 1.3]
 [6.3 2.5 5.  1.9]
 [6.3 2.8 5.1 1.5]
 [6.3 2.7 4.9 1.8]
 [6.  3.  4.8 1.8]
 [6.  2.2 5.  1.5]
 [6.2 2.8 4.8 1.8]
 [6.5 3.  5.2 2. ]
 [7.2 3.  5.8 1.6]
 [5.6 2.8 4.9 2. ]
 [5.9 3.  5.1 1.8]
 [4.9 2.5 4.5 1.7]]


In [11]:
'''
Question 7:  Write a Python program to:
● Load the Breast Cancer dataset
● Train a Gaussian Naïve Bayes model
● Print its classification report including precision, recall, and F1-score.
'''
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

#load the breast cancer dataset
breast_cancer= load_breast_cancer()
df= pd.DataFrame(breast_cancer.data, columns= breast_cancer.feature_names)
df["Target"]= breast_cancer.target
df.head()

#train-test split
X= df.drop("Target", axis= 1)
y= df["Target"]
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size= 0.2, random_state= 42)

classifier= GaussianNB()
classifier.fit(X_train, y_train)
y_pred= classifier.predict(X_test)

from sklearn.metrics import precision_score, recall_score, f1_score
print(classification_report(y_test, y_pred))
print("Precision Score:", precision_score(y_test, y_pred))
print("Recall Score:", recall_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      0.93      0.96        43
           1       0.96      1.00      0.98        71

    accuracy                           0.97       114
   macro avg       0.98      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114

Precision Score: 0.9594594594594594
Recall Score: 1.0
F1 Score: 0.9793103448275862


In [16]:
'''
Question 8: Write a Python program to:
● Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best
C and gamma.
● Print the best hyperparameters and accuracy.
'''
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")
from sklearn.datasets import load_wine
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

wine = load_wine()
wine.keys()
df= pd.DataFrame(wine.data, columns= wine.feature_names)
df["Target"]= wine.target
df.head()

#train-test split
X= df.drop("Target", axis= 1)
y= df["Target"]
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size= 0.2, random_state= 42)

param_grid= {"C": [0.1, 1, 10, 100], "gamma": [1, 0.1, 0.01, 0.001]}
classifier= SVC()
grid_search= GridSearchCV(classifier, param_grid, cv= 5)
grid_search.fit(X_train, y_train)
print("Best Hyperparameters: ", grid_search.best_params_)
y_pred= grid_search.predict(X_test)
print("Accuracy: ", accuracy_score(y_test, y_pred))


Best Hyperparameters:  {'C': 100, 'gamma': 0.001}
Accuracy:  0.8333333333333334


In [19]:
'''
Question 9: Write a Python program to:
● Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. using
sklearn.datasets.fetch_20newsgroups).
● Print the model's ROC-AUC score for its predictions.
'''
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score

categories= ["alt.atheism", "talk.religion.misc", "comp.graphics", "sci.space"]
newsgroups= fetch_20newsgroups(subset= "train", categories= categories)
newsgroups.keys()
df= pd.DataFrame(newsgroups.data, columns= ["Text"])
df["Target"]= newsgroups.target
df.head()

X= df["Text"]
y= df["Target"]
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size= 0.2, random_state= 42)
vectorizer= CountVectorizer()
X_train_vec= vectorizer.fit_transform(X_train)
X_test_vec= vectorizer.transform(X_test)
classifier= MultinomialNB()
classifier.fit(X_train_vec, y_train)
y_probs = classifier.predict_proba(X_test_vec)
roc_auc = roc_auc_score(y_test, y_probs, multi_class='ovr')
print("ROC-AUC Score: ", roc_auc)




ROC-AUC Score:  0.9913138207609082


***Question 10: Imagine you’re working as a data scientist for a company that handles email communications. Your task is to automatically classify emails as Spam or Not Spam. The emails may contain:
● Text with diverse vocabulary
● Potential class imbalance (far more legitimate emails than spam)
● Some incomplete or missing data
Explain the approach you would take to:
● Preprocess the data (e.g. text vectorization, handling missing data)
● Choose and justify an appropriate model (SVM vs. Naive Bayes)
● Address class imbalance
● Evaluate the performance of your solution with suitable metrics
And explain the business impact of your solution. ***



## 1. Preprocessing the Data

### a) Handling Text Data with Diverse Vocabulary
- **Text Cleaning**: Remove or normalize email metadata (headers, signatures), HTML tags, stop words, punctuation, and perform lowercasing. Remove extra whitespace.
- **Tokenization**: Split emails into words or meaningful subunits.
- **Vectorization**:
  - Use **TF-IDF (Term Frequency-Inverse Document Frequency)** vectorization to represent emails numerically, which helps emphasize important but less frequent words.
  - Alternatively, or complementarily, use **n-grams** (e.g., unigrams + bigrams) for capturing context or phrases common in spam.
- **Handling Out-of-Vocabulary Words**: Since email vocabulary can be large and diverse, apply **feature selection** (e.g., keeping top frequent features) or dimensionality reduction to reduce noise.

### b) Handling Missing/Incomplete Data
- For missing completely empty emails or corrupted records:
  - Clean or discard them if they offer no useful information.
- For emails with partially missing portions (missing subject or body)
  - Consider imputing missing parts as empty strings or add flags indicating missingness.
- Naïve Bayes and SVM typically handle sparse data well, so missing tokens manifest as zero entries.

## 2. Choosing and Justifying the Model – SVM vs. Naive Bayes

### Naive Bayes
- **Pros**:
  - Fast to train and predict, especially with large text datasets.
  - Performs well with **high-dimensional sparse datasets** like text.
  - Naturally handles probabilistic modeling and provides interpretability.
- **Cons**:
  - Assumes **feature independence**, which is often violated but sometimes surprisingly robust.
  - May perform poorly when contextual word interdependencies are important.

### Support Vector Machine (SVM)
- **Pros**:
  - Effective in **high-dimensional spaces** and can find decision boundaries that maximize margins.
  - With kernels (e.g., linear kernel) it works well for text data, especially when classes are not linearly separable.
  - Often leads to better accuracy than Naïve Bayes if well-tuned.
- **Cons**:
  - Slower to train on very large datasets.
  - More hyperparameters to tune (e.g., regularization C), requiring more computational resources.
  
### Recommendation:
- Start with **Multinomial Naive Bayes** due to its speed and solid baseline performance for spam classification.
- If computational resources allow and performance needs improvement, train a **linear SVM** on TF-IDF features.
- If the dataset is very large and feature space is huge, SVM with linear kernel and stochastic gradient descent (SGD) can scale.
  
---

## 3. Addressing Class Imbalance

Since legitimate emails typically far outnumber spam emails:

- **Resampling techniques**:
  - **Oversampling** the minority class (Spam) — e.g., SMOTE (Synthetic Minority Oversampling Technique).
  - **Undersampling** the majority class (Not Spam) carefully to avoid losing important data.
- **Class Weighting**:
  - For models like SVM, use class weights inversely proportional to class frequencies to penalize misclassification of the minority class more.
  - For Naïve Bayes, class prior probabilities can be adjusted to reflect real class proportions.
- **Threshold Tuning**:
  - Change classification decision threshold to balance Precision and Recall according to business needs.
- **Anomaly or Outlier Detection**:
  - Sometimes treat spam as anomaly detection problem depending on relative proportions.

## 4. Evaluating Model Performance

### Appropriate Metrics for Imbalanced Classification
- **Precision**: Proportion of predicted spam emails that are actually spam.
- **Recall (Sensitivity)**: Proportion of actual spam emails correctly detected (important to catch spam).
- **F1 Score**: Harmonic mean of Precision and Recall — good overall performance metric.
- **ROC-AUC**: Measures model’s ability to distinguish between classes across thresholds.
- **PR-AUC (Precision-Recall Area Under Curve)**: More informative than ROC when classes are highly imbalanced.
- **Confusion Matrix**: To visualize True Positives, False Positives, False Negatives, and True Negatives.
  
### Business Priorities Affect Metric Choice
- If **false negatives** (spam not detected) cost more (e.g., user annoyance, security risks), prioritize **high recall**.
- If **false positives** (legitimate emails marked as spam) cost more (e.g., lost important emails), prioritize **high precision**.
- Adjust model threshold accordingly to optimize business objectives.

## 5. Business Impact

### Benefits
- **Improved User Experience:** Automated spam filtering reduces clutter, improves email relevance, and reduces risk from phishing or malware.
- **Operational Efficiency:** Reduces manual email triaging and support/resolution costs.
- **Security:** Early detection blocks suspicious emails, protecting company infrastructure and end users.
- **Cost Savings:** Minimizes losses from spam-induced workflow disruption or data breaches.
- **Customer Trust:** Reliable filtering supports brand reputation and customer satisfaction.

### Risks of Poor Classification
- Legitimate emails marked as spam (false positives) might lead to missed opportunities or client dissatisfaction.
- Spam emails wrongly classified as legitimate (false negatives) can lead to fraud, phishing attacks, or malware infection.

### Final Recommendation
- Deploy the solution incrementally, starting with a robust baseline (e.g. Multinomial Naïve Bayes with TF-IDF).
- Continuously monitor metrics, user feedback, and periodically retrain the model with updated data.
- Consider ensemble approaches or hybrid filtering systems combining machine learning with rule-based filters for maximum efficacy.

# Summary

| Step                            | Approach                                                    |
|--------------------------------|-------------------------------------------------------------|
| Preprocessing                  | Clean text, TF-IDF vectorization, handle missing data by imputation or discarding empty samples |
| Model Choice                  | Start with Multinomial Naïve Bayes; optionally deploy linear SVM for improved accuracy       |
| Address Class Imbalance       | Use class weighting, oversampling/undersampling, and tune classification threshold             |
| Evaluation Metrics            | Precision, Recall, F1, ROC-AUC, Precision-Recall AUC; choose metric as per business impact     |
| Business Impact               | Improves user experience, security, operational efficiency; reduces risks with ongoing monitoring|

---




