# Question 1: What is a Support Vector Machine (SVM), and how does it work?

A **Support Vector Machine (SVM)** is a powerful supervised machine learning algorithm used for both **classification** and **regression** tasks, but it is most commonly applied in classification problems.

## Working:

1. **Decision Boundary (Hyperplane):**
   - The main idea of SVM is to find a hyperplane (a line in 2D, a plane in 3D, or a higher-dimensional surface) that best separates the data points of different classes.
   - The goal is to maximize the margin, i.e., the distance between the hyperplane and the nearest data points from each class.

2. **Support Vectors:**
   - The data points that are closest to the hyperplane are called **support vectors**.
   - These points are critical because they "support" the position and orientation of the hyperplane.

3. **Margin:**
   - The margin is the distance between the hyperplane and the nearest support vector.
   - SVM tries to maximize this margin to improve the generalization ability of the model.

4. **Linear vs. Non-linear Separation:**
   - If the data is **linearly separable**, SVM finds a straight hyperplane.
   - If the data is **not linearly separable**, SVM uses a **kernel trick** to project the data into higher dimensions where a linear separation is possible.

5. **Kernel Trick:**
   - A mathematical function (kernel) is used to transform the data into a higher-dimensional space without explicitly computing the coordinates.
   - Common kernels:
     - Linear Kernel
     - Polynomial Kernel
     - Radial Basis Function (RBF) Kernel
     - Sigmoid Kernel


## Intuition:
- SVM focuses only on the **most critical data points** (support vectors), ignoring the rest, making it robust and effective.
- By maximizing the margin, SVM reduces the risk of overfitting.

## Example:
- Imagine we have data of patients with two features (e.g., blood pressure and cholesterol level).
- We want to classify whether they are at **high risk** or **low risk** of disease.
- SVM will draw the best hyperplane between these two groups so that:
  - Patients on one side are classified as **low risk**.
  - Patients on the other side are classified as **high risk**.


## Advantages of SVM:
- Effective in high-dimensional spaces.
- Works well with clear margin of separation.
- Flexible with different kernel functions.

## Limitations of SVM:
- Can be slow with very large datasets.
- Choosing the right kernel and parameters can be tricky.
- Less effective if classes are heavily overlapping.

---


# Question 2: Explain the difference between Hard Margin and Soft Margin SVM.

In Support Vector Machines (SVM), the concept of **margins** is central to how the model separates data points into classes.  
The margin is the distance between the separating hyperplane and the closest data points (support vectors).  


##  Hard Margin SVM:
- Assumes that the data is **perfectly linearly separable** (no misclassifications).
- The algorithm finds a hyperplane that separates the classes **without any errors**.
- Every data point must lie **outside or exactly on the margin** boundaries.

**Advantages:**
- Provides a very strict boundary.
- Works well if data is noise-free and perfectly separable.

**Limitations:**
- Very sensitive to outliers or noise.
- Not suitable for most real-world datasets.


##  Soft Margin SVM:
- Introduces flexibility by allowing **some misclassifications**.
- Adds a parameter **C (regularization parameter)**:
  - Large **C** → Less tolerance for misclassification (model tries to classify all points correctly, may overfit).
  - Small **C** → More tolerance for misclassification (wider margin, better generalization).

**Advantages:**
- More robust to noise and outliers.
- Works well with real-world data that is not perfectly separable.

**Limitations:**
- Choosing the right **C** value is important.
- May still struggle with very noisy datasets.


##  Key Difference:
- **Hard Margin SVM**: No misclassification allowed, only works when data is perfectly separable.  
- **Soft Margin SVM**: Allows misclassification with a tradeoff controlled by parameter **C**, making it practical for real-world scenarios.


## Example
- **Hard Margin**: A strict teacher who doesn’t allow any mistakes.  
- **Soft Margin**: A practical teacher who allows a few mistakes but focuses on overall learning and generalization.  

---

# Question 3: What is the Kernel Trick in SVM? Give one example of a kernel and explain its use case.

## Kernel Trick:
The **Kernel Trick** is a mathematical technique used in Support Vector Machines (SVM) to handle **non-linearly separable data**.  
Instead of explicitly transforming data into a higher-dimensional space, the kernel function computes the **inner product in that space directly**.  

This allows SVM to find a **linear decision boundary in a higher-dimensional feature space**, which corresponds to a **non-linear boundary in the original space**.


## importance
- Many real-world datasets are not linearly separable.
- The kernel trick enables SVM to handle complex boundaries without the heavy computational cost of explicitly mapping data into higher dimensions.


## Example of a Kernel: Radial Basis Function (RBF) Kernel
The **RBF kernel** (also called Gaussian kernel) is one of the most commonly used kernels in SVM.  

**Formula:**
K(x, x') = exp(-gamma |x - x'|^2)


Where:
- ( x, x') = two data points
- (gamma) = parameter controlling the influence of each training point


## Use Case of RBF Kernel:
- Suppose we want to classify whether a patient has a disease based on **blood pressure** and **cholesterol level**.
- The data points may form **circular clusters** that cannot be separated with a straight line.
- The RBF kernel maps the data into a higher dimension, where SVM can separate the classes using a linear boundary.
- In the original space, this appears as a **curved decision boundary**.

- The **Kernel Trick** allows SVM to solve non-linear problems efficiently.
- **RBF Kernel** is widely used for problems where data is not linearly separable, such as image recognition, medical diagnosis, and text classification.

---



# Question 4: What is a Naïve Bayes Classifier, and why is it called “naïve”?

##  Naïve Bayes Classifier?
The **Naïve Bayes Classifier** is a **probabilistic machine learning algorithm** based on **Bayes’ Theorem**.  
It is mainly used for **classification tasks** such as spam detection, sentiment analysis, and medical diagnosis.

**Bayes’ Theorem:**

P(Y|X) = {P(X|Y). P(Y)}/{P(X)}


Where:
- P(Y|X) : Posterior probability → probability of class (Y) given data (X)  
- P(X|Y): Likelihood → probability of data (X) given class (Y)  
- ( P(Y)): Prior probability of class (Y)  
- (P(X)): Probability of data (X)

The classifier predicts the class (Y) that has the **highest posterior probability**.


##  Why "Naïve"?
It is called **“naïve”** because the algorithm assumes that:
- **All features are independent of each other**, given the class label.  

For example, in text classification:
- The model assumes that the occurrence of the word *“free”* is independent of the occurrence of the word *“money”*, even though they often appear together in spam emails.  
- This assumption is rarely true in real-world data, but the algorithm still performs surprisingly well.


##  Advantages:
- Very fast and efficient, especially with large datasets.
- Works well for text classification (spam filters, sentiment analysis).
- Requires less training data.


##  Limitations:
- The independence assumption is unrealistic in many cases.
- Cannot capture relationships between features.

---

# Question 5: Describe the Gaussian, Multinomial, and Bernoulli Naïve Bayes variants.  
# When would you use each one?

Naïve Bayes has different **variants** depending on the type of data we are working with.  
The main ones are **GaussianNB, MultinomialNB, and BernoulliNB**.

## 1. Gaussian Naïve Bayes
- Assumes that the **features follow a normal (Gaussian) distribution**.
- Used for **continuous numerical data** (e.g., height, weight, temperature).
- Example: Predicting whether a patient has a disease based on continuous features like **blood pressure** and **cholesterol level**.


**Use Case:** Medical diagnosis, sensor readings, continuous features.


## 2. Multinomial Naïve Bayes
- Assumes that features represent **discrete counts** (non-negative integers).
- Commonly used for **text classification** where features are **word counts or frequencies**.
- Example: Classifying emails as spam or not spam using the frequency of words.

**Use Case:** Document classification, spam detection, sentiment analysis.

---

## 3. Bernoulli Naïve Bayes
- Assumes **binary features** (0/1 values).
- Each feature indicates whether a particular attribute is present (1) or absent (0).
- Example: In text classification, a feature can indicate whether a specific word appears in the document or not.

**Use Case:** Text classification with binary features, recommendation systems.


| Variant              | Data Type                | Example Use Case                        |
|----------------------|--------------------------|-----------------------------------------|
| **GaussianNB**       | Continuous numerical     | Medical diagnosis, sensor data          |
| **MultinomialNB**    | Discrete counts/frequencies | Text classification, spam detection     |
| **BernoulliNB**      | Binary (0/1) features   | Document classification (word presence) |

---


In [10]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings(action='ignore')

In [11]:
# Question 6: Write a Python program to:
# ● Load the Iris dataset
# ● Train an SVM Classifier with a linear kernel
# ● Print the model's accuracy and support vectors.

In [12]:
from sklearn.datasets import load_iris
iris_data=load_iris()
df=pd.DataFrame(iris_data.data,columns=iris_data.feature_names)
df['target']=iris_data.target
df.sample(10)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
70,5.9,3.2,4.8,1.8,1
61,5.9,3.0,4.2,1.5,1
129,7.2,3.0,5.8,1.6,2
114,5.8,2.8,5.1,2.4,2
87,6.3,2.3,4.4,1.3,1
86,6.7,3.1,4.7,1.5,1
124,6.7,3.3,5.7,2.1,2
40,5.0,3.5,1.3,0.3,0
91,6.1,3.0,4.6,1.4,1
130,7.4,2.8,6.1,1.9,2


In [13]:
x=df.drop('target',axis=1)
y=df["target"]
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,stratify=y,random_state=1)

In [14]:
from sklearn.svm import SVC
classifier=SVC(kernel='linear')
classifier.fit(x_train,y_train)

In [15]:
y_pred=classifier.predict(x_test)

In [16]:
y_pred

array([2, 0, 0, 0, 1, 0, 1, 2, 0, 1, 2, 2, 2, 1, 2, 1, 2, 1, 1, 1, 1, 2,
       2, 1, 0, 0, 0, 1, 2, 0, 0, 2, 1, 0, 0, 1, 2, 2])

In [17]:
from sklearn.metrics import accuracy_score
print(f"Accuracy of the model is : {accuracy_score(y_test,y_pred)}")

Accuracy of the model is : 1.0


In [18]:
classifier.support_vectors_

array([[4.5, 2.3, 1.3, 0.3],
       [5.1, 3.3, 1.7, 0.5],
       [5.1, 3.8, 1.9, 0.4],
       [6.9, 3.1, 4.9, 1.5],
       [5.9, 3.2, 4.8, 1.8],
       [6. , 2.9, 4.5, 1.5],
       [6.7, 3. , 5. , 1.7],
       [6.3, 2.3, 4.4, 1.3],
       [6.1, 2.9, 4.7, 1.4],
       [6. , 2.7, 5.1, 1.6],
       [6.3, 2.5, 4.9, 1.5],
       [6.3, 3.3, 4.7, 1.6],
       [5.1, 2.5, 3. , 1.1],
       [6.4, 3.1, 5.5, 1.8],
       [6.3, 2.8, 5.1, 1.5],
       [6.2, 2.8, 4.8, 1.8],
       [6.3, 2.5, 5. , 1.9],
       [6.5, 3. , 5.2, 2. ],
       [6. , 2.2, 5. , 1.5],
       [6. , 3. , 4.8, 1.8],
       [6.5, 3.2, 5.1, 2. ],
       [6.3, 2.7, 4.9, 1.8],
       [6.1, 2.6, 5.6, 1.4]])

In [19]:
# Question 7: Write a Python program to:
# ● Load the Breast Cancer dataset
# ● Train a Gaussian Naïve Bayes model
# ● Print its classification report including precision, recall, and F1-score.

In [20]:
from sklearn.datasets import load_breast_cancer
cancer_data=cancer_data=load_breast_cancer()
df=pd.DataFrame(cancer_data.data,columns=cancer_data.feature_names)
df["target"]=cancer_data.target
df.sample(8)

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
145,11.9,14.65,78.11,432.8,0.1152,0.1296,0.0371,0.03003,0.1995,0.07839,...,16.51,86.26,509.6,0.1424,0.2517,0.0942,0.06042,0.2727,0.1036,1
420,11.57,19.04,74.2,409.7,0.08546,0.07722,0.05485,0.01428,0.2031,0.06267,...,26.98,86.43,520.5,0.1249,0.1937,0.256,0.06664,0.3035,0.08284,1
502,12.54,16.32,81.25,476.3,0.1158,0.1085,0.05928,0.03279,0.1943,0.06612,...,21.4,86.67,552.0,0.158,0.1751,0.1889,0.08411,0.3155,0.07538,1
277,18.81,19.98,120.9,1102.0,0.08923,0.05884,0.0802,0.05843,0.155,0.04996,...,24.3,129.0,1236.0,0.1243,0.116,0.221,0.1294,0.2567,0.05737,0
271,11.29,13.04,72.23,388.0,0.09834,0.07608,0.03265,0.02755,0.1769,0.0627,...,16.18,78.27,457.5,0.1358,0.1507,0.1275,0.0875,0.2733,0.08022,1
499,20.59,21.24,137.8,1320.0,0.1085,0.1644,0.2188,0.1121,0.1848,0.06222,...,30.76,163.2,1760.0,0.1464,0.3597,0.5179,0.2113,0.248,0.08999,0
427,10.8,21.98,68.79,359.9,0.08801,0.05743,0.03614,0.01404,0.2016,0.05977,...,32.04,83.69,489.5,0.1303,0.1696,0.1927,0.07485,0.2965,0.07662,1
462,14.4,26.99,92.25,646.1,0.06995,0.05223,0.03476,0.01737,0.1707,0.05433,...,31.98,100.4,734.6,0.1017,0.146,0.1472,0.05563,0.2345,0.06464,1


In [21]:
x=df.drop('target',axis=1)
y=df["target"]

In [22]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,random_state=1,stratify=y)

In [23]:
from sklearn.naive_bayes import GaussianNB
gnb=GaussianNB()
gnb.fit(x_train,y_train)

In [24]:
y_pred=gnb.predict(x_test)

In [25]:
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.91      0.91      0.91        53
           1       0.94      0.94      0.94        90

    accuracy                           0.93       143
   macro avg       0.93      0.93      0.93       143
weighted avg       0.93      0.93      0.93       143



In [26]:
# Question 8: Write a Python program to:
# ● Train an SVM Classifier on the Wine dataset using GridSearchCV to find the best C and gamma.
# ● Print the best hyperparameters and accuracy.

In [27]:
from sklearn.datasets import load_wine
wine_data=load_wine()
df=pd.DataFrame(wine_data.data,columns=wine_data.feature_names)
df['target']=wine_data.target
df.sample(10)

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,target
72,13.49,1.66,2.24,24.0,87.0,1.88,1.84,0.27,1.03,3.74,0.98,2.78,472.0,1
157,12.45,3.03,2.64,27.0,97.0,1.9,0.58,0.63,1.14,7.5,0.67,1.73,880.0,2
130,12.86,1.35,2.32,18.0,122.0,1.51,1.25,0.21,0.94,4.1,0.76,1.29,630.0,2
103,11.82,1.72,1.88,19.5,86.0,2.5,1.64,0.37,1.42,2.06,0.94,2.44,415.0,1
19,13.64,3.1,2.56,15.2,116.0,2.7,3.03,0.17,1.66,5.1,0.96,3.36,845.0,0
44,13.05,1.77,2.1,17.0,107.0,3.0,3.0,0.28,2.03,5.04,0.88,3.35,885.0,0
171,12.77,2.39,2.28,19.5,86.0,1.39,0.51,0.48,0.64,9.899999,0.57,1.63,470.0,2
176,13.17,2.59,2.37,20.0,120.0,1.65,0.68,0.53,1.46,9.3,0.6,1.62,840.0,2
52,13.82,1.75,2.42,14.0,111.0,3.88,3.74,0.32,1.87,7.05,1.01,3.26,1190.0,0
95,12.47,1.52,2.2,19.0,162.0,2.5,2.27,0.32,3.28,2.6,1.16,2.63,937.0,1


In [28]:
x=df.drop('target',axis=1)
y=df["target"]

In [29]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,random_state=1,stratify=y,test_size=0.2)

In [30]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

In [31]:
param_grid = {
        'C': [0.001, 0.01, 0.1, 1, 10, 100],
        'gamma': [0.0001, 0.001, 0.01, 0.1]
    }

In [32]:
validator=GridSearchCV(SVC(),param_grid=param_grid,cv=5)
validator.fit(x_train,y_train)

In [33]:
validator.best_params_

{'C': 100, 'gamma': 0.0001}

In [34]:
validator.best_score_   # best accuracy

np.float64(0.844334975369458)

In [35]:
# Question 9: Write a Python program to:
# ● Train a Naïve Bayes Classifier on a synthetic text dataset (e.g. using
# sklearn.datasets.fetch_20newsgroups).
# ● Print the model's ROC-AUC score for its predictions.
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split

# Load the dataset
newsgroups = fetch_20newsgroups(subset='all', categories=['rec.sport.baseball', 'rec.sport.hockey'], remove=('headers', 'footers', 'quotes'))

In [36]:
# Split data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(newsgroups.data, newsgroups.target, test_size=0.2, random_state=42)

# Vectorize the text data
vectorizer = TfidfVectorizer()
x_train_vec = vectorizer.fit_transform(x_train)
x_test_vec = vectorizer.transform(x_test)

# Train a Multinomial Naive Bayes model
model = MultinomialNB()
model.fit(x_train_vec, y_train)

# Predict probabilities for the positive class
y_pred_proba = model.predict_proba(x_test_vec)[:, 1]

# Calculate and print the ROC-AUC score
roc_auc = roc_auc_score(y_test, y_pred_proba)
print(f"ROC-AUC Score: {roc_auc}")

ROC-AUC Score: 0.9844464545957083


# Question 10: Spam Email Classification Project

Imagine you are working as a data scientist for a company that handles email communications.  
Your task is to automatically classify emails as **Spam** or **Not Spam**.

The dataset contains:
- **Text with diverse vocabulary**
- **Potential class imbalance** (more legitimate emails than spam)
- **Some incomplete or missing data**

## Step 1: Data Preprocessing

1. **Handling Missing Data**
   - Drop emails where the text body is completely missing.
   - If only small parts (like subject) are missing, fill with `"unknown"`.

2. **Text Vectorization**
   - Convert email text into numerical form using **TF-IDF Vectorizer**.
   - This captures both the frequency of words and their importance.

3. **Class Imbalance**
   - Use **SMOTE (Synthetic Minority Oversampling Technique)** or **class weights** to balance spam vs. ham (not spam).
   - This prevents the model from always predicting "Not Spam".

---

## Step 2: Choosing the Model

- **Naïve Bayes**:
  - Works very well for **text classification** (bag of words, TF-IDF).
  - Assumes word occurrences are independent (the “naïve” assumption).
  - Fast and scalable.

- **SVM (Support Vector Machine)**:
  - Powerful classifier for high-dimensional sparse data (like TF-IDF).
  - Can model complex boundaries but is computationally heavier.

For this problem, **Naïve Bayes** is usually preferred because:
- It performs very well on **text + bag of words/TF-IDF**.
- It’s faster and easier to train on large datasets.

---

## Step 3: Model Training & Evaluation

- Use **train-test split**.
- Handle class imbalance with `class_weight='balanced'` or resampling.
- Evaluate with:
  - **Precision & Recall** (important for spam detection).
  - **F1-score** (balance between precision & recall).
  - **Confusion Matrix** (to see false positives & negatives).

---

## Step 4: Business Impact

- **Reduces manual effort**: No need for employees to manually filter spam.
- **Improves productivity**: Employees can focus only on genuine emails.
- **Enhances security**: Helps prevent phishing/spam attacks from reaching users.
- **Saves cost**: Automated filtering is cheaper than human supervision.

---

## Code




In [37]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix
from imblearn.over_sampling import SMOTE

data = {
    "text": [
        "Congratulations! You won a lottery of $1000",
        "Reminder: Meeting scheduled at 10 AM",
        "Limited offer, claim your free gift card now",
        "Please find attached the project report",
        "Win a free iPhone by clicking this link",
        "Lunch at 1 PM with team"
    ],
    "label": ["spam", "ham", "spam", "ham", "spam", "ham"]
}

df = pd.DataFrame(data)
df

Unnamed: 0,text,label
0,Congratulations! You won a lottery of $1000,spam
1,Reminder: Meeting scheduled at 10 AM,ham
2,"Limited offer, claim your free gift card now",spam
3,Please find attached the project report,ham
4,Win a free iPhone by clicking this link,spam
5,Lunch at 1 PM with team,ham


In [38]:
# Handle Missing Data
df["text"] = df["text"].fillna("unknown")

# Convert text to numerical features
tfidf = TfidfVectorizer(stop_words="english")
X = tfidf.fit_transform(df["text"])
y = df["label"]

# Handle Class Imbalance using SMOTE
smote = SMOTE(random_state=42)
X_res, y_res = smote.fit_resample(X, y)


In [39]:
# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X_res, y_res, test_size=0.3, random_state=42)

# Train Naive Bayes Model
model = MultinomialNB()
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Evaluation
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

Classification Report:
               precision    recall  f1-score   support

         ham       0.50      1.00      0.67         1
        spam       0.00      0.00      0.00         1

    accuracy                           0.50         2
   macro avg       0.25      0.50      0.33         2
weighted avg       0.25      0.50      0.33         2

Confusion Matrix:
 [[1 0]
 [1 0]]
