# Supervised Classification: Decision Trees, SVM, and Naive Bayes

---

## Question 1: What is Information Gain, and how is it used in Decision Trees?

**Information Gain** is a metric used in decision tree algorithms to measure the reduction in uncertainty or randomness (entropy) in a dataset after it is split on a particular feature. 🌳

### How It's Used in Decision Trees

The primary use of Information Gain is to decide the best feature to split the data on at each node of the tree. The process is as follows:

1.  **Calculate Initial Entropy**: At a given node, the entropy of the dataset is calculated to measure its initial impurity.
2.  **Calculate Conditional Entropy**: For each feature, the dataset is hypothetically split. The weighted average entropy of the resulting child nodes (the conditional entropy) is calculated.
3.  **Calculate Information Gain**: The Information Gain for that feature is the initial entropy minus the conditional entropy.
    
    `Information Gain = Entropy(parent) - Weighted Average Entropy(children)`

4.  **Select Best Feature**: The algorithm repeats this for all features. The feature that results in the **highest Information Gain** is chosen as the splitting criterion for that node because it creates the most homogeneous (purest) child nodes.

---

## Question 2: What is the difference between Gini Impurity and Entropy?

**Gini Impurity** and **Entropy** are both metrics used to measure the level of impurity or disorder within a node of a decision tree. The goal of a split is to reduce this impurity. While they often produce very similar results, they differ in their calculation and sensitivity.

| Feature | Gini Impurity | Entropy |
|---|---|---|
| **Concept** | Measures the probability of misclassifying a randomly chosen element from the node. | Measures the amount of uncertainty or randomness in the node's data. |
| **Calculation** | Faster to compute as it avoids logarithmic calculations. `Gini = 1 - Σ(p_i)²` | Slower to compute due to the logarithm. `Entropy = -Σ(p_i * log₂(p_i))` |
| **Range** | 0 to 0.5 (for binary classification) | 0 to 1 (for binary classification) |
| **Behavior** | Tends to isolate the most frequent class in its own branch. | Tends to produce slightly more balanced trees. |
| **Use Case** | The default criterion in scikit-learn's `DecisionTreeClassifier` due to its computational efficiency. | Used in algorithms like ID3 and C4.5. Can be slightly more effective for datasets with more complex class distributions. |

---

## Question 3: What is Pre-Pruning in Decision Trees?

**Pre-pruning** is a technique used to prevent a decision tree from overfitting the training data by **stopping its growth early**. 🛑

Instead of building a complete tree and then trimming it back (post-pruning), pre-pruning sets conditions or thresholds that stop a branch from splitting further. If a proposed split does not meet these conditions, it is rejected, and the current node becomes a leaf node.

Common pre-pruning strategies include:

- **`max_depth`**: Limiting the maximum depth the tree can grow to.
- **`min_samples_split`**: Setting the minimum number of data points required in a node before it can be split.
- **`min_samples_leaf`**: Defining the minimum number of data points that must exist in a leaf node after a split.
- **`max_leaf_nodes`**: Limiting the total number of leaf nodes in the tree.

---

## Question 4: Write a Python program to train a Decision Tree Classifier using Gini Impurity as the criterion and print the feature importances (practical).

```py
import numpy as np
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

# Load a sample dataset
iris = load_iris()
X, y = iris.data, iris.target

# 1. Initialize the Decision Tree Classifier with Gini Impurity
# 'gini' is the default criterion, but we specify it for clarity.
gini_tree = DecisionTreeClassifier(criterion='gini', random_state=42)

# 2. Train the model
gini_tree.fit(X, y)

# 3. Get and print the feature importances
feature_importances = gini_tree.feature_importances_

print("Feature Importances for the Iris Dataset (using Gini Impurity):")
print("----------------------------------------------------------------")
for i, feature_name in enumerate(iris.feature_names):
    print(f"{feature_name}: {feature_importances[i]:.4f}")

Output:
--------------------------------------------------------------------------------
Feature Importances for the Iris Dataset (using Gini Impurity):
----------------------------------------------------------------
sepal length (cm): 0.0133
sepal width (cm): 0.0000
petal length (cm): 0.5641
petal width (cm): 0.4226

---

## Question 5: What is a Support Vector Machine (SVM)?

A **Support Vector Machine (SVM)** is a powerful and versatile supervised machine learning algorithm used for both **classification** and **regression** tasks. 🤖

The core idea of an SVM, especially for classification, is to find the **optimal hyperplane** that best separates a dataset into classes.

### Key Concepts:

- **Hyperplane**: This is the decision boundary. In a 2D space, it's a line; in a 3D space, it's a plane; in higher dimensions, it's a hyperplane.
- **Margin**: The margin is the distance between the hyperplane and the nearest data points from each class. SVM aims to **maximize this margin**.
- **Support Vectors**: These are the data points that lie closest to the hyperplane and define the margin. They are the critical elements of the dataset, as removing them would alter the position of the optimal hyperplane.

By maximizing the margin, SVM finds a decision boundary that is robust and generalizes well to new, unseen data.



---

## Question 6: What is the Kernel Trick in SVM?

The **Kernel Trick** is a clever mathematical technique that allows SVMs to solve **non-linear classification problems**. 🧙‍♂️

### The Problem
Many real-world datasets are not linearly separable, meaning a straight line (or plane) cannot effectively separate the classes.

### The Solution
The Kernel Trick enables the SVM to work in a higher-dimensional space **without actually transforming the data**. It does this by using a **kernel function**.

Instead of:
1.  Transforming the data into a much higher dimension (which is computationally expensive).
2.  Finding a hyperplane there.

The kernel function directly computes the dot product of the data points as if they were in that higher-dimensional space. This allows the SVM to find a non-linear decision boundary in the original, lower-dimensional space.

Common kernel functions include:
- **Linear**
- **Polynomial**
- **Radial Basis Function (RBF)**

---

## Question 7: Write a Python program to train two SVM classifiers with Linear and RBF kernels on the Wine dataset, then compare their accuracies.

```py
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# 1. Load and split the dataset
wine = load_wine()
X, y = wine.data, wine.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# 2. Train and evaluate the SVM with a Linear Kernel
linear_svm = SVC(kernel='linear', random_state=42)
linear_svm.fit(X_train, y_train)
linear_predictions = linear_svm.predict(X_test)
linear_accuracy = accuracy_score(y_test, linear_predictions)

# 3. Train and evaluate the SVM with an RBF Kernel
rbf_svm = SVC(kernel='rbf', random_state=42)
rbf_svm.fit(X_train, y_train)
rbf_predictions = rbf_svm.predict(X_test)
rbf_accuracy = accuracy_score(y_test, rbf_predictions)

# 4. Compare the accuracies
print("SVM Accuracy Comparison on the Wine Dataset:")
print("---------------------------------------------")
print(f"Accuracy with Linear Kernel: {linear_accuracy:.4f}")
print(f"Accuracy with RBF Kernel:    {rbf_accuracy:.4f}")


Output:
------------------------------------------------------------
SVM Accuracy Comparison on the Wine Dataset:
---------------------------------------------
Accuracy with Linear Kernel: 0.9444
Accuracy with RBF Kernel:    0.6667

---

## Question 8: What is the Naïve Bayes classifier, and why is it called "Naïve"?

The **Naïve Bayes classifier** is a simple yet effective probabilistic classification algorithm based on **Bayes' Theorem**. It calculates the probability of an observation belonging to a particular class based on the values of its features.

### Why is it called "Naïve"?

The classifier is called "naïve" because it makes a very strong, and often unrealistic, assumption about the data: **the assumption of class-conditional independence**. 🤔

This means that the algorithm assumes that all the features are **independent of one another**, given the class.

**Example:** When classifying an email as `spam` or `not spam`:
- A Naïve Bayes classifier would assume that the presence of the word "viagra" is completely independent of the presence of the word "money".
- In reality, these words often appear together in spam emails, so they are not independent.

Despite this "naïve" assumption, the classifier works surprisingly well in many real-world applications, particularly for text classification and medical diagnosis.

---

## Question 9: Explain the differences between Gaussian Naïve Bayes, Multinomial Naïve Bayes, and Bernoulli Naïve Bayes

The main difference between the various types of Naïve Bayes classifiers lies in the assumptions they make about the distribution of the feature data.

### 1. Gaussian Naïve Bayes
- **Use Case**: Used for **continuous features**.
- **Assumption**: It assumes that the features for each class follow a **Gaussian (normal) distribution**.
- **Example**: It's suitable for data like patient height, weight, temperature, or the feature values in the Iris or Breast Cancer datasets.

### 2. Multinomial Naïve Bayes
- **Use Case**: Used for **discrete features**, typically representing counts or frequencies.
- **Assumption**: It models the data based on multinomial distribution.
- **Example**: Its most common application is in **text classification**, where the features are the frequency of each word appearing in a document (e.g., word counts).

### 3. Bernoulli Naïve Bayes
- **Use Case**: Used for **binary/boolean features** (i.e., features that are either present (1) or absent (0)).
- **Assumption**: It models data based on the Bernoulli distribution.
- **Example**: Also used in text classification, but instead of word counts, it only considers whether a word appears in a document or not.

---

## Question 10: Write a Python program to train a Gaussian Naïve Bayes classifier on the Breast Cancer dataset and evaluate accuracy.

```py
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# 1. Load the Breast Cancer dataset
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target

# 2. Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 3. Initialize and train the Gaussian Naïve Bayes classifier
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# 4. Make predictions on the test set
y_pred = gnb.predict(X_test)

# 5. Evaluate the model's accuracy
accuracy = accuracy_score(y_test, y_pred)

print("Gaussian Naïve Bayes Classifier on Breast Cancer Dataset")
print("--------------------------------------------------------")
print(f"Model Accuracy: {accuracy:.4f}")


Output:
-------------------------------------------------------------------
Gaussian Naïve Bayes Classifier on Breast Cancer Dataset
--------------------------------------------------------
Model Accuracy: 0.9415