#Supervised Classification: Decision Trees, SVM, and Naive Bayes Assignment

Q.1What is Information Gain, and how is it used in Decision Trees?


--> Information Gain (IG) is a key concept used in Decision Trees to decide which feature to split on at each step while building the tree.


Information Gain measures how much uncertainty (entropy) in the target variable is reduced after splitting the data based on a particular feature.
It tells us how informative a feature is in predicting the target.

Formula:

Information Gain=Entropy (Parent)-∑i Ni/N*Entropy (Child i)

where:

Entropy (Parent): Entropy before the split

Entropy (Child i):Entropy of each subset after the split

Ni: Number of samples in child node i

N : Total number of samples before the split

Entropy Formula:

Entropy measures impurity (randomness) in the dataset.

Entropy = -k∑K=1 pk log2(pk)

where

pk = proportion of samples belonging to class k.

If entropy = 0 - all samples are pure (same class).

If entropy = 1 - samples are evenly split (high impurity).

How It's Used in Decision Trees:

 - Calculate Entropy of the target before any split.

 - For each feature, calculate the entropy of the subsets created by splitting on that feature.

 - Compute Information Gain for each feature.

 - Select the feature with the highest Information Gain — it gives the most “information” about the target.

 - Repeat this process recursively for each node.

Q.2 What is the difference between Gini Impurity and Entropy?


--> Difference between Gini Impurity and Entropy

Gini Impurity
-------------

- Measures the probability of incorrectly classifying a randomly chosen element if it was labeled according to the distribution of labels in the node.

 - Gini=1-∑pi2

 - Range - 0 (pure) to 0.5 (for binary split with equal classes)
 - Lower Gini means higher purity.
 - Simpler and faster to compute.
 - Tends to favor larger partitions
 - Preferred in practice for speed (e.g., used in CART).

Entropy
-------

 - Measures the amount of uncertainty (information disorder) in the data; based on information theory.

 - Entropy=−∑pi​log2​(pi​)

 - Range - 0 (pure) to 1 (for binary split with equal classes)
 - Lower Entropy means higher purity.
 - Slightly slower due to logarithmic computation.
 - More sensitive to changes in class probabilities.
 - Preferred when you want a more information-theoretic interpretation


Q.3 What is Pre-Pruning in Decision Trees?

--> Pre-Pruning in Decision Trees (also called Early Stopping) is a technique used to stop the tree from growing too large during its construction — to avoid overfitting and improve generalization.

Definition:

Pre-pruning stops the splitting process early, before the tree becomes fully grown, based on certain stopping conditions.

Common Stopping Criteria:

A node will not be split further if any of the following conditions are met:

The maximum tree depth is reached.

The number of samples in a node is below a minimum threshold (min_samples_split or min_samples_leaf).

The impurity decrease (Information Gain or Gini reduction) from a split is too small.

The node is already pure (all samples belong to one class).

Advantages:

 - Prevents overfitting (simpler tree, better generalization).

 - Reduces training time.

- Easier to interpret due to smaller tree size.

Disadvantages:

 - Risk of underfitting if the pruning is too aggressive.

 - May miss useful splits that could improve performance later.



Q.4 :Write a Python program to train a Decision Tree Classifier using Gini
Impurity as the criterion and print the feature importances (practical).

In [1]:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

# Load sample dataset (Iris)
iris = load_iris()
X = iris.data
y = iris.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Decision Tree Classifier using Gini Impurity
clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

# Print Feature Importances
print("Feature Importances:")
for feature, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{feature}: {importance:.4f}")

# Evaluate model performance (optional)
accuracy = clf.score(X_test, y_test)
print(f"\nModel Accuracy: {accuracy:.2f}")


Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0191
petal length (cm): 0.8933
petal width (cm): 0.0876

Model Accuracy: 1.00


Q.5 What is a Support Vector Machine (SVM)?

--> Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification and regression tasks, but it is most commonly used for classification.

Definition:

SVM aims to find the best decision boundary (hyperplane) that separates different classes in the feature space with the maximum margin — i.e., the largest possible distance between data points of different classes.

Key Concepts:

 - Hyperplane -	The decision boundary that separates different classes. In 2D, it’s a line; in 3D, a plane.
 - Margin
The distance between the hyperplane and the nearest data points from each class. SVM tries to maximize this margin.
 - Support Vectors
The data points closest to the hyperplane that directly influence its position and orientation.
 - Kernel Trick
A mathematical technique that allows SVM to perform classification in non-linear spaces by transforming data into a higher-dimensional space (e.g., using polynomial or RBF kernels).

Types of SVM:

Linear SVM: Used when data is linearly separable.

Non-linear SVM: Uses kernel functions (e.g., RBF, polynomial) when data is not linearly separable.

Q.6 What is the Kernel Trick in SVM?

--> Kernel Trick in SVM is a mathematical technique that allows the Support Vector Machine to perform non-linear classification by implicitly mapping the input data into a higher-dimensional feature space — without actually computing the transformation.

The Idea:
--------

Many datasets are not linearly separable in their original space.

The kernel trick projects data into a higher-dimensional space where it becomes linearly separable.

Instead of explicitly computing the new coordinates (which is computationally expensive), the kernel trick computes the dot product between pairs of data points in that higher-dimensional space using a kernel function.

Mathematical Explanation:

If 𝜙(x) is the transformation function to a higher dimension,
then instead of computing 𝜙(xi) and 𝜙(xj)

SVM uses a kernel function
𝐾(𝑥𝑖,𝑥𝑗)=𝜙(𝑥𝑖)⋅𝜙(𝑥𝑗)

This saves computation and enables handling complex non-linear boundaries efficiently.

Q.7 Write a Python program to train two SVM classifiers with Linear and RBF
kernels on the Wine dataset, then compare their accuracies.

In [2]:
# Import necessary libraries
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load the Wine dataset
wine = datasets.load_wine()
X = wine.data
y = wine.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train SVM with Linear Kernel
svm_linear = SVC(kernel='linear', random_state=42)
svm_linear.fit(X_train, y_train)
y_pred_linear = svm_linear.predict(X_test)
acc_linear = accuracy_score(y_test, y_pred_linear)

# Train SVM with RBF Kernel
svm_rbf = SVC(kernel='rbf', random_state=42)
svm_rbf.fit(X_train, y_train)
y_pred_rbf = svm_rbf.predict(X_test)
acc_rbf = accuracy_score(y_test, y_pred_rbf)

# Print the accuracy scores
print("Accuracy Comparison:")
print(f"Linear Kernel Accuracy: {acc_linear:.2f}")
print(f"RBF Kernel Accuracy:    {acc_rbf:.2f}")

# Optional: Compare which kernel performed better
if acc_linear > acc_rbf:
    print("\n Linear kernel performed better.")
elif acc_rbf > acc_linear:
    print("\n RBF kernel performed better.")
else:
    print("\n⚖️ Both kernels performed equally well.")


Accuracy Comparison:
Linear Kernel Accuracy: 0.98
RBF Kernel Accuracy:    0.76

 Linear kernel performed better.


Q.8 What is the Naïve Bayes classifier, and why is it called "Naïve"?

--> Naïve Bayes Classifier is a probabilistic supervised learning algorithm based on Bayes' Theorem, primarily used for classification tasks.

It predicts the probability that a given data point belongs to a particular class based on the features likelihoods.

Bayes Theorem:

P(C∣X)=P(X∣C)P(C)/P(X)


It's called "Naïve" because it assumes that all features are independent of each other given the class label — which is rarely true in real-world data.

For example, in text classification, words often depend on each other (e.g., "not good"), but Naïve Bayes treats them as independent.

Despite this naïve assumption, the algorithm works surprisingly well in practice, especially for high-dimensional data like text.

Types of Naïve Bayes Classifiers:
 - GaussianNB
 - MultinomialNB
 - BernoulliNB

	​


Q.9 Explain the differences between Gaussian Naïve Bayes, Multinomial Naïve
Bayes, and Bernoulli Naïve Bayes

--> Comparison of the three main types of Naïve Bayes classifiers

1.Gaussian Naïve Bayes (GNB)
 - Continuous (numeric) features
 - The features follow a normal (Gaussian) distribution.
 - Iris dataset (flower measurements), medical data, sensor readings.

2.Multinomial Naïve Bayes (MNB)
 - Discrete or count-based data (non-negative integers)
 - Features represent counts or frequencies (e.g., word counts)
 - Text classification (spam detection, topic categorization)

3.Bernoulli Naïve Bayes (BNB)
 - Binary features (0 or 1)
 - Each feature is a binary variable indicating presence/absence
 - Document classification with binary features, sentiment analysis.


Q.10 Write a Python program to train a Gaussian Naïve Bayes classifier on the Breast Cancer
dataset and evaluate accuracy.


In [1]:
# Import necessary libraries
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize the Gaussian Naïve Bayes classifier
gnb = GaussianNB()

# Train the model
gnb.fit(X_train, y_train)

# Make predictions
y_pred = gnb.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Gaussian Naïve Bayes Classifier Accuracy: {:.2f}".format(accuracy))


Gaussian Naïve Bayes Classifier Accuracy: 0.94
