Question 1 : What is Information Gain, and how is it used in Decision Trees?

Answer:Information Gain measures how much uncertainty (entropy) in the data is reduced after splitting it based on a particular feature.

In simple terms:

It tells us how well a given attribute separates the training examples according to their target classification.

Information Gain (IG) is a key concept used in Decision Trees (such as ID3, C4.5, and CART algorithms) to decide which feature to split on at each step of building the tree.

Information gain = Entopy(parent) - submition of (prportion of sample in subset * entropy)

Where:

𝑆
S: the original dataset

S
i
	: subsets after splitting on an attribute
	​

∣/∣S∣: proportion of samples in subset
𝑆
𝑖
S
i
	​


Entropy(𝑆)=−submition of 𝑝𝑖 log2(𝑝𝑖)

Entropy(S)=−∑pi
log
2
(pi), where

pi is the probability of class 𝑖


Question 2: What is the difference between Gini Impurity and Entropy?
Hint: Directly compares the two main impurity measures, highlighting strengths,
weaknesses, and appropriate use cases

Answer:
Entropy comes from information theory — it tells how much information (or surprise) there is in the data.

Gini Impurity is a simpler metric that tells how often a randomly chosen sample would be misclassified if it were labeled randomly according to class distribution.

Entropy - From 0 (pure node) to 1 (maximum uncertainty).
Gini Impurity - From 0 (pure node) to 0.5 (maximum impurity).
Entropy - To select the feature that results in the largest drop in impurity after a split.

Gini impurity - To select the feature that provides the highest "Information Gain," or the greatest reduction in uncertainty.

Entropy - To select the feature that provides the highest "Information Gain," or the greatest reduction in uncertainty.

Can sometimes produce slightly better results, though the difference is often negligible.

Can be more sensitive to imbalanced class distributions and may lead to deeper, more balanced trees.

Gini impurity - Faster to compute because it does not involve logarithmic calculations.

Preferred for very large datasets where computational efficiency is a key concern.

Tends to favor splits that create more balanced partitions.

Question 3:What is Pre-Pruning in Decision Trees?

Answer:
Pre-pruning, or early stopping, is a technique in decision trees that halts the growth of a tree before it becomes fully developed to prevent overfitting. It stops the tree-building process based on certain criteria, such as reaching a maximum depth, having too few samples in a node, or not achieving a minimum impurity decrease from a split. This approach restricts the complexity of the model from the beginning, leading to a smaller, simpler tree that may generalize better to new data.

Maximum depth: The tree has reached a predefined maximum number of levels.

Minimum samples: A node has fewer than a specified minimum number of samples.

Minimum impurity decrease: The best split at a node doesn't result in a significant enough improvement in the impurity (e.g., Gini impurity or information gain).

Benefits:

Faster training: It is computationally more efficient than allowing the tree to grow completely and then trim it later.

Prevents overfitting: By limiting complexity early on, it reduces the risk of overfitting the training data.


Question 4:Write a Python program to train a Decision Tree Classifier using Gini
Impurity as the criterion and print the feature importances (practical).
Hint: Use criterion='gini' in DecisionTreeClassifier and access .feature_importances_.

Answer

In [2]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

clf = DecisionTreeClassifier(criterion='gini', random_state=42)

clf.fit(X_train, y_train)

accuracy = clf.score(X_test, y_test)
print(f"Model Accuracy: {accuracy:.2f}")

print("\nFeature Importances:")
for feature_name, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{feature_name}: {importance:.4f}")


Model Accuracy: 1.00

Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0167
petal length (cm): 0.9061
petal width (cm): 0.0772


Question 5: What is a Support Vector Machine (SVM)?


Answer:
A Support Vector Machine (SVM) is a powerful supervised machine learning algorithm used for classification and regression tasks — most commonly for binary classification.

SVM tries to draw a line (in 2D) or a hyperplane (in higher dimensions) that divides the data into classes as clearly as possible — while keeping the widest possible gap between them.

That gap is called the margin.

The data points closest to the boundary are called Support Vectors — they define the position and orientation of the hyperplane.

Types of SVM:

Linear SVM:
Used when the data is linearly separable (can be divided by a straight line or plane).

Non-Linear SVM:
Uses a Kernel Function (like polynomial or RBF) to map data into a higher-dimensional space where a linear separator can be found.

Question 6: What is the Kernel Trick in SVM?

Answer:
The Kernel Trick is one of the most powerful concepts in Support Vector Machines (SVMs) — it allows SVMs to handle non-linear data efficiently without explicitly transforming it into a higher-dimensional space.

Instead of manually converting data to a higher dimension, the Kernel Trick uses a mathematical function (kernel) to compute the similarity between data points in that higher-dimensional space — without ever performing the transformation explicitly.

Normally, to separate non-linear data, we transform data points
𝑥
x using a mapping function
𝜙
(
𝑥
)
ϕ(x):

𝜙
:
𝑋
→
𝐻
ϕ:X→H

where
𝐻
H is a higher-dimensional space.

But instead of computing
𝜙
(
𝑥
)
ϕ(x) directly, SVM only needs dot products between pairs of points:

𝐾
(
𝑥
𝑖
,
𝑥
𝑗
)=𝜙
(
𝑥
𝑖
)
⋅
𝜙
(
𝑥
𝑗
)
K(xi,xj)=ϕ(xi)⋅ϕ(xj)

Here,
𝐾
(
𝑥
𝑖
,
𝑥
𝑗
)
K(x
i
	​
,x
j)
 is the Kernel Function — it computes the dot product in the high-dimensional space without explicitly computing 𝜙(𝑥)ϕ(x).

That’s the Kernel Trick.

Question 7: Write a Python program to train two SVM classifiers with Linear and RBF
kernels on the Wine dataset, then compare their accuracies.

Answer

In [3]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

wine = load_wine()
X = wine.data
y = wine.target

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42
)

svm_linear = SVC(kernel='linear', random_state=42)
svm_rbf = SVC(kernel='rbf', random_state=42)

svm_linear.fit(X_train, y_train)
svm_rbf.fit(X_train, y_train)

y_pred_linear = svm_linear.predict(X_test)
y_pred_rbf = svm_rbf.predict(X_test)

acc_linear = accuracy_score(y_test, y_pred_linear)
acc_rbf = accuracy_score(y_test, y_pred_rbf)

print(" SVM Classifier Comparison on Wine Dataset")
print("--------------------------------------------")
print(f"Linear Kernel Accuracy: {acc_linear:.4f}")
print(f"RBF Kernel Accuracy:    {acc_rbf:.4f}")

if acc_linear > acc_rbf:
    print("\n Linear kernel performed better.")
elif acc_linear < acc_rbf:
    print("\n RBF kernel performed better.")
else:
    print("\n  Both kernels performed equally well.")


 SVM Classifier Comparison on Wine Dataset
--------------------------------------------
Linear Kernel Accuracy: 0.9722
RBF Kernel Accuracy:    1.0000

 RBF kernel performed better.


Question 8: What is the Naïve Bayes classifier, and why is it called "Naïve"?

Answer:
The Naïve Bayes classifier is a supervised machine learning algorithm based on Bayes’ Theorem — used mainly for classification tasks, especially text classification (like spam detection, sentiment analysis, etc.).

Bayes’ Theorem:
p(
𝐶
∣
𝑋
)
=
𝑃
(
𝑋
∣
𝐶
)
×
𝑃
(
𝐶
)
𝑃
(
𝑋
)
P(C∣X)=
P(X)
P(X∣C)×P(C)

	​
P(C∣X): Probability of class C given feature set X (posterior)

𝑃
(
𝑋
∣
𝐶
)
P(X∣C): Probability of features given class C (likelihood)

𝑃
(
𝐶
)
P(C): Prior probability of class C

𝑃
(
𝑋
)
P(X): Probability of the feature set (evidence)
P(C∣X): Probability of class C given feature set X (posterior)

Calculate prior probabilities for each class (e.g., spam or not spam).

Calculate likelihoods
𝑃
(
𝑋
∣
𝐶
)
P(X∣C) for each feature given the class.

Use Bayes’ theorem to compute posterior probabilities
𝑃
(
𝐶
∣
𝑋
)
P(C∣X).

Choose the class with the highest posterior probability

Question 9: Explain the differences between Gaussian Naïve Bayes, Multinomial Naïve
Bayes, and Bernoulli Naïve Bayes


Answer:

gaussian Naive Bayes, Multinomial Naive Bayes, and Bernoulli Naive Bayes are all variations of the Naive Bayes classification algorithm, but they differ in how they handle the type of data they are given, mainly whether features are continuous, discrete counts, or binary values:

Gaussian Naive Bayes: Assumes features are continuous and distributed normally (like a bell curve). It's often used for numerical data like age or weight.

Multinomial Naive Bayes: Assumes features are discrete and represented as counts. It's suitable for data where features represent the number of times something occurs within a given category, like the frequency of words in a document.

Bernoulli Naive Bayes: Assumes features are binary (either present or absent). It's useful for situations where features are classified as on/off, like whether a word appears in a text.

Question 10: Breast Cancer Dataset
Write a Python program to train a Gaussian Naïve Bayes classifier on the Breast Cancer
dataset and evaluate accuracy.


Answer:


In [4]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report

data = load_breast_cancer()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)
gnb = GaussianNB()

gnb.fit(X_train, y_train)

y_pred = gnb.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.4f}")

print("\n Classification Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))


Model Accuracy: 0.9737

 Classification Report:
              precision    recall  f1-score   support

   malignant       1.00      0.93      0.96        43
      benign       0.96      1.00      0.98        71

    accuracy                           0.97       114
   macro avg       0.98      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114

