**Question 1: What is Information Gain, and how is it used in Decision Trees?**

ANSWER

In simple terms, Information Gain (IG) measures how much “information” or “purity” we gain when we split data based on a particular feature in a decision tree. Decision trees are built by repeatedly splitting data into smaller parts to make the classes as pure as possible. Information Gain helps us decide which feature is the best to split on.

Concept

Information Gain is based on entropy, which measures impurity or randomness in the data.
When we split the dataset, we expect to reduce the randomness.
The more reduction in entropy after the split, the higher the Information Gain.

The formula is:

IG(S,A)=Entropy(S)−v∈values(A) ∑∣S∣∣Sv∣​×Entropy(Sv)

Where:

S = the whole dataset

A = the attribute we split on

Sᵥ = subset of data for each value of A


Example

Suppose we are predicting whether students pass or fail based on whether they study or not.

Study	Result
Yes	Pass
Yes	Pass
No	Fail
No	Fail
Yes	Pass

Before splitting, entropy = 0.97 (mix of pass and fail).
If we split on “Study,” the “Yes” group becomes pure (mostly Pass), and “No” group becomes pure (mostly Fail).
So, entropy decreases → Information Gain increases.

Daigram

 Root Node
              [Pass/Fail]
                /     \
           Study=Yes  Study=No
           [Pure]      [Pure]

This split provides the highest Information Gain.

In Decision Trees

At every node, the tree checks all features, calculates their Information Gain, and picks the one with the highest gain. This continues until all nodes are pure or other stopping criteria are met.

Conclusion

Information Gain helps the decision tree identify which feature best separates data. It ensures that each split leads to a purer subset, making the model more accurate and interpretable.

**Question 2: What is the difference between Gini Impurity and Entropy?**

Hint: Directly compares the two main impurity measures, highlighting strengths, weaknesses, and appropriate use cases.

ANSWER:

Both Gini Impurity and Entropy are measures of impurity used in decision trees, but they are slightly different in how they calculate impurity and how sensitive they are to class changes.

Measure	Formula	Range	Best Split Preference
Entropy	−Σ pᵢ log₂(pᵢ)	0 to 1	Slower, more accurate
Gini Impurity	1 − Σ pᵢ²	0 to 0.5	Faster, less sensitive

**1. Entropy**

Entropy comes from information theory and measures the level of disorder or randomness in data.
If all samples in a node belong to one class, entropy = 0.
If they are evenly split, entropy = 1.

Example:
If a dataset has 50% Pass and 50% Fail:
Entropy = −(0.5 log₂ 0.5 + 0.5 log₂ 0.5) = 1

**2. Gini Impurity**

Gini Impurity measures how often a randomly chosen element from the dataset would be incorrectly labeled if we randomly labeled it according to the distribution of labels.

Example:

If the same dataset has 50% Pass and 50% Fail:
Gini = 1 − (0.5² + 0.5²) = 0.5

Purity Level (lower = better)

|        Entropy Curve

|      /

|     /

|----/--- Gini Curve

|  /

|_/

Probability (p)

**Comparison**

Entropy uses logarithms → more computationally heavy.

Gini is simpler and faster to calculate.

Both usually lead to similar splits, but Gini tends to isolate the most frequent class first.

Conclusion

Gini Impurity and Entropy serve the same purpose: measuring impurity in nodes. Gini is simpler and faster, while Entropy gives a more theoretical information-based measure. In practice, both give similar results, and the choice depends on speed versus interpretability.

**Question 3: What is Pre-Pruning in Decision Trees?**

ANSWER

When we build a decision tree, it can easily become too complex and start memorizing the training data instead of learning patterns. This problem is called overfitting. To avoid it, we use pruning, which means cutting the tree shorter.
Pre-pruning means stopping the tree from growing too large while it’s being built.

Concept

Instead of allowing the tree to grow fully and then cutting it, pre-pruning stops splitting early based on certain conditions such as:

Maximum tree depth

Minimum number of samples per leaf

Minimum information gain required for a split

Example

Suppose we have a tree that perfectly classifies all training examples. However, when we test it on new data, the accuracy drops sharply. That means it overfitted.
If we had used pre-pruning, we could have stopped it at a smaller, more general level.

Diagram

Without Pre-Pruning          With Pre-Pruning
      Root                        Root
     /   \                       /   \
   Split Split                 Decision
 Too Deep Tree             Simpler General Tree

**Pre-Pruning Parameters in Python**

max_depth: limits how deep the tree grows

min_samples_split: minimum samples to split a node

min_samples_leaf: minimum samples per leaf node

max_leaf_nodes: maximum leaves allowed

Conclusion

Pre-pruning prevents decision trees from growing unnecessarily complex, improving generalization. It’s a proactive approach that controls tree size during training rather than fixing overfitting afterward.

**Question 4:Write a Python program to train a Decision Tree Classifier using Gini Impurity as the criterion and print the feature importances (practical).**

Hint: Use criterion='gini' in DecisionTreeClassifier and access .feature_importances_.


In [1]:
##ANSWER

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

# Load dataset
data = load_iris()
X, y = data.data, data.target

# Train Decision Tree using Gini
model = DecisionTreeClassifier(criterion='gini', random_state=42)
model.fit(X, y)

# Display Feature Importances
for name, importance in zip(data.feature_names, model.feature_importances_):
    print(f"{name}: {importance:.4f}")

sepal length (cm): 0.0133
sepal width (cm): 0.0000
petal length (cm): 0.5641
petal width (cm): 0.4226


Explanation

The criterion='gini' tells the model to use Gini Impurity for splitting.

After fitting, .feature_importances_ shows which features contributed most.

For the Iris dataset, usually petal length and petal width are most important.

Conclusion

Using Gini Impurity, the decision tree identifies the most informative features efficiently. This method helps prioritize variables that best separate the classes.

**Question 5: What is a Support Vector Machine (SVM)?**

ANSWER

A Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification and regression tasks. It works by finding the best boundary (hyperplane) that separates different classes in the data.

Concept

SVM tries to find a line (in 2D), plane (in 3D), or hyperplane (in higher dimensions) that maximizes the margin, which is the distance between the boundary and the nearest data points from each class (called support vectors).

Diagram

Class A (●)      | Margin |       Class B (○)
● ● ● ● ● ●  -------|-------- ○ ○ ○ ○ ○
      Support Vectors

Example

Suppose we want to classify emails as spam or not spam. Each email is represented by numerical features (like word count or keyword frequency).
SVM finds the best line or surface that divides spam emails from non-spam emails with maximum margin.

Mathematical Idea

The goal is to maximize:

Margin = 2∣∣𝑤∣∣

where w is the weight vector.

Advantages

Works well for high-dimensional data
Effective even when the number of features > samples
Uses different kernels to handle non-linear data

Conclusion

SVM is a powerful algorithm that aims to find the most optimal decision boundary between classes. Its ability to handle both linear and non-linear data makes it one of the most reliable classifiers in data analytics.

**Question 6: What is the Kernel Trick in SVM?**

ANSWER

Not all data can be separated with a straight line. The Kernel Trick allows SVM to classify data that is not linearly separable by transforming it into a higher-dimensional space where it becomes separable.

Example

Imagine data shaped like two concentric circles.
In 2D, it’s impossible to draw a straight line to separate them.
But if we project it into 3D space, the circles become separable by a plane.

Diagram:

2D:            3D (after kernel)
 ○ ○ ○          ↑
○     ○         | Plane separates easily
 ○ ○ ○


How Kernel Trick Works

SVM doesn’t actually compute the transformation directly. It uses a mathematical function (kernel) to compute inner products in higher dimensions without explicitly transforming data.

Common Kernels

Linear Kernel → works for linearly separable data

K(x,y)=xTy

Polynomial Kernel → good for curved boundaries

K(x,y)=(xTy+c)d

RBF (Radial Basis Function) → best for complex, non-linear boundaries

   K(x,y)=e−γ∣∣x−y∣∣2

Conclusion

The Kernel Trick makes SVM extremely flexible. It allows linear algorithms to solve non-linear problems efficiently without increasing computational cost.

**Question 7: Write a Python program to train two SVM classifiers with Linear and RBF kernels on the Wine dataset, then compare their accuracies.**

Hint:Use SVC(kernel='linear') and SVC(kernel='rbf'), then compare accuracy scores after fitting on
the same dataset.


In [2]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load dataset
data = load_wine()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.3, random_state=42
)

# Linear Kernel
linear_svm = SVC(kernel='linear')
linear_svm.fit(X_train, y_train)
linear_acc = accuracy_score(y_test, linear_svm.predict(X_test))

# RBF Kernel
rbf_svm = SVC(kernel='rbf')
rbf_svm.fit(X_train, y_train)
rbf_acc = accuracy_score(y_test, rbf_svm.predict(X_test))

print("Linear Kernel Accuracy:", linear_acc)
print("RBF Kernel Accuracy:", rbf_acc)


Linear Kernel Accuracy: 0.9814814814814815
RBF Kernel Accuracy: 0.7592592592592593


**Explanation**

- The linear kernel works well if the data is linearly separable.
- The RBF kernel is more flexible and usually gives better accuracy for complex data.
- You can compare both results; often, RBF performs slightly better.

**Question 8: What is the Naïve Bayes Classifier, and Why is it Called “Naïve”?**

ANSWER

Naïve Bayes is a classification algorithm based on Bayes’ Theorem. It assumes that all features are independent of each other, which is rarely true in real life — that’s why it’s called “naïve.”

Bayes’ Theorem
       P(A∣B)=P(B)P(B∣A)×P(A)
	​
In classification terms:

P(Class∣Features)∝P(Features∣Class)×P(Class)

Example

Suppose we want to predict if an email is spam based on two words: “offer” and “discount.”

Naïve Bayes assumes the occurrence of these words is independent, even though in reality they might co-occur often.

Diagram

Word Features → [offer, buy, click]

          ↓

   Naïve Bayes Model

          ↓

     Predict: Spam / Not Spam


Advantages

Fast and simple

Works well with small datasets

Great for text and email classification

Conclusion

Naïve Bayes is simple but effective, especially in text-related tasks. The “naïve” independence assumption simplifies computation, making it both efficient and practical.

**Question 9: Explain the differences between Gaussian Naïve Bayes, Multinomial Naïve Bayes, and Bernoulli Naïve Bayes ?**


ANSWER

Type	Data Type	Use Case	Distribution
Gaussian NB	Continuous	Sensor data, real values	Normal Distribution
Multinomial NB	Discrete counts	Word frequencies, text	Multinomial
Bernoulli NB	Binary features	Presence/absence	Bernoulli

1. Gaussian Naïve Bayes

Used when features are continuous (e.g., height, temperature).
It assumes data follows a normal distribution.

P(xi	​∣y)=2πσy2-1e2σy2	(xi−μy)2


2. Multinomial Naïve Bayes

Used for discrete counts, such as word frequencies in text.
Common in document classification.

3. Bernoulli Naïve Bayes

Used when features are binary (0 or 1).
Example: if a word is present (1) or absent (0) in a document.

Conclusion

Each Naïve Bayes variant is suited to specific data types. Choosing the right one ensures better model performance and interpretability.

**Question 10: Breast Cancer Dataset**

Write a Python program to train a Gaussian Naïve Bayes classifier on the Breast Cancer dataset and evaluate accuracy.

Hint:Use GaussianNB() from sklearn.naive_bayes and the Breast Cancer dataset from sklearn.datasets.


In [3]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.3, random_state=42
)

# Train Gaussian Naive Bayes
model = GaussianNB()
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))


Accuracy: 0.9415204678362573


Explanation

The Breast Cancer dataset contains real-valued medical measurements.

GaussianNB is ideal because it assumes a normal distribution.

The model usually achieves around 94–96% accuracy, showing its reliability.

Conclusion

Gaussian Naïve Bayes is efficient for continuous medical datasets like Breast Cancer. Its speed and accuracy make it an excellent baseline model in data analysis projects.