In [3]:
#  What is Entropy in Decision Trees?

# Entropy measures the impurity (uncertainty) in a dataset.

# A decision tree uses entropy to decide where to split the data.

# Goal: split the data in such a way that each branch
# becomes more pure (less entropy).

#   purity vs impurity

# 🔹 Purity

# A node (or dataset) is called pure if all the samples
# inside it belong to the same class.

# Example:

# If a node contains 100 animals and all are Dogs, then the node is pure.

# In this case, there is no uncertainty.

# Entropy = 0, Gini = 0.

# Pure means perfectly classified (no mix of classes).

#  Impurity

# A node is impure if it contains a mix of different classes.

# Example:

# If a node has 50 Dogs and 50 Cats → maximum impurity (most uncertain).

# If a node has 80 Dogs and 20 Cats → still impure,
#  but less than the 50–50 case.

#  Impure means uncertain / mixed classes.

In [4]:
# How Decision Trees Use This


# Decision trees split data so that impurity decreases after each split.

# Measures of impurity:

# Entropy → 0 (pure) to 1 (most impure in binary case).

# Gini Index → 0 (pure) to 0.5 (most impure in binary case).

In [5]:
# gini

# 🔹 Properties

# Pure node → only one class present → Gini = 0

# Most impure node (equal distribution of classes) → Gini is maximum

In [6]:
# observation

#  1. more than the uncertainity more is entropy

# define

#  Entropy and Uncertainty

# Entropy is a measure of uncertainty (or impurity).

# If the outcome is certain → entropy is low (0).

# If the outcome is uncertain / random → entropy is high.


# example

#  Binary Classification Example

# 100% Dog (no Cat)

# No uncertainty, you already know the class.

# Entropy = 0 (pure).

# 50% Dog, 50% Cat

# Maximum uncertainty → you can’t predict better than random guess.

# Entropy = 1 (maximum).

# 80% Dog, 20% Cat

# Some uncertainty (but less than 50–50).

# Entropy ≈ 0.72.

# conclusion
#              And decision trees work by splitting data
#               in a way that reduces uncertainty (entropy) step by step.

#  2. for a class problem the min entropy is 0 and the
#    max is 1

# Entropy in classification problems:

# Minimum Entropy = 0
# → This happens when the node is pure (all samples belong to the same class).
# Example: 100% Dog, 0% Cat → Entropy = 0

# Maximum Entropy = log₂(c), where c = number of classes
# → For binary classification (c = 2): max entropy = log₂(2) = 1
# This happens when the classes are evenly split (50% Dog, 50% Cat).

# → For 3 classes (c = 3): max entropy = log₂(3) ≈ 1.585
# This happens when all three classes are equally distributed (1/3 each).

# Summary:

# For binary classification → min entropy = 0, max entropy = 1



#  for multi  classifiction
#  For multi-class classification →
#   min entropy = 0, max entropy = log₂(c)

# both log2 or log e can be used to calculate entropy



In [7]:
#  Entropy and KDE

# If KDE is very flat (low peakness) → data is spread out
#  → high entropy (more uncertainty).

# If KDE is very sharp (high peakness) →
#  data is concentrated in a small region → low entropy (less uncertainty).

#   Think of What Entropy Measures

# Entropy = average uncertainty / unpredictability.

# If the probability distribution is sharp (high peakness) →
#  most of the probability mass is concentrated in a small region →
#  outcomes are more predictable → entropy is low.

# If the probability distribution is flat (low peakness) →
# probability mass is spread over a wide region →
#  outcomes can occur in many places → more uncertainty → entropy is high.

In [8]:
# entropy

#    Meaning of Entropy

# Entropy is a measure of uncertainty, impurity, or randomness in the data.

# When the classes are mixed , entropy is high.

# When the data is pure (all samples belong to one class),
#  entropy is low (zero).

#  Parent Entropy = 1 (What it Means)

# In your case, the parent node has:

# Yes = 4 (50%)

# No = 4 (50%)

# This is a perfectly balanced distribution (50–50).

#  It means we are completely uncertain about the next sample →
#  it could be Yes or No with equal probability.

# Therefore, entropy takes its maximum value = 1 bit
#  (for binary classification).




# Comparison Cases

# All Yes (8 Yes, 0 No):

# H = 0

# → No uncertainty (we are fully sure outcome = Yes).

# All No (0 Yes, 8 No):


# H = 0

# → No uncertainty (we are fully sure outcome = No).

# 50–50 split (4 Yes, 4 No):

# H = 1

# → Maximum uncertainty (highest confusion).


In [9]:
#  So, Parent Entropy = 1 means the data is maximally impure
#            (50% Yes, 50% No).
#  It’s the highest possible uncertainty for binary classification.

In [10]:
# Information Gain (IG)

# Definition: Information Gain is a metric used in Decision Trees
# to measure how much a feature reduces the uncertainty (entropy) in the data.

# It basically tells:

#  “How good is this feature at splitting the data into pure groups?”



#  In One Line

# Information Gain = Reduction in uncertainty (entropy)
# after splitting on a feature.

In [11]:
# Example: Information Gain Calculation


# We have 8 samples in the parent node:

# Yes = 4

# No = 4

# Step 1: Parent Entropy
# H(Parent) = -(0.5 * log2(0.5) + 0.5 * log2(0.5)) = 1

# Step 2: Split on the feature "Outlook"
# After splitting, the data becomes:

# Sunny group (4 samples): 3 Yes, 1 No

# Rainy group (4 samples): 1 Yes, 3 No

# Step 3: Child Entropies


# Sunny group (3 Yes, 1 No):
# H(Sunny) = -(3/4 * log2(3/4) + 1/4 * log2(1/4)) = 0.811

# Rainy group (1 Yes, 3 No):
# H(Rainy) = -(1/4 * log2(1/4) + 3/4 * log2(3/4)) = 0.811

# Step 4: Weighted Average Entropy of Children
# H(Children) = (4/8 * 0.811) + (4/8 * 0.811) = 0.811

# Step 5: Information Gain
# IG = H(Parent) - H(Children)
# IG = 1 - 0.811 = 0.189

# Final Result:
# The Information Gain for splitting on "Outlook" = 0.189 bits.

# Interpretation:

# This feature reduces uncertainty a little, but not perfectly.
# If one branch had all Yes and the other had all No,
# then IG would be 1 (maximum).

In [12]:
# Parent Entropy = Entropy of the entire dataset before any split

# Examples:

# All samples same (Yes, Yes, Yes, …) → Parent Entropy = 0

# Half Yes, half No → Parent Entropy = 1 (maximum)

# 6 Yes, 2 No → Parent Entropy = 0.811 (some uncertainty)



In [13]:
# Gini Impurity:

# Gini Impurity measures the probability of incorrectly classifying
#  a randomly chosen element if it was labeled
#  according to the distribution of classes in the dataset.

# Formula: Gini = 1 - Σ (pi²)
# (where pi is the probability of class i)

# If Gini = 0 → dataset is pure (all samples belong to one class).

# Higher Gini → more mixed classes, more impurity.

In [14]:
# gini vs entropy

# Difference between Gini Impurity and Entropy

# Definition

# Entropy measures the amount of information (or uncertainty) in the dataset.

# Gini Impurity measures the probability of misclassifying
# a randomly chosen sample.

# Formula

# Entropy = – Σ (pi * log₂ pi)

# Gini = 1 – Σ (pi²)

# Range

# Entropy: 0 to 1 (for binary classification).

# Gini: 0 to 0.5 (for binary classification).

# Interpretation

# Entropy is based on information theory (information gain).

# Gini is based on probability of misclassification.

# Speed

# Entropy is slower to compute (because of log).

# Gini is faster (no log).

# Tree Splitting

# Both often give similar splits.

# But Gini tends to isolate the most frequent class,
#  while Entropy is more sensitive to class distribution.


In [15]:
# Relation between Information Gain and Impurity


# When we split a dataset, the impurity (Entropy or Gini) decreases.

# The greater the decrease in impurity, the higher the Information Gain.

# Formula:
# Information Gain = Parent Impurity – Weighted Average of Child Impurities

# So:

# Higher Information Gain → Better split (less impurity in child nodes)

# Lower Information Gain → Poor split (impurity is still high)

# Conclusion:
# More Information Gain = Less Impurity

****

In [16]:
 # Decision tree how it works

In [18]:
#  How a Decision Tree works

# A Decision Tree keeps splitting data into left and right
#  branches until it reaches small groups (leaf nodes).

# At the root, you have all rows (200 in your example).

# After each split, the rows are divided into left and right branches.

# At each node, the tree decides whether to stop and make a decision
# or to keep splitting further.

# Your sir’s example

# Root Node (200 rows)

# First split → Left = 140 rows, Right = 60 rows

# Left Node (140 rows)

# Next split → Left = 100 rows, Right = 40 rows

# Left of 100 Node (100 rows)

# Next split → Left = 2 rows, Right = 98 rows

# What “Decision depends here” means

# The final prediction always depends on the leaf node
# where the sample ends up.

# If a leaf node has only 2 rows and both belong to the same class →
# the decision is fixed for that path.

# So if a new sample satisfies all conditions to reach that leaf,
# the tree will predict the class based on those 2 rows only.

# That’s why your sir said: “Decision depends here.” →
# It means the final prediction depends on the distribution of data
# inside that small node.

# Example to make it clearer

# Suppose we’re predicting “Buy Product (Yes/No)”

# Root (200 people)

# Split: Age < 40 → Left = 140, Right = 60

# Left branch (140): Split Salary < 50k → Left = 100, Right = 40

# Left of 100: Split Gender = Female → Left = 2 (both “Yes”), Right = 98 (mixed)

# If a new sample is Female + Salary < 50k + Age < 40 →
# It will fall into the 2-row leaf, where both are “Yes”.
# So the Decision Tree will always predict Yes for this case.

**page no 7 case discussing we are here**

In [19]:
# Decision Tree Explanation

# At the root node, the split is based on Height ≤ 174.9.

# If Height ≤ 174.9 → go to the left branch.

# Then check Weight ≤ 66.5:

# If Weight ≤ 66.5 → Prediction = No

# If Weight > 66.5 → Prediction = Yes

# If Height > 174.9 (i.e., 175–180 and beyond) → go to the right branch.

# Then check Weight ≤ 76.5:

# If Weight ≤ 76.5 → Prediction = Yes

# If Weight > 76.5 → Prediction = Yes

# So basically, in the right branch (Height > 174.9),
#  regardless of weight, the result is always Yes.

# Overfitting Case

# If the dataset had very complex splits
#  (e.g., separate conditions for Height = 175, 176, 177, …, 180,
#   or very fine-grained weight intervals like 66.5, 67.2, 67.8, etc.),
#   the tree would become unnecessarily large
#   and memorize the training data instead of learning general rules.

# This is called overfitting.

# A simple tree like the one above is not overfitted,
#  because the right side (Height > 174.9)
#   makes the same prediction (Yes) for all weights
# instead of splitting further.

In [None]:
#        example of page no 1 nitin sir refine example

#               (200)
#              /     \
#          (140)     (60)
#          /   \
#      (100)   (40)
#         \
#         (2)

# Example Tree (the one you gave):


# Explanation of Overfitting in This Case:

# At the root (200), the split into 140 vs 60 seems okay
# because both groups are reasonably large.

# At 140 → split into 100 vs 40, this is also fine
#  because both are still meaningful groups.

# But then at 100 → split into 98 vs 2 (for example, your "2"),
# this is where the problem comes:

# The tree is trying too hard to separate even the tiny number
# of samples (just 2).

# This small branch does not generalize well;
# it is only capturing noise or outliers in training data.

# In testing (new/unseen data),
# such tiny splits rarely help and usually hurt performance.

# Why is this Overfitting?

# The model is fitting the training data perfectly,
#         even for very small groups.

# Instead of learning the general pattern
#  (big splits like 200→140/60 or 140→100/40),
#   it goes deep to create rules for just 2 samples.

# These small branches are not representative of the real distribution.

#  So, overfitting happens when your tree keeps splitting
# until very small nodes (like 2 samples) are created.
#  A better approach is to prune the tree or set a minimum samples
#  per leaf (e.g., at least 10 samples before a split).


In [21]:
# if Study Hours ≤ 2 and Pen = Red → Fail

# if Study Hours ≤ 2 and Pen = Blue → Pass


# means  if ant test data coming  coming for less than 2 hr type
# then it may mistake more because  it have not understands
# pattern just learnt  yes and no by two condition

# hence it is the case of overfitting

