### **Gini Index Calculation for Decision Trees in Python**


#### **1. Gini Index**


The Gini Index for a dataset $S$ is calculated as:

$$
\text{Gini Index}(S) = 1 - \sum_{i=1}^n (p_i)^2
$$

Where:

- $p_i$ is the proportion of instances in class $i$.
- $n$ is the number of different classes.


In [48]:
def calculate_gini_index(class_counts: list[int]) -> float:
    total_instances = sum(class_counts)
    probability = 0
    for count in class_counts:
        if count > 0:
            probability += (count / total_instances) ** 2
    gini_index = 1 - probability
    return gini_index

In [49]:
class_counts = [9, 5]  # 9 instances of class 'Yes', 5 instances of class 'No'
gini_index = calculate_gini_index(class_counts)
print(f"Gini Index: {gini_index:.4f}")

Gini Index: 0.4592


---


#### **2. Weighted Gini Index**

$$
\text{Weighted Gini Index}(S, A) = \sum_{j=1}^m \left( \frac{|S_j|}{|S|} \times \text{Gini Index}(S_j) \right)
$$

Where:

- $m$ is the number of subsets after the split.
- $|S_j|$ is the number of instances in subset $j$.
- $|S|$ is the total number of instances before the split.
- $\text{Gini Index}(S_j)$ is the Gini Index of subset $j$.


In [50]:
def calculate_weighted_gini_index(subsets: list[list[int]]) -> float:
    total_instances = sum(sum(subset) for subset in subsets)
    weighted_gini = 0.0
    for subset in subsets:
        subset_gini = calculate_gini_index(subset)
        weighted_gini += (sum(subset) / total_instances) * subset_gini
    return weighted_gini

In [51]:
# After split, we have two subsets:
# Subset 1: 4 'Yes', 2 'No'
# Subset 2: 5 'Yes', 3 'No'
subsets = [[4, 2], [5, 3]]
weighted_gini_index = calculate_weighted_gini_index(subsets)
print(f"Weighted Gini Index: {weighted_gini_index:.4f}")

Weighted Gini Index: 0.4583


---


#### **3. Gini Gain**


$$
\text{Gini Gain}(S, A) = \text{Gini Index}(S) - \text{Weighted Gini Index}(S, A)
$$

Where:

- $\text{Gini Index}(S)$ is the Gini Index before the split.
- $\text{Weighted Gini Index}(S, A)$ is the weighted Gini Index after the split.

> Take Small Gini Gain

In [52]:
def calculate_gini_gain(gini_before_split: float, subsets: list[list[int]]) -> float:
    weighted_gini_index = calculate_weighted_gini_index(subsets)
    gini_gain = gini_before_split - weighted_gini_index
    return gini_gain

In [53]:
# Before split: 9 instances of 'Yes', 5 instances of 'No'
original_class_counts = [9, 5]
gini_before_split = calculate_gini_index(original_class_counts)

# After split, we have two subsets:
# Subset 1: 4 'Yes', 2 'No'
# Subset 2: 5 'Yes', 3 'No'
subsets = [[4, 2], [5, 3]]

In [54]:
gini_gain = calculate_gini_gain(gini_before_split, subsets)
print(f"Gini Gain: {gini_gain:.4f}")

Gini Gain: 0.0009


#### **Summary**


1. **Gini Index**: Measures the impurity of a dataset based on class proportions.
2. **Weighted Gini Index**: Computes the weighted average of Gini Indices for subsets after a split.
3. **Gini Gain**: Measures the reduction in impurity (Gini Index) due to a split
