### Functions for calculating the followings:
- Gini impurity
- Weighted gini impurity
- Information Gain using gini impurity

In [1]:
!gdown  1l53Fgkg1G1ekCxxgaDQ00EXrnSMTeJj-

Downloading...
From: https://drive.google.com/uc?id=1l53Fgkg1G1ekCxxgaDQ00EXrnSMTeJj-
To: E:\Scaler-Notes-Git\DSML-Notes\M-15 ML-Supervised Algorithms\04 ML Decision Trees-2\sample_data.csv

  0%|          | 0.00/32.5k [00:00<?, ?B/s]
100%|##########| 32.5k/32.5k [00:00<?, ?B/s]


In [2]:
import pandas as pd
import numpy as np

In [3]:
sample_data = pd.read_csv('sample_data.csv')
sample_data.head()

Unnamed: 0,Gender,Age_less_35,JobRole,Attrition
0,Male,True,Laboratory Technician,0
1,Male,False,Sales Executive,1
2,Male,True,Sales Representative,1
3,Female,False,Healthcare Representative,0
4,Male,True,Sales Executive,0


In [4]:
sample_data.Attrition.value_counts()

Attrition
0    831
1    169
Name: count, dtype: int64

In [5]:
def gini_impurity(y):
  p = y.value_counts()/y.shape[0]
  gini = 1-np.sum(p**2)

  return gini

In [6]:
def calculate_weighted_gini(feature, y):
    categories = feature.unique()

    weighted_gini_impurity = 0

    for category in categories:
        y_category = y[feature == category]
        gini_impurity_category = gini_impurity(y_category)
        weighted_gini_impurity += y_category.shape[0]/y.shape[0]*gini_impurity_category

    return weighted_gini_impurity

In [7]:
def information_gain_gini(feature, y):

    parent_gini = gini_impurity(y)
    child_gini = calculate_weighted_gini(feature,y)

    ig = parent_gini - child_gini

    return ig

Gini impurity at root node -

In [8]:
gini_impurity(sample_data.Attrition)

0.28087799999999996

Weighted gini impurity of child for `JobRole` -

In [9]:
calculate_weighted_gini(sample_data.JobRole, sample_data.Attrition)

0.26022396036321827

Information Gain using gini impurity for `JobRole` -

In [10]:
information_gain_gini(sample_data.JobRole, sample_data.Attrition)

0.020654039636781696

Comparing Information Gain (using gini) for features -

In [11]:
for feature in sample_data.columns[:-1]:
    print(f'Information Gain for {feature} is {information_gain_gini(sample_data[feature],sample_data.Attrition)}')

Information Gain for Gender is 1.2832567979348397e-06
Information Gain for Age_less_35 is 0.008400808101418078
Information Gain for JobRole is 0.020654039636781696
