# Gini Impurity

### Gini Impurity ranges between 0 and 0.5. 0 represents a pure dataset and 0.5 represents a completely impure dataset.

![](Gini_Impurity_1.png)

### The dataset that I used was taken from here and I uploaded as /Golf_dataset.csv. You can see more information through this link : https://sefiks.com/2018/08/27/a-step-by-step-cart-decision-tree-example/#google_vignette

## Import Libraries

In [1]:
import pandas as pd
import numpy as np

## Load the dataset

In [2]:
data = pd.read_csv("./Golf_dataset.csv")
df = pd.DataFrame(data)

In [3]:
data

Unnamed: 0,Day,Outlook,Temp.,Humidity,Wind,Decision
0,1,Sunny,Hot,High,Weak,No
1,2,Sunny,Hot,High,Strong,No
2,3,Overcast,Hot,High,Weak,Yes
3,4,Rain,Mild,High,Weak,Yes
4,5,Rain,Cool,Normal,Weak,Yes
5,6,Rain,Cool,Normal,Strong,No
6,7,Overcast,Cool,Normal,Strong,Yes
7,8,Sunny,Mild,High,Weak,No
8,9,Sunny,Cool,Normal,Weak,Yes
9,10,Rain,Mild,Normal,Weak,Yes


In [4]:
# drop Day column
# x = dataframe
# y = Decision?

### Check which values exist in the Decision column

In [5]:
sample_num = df.shape[0]
decision_values = []
for i in range(sample_num):
    if df.loc[i, 'Decision'] not in decision_values:
        decision_values.append(df.loc[i, 'Decision'])
print(decision_values)

['No', 'Yes']


## Functions to compute the models

### Making table function

In [6]:
def MakeTable(df, feature):
    table = pd.DataFrame(columns=['value', 'Yes', 'No', 'Total_num', 'Gini_Impurity'])

    for i in range(sample_num):
        # Fill values table
        if df.loc[i, feature] not in table['value'].values:    # converts the series into a numpy array
            table.loc[len(table), 'value'] = df.loc[i, feature]
            
            # Initialize other column's value
            table.loc[len(table)-1, 'No'] = 0
            table.loc[len(table)-1, 'Yes'] = 0
            table.loc[len(table)-1, 'Total_num'] = 0
            table.loc[len(table)-1, 'Gini_Impurity'] = 0.0

        # Calculate numbers for each value
        for j in range(len(table)):        
            if df.loc[i, feature] == table.loc[j, 'value']:
                table.loc[j, 'Total_num'] += 1
                if df.loc[i, 'Decision'] == 'Yes':
                    table.loc[j, 'Yes'] += 1
                else:
                    table.loc[j, 'No'] += 1
                    
    return table

### Gini impurity function

In [7]:
def Gini_Impurity(table):
    gini_impurity = 0
    for i in range(table.shape[0]):
        # calculate gini impurity of specific value
        table.loc[i, 'Gini_Impurity'] = 1 - pow(table.loc[i, 'Yes']/table.loc[i, 'Total_num'], 2) - pow(table.loc[i, 'No']/table.loc[i, 'Total_num'], 2)
        gini_impurity += table.loc[i, 'Gini_Impurity'] * (table.loc[i, 'Total_num']/sample_num)
        
    return gini_impurity, table

## Test functions

In [8]:
feature = 'Outlook'

# Make table first
table = MakeTable(df, feature)
print('\n========', feature, 'Table ========')
print(table)

# Calculate gini impurity
gini_impurity, table = Gini_Impurity(table)
print('\n======== Final', feature, 'Table ========')
print(table)
print('Gini Impurity of [', feature, '] =', gini_impurity)


      value Yes No Total_num Gini_Impurity
0     Sunny   2  3         5           0.0
1  Overcast   4  0         4           0.0
2      Rain   3  2         5           0.0

      value Yes No Total_num Gini_Impurity
0     Sunny   2  3         5          0.48
1  Overcast   4  0         4           0.0
2      Rain   3  2         5          0.48
Gini Impurity of [ Outlook ] = 0.34285714285714286


## Things to do
1. Convert Decision part -> more flexible usage
2. Print impurity each column