# C4.5 Classifier
Target feature: ```Job Offer```
C4.5 classifier is the improvement of ID3 (Iterative Dichotomiser 3).
- Target feature: ```play```
- We split the dataset based on the feature that has the maximum **gain ratio**.
- We keep repeating this process until the whole dataset is classified on the basis of the key feature.
# Theory
## Entropy of a dataset
$$\text{Entropy}(T, key) = -\sum_{i=0}^n p_i\log_2(p_i)$$
- $T$ represents the dataset.
- $key$ represents the column for which we find the entropy.
- $p_i$ is the probability of being selected, of each of the unique values of the key feature. In our case, the key feature is ```play```.
## Information Gain of a feature
$$\text{Information Gain = Entropy(dataset, target feature)}-\sum_{i=0}^n \text{Probability of}\ f_i\times\text{Entropy}( \sigma_{f=f_i}\text{dataset}, \text{target feature})$$
- $f_i$ is the $i$-th unique value of that feature.
- $\text{Entropy of}\ f_i$ is the entropy of the dataset, that has $f_i$ as that value for feature $f$, w.r.t. the key feature.
## Split Info
It is the entropy w.r.t the current feature.
$$\text{Split info of feature } f = \text{Entropy(Entire dataset, feature f)}$$
## Gain ratio
$$\text{Gain ratio of feature } f = \frac{\text{Entropy of feature }f}{\text{Split ratio of feature }f}$$
# Importing libraries and the datasets

In [45]:
import numpy as np
import pandas as pd
data = pd.read_csv("./dataset.csv")
data

Unnamed: 0,CGPA,Interactive,Practical Knowledge,Communication Skills,Job Offer
0,>=9,Yes,Very Good,Good,Yes
1,>=8,No,Good,Moderate,Yes
2,>=9,No,Average,Poor,No
3,<8,No,Average,Good,No
4,>=8,Yes,Good,Moderate,Yes
5,>=9,Yes,Good,Moderate,Yes
6,<8,Yes,Good,Poor,No
7,>=9,No,Very Good,Good,Yes
8,>=8,Yes,Good,Good,Yes
9,>=8,Yes,Average,Good,Yes


# Function to calculate Entropy of a dataset

In [46]:
from math import log
log2 = log(2)

# function to calculate entropy of a dataset
def calc_entropy(dataset, key = "Job Offer"):
    categories = dataset[key].unique()
    total_no_obs = dataset.shape[0]
    
    entropy_dataset = 0.0
    for category in categories:
        number = dataset[dataset[key] == category].shape[0]
        prob = float(number) / float(total_no_obs)
        entropy_dataset -= (prob * log(prob))

    return entropy_dataset / log2

# Function to calculate Information Gain of a dataset

In [78]:
def get_root_node(dataset, key = "Job Offer"):
    inf_gain_map = {}
    
    for col in dataset.columns:
        if col != key:
            features_array = dataset[col].unique()
            inf_gain = calc_entropy(dataset)
            
            for category in features_array:
                tmp_dataset = dataset[dataset[col] == category]
                entropy  = calc_entropy(tmp_dataset)
                prob = float(tmp_dataset.shape[0]) / float(dataset.shape[0])
                inf_gain -= (entropy * prob)

            split_info = calc_entropy(dataset, col)
            inf_gain_map[col] = {
                "information gain": inf_gain,
                "split info": split_info,
                "gain ratio": inf_gain / split_info
            }

    root_features = []
    max_gain_ratio = max([
        (inf_gain_map[col]['gain ratio'], col) for col in inf_gain_map.keys()
    ])[0]
    
    # it is possible that 2 features have the same gain ratio and they're both maximum
    for col in inf_gain_map.keys():
        if inf_gain_map[col]['gain ratio'] == max_gain_ratio:
            root_features.append(col)
        
    return root_features, inf_gain_map

# First split

In [87]:
import json

root1, if_gain_map = get_root_node(data)
print(f"First split feature(s): {', '.join(root1)}")
print(json.dumps(if_gain_map, indent = 4))

First split feature(s): CGPA
{
    "CGPA": {
        "information gain": 0.5567796494470396,
        "split info": 1.5219280948873626,
        "gain ratio": 0.36583834106055235
    },
    "Interactive": {
        "information gain": 0.09127744624168022,
        "split info": 0.9709505944546688,
        "gain ratio": 0.0940083324146332
    },
    "Practical Knowledge": {
        "information gain": 0.2448381015706647,
        "split info": 1.4854752972273346,
        "gain ratio": 0.16482138883605757
    },
    "Communication Skills": {
        "information gain": 0.5203268517870115,
        "split info": 1.4854752972273346,
        "gain ratio": 0.35027634101907357
    }
}


In [88]:
features_array = data.CGPA.unique()

split_datasets_1 = []
for category in features_array:
    split_datasets_1.append(data[data.CGPA == category])

split_datasets_1[0]

Unnamed: 0,CGPA,Interactive,Practical Knowledge,Communication Skills,Job Offer
0,>=9,Yes,Very Good,Good,Yes
2,>=9,No,Average,Poor,No
5,>=9,Yes,Good,Moderate,Yes
7,>=9,No,Very Good,Good,Yes


In [89]:
split_datasets_1[1]

Unnamed: 0,CGPA,Interactive,Practical Knowledge,Communication Skills,Job Offer
1,>=8,No,Good,Moderate,Yes
4,>=8,Yes,Good,Moderate,Yes
8,>=8,Yes,Good,Good,Yes
9,>=8,Yes,Average,Good,Yes


In [90]:
split_datasets_1[2]

Unnamed: 0,CGPA,Interactive,Practical Knowledge,Communication Skills,Job Offer
3,<8,No,Average,Good,No
6,<8,Yes,Good,Poor,No


# Second split

In [94]:
base_data_set = split_datasets_1[0].drop('CGPA', axis = 'columns')

root2, if_gain_map = get_root_node(base_data_set)
print(f"Second split feature(s): {', '.join(root2)}")
print(json.dumps(if_gain_map, indent = 4))

Second split feature(s): Practical Knowledge, Communication Skills
{
    "Interactive": {
        "information gain": 0.31127812445913283,
        "split info": 1.0,
        "gain ratio": 0.31127812445913283
    },
    "Practical Knowledge": {
        "information gain": 0.8112781244591328,
        "split info": 1.5,
        "gain ratio": 0.5408520829727552
    },
    "Communication Skills": {
        "information gain": 0.8112781244591328,
        "split info": 1.5,
        "gain ratio": 0.5408520829727552
    }
}


In [96]:
features_array = data["Practical Knowledge"].unique()

split_datasets_2 = []
for category in features_array:
    dataset = split_datasets_1[0][split_datasets_1[0]["Practical Knowledge"] == category]
    split_datasets_2.append(dataset)

split_datasets_2[0]

Unnamed: 0,CGPA,Interactive,Practical Knowledge,Communication Skills,Job Offer
0,>=9,Yes,Very Good,Good,Yes
7,>=9,No,Very Good,Good,Yes


In [97]:
split_datasets_2[1]

Unnamed: 0,CGPA,Interactive,Practical Knowledge,Communication Skills,Job Offer
5,>=9,Yes,Good,Moderate,Yes


In [98]:
split_datasets_2[2]

Unnamed: 0,CGPA,Interactive,Practical Knowledge,Communication Skills,Job Offer
2,>=9,No,Average,Poor,No


***