### Decision Tree

In this, we will be implementing a Decision Tree from scratch, building the entire Decision Tree class, and testing it with a dataset.

Decision Tree can be implemented with the help of three metrics:
1. Entropy
2. Gini Impurity
3. Information Gain

#### Entropy
Entropy is a measure of the randomness/disorder/purity/impurity in data. For any dataset, the entropy is calculated for the class labels of the data. The formula for entropy \( E(D) \) is:

\[
E(D) = -\sum_{i=1}^{c} p_i \log_2(p_i)
\]

Where:
- \( E(D) \) is the entropy of the dataset \( D \).
- \( c \) is the number of classes.
- \( p_i \) is the probability of class \( i \) in the dataset.

Entropy ranges from 0 to 1. A dataset is pure (completely homogeneous) if entropy is 0.

#### Gini Impurity
Gini Impurity is another measure used to quantify the impurity of a dataset. It calculates the probability of misclassifying a random instance. The formula for Gini Impurity \( G(D) \) is:

\[
G(D) = \sum_{i=1}^{c} p_i (1 - p_i)
\]


Where:
- \( G(D) \) is the Gini Impurity of the dataset \( D \).
- \( c \) is the number of classes.
- \( p_i \) is the probability of class \( i \) in the dataset.

A dataset is pure if the Gini Impurity is 0.

#### Information Gain
Information Gain (IG) is used to determine the best feature to split the dataset. It measures the reduction in entropy after splitting the dataset based on a feature. The formula for Information Gain is:

\[
IG(D, A) = E(D) - \sum_{v \in Values(A)} \frac{|D_v|}{|D|} E(D_v)
\]

Where:
- \( IG(D, A) \) is the Information Gain of attribute \( A \) for dataset \( D \).
- \( E(D) \) is the entropy of the dataset \( D \).
- \( Values(A) \) represents the possible values of attribute \( A \).
- \( D_v \) is the subset of \( D \) for which attribute \( A \) takes value \( v \).
- \( |D_v| \) is the number of instances in \( D_v \), and \( |D| \) is the total number of instances in \( D \).

The attribute with the highest information gain is selected for the split.


In [2]:
import numpy as np
import pandas as pd

In [52]:
class Node:
    def __init__(self, feature_index=None, threshold=None, left=None, right=None, info_gain=None, value=None):
        self.feature_index = feature_index  # Index of the feature to split on
        self.threshold = threshold          # Threshold value to split at
        self.left = left                    # Left subtree
        self.right = right                  # Right subtree
        self.info_gain = info_gain          # Information gain from the split
        self.value = value                  # Value if it's a leaf node

        

In [58]:
class DecisionTree:
    
    # constructor of the class DecisionTree
    def __init__(self, min_samples_left=2, max_depth=2):
        self.root = None
        self.min_samples_left = min_samples_left
        self.max_depth = max_depth

    
    ## function to build the decision tree
    def BuildTree(self, dataset, curr_depth=0):
        X, y = dataset[:, :-1], dataset[:, -1]
        num_samples = X.shape[0]
        num_features = X.shape[1]
        
        #splitting on the best feature
        if num_samples >= self.min_samples_left and curr_depth <= self.max_depth:
            best_split = self.get_best_split(dataset, num_samples, num_features)
            if best_split["info_gain"] > 0:
                left_subtree = self.BuildTree(best_split["dataset_left"], curr_depth + 1)
                right_subtree = self.BuildTree(best_split["dataset_right"], curr_depth + 1)
                return Node(best_split["feature_index"], best_split["threshold"], left_subtree, right_subtree, best_split["info_gain"])

        leaf_value = self.calculate_leaf_value(y)
        return Node(value=leaf_value)
    
    # function to get the best split
    def get_best_split(self, dataset, num_samples, num_features):
        best_split = {}
        max_info_gain = -float("inf")

        for feature_index in range(num_features):
            feature_values = dataset[:, feature_index]
            possible_threshold = np.unique(feature_values)

            for threshold in possible_threshold:
                dataset_left, dataset_right = self.split(dataset, feature_index, threshold)
                if len(dataset_left) > 0 and len(dataset_right) > 0:
                    y, left_y, right_y = dataset[:, -1], dataset_left[:, -1], dataset_right[:, -1]
                    curr_info_gain = self.information_gain(y, left_y, right_y, "gini")
                    if curr_info_gain > max_info_gain:
                        best_split["feature_index"] = feature_index
                        best_split["threshold"] = threshold
                        best_split["dataset_left"] = dataset_left
                        best_split["dataset_right"] = dataset_right
                        best_split["info_gain"] = curr_info_gain
                        max_info_gain = curr_info_gain

        return best_split

    #split function to divide the data on a basis of a threshold of a feature
    def split(self, dataset, feature_index, threshold):
        dataset_left = np.array([row for row in dataset if row[feature_index] <= threshold])
        dataset_right = np.array([row for row in dataset if row[feature_index] > threshold])
        return dataset_left, dataset_right

    #information gain
    # 1- entropy
    # 2- gini impurity
    
    def information_gain(self, y, left_y, right_y, mode="entropy"):
        weight_l = len(left_y) / len(y)
        weight_r = len(right_y) / len(y)

        if mode == "gini":
            gain = self.gini_impurity(y) - (weight_l * self.gini_impurity(left_y) + weight_r * self.gini_impurity(right_y))
        else:
            gain = self.entropy(y) - (weight_l * self.entropy(left_y) + weight_r * self.entropy(right_y))

        return gain

    def entropy(self, y):
        labels = np.unique(y)
        entropy = 0
        for cls in labels:
            p_class = len(y[y == cls]) / len(y)
            entropy += -p_class * np.log2(p_class)
        return entropy

    def gini_impurity(self, y):
        labels = np.unique(y)
        gini = 0
        for cls in labels:
            p_class = len(y[y == cls]) / len(y)
            gini += p_class ** 2
        return 1 - gini

    def calculate_leaf_value(self, y):
        y = list(y)
        return max(y, key=y.count)
    
    #model training
    def fit(self, X, y):
        data = np.concatenate((X, y.reshape(-1, 1)), axis=1)
        self.root = self.BuildTree(data)

    def predict(self, X):
        predictions = [self.make_prediction(x, self.root) for x in X]
        return predictions

    #making model inference
    def make_prediction(self, x, tree):
        if tree.value is not None:
            return tree.value
        feature_val = x[tree.feature_index]
        if feature_val <= tree.threshold:
            return self.make_prediction(x, tree.left)
        else:
            return self.make_prediction(x, tree.right)


In [59]:
data = pd.read_csv("Iris.csv")
X = data.iloc[:, :-1].values
y = data.iloc[:, -1].values.reshape(-1,1)

In [60]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

In [61]:
classifier = DecisionTree(min_samples_left = 3, max_depth=3)
classifier.fit(X_train, y_train)

In [57]:
Y_pred = classifier.predict(X_test) 
from sklearn.metrics import accuracy_score
accuracy_score(y_test, Y_pred)

0.9666666666666667