# 4. Classification

This JupyterNotebook is part of an exercise series titled *Classification* based on the lecture of the same title.

This exercise series is divided into three parts. There will be one exercise session per part (= one part per week):

- **4.1.** Decision Tree (*this notebook*)
    - **4.1.1.** [Dataset](#4.1.1.-Dataset) 
    - **4.1.2.** [Tree Helper Objects](#4.1.2.-Tree-Helper-Objects)
    - **4.1.3.** [Train Your Decision Tree](#4.1.3.-Train-Your-Decision-Tree)
    - **4.1.4.** [Obtain Predictions with Your Decision Tree](#4.1.4.-Obtain-Predictions-with-Your-Decision-Tree)
    - **4.1.5.** [Evaluate Your Decision Tree](#4.1.5.-Evaluate-Your-Decision-Tree)
    - **4.1.6.** [Use Another Dataset to Test your Decision Tree Implementation](#4.1.6.-Use-Another-Dataset-to-Test-your-Decision-Tree-Implementation)
    - **4.1.7.** [Another Attribute Selection Method](#4.1.7.-Another-Attribute-Selection-Method)
        - **4.1.7.1.** [Gain Ratio](#4.1.7.1.-Gain-Ratio)
        - **4.1.7.2.** [Gini Index](#4.1.7.2.-Gini-Index)
- **4.2.** [Naive Bayes](./4.2.-Naive-Bayes.ipynb) (*next weeks notebook*)
- **4.3.** AdaBoost (*notebook of the week after next*) - *Will be uploaded at a later date as a separate zip-file*

<div class="alert alert-block alert-warning">

**Important:**
    
Work on the respective part yourself **BEFORE** each exercise session. The exercise session is **NOT** intended to take a first look at the exercise sheet, but to solve problems students had while preparing the exercise sheet beforehand.
    
</div>

**Importing Libraries**

Feel free to import more libraries here.

In [None]:
import pandas as pd
from typing import List, Any, Callable, Tuple

from math import log

## 4.1. Decision Tree

In this first exercise you find yourself implementing a basic decision tree algorithm from scratch. Yet before you get to implement the decision tree algorithm itself, you need an attribute selection measure. Recall that we discussed three in our lecture: information gain, gain ratio, and Gini index. All three have their advantages and disadvantages.

<div class="alert alert-info" role="alert">

**Task 1:**
    
What Are the Key Differences Between the Three Discussed Attribute Selection Measures? Bullet Points are Sufficient.

</div>

Information gain:

Gain ratio:

Gini index:

Information gain:
- supports multiway split
- used by ID3
- guarantees a simple (but not the simplest) tree
- select attribute with highest information gain
- favours attributes with large amount of (distinct) values

Gain ratio:
- extension to information gain
- used by C4.5 (which is an improved version of ID3)
- select attribute with highest gain ratio
- becomes unstable when SplitInfo approaches zero (possible solution: contrain it by using information gain then instead of SplitInfo)
- prefers unbalanced splits

Gini index:
- enforces binary split
- used by CART
- select attribute with lowest Gini index
- biased towards multivalued attributes
- difficulties with large number of classes

### 4.1.1. Dataset 
We will use the following dataset in this JupyterNotebook:

In [None]:
from datasets.buys_computer import train_buys_computer

# view dataset
train_buys_computer

<div class="alert alert-info" role="alert">

**Task 2:**
    
Implement Information Gain

</div>

In [None]:
def information(dataset: pd.DataFrame, target_attribute: str) -> float:
    """Calculate encoded information in a dataset based on its target label distribution."""
    raise NotImplementedError("Implement this function.")


def information_partitioned(
    dataset: pd.DataFrame, target_attribute: str, partition_attribute: str
) -> float:
    """Calculate encoded information in a dataset partitioned by a
    specific attribute and based on its target label distribution."""
    raise NotImplementedError("Implement this function.")


def information_gain(
    dataset: pd.DataFrame, target_attribute: str, partition_attribute: str
) -> float:
    """Calculating information gain of a given dataset and its partitioning attribute."""
    raise NotImplementedError("Implement this function.")

In [None]:
def information(dataset: pd.DataFrame, target_attribute: str) -> float:
    """Calculate encoded information in a dataset based on its target label distribution."""
    class_probability = dataset[target_attribute].value_counts() / dataset.shape[0]
    return sum([p * log(p, 2) for p in class_probability]) * -1


def information_partitioned(
    dataset: pd.DataFrame, target_attribute: str, partition_attribute: str
) -> float:
    """Calculate encoded information in a dataset partitioned by a
    specific attribute and based on its target label distribution."""
    weights = dataset[partition_attribute].value_counts() / dataset.shape[0]
    return sum(
        [
            weight
            * information(
                dataset=dataset[dataset[partition_attribute] == index],
                target_attribute=target_attribute,
            )
            for index, weight in weights.items()
        ]
    )


def information_gain(
    dataset: pd.DataFrame, target_attribute: str, partition_attribute: str
) -> float:
    """Calculating information gain of a given dataset and its partitioned version."""
    return information(dataset, target_attribute) - information_partitioned(
        dataset, target_attribute, partition_attribute
    )

Test your functions:

In [None]:
# target we want to predict
target_attribute = "buys_computer"

info = information(train_buys_computer, target_attribute)
print(info)
assert info == 0.9402859586706309

In [None]:
info_partitioned = information_partitioned(train_buys_computer, target_attribute, "age")
print(info_partitioned)
assert info_partitioned == 0.6935361388961918

In [None]:
info_gain = information_gain(
    dataset=train_buys_computer,
    target_attribute=target_attribute,
    partition_attribute="age",
)
print(info_gain)
assert info_gain == 0.2467498197744391

### 4.1.2. Tree Helper Objects

Implementing a decision tree from scratch requires storing information in a specific structure. For this, we provide you with two classes, namely `Node` and `Branch`.

Recall the components of a decision tree on slide 8:
![Components of a decision tree.](decision-tree-components.png) 


A `Node` object refers to the root node, an internal node, or could also be a leaf node. The difference between a (root/internal) node and a leaf node is the existance or absence of branches and children. Meaning, a node with an empty branches list is a leaf node. Its label holds the class label.

As depicted in above's figure, a node can have branches that either lead to internal nodes or a leaf node. Additionally, a (decision) tree may support multiway split (as pictured above) or only support binary splits. To support multiway splits and to enable to branches hold labels with the attribute's corresponding value, we created a `Branch` object. Such an object essentially contains a label and a node. A `Node` object then holds several branches in a list.

In [None]:
class Node:
    """Node of a tree. Can consist of multiple branches,
    if no branches exist then node is not an internal node but a leaf node."""

    def __init__(self, label: str, branches: List = None) -> None:
        self.label = label
        # Our decision trees may support multiway splits.
        # Therefore, we store our branches or children as a list.
        # Should be of type List[Branch], but in this JupyterNotebook cell
        # it is not possible to reference the object Branch before it is defined.
        # We refrained form creating a package for these two objects.
        self.branches = branches

    def __repr__(self) -> str:
        """Special method to return a string containing a printable
        representation of this custom object. This representation can
        be used to create this very same object when the value is passed
        to eval()."""
        return "Node(%r, %r)" % (self.label, self.branches)


class Branch:
    """Branch of a tree containing a label and a (internal/leaf) node."""

    def __init__(self, label: str, node: Node = None) -> None:
        self.label = label
        # Actual child of a tree Node.
        self.node = node

    def __repr__(self) -> str:
        """Special method to return a string containing a printable
        representation of this custom object. This representation can
        be used to create this very same object when the value is passed
        to eval()."""
        return "Branch(%r, %r)" % (self.label, self.node)

<div class="alert alert-info" role="alert">

**Task 3:**
    
Implement the Basic Algorithm for Decision Trees That Uses Your Implemented Information Gain Function.
    
</div>    

Up until now you implemented one attribute selection measure, namely information gain. Additionally, you have two Python objects to aid your decision tree implementation. In this task you will implement a decision tree from scratch. 

For implementation details refer to lecture slide 9 for the algorithm sketch and to slide 10 for the stopping criteria. The full decision tree algorithm is also in the appendix of this lecture.

Our reference book, on which our lecture is based on, contains a detailed explanation on each step (pp. 332) and a pseudo code on p. 333, figure 8.3. Note that this book is available as hard copy in our library and also available online (just google it and it will be among the first results).

**Your task is to implement some functions in the following `DecisionTree` object: `_build_tree`, `_find_best_splitting_attribute`, and `predict`.**

- `_build_tree` is the heart of this object as it is responsible in constructing a decision tree. This function is called in `fit`.
- `_find_best_splitting_attribute` should be called in `_build_tree` to determine which attribute is the best to split and grow a subtree. This function will use your implemented `information_gain` function. How will it use it? By simply instantiating a `DecisionTree` object with a reference to your function. This is the reason why the `__init__` method has one parameter `attribute_selection_method` that is of type `Callable`.
- `predict` will use your constructed decision tree. Your task here is to implement this function to walk down your constructed tree to retrieve a class label for a given test data.


<div class="alert alert-danger" role="alert">

**Note these additional requirements:**
- For the time being, we restrict our decision tree to work with **categorical data** only.
- We allow multiway splits.
    
</div>

<div class="alert alert-warning" role="alert">

*Short Excursus:*

You are wondering why are there methods begining with a single underscore? And why are these methods refered to as "private" even though no such keyword exist in Python?

True private methods or variables begin with two leading underscores (similar to special methods like `__init__`). It is not possible to access these methods or variables from outside this particular object. Why? Because Python internally adds the class name to this method or variable. When inherit from such an object with a private method/variable comes with its own quirks (you are invited to play around with this yourself). For easier use, it is common to create so called "private" methods that have only one leading underscore. These signal that you should not call them from outside the class object or call them directly.
    
</div>    

**Now back to your task at hand: Implementing a basic decision tree from scratch.**

In [None]:
class DecisionTree:
    """Basic Decision Tree algorithm."""

    def __init__(
        self,
        attribute_selection_method: Callable,
    ) -> None:
        self.attribute_selection_method = attribute_selection_method

        # Function fit will later populate this variable
        self.target_attribute = None

        # Function fit will later produce a decision tree
        self.tree: Node = None

    def fit(
        self,
        dataset: pd.DataFrame,
        target_attribute: str,
    ) -> None:
        """Fit decision tree on a given dataset and target attribute."""
        # Store target_attribute in this object
        self.target_attribute = target_attribute
        # Get the attribute list
        attribute_list = [col for col in dataset.columns if col != target_attribute]
        # Construct the actual decision tree
        self.tree = self._build_tree(dataset, attribute_list)

    def _build_tree(self, data: pd.DataFrame, attribute_list: List[str]) -> Node:
        """'Private' method to build decision tree recursively. Returns current (sub-)tree at point."""
        raise NotImplementedError("Implement this function.")

    def _find_best_splitting_attribute(
        self, data: pd.DataFrame, attribute_list: List[str]
    ) -> tuple[str, Any]:
        """'Private' method to find the best splitting attribute in a list of all available attributes."""
        # This function should be used in _build_tree. Of course, you can implement
        # this functionality directly in _build_tree if you prefer.
        raise NotImplementedError("Implement this function.")

    def predict(self, dataset: pd.DataFrame) -> List[Any]:
        """Returns predicted values for a given dataset."""
        raise NotImplementedError("Implement this function.")

In [None]:
class DecisionTree:
    """Basic Decision Tree algorithm."""

    def __init__(
        self,
        attribute_selection_method: Callable,
    ) -> None:
        self.attribute_selection_method = attribute_selection_method

        # Function fit will later populate this variable
        self.target_attribute = None

        # Function fit will later produce a decision tree
        self.tree: Node = None

    def fit(
        self,
        dataset: pd.DataFrame,
        target_attribute: str,
    ) -> None:
        """Fit decision tree on a given dataset and target attribute."""
        # Store target_attribute in this object
        self.target_attribute = target_attribute
        # Get the attribute list
        attribute_list = [col for col in dataset.columns if col != target_attribute]
        # Construct the actual decision tree
        self.tree = self._build_tree(dataset, attribute_list)

    def _build_tree(self, data: pd.DataFrame, attribute_list: List[str]) -> Node:
        """'Private' method to build decision tree recursively. Returns current (sub-)tree at point."""
        if len(data[self.target_attribute].unique()) == 1:
            # All tuples have same class, thus return node as leaf node labeled with this class
            return Node(label=data[self.target_attribute].unique()[0])

        if not attribute_list:
            # List is empty, return leaf node with majority class
            majority_class = (
                data[self.target_attribute]
                .value_counts()
                .sort_values(ascending=False)
                .index[0]
            )
            return Node(label=majority_class)

        # Determine splitting attribute
        splitting_attribute, labels = self._find_best_splitting_attribute(
            data, attribute_list
        )

        # Typically, we have to determine if the splitting attribute is discrete valued,
        # but we restrict ourselves here only to discrete-valued data.
        # Yet, we need to check if the attribute_selection_method allows multiway splits.
        # For instance, Gini index only allows binary trees, thus, we can only remove the
        # splitting attribute from the attribute list when we do not have Gini index as the
        # attribute selection method.
        if self.attribute_selection_method.__name__ != "gini_index" or (
            labels and len(labels) == 1
        ):
            # Remove the splitting_attribute from attribute_list
            attribute_list = [
                attr for attr in attribute_list if attr != splitting_attribute
            ]

        # Create a node with an empty list as branches
        node = Node(splitting_attribute, [])

        if self.attribute_selection_method.__name__ == "gini_index":
            attribute_values = labels
        else:
            attribute_values = [[value] for value in data[splitting_attribute].unique()]

        # For each unique value of this splitting_attribute
        for value in attribute_values:
            # Partition the tuples and grow subtrees for each partition
            partition: pd.DataFrame = data[data[splitting_attribute].isin(value)]
            if partition.empty:
                # Attach a leaf labeled with the majority class
                node.branches.append(Node(value))
            else:
                # Append the node returned by _build_tree.
                # Note that we need to copy the list of attributes otherwise we would perform the following
                # operations on the very same attribute list. This can be done by slicing, but
                # also by using the built in function copy().
                node.branches.append(
                    Branch(
                        label=value,
                        node=self._build_tree(
                            data=partition, attribute_list=attribute_list[:]
                        ),
                    )
                )
        return node

    def _find_best_splitting_attribute(
        self, data: pd.DataFrame, attribute_list: List[str]
    ) -> tuple[str, Any]:
        """'Private' method to find the best splitting attribute in a list of all available attributes."""
        # For each attribute in the given attribute_list calculate a scalar value that
        # is later then used to determine the best splitting attribute.
        # Here, we build a list of tuples. One such tuple contains the attribute name as
        # well the calculated scalar value.
        # Note that in the case of Gini index as the attribute selection method, a list
        # of attribute values and a scalar value such as
        # [[['high'], ['medium', 'low']], 0.4428571428571429] is returnd.
        all_split_information = [
            (
                attribute,
                self.attribute_selection_method(
                    dataset=data,
                    target_attribute=self.target_attribute,
                    partition_attribute=attribute,
                ),
            )
            for attribute in attribute_list
        ]
        # Above list comprehension is the same as:
        # all_split_information = []
        # for attribute in attribute_list:
        #     all_split_information.append(
        #         (
        #             attribute,
        #             self.attribute_selection_method(
        #                 dataset=data,
        #                 target_attribute=self.target_attribute,
        #                 partition_attribute=attribute,
        #             ),
        #         )
        #     )

        # Test if our attribute_selection_method is the Gini index,
        # otherwise it must be one of the other measures.
        if self.attribute_selection_method.__name__ == "gini_index":
            # Sort this list of tuples based on the scalar value
            sorted_information = sorted(all_split_information, key=lambda x: x[1][-1])
            # When using Gini index, we want to maximize the information needed and thus need to select
            # the minimum value. This is the first element and it may look like
            # ('income', ([['high'], ['medium', 'low']], 0.375)).
            # We therefore, want to return the attribute name and the labels.
            return sorted_information[0][0], sorted_information[0][1][0]
        # "Else": Another measure has been used.
        # It is not wrong to explicitly write else here. Yet it is not really needed in this particular case.
        # Sort this list of tuples based on the scalar value
        sorted_information = sorted(all_split_information, key=lambda x: x[1])
        # When using information gain or gain ratio we want to minimize the information needed
        # to classify a tuple/row, meaning we have to select the element in sorted_information with
        # the highest value. In our variable sorted_information it is the last Python tuple element ([-1]).
        return sorted_information[-1]

    def predict(self, dataset: pd.DataFrame) -> List[Any]:
        """Returns predicted values for a given dataset."""
        if self.tree is None:
            raise ValueError(
                "DecisionTree not trained on data. Call function fit() first."
            )
        return [self._dfs(self.tree, row) for _, row in dataset.iterrows()]

    def _dfs(self, node: Node, data_row: pd.Series):
        """Private method to recursively walk down our decision tree to obtain a signle class label."""
        # If a Branch contains an empty list or is None, return its node label.
        if not node.branches:
            return node.label

        # Obtain the corresponding value of our tuple/sample of the column
        # that is specified in our node label.
        value = data_row[node.label]

        # Iterate over each branch of the current node.
        for branch in node.branches:
            if value in branch.label:
                # If the current branch label is equal to the dataset's corresponding
                # column value then go level down in the tree.
                return self._dfs(branch.node, data_row)

### 4.1.3. Train Your Decision Tree

In [None]:
dt = DecisionTree(attribute_selection_method=information_gain)
dt.fit(dataset=train_buys_computer, target_attribute=target_attribute)

In [None]:
print(dt.tree)

### 4.1.4. Obtain Predictions with Your Decision Tree

Test your decision tree. In the following cell, we will use this one time only the training dataset to make sure our decision tree works as intended. Note, however, **you do not use your training data to test your model!**

In [None]:
# Use all columns except the last one to obtain predictions.
# Last column contains the true class labels.
# Note that typically you do not use your training data to test your model!
# Here we only use it to make sure our implementation works as intended.
predictions = dt.predict(train_buys_computer.iloc[:, :-1])

Let's take a simple look at the true and predicted values:

In [None]:
for true, predict in zip(train_buys_computer.iloc[:, -1], predictions):
    print("True", true, "prediction", predict)

### 4.1.5. Evaluate Your Decision Tree
- calculate confusion matrix
- calculate other metrics (sensitifity, specificity)

### 4.1.6. Use Another Dataset to Test your Decision Tree Implementation
The following dataset may help a tennis player to determine wtether to go play tennis or not.

In [None]:
from datasets.play_tennis import train_play_tennis


train_play_tennis

Train your decision tree:

In [None]:
# train your decision tree here

In [None]:
# train your decision tree here
dt_tennis = DecisionTree(attribute_selection_method=information_gain)
dt_tennis.fit(dataset=train_play_tennis, target_attribute="Play Tennis")

Test your newly trained decision tree with the following test dataset:

In [None]:
from datasets.play_tennis import test_play_tennis


test_play_tennis

Make predictions with your decision tree:

In [None]:
# get predictions here

In [None]:
# get predictions here
predictions = dt_tennis.predict(test_play_tennis.iloc[:, :-1])

Evaluate your decision tree:

In [None]:
# evaluate your decision tree here

In [None]:
# evaluate your decision tree here
for true, predict in zip(test_play_tennis.iloc[:, -1], predictions):
    print("True", true, "prediction", predict)

### 4.1.7. Another Attribute Selection Method

<div class="alert alert-info" role="alert">

**Task 4:**
    
Implement Another Attribute Selection Method and Incorporate it in your Decision Tree Implementation. 
For instance, implement Gain Ratio or Gini Index. Keep in mind that some splitting criteria methods minimize whereas others seek to maximize some value. You may need to update your decision tree accordingly.
    
</div>    

#### 4.1.7.1. Gain Ratio

In [None]:
def split_info(dataset: pd.DataFrame, partition_attribute: str) -> float:
    """Calculates and returns SplitInfo given a dataset and a partitioning attribute."""
    raise NotImplementedError


def gain_ratio(
    dataset: pd.DataFrame, target_attribute: str, partition_attribute: str
) -> float:
    """Calculates gain ratio given a dataset, a target attribute, and a partitioning attribute."""
    raise NotImplementedError

In [None]:
def split_info(dataset: pd.DataFrame, partition_attribute: str) -> float:
    """Calculates and returns SplitInfo given a dataset and a partitioning attribute."""
    weights = dataset[partition_attribute].value_counts() / dataset.shape[0]
    return sum([weight * log(weight, 2) for _, weight in weights.items()]) * -1


def gain_ratio(
    dataset: pd.DataFrame, target_attribute: str, partition_attribute: str
) -> float:
    """Calculates gain ratio given a dataset, a target attribute, and a partitioning attribute."""
    gain = information_gain(
        dataset=dataset,
        target_attribute=target_attribute,
        partition_attribute=partition_attribute,
    )
    split_info_ = split_info(
        dataset=dataset,
        partition_attribute=partition_attribute,
    )
    return gain / split_info_

In [None]:
split_information = split_info(
    dataset=train_buys_computer, partition_attribute="income"
)
print(split_information)
assert split_information == 1.5566567074628228

In [None]:
info_gain = information_gain(
    dataset=train_buys_computer,
    target_attribute=target_attribute,
    partition_attribute="income",
)
print(info_gain)
assert info_gain == 0.029222565658954647

In [None]:
gain_ratio_value = gain_ratio(
    dataset=train_buys_computer,
    target_attribute=target_attribute,
    partition_attribute="income",
)
print(gain_ratio_value)
assert gain_ratio_value == 0.01877264622241867

Use your decision tree implementation with gain ratio as the attribute selection method:

In [None]:
dt_gain_ratio = DecisionTree(attribute_selection_method=gain_ratio)
dt_gain_ratio.fit(dataset=train_buys_computer, target_attribute=target_attribute)

print("Decision tree:", dt_gain_ratio.tree)

predictions = dt.predict(train_buys_computer.iloc[:, :-1])

print("Predictions:")
for true, predict in zip(train_buys_computer.iloc[:, -1], predictions):
    print("True", true, "prediction", predict)

#### 4.1.7.2. Gini Index

In [None]:
def gini(dataset: pd.DataFrame, target_attribute: str) -> float:
    raise NotImplementedError("Implement this function.")


def gini_index(
    dataset: pd.DataFrame, target_attribute: str, partition_attribute: str
) -> List:
    raise NotImplementedError("Implement this function.")

In [None]:
import itertools


def gini(dataset: pd.DataFrame, target_attribute: str) -> float:
    """Calculate the purity of the dataset based on its target_attribute."""
    weights = dataset[target_attribute].value_counts() / dataset.shape[0]
    return 1 - sum([weight**2 for _, weight in weights.items()])


def partition_dataset(
    dataset: pd.DataFrame, partition_attribute: str, values: List[str]
) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """Partition a dataset based on a partitioning attribute and a corresponding list of values.
    Returns the dataset partitioned by these values as well as the inverse."""
    condition = dataset[partition_attribute].isin(values)
    return dataset[condition], dataset[~condition]


def gini_index(
    dataset: pd.DataFrame, target_attribute: str, partition_attribute: str
) -> List:
    """Calculating Gini index of a given dataset, target attribute, and its partitioning attribute."""
    # Get number of tuples/rows
    number_tuples = dataset.shape[0]
    # Get unique values of the partitioning attribute
    unique_values = dataset[partition_attribute].unique()

    # If only one unique value exists, we cannot compute the Gini index.
    if len(unique_values) == 1:
        # Return this single unique value
        return [[unique_values], 1]

    # Determine unique value combination to build a binary tree.
    # Suppose attribute A has v possible values, then we have to take a look at
    # 2^v - 2 combinations (leaving out the empty set and the power set).
    subset_combinations = [
        list(subset)
        for l in range(1, len(unique_values))
        for subset in itertools.combinations(unique_values, l)
    ]
    # We later want to build a binary tree. For this to work, we need a
    # Python tuple of label for each branch in our tree.
    binary_subset_splits = [
        [a, b] for a, b in zip(subset_combinations, subset_combinations[::-1])
    ]
    # Remove duplicates by only taking the half of all elements
    binary_subset_splits = binary_subset_splits[: int(len(binary_subset_splits) / 2)]

    # Calculate the Gini index for each branch label combination
    gini_indices = []
    # For each label combination. We only need one element of each tuple
    for label_values_left, label_values_right in binary_subset_splits:
        # Partition dataset according to the label values and the partitioning attribute
        dataset_1, dataset_2 = partition_dataset(
            dataset=dataset,
            partition_attribute=partition_attribute,
            values=label_values_left,
        )
        # Calculate the Gini index for the partitioned datasets and add a Python tuple
        # consisting of the current label combinations and the calculated Gini index
        # to the gini_indices list.
        gini_indices.append(
            (
                [label_values_left, label_values_right],
                dataset_1.shape[0] / number_tuples * gini(dataset_1, target_attribute)
                + dataset_2.shape[0]
                / number_tuples
                * gini(dataset_2, target_attribute),
            )
        )
    # Sort the Gini indices by their scalar value and return the element with the smallest value.
    sorted_indices = sorted(gini_indices, key=lambda x: x[1])
    return sorted_indices[0]

In [None]:
buys_computer_gini_index = gini(
    dataset=train_buys_computer, target_attribute=target_attribute
)
assert buys_computer_gini_index == 0.4591836734693877

In [None]:
partition_attribute = "income"
all_split_information, index = gini_index(
    train_buys_computer, target_attribute, partition_attribute
)
print(all_split_information, index)

assert all_split_information == [["high"], ["medium", "low"]]
assert index == 0.4428571428571429

In [None]:
dt_gini_index = DecisionTree(attribute_selection_method=gini_index)
dt_gini_index.fit(dataset=train_buys_computer, target_attribute=target_attribute)

print(dt_gini_index.tree)

predictions = dt.predict(train_buys_computer.iloc[:, :-1])

for true, predict in zip(train_buys_computer.iloc[:, -1], predictions):
    print("True", true, "prediction", predict)