### This Notebook will create the Decision Tree algorithm from scratch. The dataset used for this will be the sklearn diabetes which is a regression problem

In [97]:
from sklearn import datasets
import pandas as pd
import numpy as np

In [98]:
ds = datasets.load_diabetes()

In [99]:
X = np.array(ds.data)
y = np.array(ds.target)

Now that we have the data, we need to make the decision tree. So let's look at the first step. 

The first step is to decide how we are going to separate the data. How do we find this? Since we don't know the best spli criteria, we will have to find it. We do this by iterating through each column and for each column we test every possible value to split and see which one is the best. However, if a column A can have all possible positive values, there would be infinite values for which to try to split the data. So, to circunvent this, for each column, we try to separate the labels  using only the values of that column that we found in our training data. But now, how do we determine what is the best split. There are different metrics that we can use, however, in this case, we will calculate the variance of the labels on each child node and sum them. The best split is the one that have the lower variance. 

So, let's build 5 functions:
 1) Calculates the variance of a list of values, returning a float 
 2) For a given n of labels y it sums the values of the variances, returning a float
 3) For a given column c and a list it loops through all its values and calculates the variance of the split caused by the value, returning a tuple with (value, sum of variance)
 4) For a given value v, return the indexes that are less than that value and the indexes that are equal or greater than that value, as a list of lists.
 5) For a matrix X and labels y, it loops through all the columns and executes function 3), calculating the best column and value to split on, returning a dictionary with the index of the column, the value and the variance

In [100]:
def calculate_variance(number_list:np.ndarray[float])->float:
    n = len(number_list)
    if n == 0:
        return float('inf')
    sum_list = 0
    for number in number_list:
        sum_list += number
    average = sum_list/n
    sum = 0
    for number in number_list:
        sum+=(number-average)**2
    variance = sum/n
    return variance

In [101]:
def calculate_lists_variances(list_of_lists:list[list[float]])->float:
    variance_sum = 0
    for number_list in list_of_lists:
        variance_sum+=calculate_variance(number_list)
    return variance_sum

In [102]:
def calculate_split_indexes(split_value:float, list_column_values:list[float])->list[list[int], list[int]]:
    indexes_smaller = []
    indexes_eq_bigger = []
    for index, value in enumerate(list_column_values):
        if value<split_value:
            indexes_smaller.append(index)
        else:
            indexes_eq_bigger.append(index)
    return [indexes_smaller, indexes_eq_bigger]


In [103]:
def calculate_best_column_variance(float_list:np.ndarray[float], labels:np.ndarray[float])->tuple[float, float]:
    min_variance = float('inf')
    best_split = float('inf')
    for value in float_list:
        list_of_lists_of_indexes= calculate_split_indexes(value, float_list)
        list_of_lists = []
        for index_list in list_of_lists_of_indexes:
            list_of_lists.append(labels[index_list])
        value_variance = calculate_lists_variances(list_of_lists)
        if value_variance< min_variance:
            min_variance = value_variance
            best_split = value
    return (best_split, min_variance)


In [104]:
def calculate_matrix_best_split(matrix: np.ndarray[float], labels: np.ndarray[float])->dict[str, float]:
    result_dictionary = {
        "column_index":-1,
        "split":-1,
        "variance":float('inf')
    }
    for column_index in range(matrix.shape[1]):
        column_values = matrix[:, column_index] #select all values from the column of index column_index
        split_value, variance = calculate_best_column_variance(column_values, labels)
        if variance<result_dictionary['variance']:
            result_dictionary['column_index'] = column_index
            result_dictionary['split'] = split_value
            result_dictionary['variance'] = variance
    return result_dictionary

### Now let's try to determine the best split for our data!

In [105]:
calculate_matrix_best_split(X, y)

{'column_index': 6,
 'split': -0.09862541271332903,
 'variance': 5862.262308400307}