![title](https://upload.wikimedia.org/wikipedia/commons/f/f3/CART_tree_titanic_survivors.png)

As humans, a lot of times we make decisions based on some circumstances. For example, let's try to imagine the act of eating. We first ask ourselves:

    Am I hungry? If no don't eat anything else continue.
    Is there food in the kitchen? If yes eat that food, else continue.
    Are there enough ingredients to prepare some food? If yes prepare the food and eat it, else continue
    Do I have money to order food? If yes order food else die of hunger.
    
So you may see them as a chain of ifs, which model the whole path towards a final decision, by answering a lot of yes or no questions. This is called a **decision tree**.

Decision Trees are a versatile, powerful and simple  prediction method, that are still widely used today, or in conjunction with other ML algorithms. The process and end result of the model is fairly easy to understand, even for those that aren't very technical or knowledgable of the problem or the method itself. And they are used in conjunction to other ML algorithms that we'll see later on.

On this noteboook, we'll focus on a type of decision tree called **Classification and Regression Tree** or CART trees. These trees perform a classification or regression task based on given input data. This tree is similar to binary trees from data structures, where each node of a tree has zero,one or two "children" or decision results. Let's first load the Human Resources dataset.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
HR = pd.read_csv("HR_comma_sep.csv")
HR.head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,dept,salary
0,0.38,0.53,2,157,3,0,1,0,sales,low
1,0.8,0.86,5,262,6,0,1,0,sales,medium
2,0.11,0.88,7,272,4,0,1,0,sales,medium
3,0.72,0.87,5,223,5,0,1,0,sales,low
4,0.37,0.52,2,159,3,0,1,0,sales,low


This dataset contains information of employees for some random company. It contains information like the last evaluation score, satisfaction level, promotions, average monthly hours, salary, etc; and the target variable, if the employee left the company or not. Right off the bat we can make some questions and assumptions about determining if  an employee left or not.

    Where they overworked and/or underpaid?
    Are they underperforming? 
    Are they unsatisfied with the company?
    Have they spent a long time with the company without getting promoted?
    

Maybe a combination of these factors can give us a better probability of determining if an employee left or not.


## 1. Gini

So how do CART models work?

Well first, an input variable and a target variable are given, for example the time spent on the company and if the employee left or not. Then, a split is made based on the input variable, dividing them into two groups. After this, the splits are evaluated to determine which one optimized the model best. To determine this we calculate the **Gini score**, which gives an idea of how good a split is. A perfect Gini score is one which completely separates both target classes ( if there are only two target classes like in binary classification) is 0, while one that separates them by 50/50 of each class in each split and has no impact on actually separing the data has a gini score of 1. 

To calculate it we first have to calculate the proportion of the target classes on both splits. 

$Proportion = \dfrac{count(class)}{(count(grouprows))}$ 

Let's get the proportion for those that left the company, based on number of years they have spent on the company.

In [4]:
median_years = HR.time_spend_company.median()
median_years

3.0

In [5]:
more_than_3 = HR.loc[HR.time_spend_company > median_years]
less_than_3 = HR.loc[HR.time_spend_company <= median_years]
print(more_than_3.shape)
print(less_than_3.shape)

(5312, 10)
(9687, 10)


So there are a lot more employees with less than three years than the ones with more than 3 years. Let's now calculate the proportions for both the ones who left and the ones who didn't, for those that have more than 3 years or 3 years or less.

In [6]:
def calc_Prop(class_count,count_rows):
    return (class_count/count_rows)[-1]
more_than_3_left = calc_Prop(more_than_3.loc[more_than_3.left == 1].count(),more_than_3.count())
more_than_3_stayed = calc_Prop(more_than_3.loc[more_than_3.left == 0].count(),more_than_3.count())
less_than_3_left = calc_Prop(less_than_3.loc[less_than_3.left == 1].count(),less_than_3.count())
less_than_3_stayed = calc_Prop(less_than_3.loc[less_than_3.left == 0].count(),less_than_3.count())

print("More than 3 years and left: " + str(more_than_3_left))
print("More than 3 years and stayed: " + str(more_than_3_stayed))
print("3 years or less and left: " + str(less_than_3_left))
print("3 years or more and stayed: " + str(less_than_3_stayed))

More than 3 years and left: 0.363704819277
More than 3 years and stayed: 0.636295180723
3 years or less and left: 0.169195829462
3 years or more and stayed: 0.830804170538


The gini index is then calculated as follows:

$Gini = \sum Proportion * (1 - Proportion)$

In our example, the gini index for the years on company is:

In [7]:
def ComputeGini(proportions):
    return sum([(prop * (1 - prop)) for prop in proportions])

print(ComputeGini([more_than_3_left,more_than_3_stayed,less_than_3_left,less_than_3_stayed]))

0.743984448933


So in the end, splitting by the median number of years wasn't that good of a split. Let's create a function that computes all these steps more easily.

In [9]:
def Gini(groups,class_values=[1,0],target_name = "left"):
    gini = 0.0
    for value in class_values:
        for group in groups:
            group_size = group.shape[0]
            split = group.loc[group[target_name] == value]
            split_size = split.shape[0]
            if split_size > 0:
                proportion = split_size / group_size
                gini += proportion * (1 - proportion)
    return gini

print(Gini([HR.loc[HR.time_spend_company > median_years],HR.loc[HR.time_spend_company <= median_years]],[1,0],"left"))

0.7439844489329797


Let's experiment with different values of years.

In [10]:
print("Split 1 years " + str(Gini([HR.loc[HR.time_spend_company > 1],HR.loc[HR.time_spend_company <= 1]],[1,0],"left")))
print("Split 2 years " + str(Gini([HR.loc[HR.time_spend_company > 2],HR.loc[HR.time_spend_company <= 2]],[1,0],"left")))
print("Split 4 years " + str(Gini([HR.loc[HR.time_spend_company > 4],HR.loc[HR.time_spend_company <= 4]],[1,0],"left")))
print("Split 5 years " + str(Gini([HR.loc[HR.time_spend_company > 5],HR.loc[HR.time_spend_company <= 5]],[1,0],"left")))

Split 1 years 0.3627984870747277
Split 2 years 0.4515623350408856
Split 4 years 0.7981143203281653
Split 5 years 0.6429470010167895


Let's also experiment with another variable.

In [11]:
print("Split low salary " + str(Gini([HR.loc[HR.salary == "low"],HR.loc[HR.salary != "low"]],[1,0],"left")))
print("Split medium salary " + str(Gini([HR.loc[HR.salary == "medium"],HR.loc[HR.salary != "medium"]],[1,0],"left")))
print("Split high salary " + str(Gini([HR.loc[HR.salary == "high"],HR.loc[HR.salary != "high"]],[1,0],"left")))

Split low salary 0.7153542922751005
Split medium salary 0.7133050676097072
Split high salary 0.5022896051397686


And a compound experimentation.

In [12]:
low_salary = HR.loc[HR.salary == "low"]
medium_salary = HR.loc[HR.salary == "medium"]
high_salary = HR.loc[HR.salary == "high"]

print("Split low salary 3 years " + str(Gini([low_salary.loc[low_salary.time_spend_company > 3],
                                             low_salary.loc[low_salary.time_spend_company <= 3]],[1,0],"left")))
print("Split low salary 2 year " + str(Gini([low_salary.loc[low_salary.time_spend_company > 2],
                                             low_salary.loc[low_salary.time_spend_company <= 2]],[1,0],"left")))

print("Split medium salary 3 years " + str(Gini([medium_salary.loc[medium_salary.time_spend_company > 3],
                                             medium_salary.loc[medium_salary.time_spend_company <= 3]],[1,0],"left")))

print("Split medium salary 2 year " + str(Gini([medium_salary.loc[medium_salary.time_spend_company > 2],
                                             medium_salary.loc[medium_salary.time_spend_company <= 2]],[1,0],"left")))

print("Split high salary 3 years " + str(Gini([high_salary.loc[high_salary.time_spend_company > 3],
                                             high_salary.loc[high_salary.time_spend_company <= 3]],[1,0],"left")))

print("Split high salary 2 year " + str(Gini([high_salary.loc[high_salary.time_spend_company > 2],
                                             high_salary.loc[high_salary.time_spend_company <= 2]],[1,0],"left")))

Split low salary 3 years 0.830374825997142
Split low salary 2 year 0.4953205321386966
Split medium salary 3 years 0.675601829840803
Split medium salary 2 year 0.42338841464891047
Split high salary 3 years 0.2643282469015171
Split high salary 2 year 0.16017314032344593


So we found a very good split for those that have a high salary and splitted at 2 years on the company. However, manually creating these splits is a tedious process. So let's create a function that iteratively creates a list of splits.

## 2 Splitting

In [13]:
def Create_Split(data,split_column):
    splits = []
    if data[split_column].dtype == object or len(data[split_column].unique()) == 2:
        #It's either a categorical column or a binary column.
        values = data[split_column].unique()
        if len(values == 2):
            values = [values[0]]
        for value in values:
            left_split = data.loc[data[split_column] == value]
            right_split = data.loc[data[split_column]!= value]
            split_name = split_column + " = " + str(value)
            splits.append([left_split,right_split,split_name])
    else:
        #It's a numerical column.
        summary = data[split_column].describe()
        values = np.asarray([summary["25%"],summary["50%"],summary["75%"]])
        #Make sure they are unique splits.
        for value in np.unique(values):
            left_split = data.loc[data[split_column] <= value]
            right_split = data.loc[data[split_column] > value]
            split_name = split_column + " <= " + str(value)
            splits.append([left_split,right_split,split_name])
    
    return splits

In [14]:
year_splits = Create_Split(HR,"time_spend_company")
#Print all the split names.
for split in year_splits:
    print(split[2])

time_spend_company <= 3.0
time_spend_company <= 4.0


Now, let's see which column gives us the best singular split.

In [15]:
def Get_Best_Column_Split(data,target="left",verbose = False):
    temp = data.drop(target, axis = 1).copy()
    best_gini = 1
    best_split_condition = ""
    best_split = {}
    for column in temp.columns:
        column_splits = Create_Split(data,column)
        for split in column_splits:
            
            split_gini = Gini(split[0:2])
            if verbose:
                print("Splitting : " + split[2] + " Gini : " + str(split_gini))
            if split_gini < best_gini:
                best_gini = split_gini
                best_split_condition = split[2]
                best_split["left"] = split[0].drop(column,axis = 1)
                best_split["right"] = split[1].drop(column, axis = 1)
                best_split["condition"] = best_split_condition
                
    if verbose:
        print("Best split: " + best_split_condition + " best Gini: " + str(best_gini))
    return best_split

In [16]:
Best_Split = Get_Best_Column_Split(HR,"left",verbose = True)
Best_Split["right"].head()

Splitting : satisfaction_level <= 0.44 Gini : 0.6842218273867912
Splitting : satisfaction_level <= 0.64 Gini : 0.6767195791231189
Splitting : satisfaction_level <= 0.82 Gini : 0.6148138363301998
Splitting : last_evaluation <= 0.56 Gini : 0.7758772040261261
Splitting : last_evaluation <= 0.72 Gini : 0.7249095845147965
Splitting : last_evaluation <= 0.87 Gini : 0.7678089292997319
Splitting : number_project <= 3.0 Gini : 0.7289819499966351
Splitting : number_project <= 4.0 Gini : 0.7698458241123651
Splitting : number_project <= 5.0 Gini : 0.777638157339555
Splitting : average_montly_hours <= 156.0 Gini : 0.7765419804049951
Splitting : average_montly_hours <= 200.0 Gini : 0.7244178845497011
Splitting : average_montly_hours <= 245.0 Gini : 0.7805322047899803
Splitting : time_spend_company <= 3.0 Gini : 0.7439844489329797
Splitting : time_spend_company <= 4.0 Gini : 0.7981143203281653
Splitting : Work_accident = 0 Gini : 0.5333904902485388
Splitting : promotion_last_5years = 0 Gini : 0.47885

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,dept,salary
18,0.45,0.51,2,160,3,1,1,sales,low
82,0.79,0.59,4,139,3,0,1,management,low
171,0.41,0.46,2,160,3,0,1,sales,low
287,0.11,0.79,6,292,4,0,1,technical,low
584,0.41,0.56,2,154,3,0,1,support,medium


So the column that helps us distinguish best if an employee has left the company or not is if they had a promotion in the last 5 years, based on its lowest gini score. This also makes sense that if an employee hasn't received a promotion they may feel disgruntled and end up leaving the office. However, this singular value is not enough to give us a good accuracy for prediction. If we combine a sequence of predictions, we can have a way better shot at dividing those who left and those who didn't. With splitting and calculating gini out of the way, we can proceed to actually building the **decision tree**.

## 3. Creating a Decision Tree

To build a tree we must make sure of the following things:

- The tree must have final nodes which don't split anymore.
- The tree must recursively find splits on itself.
- A split must have a minimum number of examples per split and a maximum number of splits. 

Let's go step by step, and first establish when a node isn't allowed to split anymore. To do this we define to variables:

- Maximum depth: If a route has reached the maximum depth, the tree isn't allowed to split anymore.
- Minimum split size: The minimum number of rows that a split must contain.

Let's first define how the predictions are made in the terminal node. This is done by selecting the most common value of all the possible values.

In [17]:
def terminal_node(group,target="left"):
    outcomes = group[target].value_counts()
    return outcomes.index[0]

Now we proceed to creating the tree. This is done by calling getting the best column split over and over again while making sure that we don't surpass the max depth threshold, and that we are only splitting on nodes with enough values. This is done using a **recursive** function, a type of function that calls itself until it is stopped by a certain condition. In our case, it's when we have reached the max depth, or are unable to split anymore.

In [18]:
def Split_Tree(split,cur_depth,min_size,max_depth=3,target = "left",verbose = False):
    left,right = split["left"],split["right"]
    #In case our depth is too large.
    del split["left"]
    del split["right"]
    
    
    if  not left.any or  not right.any:
        split["left"] = split["right"] = terminal_node((left + right))
        return
    if cur_depth == max_depth:
        split["left"], split["right"] = terminal_node(left), terminal_node(right)
        return
    #Calculate left
    if len(left) <= min_size:
        split["left"] = terminal_node(left)
    else:
        split["left"] = Get_Best_Column_Split(left,verbose = verbose)
        Split_Tree(split["left"],cur_depth + 1,min_size=min_size,verbose = verbose)
    
    #Calculate right
    if len(right) <= min_size:
        split["right"] = terminal_node(right)
    else:
        split["right"] = Get_Best_Column_Split(right,verbose = verbose)
        Split_Tree(split["right"],cur_depth + 1,min_size = min_size,verbose= verbose)
        

Finally, we can build the the whole tree by combining finding the first best split, and then splitting it until we can't do it anymore. 

In [40]:
def Build_Tree(data,max_depth= 6,ratio = 10,target= "left",verbose = False):
    root = Get_Best_Column_Split(data, verbose = verbose)
    min_size = len(data) // ratio
    Split_Tree(root,1,3,max_depth,verbose = verbose)
    return root


def Print_Tree(Node, depth=0):
    if isinstance(Node, dict):
        print('{} : {} {}'.format(depth, depth* ' ',Node['condition']))
        Print_Tree(Node['left'], depth+1)
        Print_Tree(Node['right'], depth+1)
    else:
        print('{} :{} {}'.format(depth, depth * ' ',Node))
    

Let's now get the best decision tree for the HR dataset. 

In [41]:
tree = Build_Tree(HR,verbose = False)
Print_Tree(tree)

0 :  promotion_last_5years = 0
1 :   Work_accident = 0
2 :    satisfaction_level <= 0.81
3 :    0
3 :    0
2 :    time_spend_company <= 2.0
3 :    0
3 :    0
1 :   time_spend_company <= 5.0
2 :    satisfaction_level <= 0.82
3 :    0
3 :    0
2 :    satisfaction_level <= 0.5
3 :    0
3 :    0


## 4. Prediction

Finally, to predict using the resulting tree, we can traverse it just like we did with print, but this time we evaluate the conditions that have been established by the tree. Since we saved the conditions as a string, this means parsing the string to check the column and the comparison we must make.

In [42]:
def Parse_Value(value):
    if "." in value:
        return float(value)
    try:
        return int(value)
    except:
        return value

def Predict(Node,row):
    if isinstance(Node,dict):
        condition = Node['condition'].split()
        column = condition[0]
        comparison = condition[1]
        value = Parse_Value(condition[2])
        
        if comparison == "=":
            if row[column] == value:
                return Predict(Node['left'],row)
            else:
                return Predict(Node['right'],row)
        else:
            if row[column] <= value:
                return Predict(Node['left'],row)
            else:
                return Predict(Node['right'],row)
        
    else:
        return Node

Let's now create a subsample of 10 rows from the dataset, and try to predict their values.

In [43]:
def Test(sample,tree,target = "left", verbose = True):
    total = 0
    for index,row in sample.iterrows():
        real = row[target]
        prediction = Predict(tree,row)
        if verbose:
            print("Expected: " + str(real) + " Predicted: " + str(prediction))
        total += real == prediction
    accuracy = total/len(sample)
    if verbose:
        print("Accuracy = {}%".format(accuracy*100))
    return accuracy
    
              
np.random.seed(777)
random_indexes = np.random.choice(len(HR), replace=False, size=50)
sample = HR.iloc[random_indexes]
Test(sample,tree)

Expected: 0 Predicted: 0
Expected: 0 Predicted: 0
Expected: 1 Predicted: 0
Expected: 0 Predicted: 0
Expected: 1 Predicted: 0
Expected: 0 Predicted: 0
Expected: 0 Predicted: 0
Expected: 0 Predicted: 0
Expected: 1 Predicted: 0
Expected: 0 Predicted: 0
Expected: 0 Predicted: 0
Expected: 0 Predicted: 0
Expected: 0 Predicted: 0
Expected: 0 Predicted: 0
Expected: 0 Predicted: 0
Expected: 0 Predicted: 0
Expected: 0 Predicted: 0
Expected: 0 Predicted: 0
Expected: 0 Predicted: 0
Expected: 0 Predicted: 0
Expected: 1 Predicted: 0
Expected: 1 Predicted: 0
Expected: 0 Predicted: 0
Expected: 0 Predicted: 0
Expected: 1 Predicted: 0
Expected: 0 Predicted: 0
Expected: 0 Predicted: 0
Expected: 1 Predicted: 0
Expected: 0 Predicted: 0
Expected: 1 Predicted: 0
Expected: 0 Predicted: 0
Expected: 0 Predicted: 0
Expected: 0 Predicted: 0
Expected: 1 Predicted: 0
Expected: 0 Predicted: 0
Expected: 0 Predicted: 0
Expected: 0 Predicted: 0
Expected: 1 Predicted: 0
Expected: 0 Predicted: 0
Expected: 0 Predicted: 0


0.76000000000000001

As you have probably taken notice now, our prediction are imbalanced. This is because the proportion of values of "left" in the HR dataset is imbalanced. Let's try to balance it.

In [44]:
HR.left.value_counts()

0    11428
1     3571
Name: left, dtype: int64

In [45]:
left_dataset = HR.loc[HR.left == 0]
random_left  = np.random.choice(len(left_dataset), replace=False, size=3571)
sample_left = left_dataset.iloc[random_left]
balanced = sample_left.append(HR.loc[HR.left == 1])
balanced.shape

(7142, 10)

And now let's try retraining the tree. 

In [46]:
btree = Build_Tree(balanced,max_depth = 5,ratio= 7,verbose = False)
Print_Tree(btree)

0 :  number_project <= 2.0
1 :   average_montly_hours <= 136.0
2 :    last_evaluation <= 0.48
3 :    1
3 :    1
2 :    satisfaction_level <= 0.38
3 :    1
3 :    1
1 :   time_spend_company <= 3.0
2 :    Work_accident = 0
3 :    0
3 :    0
2 :    satisfaction_level <= 0.11
3 :    1
3 :    1


And once again, predict!

In [47]:
random_indexes = np.random.choice(len(balanced), replace=False, size=50)
sample = balanced.iloc[random_indexes]
Test(sample,btree)

Expected: 1 Predicted: 1
Expected: 0 Predicted: 1
Expected: 1 Predicted: 1
Expected: 0 Predicted: 0
Expected: 0 Predicted: 0
Expected: 0 Predicted: 0
Expected: 0 Predicted: 1
Expected: 1 Predicted: 1
Expected: 1 Predicted: 1
Expected: 1 Predicted: 1
Expected: 1 Predicted: 1
Expected: 1 Predicted: 1
Expected: 0 Predicted: 1
Expected: 1 Predicted: 1
Expected: 1 Predicted: 1
Expected: 0 Predicted: 0
Expected: 0 Predicted: 0
Expected: 0 Predicted: 1
Expected: 1 Predicted: 1
Expected: 0 Predicted: 0
Expected: 0 Predicted: 0
Expected: 0 Predicted: 0
Expected: 1 Predicted: 1
Expected: 0 Predicted: 0
Expected: 0 Predicted: 0
Expected: 1 Predicted: 1
Expected: 1 Predicted: 1
Expected: 0 Predicted: 0
Expected: 1 Predicted: 1
Expected: 1 Predicted: 1
Expected: 0 Predicted: 0
Expected: 1 Predicted: 1
Expected: 0 Predicted: 0
Expected: 0 Predicted: 0
Expected: 0 Predicted: 0
Expected: 1 Predicted: 1
Expected: 1 Predicted: 1
Expected: 0 Predicted: 0
Expected: 0 Predicted: 0
Expected: 1 Predicted: 1


0.90000000000000002

Now we have a much better accuracy. You have probably guessed that largely imbalanced datasets can throw off decision trees, as it will just throw whatever is most common. Do take care to not employ decision trees in problems like that, or at least change the way of predicting. In case we want to optimize the parameters, we can just create a function that does just that.

In [48]:
def Optimize_tree(train,test):
    depths = list(range(1,10))
    ratios= list(range(2,20))
    best_accuracy = 0.0
    best_params = {}
    
    for depth in depths:
        for ratio in ratios:
            tree = Build_Tree(train,max_depth = depth,ratio= ratio,verbose = False)
            accuracy = Test(test,tree,verbose = False)
            if accuracy > best_accuracy:
                best_accuracy = accuracy
                best_params["max_depth"] = depth
                best_params["ratio"] = ratio
                
    print(best_accuracy)
    return best_params

shuffle  = balanced.sample(frac=1).reset_index(drop=True)
train = shuffle[0:int(len(shuffle) * 0.2)]
test = shuffle[len(train):len(shuffle)]
best_params = Optimize_tree(train,test)
best_params

KeyboardInterrupt: 

With a large sample (about 600 samples instead of the 50 we were working with in the first test) the model gets an 81% accuracy. 

## Decision Tree Regressor.

Decision trees can also be used to predict numerical values. It follows almost the exact same procedure, but instead of using gini, we use the sum of squared errors. First, we define how to split the trees. This is done by using the **Standard Deviation Reduction** method. What this method does is that it compares the standard deviation of the target variable (in this case satisfaction level), against the standard deviation of both splits that are being made. This is done by calculating the difference between the whole standard deviation and the standard deviation of the split. Then, the split with the highest standard deviation difference is chosen.

$SDR(y,x) = SD(y) - SD(y,x)$ Where $SD$ = Standard Deviation

Example:

$SDR(Satisfaction level,last evaluation) = SD(Satisfaction level) - SD(Satisfaction level,last evaluation) $

**Why?** Because this means that this split is the most homogeneous one, or in more simple terms, there's enough difference between the whole and the split ( the standard deviation of the split is small, so the difference is a large number), that the split contains information that separates it well from the whole, and thus is good for predictions. This is similar to calculating the Gini index, where we wanted splits with imbalanced proportions as they tell more information. 

In [49]:
def SdevReduction(groups,sdev,target_name = "satisfaction_level"):
    best_sdev = 0
    best_group = None
    for group in groups:
        groupsdev = group[target_name].std()
        diff = sdev - groupsdev
        if diff > best_sdev:
            best_sdev = diff
            best_group = group
            
    return best_sdev


Next, we create a modification of getting the best column split. The only difference here is that instead of calculating Gini, we calculate the SDR.

In [50]:
def Get_Best_Column_Reg(data,target="satisfaction_level",verbose = False):
    temp = data.drop(target, axis = 1).copy()
    best_sdev = 0
    best_split_condition = ""
    best_split = {}
    orig_sdev = data[target].std()
    for column in temp.columns:
        column_splits = Create_Split(data,column)
        for split in column_splits:
            split_sdev = SdevReduction(split[0:2],orig_sdev)
            if verbose:
                print("Splitting : " + split[2] + " SDEV : " + str(split_sdev))
            if split_sdev > best_sdev:
                best_sdev = split_sdev
                best_split_condition = split[2]
                best_split["left"] = split[0].drop(column,axis = 1)
                best_split["right"] = split[1].drop(column, axis = 1)
                best_split["condition"] = best_split_condition
                
    if verbose:
        print("Best split: " + best_split_condition + " best SDEV: " + str(best_sdev))
    return best_split

In [51]:
Best_Split = Get_Best_Column_Reg(HR,"satisfaction_level",verbose = True)
Best_Split["right"].head()

Splitting : last_evaluation <= 0.56 SDEV : 0.039633242794188644
Splitting : last_evaluation <= 0.72 SDEV : 0.02816914778615326
Splitting : last_evaluation <= 0.87 SDEV : 0.008339809840993945
Splitting : number_project <= 3.0 SDEV : 0.040077096467931506
Splitting : number_project <= 4.0 SDEV : 0.038811722043194496
Splitting : number_project <= 5.0 SDEV : 0.033203599557505015
Splitting : average_montly_hours <= 156.0 SDEV : 0.037954194101917665
Splitting : average_montly_hours <= 200.0 SDEV : 0.02867664792861066
Splitting : average_montly_hours <= 245.0 SDEV : 0.026187545625358938
Splitting : time_spend_company <= 3.0 SDEV : 0.04547947471900213
Splitting : time_spend_company <= 4.0 SDEV : 0.005282887851238843
Splitting : Work_accident = 0 SDEV : 0.017442455075225904
Splitting : left = 1 SDEV : 0.03152689654142121
Splitting : promotion_last_5years = 0 SDEV : 0.036141332889753935
Splitting : dept = sales SDEV : 0.0006116703413255042
Splitting : salary = low SDEV : 0.006357049903122081
Best

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,Work_accident,left,promotion_last_5years,dept,salary
1,0.8,0.86,5,262,0,1,0,sales,medium
2,0.11,0.88,7,272,0,1,0,sales,medium
3,0.72,0.87,5,223,0,1,0,sales,low
6,0.1,0.77,6,247,0,1,0,sales,low
7,0.92,0.85,5,259,0,1,0,sales,low


Next, we also modify the terminal nodes and recursive splitting. For the terminal nodes, instead of returning the most common value in the in the split, we return the mean of the target value in the split. This is  how decision trees do predictions for regression. For the recursive splitting, we just replace the appropriate functions.

In [52]:
def terminal_reg(group,target="satisfaction_level"):
    return group[target].mean()

def Reg_Split(split,cur_depth,min_size,max_depth=3,target = "satisfaction_level",verbose = False):
    left,right = split["left"],split["right"]
    #In case our depth is too large.
    del split["left"]
    del split["right"]
    
    
    if  not left.any or  not right.any:
        split["left"] = split["right"] = terminal_reg((left + right))
        return
    if cur_depth == max_depth:
        split["left"], split["right"] = terminal_reg(left), terminal_reg(right)
        return
    #Calculate left
    if len(left) <= min_size:
        split["left"] = terminal_reg(left)
    else:
        split["left"] = Get_Best_Column_Reg(left,verbose = verbose)
        Reg_Split(split["left"],cur_depth + 1,min_size=min_size,verbose = verbose)
    
    #Calculate right
    if len(right) <= min_size:
        split["right"] = terminal_reg(right)
    else:
        split["right"] = Get_Best_Column_Reg(right,verbose = verbose)
        Reg_Split(split["right"],cur_depth + 1,min_size = min_size,verbose= verbose)

For the regression tree building, the same thing is done.

In [65]:
def Build_Reg_Tree(data,max_depth= 3,ratio = 10,target= "satisfaction_level",verbose = False):
    root = Get_Best_Column_Reg(data, verbose = verbose)
    min_size = len(data) // ratio
    Reg_Split(root,1,min_size,max_depth,verbose = verbose)
    return root

In [69]:
reg_tree = Build_Reg_Tree(HR,max_depth = 5,verbose = False)
Print_Tree(reg_tree)

0 :  time_spend_company <= 3.0
1 :   left = 1
2 :    last_evaluation <= 0.51
3 :    0.40774038461538414
3 :    0.4230855018587355
2 :    number_project <= 4.0
3 :    0.7016313264200547
3 :    0.6822621298046635
1 :   number_project <= 5.0
2 :    left = 1
3 :    0.7525047801147224
3 :    0.60905449770191
2 :   0.20731967213114882


In [74]:
def Reg_Test(sample,tree,target = "satisfaction_level", verbose = True):
    error = 0
    for index,row in sample.iterrows():
        real = row[target]
        prediction = Predict(tree,row)
        if verbose:
            print("Expected: " + str(real) + " Predicted: " + str(prediction))
        error += (real - prediction) ** 2
    MSE  = error/len(sample)
    RMSE = pow(MSE,0.5)
    return RMSE
    
              
random_indexes = np.random.choice(len(HR), replace=False, size=50)
sample = HR.iloc[random_indexes]
Reg_Test(sample,reg_tree)

Expected: 0.74 Predicted: 0.6822621298046635
Expected: 0.37 Predicted: 0.6822621298046635
Expected: 0.71 Predicted: 0.7016313264200547
Expected: 0.91 Predicted: 0.7016313264200547
Expected: 0.44 Predicted: 0.6822621298046635
Expected: 0.54 Predicted: 0.7016313264200547
Expected: 0.59 Predicted: 0.60905449770191
Expected: 0.39 Predicted: 0.40774038461538414
Expected: 0.67 Predicted: 0.6822621298046635
Expected: 0.11 Predicted: 0.20731967213114882
Expected: 0.74 Predicted: 0.7016313264200547
Expected: 0.73 Predicted: 0.60905449770191
Expected: 0.36 Predicted: 0.40774038461538414
Expected: 0.5 Predicted: 0.7016313264200547
Expected: 0.66 Predicted: 0.7016313264200547
Expected: 0.62 Predicted: 0.60905449770191
Expected: 0.99 Predicted: 0.6822621298046635
Expected: 0.5 Predicted: 0.6822621298046635
Expected: 0.1 Predicted: 0.20731967213114882
Expected: 0.11 Predicted: 0.20731967213114882
Expected: 0.71 Predicted: 0.7016313264200547
Expected: 0.15 Predicted: 0.60905449770191
Expected: 0.89 P

0.17661582966442524

In [75]:
sample = HR.sample(frac = 0.2)
Reg_Test(sample,reg_tree,verbose = False)

0.1947695750872891

And there we go. We now know how to create Decision trees for Regression and Classification problems. Still, Decision trees are hard to apply to some datasets, as they are easily swayed. On the next notebook, we'll see how to combine several trees into a more robust model.