# Decision trees

Let us consider a real world scenario in which we might use decision trees.

In emergency rooms, an important decision-maker must perform triage for every newly admitted patient

Triage simply means assigning priority: someone who is minutes away from death but can be saved by a timely operation will be admitted to treatment immediately, whereas someone who has a paper cut or something of sorts will have to wait until a more urgent situation is cleared up

We will be building a decision tree about a person to predict how happy they are based on their traits

we will be using the dataset called European social survey

# Looking at the data

In [1]:
import pandas as pd
ess = pd.read_csv('ess.csv')

  ess = pd.read_csv('ess.csv')


In [2]:
print(ess.shape)

(44387, 534)


In [3]:
print(ess.loc[:,'happy'].head())

0    5
1    5
2    8
3    8
4    5
Name: happy, dtype: int64


here we wanted to look only at the happy column, and not the other 533. That's why we used the loc() function. Here the loc() function has sliced the variable called happy from pandas dataframe. In other words, it takes out only that column and ignores the other 533. Then, the head() function shows us the first 5 rows of that column

In [4]:
print(ess.loc[:,'sclmeet'].head())

0    6
1    4
2    4
3    4
4    6
Name: sclmeet, dtype: int64


there are some cases where the responses to some survey questions are missing, sometimes the participant might just refuse to answer or dont know how to answer

For our analysis, we will only consider the responses which are complete.

we can restrict the ess data so that it only contains full responses for the variables we care about


In [5]:
ess= ess.loc[ess['sclmeet'] <= 10, :].copy()
ess= ess.loc[ess['rlgdgr'] <= 10, :].copy()
ess= ess.loc[ess['hhmmb'] <= 50, :].copy()
ess= ess.loc[ess['netusoft'] <= 5, :].copy()
ess= ess.loc[ess['agea'] <= 200, :].copy()
ess= ess.loc[ess['health'] <= 5, :].copy()
ess= ess.loc[ess['happy'] <= 10, :].copy()
ess= ess.loc[ess['eduyrs'] <= 100, :].copy().reset_index(drop = True)

# Splitting our data

There are many ways we can use this datatto explre the relationship between someone's social life and their happiness

on of the simplest approaches is binary split: we compare the happiness levels of people with highly active social lives to those of people with less active social lives

In [6]:
import numpy as np
social = list(ess.loc[:,'sclmeet'])
happy = list(ess.loc[:,'happy'])
low_social_happiness = [hap for soc, hap in zip(social, happy) if soc<=5]
high_social_happiness= [hap for soc, hap in zip(social, happy) if soc>5]

meanlower = np.mean(low_social_happiness)
meanhigher = np.mean(high_social_happiness)

we define 2 new variables, social and happy, by slicing them from the ess dataframe. 

Then we used list comprehensions to find the happiness levels of all people with higher ratings of social activity (which we saved in the high_social_happiness).


we successfully managed to make 2 branches, now we will create more smarterly

# Smarter Splitting

When we compared the happiness levels of people with active versus inactive social lives, we used 5 as our split point, saying that those who were rated higher than 5 had an active social life and viceversa. We chose 5 because it is a natural middle point for rating 1-10. However remember that our goal is to build an acurate predictor of happiness. Rather than splitting based on intuitions about a natural midpoint is or what seems like an active social life, it would be best to makeour binary split in some place that leads to the best possible accuracy

in machine learning prblms, there are a few different ways to measure accuracy

The most natural way is to find the suuum of our errors. In our case, the error that interests us is the difference between our prediction of someone's happiness rating and their actual happiness rating 

If our decision tree predicts that your happiness is 6 but its actually 8, then that tree's error for your rating is 2. If we add up the prediction errors for every respondent in some group, we can get an error sum that measures the decision tree's accuracy for predicting the happiness of members of that group

The closer we can get our error sum to 0, the better our tree is

In [7]:
lowerrors = [abs(lowhappy - meanlower) for lowhappy in low_social_happiness]
higherrors = [abs(highhappy - meanhigher) for highhappy in high_social_happiness]

total_error = sum(lowerrors) + sum(higherrors)

In [8]:
total_error

60223.8860903523

In order to get the highest possible split, we should check all the splits possible and not just 5 so for ease we will create a function for it again

In [9]:
def get_splitpoint(allvalues, predictedvalues):
    lowest_error = float('inf')
    best_split = None
    best_lowermean = np.mean(predictedvalues)
    best_highermean = np.mean(predictedvalues)
    for pctl in range(0,100):
        split_candidate = np.percentile(allvalues, pctl)
        
        loweroutcomes = [outcome for value, outcome in zip(allvalues, predictedvalues) if value<=split_candidate]
        higheroutcomes = [outcome for value, outcome in zip(allvalues, predictedvalues) if value>split_candidate]
        
        if np.min([len(loweroutcomes), len(higheroutcomes)])>0:
            meanlower = np.mean(loweroutcomes)
            meanhigher = np.mean(higheroutcomes)
            
            lowererrors = [abs(outcome - meanlower) for outcome in loweroutcomes]
            highererrors = [abs(outcome - meanhigher) for outcome in higheroutcomes]
            
            total_error = sum(lowererrors) + sum(highererrors)
            
            if total_error<lowest_error:
                best_split = split_candidate
                lowest_error = total_error
                best_lowermean = meanlower
                best_highermean = meanhigher
    return(best_split, lowest_error, best_lowermean, best_highermean)

In this function we use a variable called pctl(short for percentile) to loop through every number from 0 to 100

In the first line of the loop, we define a new split_candidate variable, which is the pctl-th percentile of the dara. After that, we go through the same process

We create a list of the happiness level of people wholse sclmeet values are less than or equal to split candidate and the happiness levels of people whose sclmeet values are greater than the split candidate, and we check the errors that come from using that split candidate 

If the error sum from using the split candidate is smaller than any of the error sums from using the previous split candidates, then we redefine the best_split variable to be equal to the split point that led to the highest accuracy.

We can run this function for any variable

In [10]:
allvalues = list(ess.loc[:,'hhmmb'])
predictedvalues = list(ess.loc[:,"happy"])
print(get_splitpoint(allvalues, predictedvalues))

(1.0, 60860.029867951016, 6.839403436723225, 7.620055170794695)


# Choosing Splitting variables

How to know which variable to split at each branch node

We will use the same principle we used to get optimal split points to decide the best split variable

the best way to split is the one that leads to smallest error

In order to determine that we need to iterate over each available variable and check whether splitting on that variable leads to the smallest error

In [11]:
def getsplit(data, variables, outcome_variable):
    best_var = ''
    lowest_error = float('inf')
    best_split = None
    predictedvalues= list(data.loc[:,outcome_variable])
    best_lowermean = -1
    best_highermean = -1
    for var in variables:
        allvalues = list(data.loc[:,var])
        splitted = get_splitpoint(allvalues, predictedvalues)
        if(splitted[1]<lowest_error):
            best_split = splitted[0]
            lowest_error = splitted[1]
            best_var = var
            best_lowermean = splitted[2]
            best_highermean = splitted[3]
            
    generated_tree = [[best_var, float('-inf'), best_split, best_lowermean], [best_var, best_split, float('inf'), best_highermean]]

    return generated_tree

we defined a function with a for loop that iterates over all the variables in a list of variables.

Fpr each of those variables, it finds the best split by calling its function

Each variable split at its best splitpoint, will lead to certain error sum for our predictions,

If a particular variable has a lower error sum than any previous variable we considered, we will store that variable name as best_var

In [12]:
variables = ['rlgdgr','hhmmb','netusoft','agea','eduyrs']
outcome_variable = 'happy'
print(getsplit(ess, variables, outcome_variable))

[['netusoft', -inf, 4.0, 7.041597337770383], ['netusoft', 4.0, inf, 7.73042471042471]]


# Adding depth

we have completed everything to make the best possible split at each branchpoint and generate a tree with 2 branches.

Next we need to grow out tree beyond just one branching node and 2 terminal nodes.

The final step of our decision tree generation is to specify a depth that we want to reach, and build new branches until we reach that dept

In [13]:
maxdepth = 3
def getsplit(depth, data, variables, outcome_variable):
    best_var = ''
    lowest_error = float('inf')
    best_split = None
    predictedvalues= list(data.loc[:,outcome_variable])
    best_lowermean = -1
    best_highermean = -1
    for var in variables:
        allvalues = list(data.loc[:,var])
        splitted = get_splitpoint(allvalues, predictedvalues)
        if(splitted[1]<lowest_error):
            best_split = splitted[0]
            lowest_error = splitted[1]
            best_var = var
            best_lowermean = splitted[2]
            best_highermean = splitted[3]
            
    generated_tree = [[best_var, float('-inf'), best_split, []], [best_var, best_split, float('inf'), []]]
    
    if depth<maxdepth:
        splitdata1 = data.loc[data[best_var] <= best_split,:]
        splitdata2 = data.loc[data[best_var]>best_split,:]
        if len(splitdata1.index)>10 and len(splitdata2.index)>10:
            generated_tree[0][3] = getsplit(depth+1, splitdata1, variables, outcome_variable)
            generated_tree[1][3] = getsplit(depth+1, splitdata2, variables, outcome_variable)
        else:
            depth = maxdepth + 1
            generated_tree[0][3] = best_lowermean
            generated_tree[1][3] = best_highermean
    else:
        generated_tree[0][3] = best_lowermean
        generated_tree[1][3] = best_highermean
    return generated_tree

In this updated function, we define the generated_tree variable, we now add empty lists to it, instead of means.

We insert means only in terminal nodes, but if we want a tree that has a greater depth, we need to insert other branches within each branch(that is what the empty lists will contain). We also added an if statement with a long chunk of code at the end of the function. If the depth of the current branch is less than the maximum depth we want in a tree, this section will recursively call the get_split() function again to fill in another branch inside it. This process continues until the maximum depth is reached

In [14]:
variables = ['rlgdgr','hhmmb','netusoft','agea','eduyrs']
outcome_variable = 'happy'
maxdepth = 2
print(getsplit(0,ess, variables, outcome_variable))

[['netusoft', -inf, 4.0, [['hhmmb', -inf, 4.0, [['agea', -inf, 15.0, 8.035714285714286], ['agea', 15.0, inf, 6.997666564322997]]], ['hhmmb', 4.0, inf, [['eduyrs', -inf, 11.0, 7.263969171483622], ['eduyrs', 11.0, inf, 8.0]]]]], ['netusoft', 4.0, inf, [['hhmmb', -inf, 1.0, [['agea', -inf, 66.0, 7.135361428970136], ['agea', 66.0, inf, 7.621993127147766]]], ['hhmmb', 1.0, inf, [['rlgdgr', -inf, 5.0, 7.743893678160919], ['rlgdgr', 5.0, inf, 7.9873320537428025]]]]]]


What you see here is a collection of lists nested within each other. These
nested lists represent our full decision tree, though it’s not as easy to read

We have successfully created a decision tree that enables us to predict happiness levels with relatively low error. We can examine the output to see the relative determinants of happiness, and the happiness levels associated with each branch

There is also more exploring we can do with decision trees and our dataset

for example we can try to run the same code but with a different or larger set of variables. we can also create a tree with a different maximum depth

In [15]:
variables = ['sclmeet','rlgdgr','hhmmb','netusoft','agea','eduyrs','health']
outcome_variable = 'happy'
maxdepth = 3
print(getsplit(0,ess,variables,outcome_variable))

[['health', -inf, 2.0, [['sclmeet', -inf, 4.0, [['health', -inf, 1.0, [['rlgdgr', -inf, 9.0, 7.9919636617749825], ['rlgdgr', 9.0, inf, 8.713414634146341]]], ['health', 1.0, inf, [['netusoft', -inf, 4.0, 7.195121951219512], ['netusoft', 4.0, inf, 7.565659008464329]]]]], ['sclmeet', 4.0, inf, [['eduyrs', -inf, 25.0, [['eduyrs', -inf, 8.0, 7.9411764705882355], ['eduyrs', 8.0, inf, 7.999169779991698]]], ['eduyrs', 25.0, inf, [['hhmmb', -inf, 1.0, 7.297872340425532], ['hhmmb', 1.0, inf, 7.9603174603174605]]]]]]], ['health', 2.0, inf, [['sclmeet', -inf, 3.0, [['health', -inf, 3.0, [['sclmeet', -inf, 2.0, 6.049427365883062], ['sclmeet', 2.0, inf, 6.70435393258427]]], ['health', 3.0, inf, [['sclmeet', -inf, 1.0, 4.135036496350365], ['sclmeet', 1.0, inf, 5.407051282051282]]]]], ['sclmeet', 3.0, inf, [['health', -inf, 4.0, [['rlgdgr', -inf, 9.0, 6.992227707173616], ['rlgdgr', 9.0, inf, 7.434662998624484]]], ['health', 4.0, inf, [['hhmmb', -inf, 1.0, 4.948717948717949], ['hhmmb', 1.0, inf, 6.1320

# Evaluating Our decision tree

In order to generate our decision tree, we compared error rates for each potential split point and each potential splitting variable

Now that we have successfully generated a decision tree, it makes sense to do some error calculation not just for a particular branch but for the whole tree

First we will make a function that outputs prediction of the decision tree

In [17]:
def get_prediction(observation, tree):
    j=0
    keepgoing = True
    prediction = -1
    while keepgoing:
        j = j+1
        variable_tocheck = tree[0][0]
        bound1 = tree[0][1]
        bound2 = tree[0][2]
        bound3 = tree[1][2]
        if observation.loc[variable_tocheck]<bound2:
            tree = tree[0][3]
        else:
            tree = tree[1][3]
        if isinstance(tree,float):
            keepgoing = False
            prediction = tree
    return prediction

we create a loop that goes through any proportion of our dataset and gets any tree's happiness for the portion

In [22]:
predictions = []
outcome_variable = 'happy'
maxdepth = 4
thetree = getsplit(0,ess,variables, outcome_variable)
for k in range(0,len(ess.index)):
    observation = ess.loc[k,:]
    predictions.append(get_prediction(observation, thetree))
print(predictions)

[6.9998894049988944, 6.9998894049988944, 6.9998894049988944, 6.9998894049988944, 8.0, 7.853383458646617, 6.9998894049988944, 6.9998894049988944, 8.0, 6.9998894049988944, 7.193058568329718, 6.9998894049988944, 6.9998894049988944, 6.9998894049988944, 6.847599164926931, 8.0, 8.0, 6.9998894049988944, 6.9998894049988944, 6.9998894049988944, 7.647353760445682, 8.0, 6.9998894049988944, 6.9998894049988944, 8.0, 6.9998894049988944, 6.9998894049988944, 8.0, 7.098360655737705, 6.9998894049988944, 6.9998894049988944, 6.9998894049988944, 6.9998894049988944, 6.9998894049988944, 8.0, 6.9998894049988944, 6.9998894049988944, 6.9998894049988944, 7.098360655737705, 6.9998894049988944, 8.0, 7.193058568329718, 6.9998894049988944, 8.0, 6.895985401459854, 6.9998894049988944, 6.9998894049988944, 8.0, 8.0, 8.0, 8.0, 6.9998894049988944, 6.9998894049988944, 8.0, 6.9998894049988944, 6.9998894049988944, 6.9998894049988944, 6.9998894049988944, 6.9998894049988944, 5.6853741496598635, 6.9998894049988944, 8.0, 8.0, 6.

In [23]:
ess.loc[:,'predicted'] = predictions
errors = abs(ess.loc[:,'predicted'] - ess.loc[:,'happy'])
print(np.mean(errors))

1.3692897210991413


# The problem of overfitting

to avoid the problem we will evaluate our model by splitting the data into train and test

In [25]:
import numpy as np
np.random.seed(518)
ess_shuffled = ess.reindex(np.random.permutation(ess.index)).reset_index(drop = True)
training_data = ess_shuffled.loc[0:37000,:]
test_data = ess_shuffled.loc[37001:,:].reset_index(drop = True)

we used numpy to shuffle the data. We accomplished this with reindex() method of the pandas module. the reindexing is done with a random shuffling of the row numbers, which we get by using the numpy module's permutation capability

In [26]:
thetree = getsplit(0,training_data, variables, outcome_variable)

In [27]:
predictions = []
for k in range(0,len(test_data.index)):
    observation = test_data.loc[k,:]
    predictions.append(get_prediction(observation, thetree))
    
test_data.loc[:,'predicted'] = predictions
errors = abs(test_data.loc[:,'predicted'] - test_data.loc[:,'happy'])
print(np.mean(errors))

1.3711954784008125


This indicates that our model doesn't suffer from overfitting: it is good at predicting the past and almost exactly as good at predicting the future. Quite often, instead of getting good news, we get bad news - that our model is worse than we thought it was - but it is good to get this news because we can still make improvements before we start using our model in a real scenario. In such cases, before our model is ready to be deployed in real life, we will need to make improvements to it so that its error rate on the test set is minimized 

# Improvements and refinements

Many of the strategies for dealing with overfitting issues boil down to some kind of simplyfying, since simple machine learning models are less likely to suffer from overfitting than are complex models

The first and easiest wayis to limit the depth of the decision tree

another way to simplify our decision tree model is to do what is called <b>pruning</b>. In this method we grow a decision tree to its full depth and then find the branches that we can remove from the tree without increasing the error rate by much

Another refinement can be to use different measures to choose the right split point and the right splitting variable

There are actually other ways to choose the right variables and ways to decide on the right split point including gini impurity, entropy, information gain, and variance reduction

