# Decision Trees
* Simple Tree like structure, model makes a decision at every node
* Useful in simple tasks
* One of the most popular algorithm
* Easy explainability, easy to show how a decision process works!

# Why decision trees are popular?
* Easy to interpret and present
* Well defined Logic, mimic human level thought
* Random Forests, Ensembles of decision trees are more powerful classifiers
* Feature values are preferred to be categorical. If the values are continuous then they are discretized prior to building the model.

# Build Decision Trees
Two common algorithms -

* CART (Classification and Regression Trees) → uses Gini Index(Classification) as metric.
* ID3 (Iterative Dichotomiser 3) → uses Entropy function and Information gain as metrics

In [33]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder

In [34]:
data = pd.read_csv('titanic.csv')
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [35]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [36]:
columns_to_drop = ["PassengerId","Name","Ticket","Cabin","Embarked"]

In [37]:
data_clean  = data.drop(columns=columns_to_drop)
data_clean.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare
0,0,3,male,22.0,1,0,7.25
1,1,1,female,38.0,1,0,71.2833
2,1,3,female,26.0,0,0,7.925
3,1,1,female,35.0,1,0,53.1
4,0,3,male,35.0,0,0,8.05


In [38]:
le = LabelEncoder()
data_clean["Sex"] = le.fit_transform(data_clean["Sex"])
data_clean.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare
0,0,3,1,22.0,1,0,7.25
1,1,1,0,38.0,1,0,71.2833
2,1,3,0,26.0,0,0,7.925
3,1,1,0,35.0,1,0,53.1
4,0,3,1,35.0,0,0,8.05


In [39]:
data_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Sex       891 non-null    int32  
 3   Age       714 non-null    float64
 4   SibSp     891 non-null    int64  
 5   Parch     891 non-null    int64  
 6   Fare      891 non-null    float64
dtypes: float64(2), int32(1), int64(4)
memory usage: 45.4 KB


In [40]:
data_clean = data_clean.fillna(data_clean["Age"].mean())
data_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Sex       891 non-null    int32  
 3   Age       891 non-null    float64
 4   SibSp     891 non-null    int64  
 5   Parch     891 non-null    int64  
 6   Fare      891 non-null    float64
dtypes: float64(2), int32(1), int64(4)
memory usage: 45.4 KB


In [41]:
input_cols = ["Pclass","Sex","Age","SibSp","Parch","Fare"]
out_cols = ["Survived"]

X = data_clean[input_cols]
y = data_clean[out_cols]

In [46]:
def entropy(col):
    counts = np.unique(col,return_counts=True)
    N = float(col.shape[0])
    ent = 0.0
    for ix in counts[1]:
        p = ix/N
        ent +=(-1.0 * p * np.log2(p))
    return ent
def divide_data(X,fkey,fval):
    X_right = pd.DataFrame([],columns=X.columns)
    X_left = pd.DataFrame([],columns=X.columns)
    
    for ix in range(X.shape[0]):
        val = X[fkey].loc[ix]
        if val>fval:
            X_right = X_right.append(X.loc[ix])
        else:
            X_left = X_left.append(X.loc[ix])
    return X_left,X_right
def information_gain(X,fkey,fval):
    
    left,right = divide_data(X,fkey,fval)
    
    l = float(left.shape[0])/X.shape[0]
    r = float(right.shape[0])/X.shape[0] 
    
    if left.shape[0]==0 or right.shape[0]==0:
        return -1000000
    
        
    inf_gain = entropy(X.Survived) - (l * entropy(left.Survived) + r * entropy(right.Survived))
    return inf_gain

In [47]:
for fx in X.columns:
    print(fx,end=" ---> ")
    print(information_gain(data_clean,fx,data_clean[fx].mean()))

Pclass ---> 0.07579362743608165
Sex ---> 0.2176601066606142
Age ---> 0.001158644038169343
SibSp ---> 0.009584541813400071
Parch ---> 0.015380754493137694
Fare ---> 0.042140692838995464


In [55]:
class DecisionTree:
    def __init__(self,depth=0,max_depth=5):
        self.left = None
        self.right = None
        self.fkey = None
        self.fval = None
        self.max_depth = max_depth
        self.depth = depth
        self.target = None
    def train(self,X):
        features = ["Pclass","Sex","Age","SibSp","Parch","Fare"]
        info_gain = []
        for ix in features:
            ingain = information_gain(X,ix,X[ix].mean())
            info_gain.append(ingain)
            
        self.fkey = features[np.argmax(info_gain)]
        self.fval = X[self.fkey].mean()
        
        print("Making Tree Features is ",self.fkey)
        
        X_left,X_right = divide_data(X,self.fkey,self.fval)
        X_left = X_left.reset_index(drop = True)
        X_right =X_right.reset_index(drop = True)
        
        if X_left.shape[0]==0 or X_right.shape[0]==0:
            if X.Survived.mean() >= 0.5:
                self.target = "Survive"
            else:
                self.target = "Dead"
            return
        if self.depth>=self.max_depth:
            if X.Survived.mean() >= 0.5:
                self.target = "Survive"
            else:
                self.target = "Dead"
            return
        self.left = DecisionTree(depth=self.depth+1,max_depth=self.max_depth)
        self.left.train(X_left)
        self.right =DecisionTree(depth=self.depth+1,max_depth=self.max_depth)
        self.right.train(X_right)
        
        if X.Survived.mean() >= 0.5:
            self.target = "Survive"
        else:
            self.target = "Dead"
        return

In [None]:
d =  DecisionTree()
d.train(data_clean)

Making Tree Features is  Sex
Making Tree Features is  Pclass
Making Tree Features is  Pclass
Making Tree Features is  Parch
Making Tree Features is  Age
Making Tree Features is  Age
Making Tree Features is  Age
Making Tree Features is  Parch
Making Tree Features is  Age
Making Tree Features is  Fare
Making Tree Features is  Parch
Making Tree Features is  Age
Making Tree Features is  Age
Making Tree Features is  Age
Making Tree Features is  Age
Making Tree Features is  Age
Making Tree Features is  Age
Making Tree Features is  Fare
Making Tree Features is  SibSp
Making Tree Features is  Fare
Making Tree Features is  Fare
Making Tree Features is  Age
Making Tree Features is  SibSp
Making Tree Features is  Parch
Making Tree Features is  Age
Making Tree Features is  SibSp
Making Tree Features is  Fare
Making Tree Features is  Parch
Making Tree Features is  Parch
Making Tree Features is  Age
Making Tree Features is  Age
Making Tree Features is  Age
