# Decision Tree


For this problem, you will be implementing a Decision Tree classifier that works on discrete (categorical) features. Although a relatively simple learning algorithm, the Decision Tree is often used as a fundamental building block for more powerful (and popular) models such as Random Forest and Gradient Boosted ensembles. 

You should base your solution on the [ID3](https://en.wikipedia.org/wiki/ID3_algorithmhttps://en.wikipedia.org/wiki/ID3_algorithm) algorithm. This is a basic tree-learning algorithm that greedly grows a tree based on _information gain_ (reduction in entropy). Please refer to Chapter 3 of _Machine Learning_ by Tom M. Mitchell for more details. 


We have provided some skeleton code for the classifier, along with a couple of utility functions in the [decision_tree.py](./decision_tree.py) module. Please fill out the functions marked with `TODO` and feel free to add extra constructor arguments as you see fit (just make sure the default constructor solves the first dataset).


In [1]:
%load_ext autoreload

We begin by loading necessary packages. Below follows a short description of the imported modules:

- `numpy` is the defacto python package for numerical calculation. Most other numerical libraries (including pandas) is based on numpy.
- `pandas` is a widely used package for manipulating (mostly) tabular data
- `decision_tree` refers to the module in this folder that should be further implemented by you

Note: The `%autoreload` statement is an [IPython magic command](https://ipython.readthedocs.io/en/stable/interactive/magics.html) that automatically reloads the newest version of all imported modules within the cell. This means that you can edit the `decision_tree.py` file and just rerun this cell to get the updated version.

In [2]:
%autoreload

import numpy as np 
import pandas as pd 
import decision_tree as dt  # <-- Your implementation

## [1] First Dataset

The first dataset is a toy problem lifted from Table 3.2 in the Machine Learning textbook. The objective is to predict whether a given day is suitable for playing tennis based on several weather conditions. 

### [1.1] Load Data

We begin by loading data from the .csv file located in the same folder as this notebook.

In [3]:
data_1 = pd.read_csv('data_1.csv')
data_1

Unnamed: 0,Outlook,Temperature,Humidity,Wind,Play Tennis
0,Sunny,Hot,High,Weak,No
1,Sunny,Hot,High,Strong,No
2,Overcast,Hot,High,Weak,Yes
3,Rain,Mild,High,Weak,Yes
4,Rain,Cool,Normal,Weak,Yes
5,Rain,Cool,Normal,Strong,No
6,Overcast,Cool,Normal,Strong,Yes
7,Sunny,Mild,High,Weak,No
8,Sunny,Cool,Normal,Weak,Yes
9,Rain,Mild,Normal,Weak,Yes


### [1.2] Fit and Evaluate Model

Next we fit and evaluate a Decision Tree over the dataset. We first partition the data into the dependent (`y` = Play Tennis) and independent (`X` = everything else) variables. We then initialize a Decision Tree learner and fit it to all the data. Finally, we evaluate the model over the same data by calculating its accuracy, i.e. the fraction of correctly classified samples.

Note that `.fit` and `.predict` will crash until you implement these two methods in [decision_tree.py](./decision_tree.py).

Assuming that you've correctly implemented the ID3 algorithm as described in the course textbook, you should expect the model to perfectly fit the training data. That is, you should get a classification accuracy of 100%.

In [4]:
# Separate independent (X) and dependent (y) variables
X = data_1.drop(columns=['Play Tennis'])
y = data_1['Play Tennis']

# Create and fit a Decrision Tree classifier
model_1 = dt.DecisionTree()  # <-- Should work with default constructor
model_1.fit(X, y)

# Verify that it perfectly fits the training set
print(f'Accuracy: {dt.accuracy(y_true=y, y_pred=model_1.predict(X)) * 100 :.1f}%')

<decision_tree.TreeNode object at 0x0000012D9650D6D8> <decision_tree.TreeNode object at 0x0000012D9650D6A0> <decision_tree.TreeNode object at 0x0000012D943BC588>
<decision_tree.TreeNode object at 0x0000012D9650D6A0> <decision_tree.TreeNode object at 0x0000012D9650D668> <decision_tree.TreeNode object at 0x0000012D9650D470>
<decision_tree.TreeNode object at 0x0000012D9650D470> <decision_tree.TreeNode object at 0x0000012D9650D438> <decision_tree.TreeNode object at 0x0000012D9650D3C8>
<decision_tree.TreeNode object at 0x0000012D9650D6D8> <decision_tree.TreeNode object at 0x0000012D9650D6A0> <decision_tree.TreeNode object at 0x0000012D943BC588>
<decision_tree.TreeNode object at 0x0000012D9650D6A0> <decision_tree.TreeNode object at 0x0000012D9650D668> <decision_tree.TreeNode object at 0x0000012D9650D470>
<decision_tree.TreeNode object at 0x0000012D9650D470> <decision_tree.TreeNode object at 0x0000012D9650D438> <decision_tree.TreeNode object at 0x0000012D9650D3C8>
<decision_tree.TreeNode obje

### [1.3] Inspect Classification Rules

A big advantage of Decision Trees is that they are relatively transparent learners. By this we mean that it is easy for an outside observer to analyse and understand how the model makes its decisions. The problem of being able to reason about how a machine learning model reasons is known as _Explainable AI_ and is often a desirable property of machine learning systems.

Every time a Decision Tree is evaluated, the datapoint is compared against a set of nodes starting at the root of the tree and (typically) ending at one of the leaf nodes. An equivalent way to view this reasoning is as an implication rule ($A \rightarrow B$) where the antecedent ($A$) is a conjunction of of attribute values and the consequent ($B$) is the predicted label. For instance, if a path down the tree first checks if Outlook=Rain, then checks if Wind=Strong, and then predicts Play Tennis=No, this line of reasoning can be represented as:

- If $Outlook=Rain \cap Wind=Strong \rightarrow$ then predict $Play Tennis = No$

We will leverage this property to export the decision tree you just created as a set of rules. For the subsequent cell to work, you must also have implemented the `.get_rules()` method in the provided boilerplate code.

In [5]:
for rules, label in model_1.get_rules():
    conjunction = ' ∩ '.join(f'{attr}={value}' for attr, value in rules)
    print(f'{"✅" if label == "Yes" else "❌"} {conjunction} => {label}')

❌ Humidity=High ∩ Outlook=Sunny => No
✅ Humidity=High ∩ Outlook=Rain ∩ Wind=Weak => Yes
✅ Humidity=Normal ∩ Wind=Weak => Yes
❌ Humidity=Normal ∩ Wind=Strong ∩ Outlook=Rain => No
✅ Humidity=Normal ∩ Wind=Strong ∩ Outlook=Sunny => Yes
❌ Humidity=High ∩ Outlook=Rain ∩ Wind=Strong => No


## [2] Second Dataset

The second dataset involves predicting whether an investment opportunity will result in a successful `Outcome` or not. To make this prediction, you are given a dataset of 200 historical$^1$ business ventures and their outcome, along with the following observed features:

- Whether the business oportunity is in a lucurative market or not 
- Whether the presented business idea has a competitive advantage
- Whether the second opinion from another investor is positive or not 
- The founder's previous experience with startups
- The founder's [Zodiac Sign](https://en.wikipedia.org/wiki/Astrologyhttps://en.wikipedia.org/wiki/Astrology)

---
[1] Disclaimer: The dataset is not based on real-world business ventures. It is synthetic and generated by us. Also, it should not be considered financial advice.

### [2.1] Load Data

This dataset can also be found in a .csv file in the same folder as this notebook.

In [6]:
data_2 = pd.read_csv('data_2.csv')
data_2

Unnamed: 0,Founder Zodiac,Founder Experience,Second Opinion,Competitive Advantage,Lucurative Market,Outcome,Split
0,cancer,moderate,negative,yes,no,success,train
1,cancer,high,positive,yes,no,failure,train
2,scorpio,low,negative,no,no,failure,train
3,cancer,low,negative,no,no,failure,train
4,aquarius,low,positive,yes,yes,success,train
...,...,...,...,...,...,...,...
195,capricorn,moderate,positive,no,yes,failure,test
196,aquarius,low,negative,no,yes,failure,test
197,cancer,moderate,negative,no,yes,failure,test
198,virgo,moderate,negative,no,no,failure,test


### [2.2] Split Data

We've also taken the liberty to pre-split the dataset into three different sets:

- `train` contains 50 samples that you should use to generate the tree
- `valid` contains 50 samples that you can use to evaluate different preprocessing methods and variations to the tree-learning algorithm.
- `test` contains 100 samples and should only be used to evaluate the final model once you're done experimenting.

In [7]:
data_2_train = data_2.query('Split == "train"')
data_2_valid = data_2.query('Split == "valid"')
data_2_test = data_2.query('Split == "test"')
X_train, y_train = data_2_train.drop(columns=["Founder Zodiac", 'Outcome', 'Split']), data_2_train.Outcome
X_valid, y_valid = data_2_valid.drop(columns=["Founder Zodiac",'Outcome', 'Split']), data_2_valid.Outcome
X_test, y_test = data_2_test.drop(columns=["Founder Zodiac", 'Outcome', 'Split']), data_2_test.Outcome



#X_train, y_train = data_2_train.drop(columns=['Outcome', 'Split']), data_2_train.Outcome
#X_valid, y_valid = data_2_valid.drop(columns=['Outcome', 'Split']), data_2_valid.Outcome
#X_test, y_test = data_2_test.drop(columns=['Outcome', 'Split']), data_2_test.Outcome
#Let us be real, we dont really believe that Zodiac signs do something for the success of business ventures.



data_2.Split.value_counts()

test     100
valid     50
train     50
Name: Split, dtype: int64

### [2.3] Fit and Evaluate Model

You may notice that the basic ID3 algorithm you developed for the first dataset does not generalize well when applied straight to this problem. Feel free to add extra functionality to it and/or the data preprocessing pipeline that might improve performance on the validation (and ultimately test set). As a debugging reference; it is highly possible to obtain accuracies over the validation and test set ranging from mid ~80% to low ~90%.

In [13]:
# Fit model (TO TRAIN SET ONLY)
%autoreload
model_2 = dt.DecisionTree(max_tree_depth=200)  # <-- Feel free to add hyperparameters 
model_2.fit(X_train, y_train)

print(f'Train: {dt.accuracy(y_train, model_2.predict(X_train)) * 100 :.1f}%')
print(f'Valid: {dt.accuracy(y_valid, model_2.predict(X_valid)) * 100 :.1f}%')
print(f'Test: {dt.accuracy(y_test, model_2.predict(X_test)) * 100 :.1f}%')



Train: 86.0%
Valid: 72.0%
Test: 62.0%


In [248]:
import numpy as np
from math import log

# Define calculate_entropy function to make things easier
def calculate_entropy(labels):
    n_labels = len(labels)
    if n_labels <= 1:
        return 0
    value, counts = np.unique(labels, return_counts=True)
    probs = counts/n_labels
    n_classes = len(value)
    if n_classes <= 1:
        return 0
    entropy = 0
    for i in probs:
        entropy -= i*log(i,2)
    return entropy

def split_by_feature(X, feature_idx, threshold):
    # if the feature is numerical
    if isinstance(threshold, int) or isinstance(threshold, float):
        X_true = X[X.iloc[:,feature_idx] >= threshold]
        X_false = X[~X.iloc[:,feature_idx] >= threshold]
    # if the feature is categorical
    else:
        X_true = X[X.iloc[:,feature_idx] == threshold]
        X_false = X[~(X.iloc[:,feature_idx] == threshold)]
    return X_true, X_false

class DecisionNode():
    def __init__(self, feature_idx=None, threshold=None, value=None, true_branch=None, false_branch=None):
        self.feature_idx = feature_idx # index of the feature that is used
        self.threshold = threshold # threshold value for feature when making the decision
        self.value = value # value if the node is a leaf in the tree
        self.true_branch = true_branch # the node we go to if decision returns True
        self.false_branch = false_branch # the node we go to if decision returns False
        
class DecisionTree():
    def __init__(self, min_info_gain=1e-7, max_depth=float("inf")):
        self.root = None # root of this tree
        self.min_info_gain = min_info_gain # minimum information gain to allow splitting
        self.max_depth = max_depth # maximum depth the tree grows to
    def fit(self, X, y):
        self.root = self.build_tree(X, y)
    def build_tree(self, X, y, current_depth=0):
        decision = None
        subtrees = None
        largest_info_gain = 0
        # add y as last column of X
        df = pd.concat((X, y), axis=1)
        n_rows, n_features = X.shape
        if current_depth <= self.max_depth:
            # iterate through every feature
            for feature_idx in range(n_features):
                # values of that column
                feature_values = X.iloc[:, feature_idx]
                unique_values = feature_values.unique()
                for threshold in unique_values:
                    X_true, X_false = split_by_feature(df, feature_idx, threshold)
                    if len(X_true) > 0 and len(X_false) > 0:
                        y_true = X_true.iloc[:,-1]
                        y_false = X_false.iloc[:,-1]
                        # Calculate impurity
                        info_gain = self.calculate_information_gain(y, y_true, y_false)
                        # Keep track of which feature gave the largest information gain
                        if info_gain > largest_info_gain:
                            largest_info_gain = info_gain
                            decision = {"feature_idx":feature_idx, "threshold":threshold}
                            subtrees = {"X_true":X_true.iloc[:,:-1],
                                        "y_true":y_true,
                                        "X_false":X_false.iloc[:,:-1],
                                        "y_false":y_false}
                            
                  
        # we will construct new branch if the information gain is larger than minimum information gain that we've defined
        
        if largest_info_gain > self.min_info_gain:
            true_branch = self.build_tree(subtrees["X_true"], subtrees["y_true"], current_depth+1)
            false_branch = self.build_tree(subtrees["X_false"], subtrees["y_false"], current_depth+1)
            return DecisionNode(feature_idx=decision["feature_idx"], threshold=decision["threshold"], true_branch=true_branch, false_branch=false_branch)

        # at leaf
        leaf_value = self.majority_vote(y)
        return DecisionNode(value=leaf_value)
                        
    def calculate_information_gain(self, y, y_true, y_false):
        # probability of choosing left subtree 
        p = len(y_true) / len(y)
        entropy = calculate_entropy(y)
        info_gain = entropy - p*calculate_entropy(y_true) - (1-p)*calculate_entropy(y_false)
        return info_gain
                
    def majority_vote(self, y):
        # this is for calculating values for the leaf nodes
        return y.value_counts().idxmax()
                
    def predict_value(self, x, tree=None):
        # recursive method to find the leaf node that corresponds to prediction
        if tree is None:
            tree = self.root
        if tree.value is not None:
            return tree.value
        feature_value = x[tree.feature_idx]
        branch = tree.false_branch
        if isinstance(feature_value, int) or isinstance(feature_value, float):
            if feature_value >= tree.threshold:
                branch = tree.true_branch
        elif feature_value == tree.threshold:
            branch = tree.true_branch
        return self.predict_value(x, branch)
    
    def predict(self, X):
        y_pred = []
        for idx, row in X.iterrows():
            y_pred.append(self.predict_value(row.values))
        
        return y_pred
    

    
def accuracy(y_true, y_pred):
    """
    Computes discrete classification accuracy
    
    Args:
        y_true (array<m>): a length m vector of ground truth labels
        y_pred (array<m>): a length m vector of predicted labels
        
    Returns:
        The average number of correct predictions
    """
    return (y_true == y_pred).mean()

tree = DecisionTree(max_depth=5)
tree.fit(X_train, y_train)


print(f'Train: {accuracy(y_train, tree.predict(X_train)) * 100 :.1f}%')
print(f'Valid: {accuracy(y_valid, tree.predict(X_valid)) * 100 :.1f}%')
print(f'Test: {accuracy(y_test, tree.predict(X_test)) * 100 :.1f}%')



Train: 92.0%
Valid: 86.0%
Test: 81.0%


## [3] Further steps (optional)

If you're done with the assignment but want to some more challenges; consider the following:

- Make a Decision Tree learner that can handle numerical attributes
- Make a Decision Tree learner that can handle numerical targets (regresion tree)
- Try implementing [Random Forest](https://en.wikipedia.org/wiki/Random_forest) on top of your Decision Tree algorithm

If you need more data for experimenting, UC Irvine hosts a [large repository](https://archive.ics.uci.edu/ml/datasets.php) of machine learning datasets.


In [249]:
#SKLEARN METHOD TO PURSUE:
from sklearn.tree import DecisionTreeRegressor

data_2_train = data_2.query('Split == "train"')
data_2_valid = data_2.query('Split == "valid"')
data_2_test = data_2.query('Split == "test"')

data_2_train = np.array(data_2_train)
data_2_valid = np.array(data_2_valid)
data_2_test = np.array(data_2_test)

#Sklearn only apparently takes one-hot encoded categories into tree methods
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
for i in range(6):
    data_2_train[:,i] = le.fit_transform(data_2_train[:,i])
    data_2_valid[:,i] = le.fit_transform(data_2_valid[:,i])
    data_2_test[:,i] = le.fit_transform(data_2_test[:,i])


data_2_train = pd.DataFrame(data_2_train)
data_2_valid = pd.DataFrame(data_2_valid)
data_2_test = pd.DataFrame(data_2_test)


X_train, y_train = data_2_train.drop(columns=[0,5,6]), data_2_train[5]
X_valid, y_valid = data_2_valid.drop(columns=[0,5,6]), data_2_valid[5]
X_test, y_test = data_2_test.drop(columns=[0,5,6]), data_2_test[5]



#GridSearch of Tree Classifier Hyper Parameters
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

def dtree_grid_search(X,y,nfolds):
    #create a dictionary of all values we want to test
    param_grid = { 'criterion':['gini','entropy'],'max_depth': [10,20,30,40,50,100]}
    # decision tree model
    dtree_model=DecisionTreeClassifier()
    #use gridsearch to test all values
    dtree_gscv = GridSearchCV(dtree_model, param_grid, cv=nfolds)
    #fit model to data
    dtree_gscv.fit(X, y)
    return dtree_gscv.best_params_

best_params = dtree_grid_search(X_train,y_train.astype("int"),5)
print("Best Params:", best_params)




# Fit model (TO TRAIN SET ONLY)
model_3 = DecisionTreeClassifier(**best_params)  # <-- Feel free to add hyperparameters 

model_3.fit(X_train, y_train.astype('int'))

print(f'Train: {dt.accuracy(y_train, model_3.predict(np.array(X_train))) * 100 :.1f}%')
print(f'Valid: {dt.accuracy(y_valid, model_3.predict(np.array(X_valid))) * 100 :.1f}%')
print(f'Test: {dt.accuracy(y_test, model_3.predict(np.array(X_test))) * 100 :.1f}%')



Best Params: {'criterion': 'gini', 'max_depth': 10}
Train: 92.0%
Valid: 86.0%
Test: 81.0%


In [250]:
#Tring Pipeline with GridSearchCV
from sklearn import decomposition, datasets
from sklearn import tree
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler

#Using pipeline and GridSearch to find best solution
std_slc = StandardScaler()

pca = decomposition.PCA()

dec_tree = tree.DecisionTreeClassifier()


pipe = Pipeline(steps=[('std_slc', std_slc),
                           ('pca', pca),
                           ('dec_tree', dec_tree)])


n_components = list(range(1,X.shape[1]+1,1))


criterion = ['gini', 'entropy']
max_depth = [3,4,5,6,7,8,9,10]



parameters = dict(pca__n_components=n_components,
                      dec_tree__criterion=criterion,
                      dec_tree__max_depth=max_depth)



clf_GS = GridSearchCV(pipe, parameters)
clf_GS.fit(X_train, y_train.astype('int'))

model_3 = DecisionTreeClassifier(criterion="gini", max_depth=8)  # <-- Feel free to add hyperparameters 

model_3.fit(X_train, y_train.astype('int'))

print(f'Train: {dt.accuracy(y_train, model_3.predict(np.array(X_train))) * 100 :.1f}%')
print(f'Valid: {dt.accuracy(y_valid, model_3.predict(np.array(X_valid))) * 100 :.1f}%')
print(f'Test: {dt.accuracy(y_test, model_3.predict(np.array(X_test))) * 100 :.1f}%')



print('Best Criterion:', clf_GS.best_estimator_.get_params()['dec_tree__criterion'])
print('Best max_depth:', clf_GS.best_estimator_.get_params()['dec_tree__max_depth'])
print('Best Number Of Components:', clf_GS.best_estimator_.get_params()['pca__n_components'])
print(); print(clf_GS.best_estimator_.get_params()['dec_tree'])

Train: 92.0%
Valid: 86.0%
Test: 81.0%
Best Criterion: gini
Best max_depth: 3
Best Number Of Components: 1

DecisionTreeClassifier(max_depth=3)


In [256]:
from sklearn.ensemble import RandomForestClassifier #Random forest did a little worse?

    
# Fit model (TO TRAIN SET ONLY)
model_3 = RandomForestClassifier(criterion="gini")  # <-- Feel free to add hyperparameters 

model_3.fit(np.array(X_train), np.array(y_train).astype("int"))

print(f'Train: {dt.accuracy(y_train, model_3.predict(np.array(X_train))) * 100 :.1f}%')
print(f'Valid: {dt.accuracy(y_valid, model_3.predict(np.array(X_valid))) * 100 :.1f}%')
print(f'Test: {dt.accuracy(y_test, model_3.predict(np.array(X_test))) * 100 :.1f}%')

Train: 92.0%
Valid: 82.0%
Test: 76.0%
