<a href="https://colab.research.google.com/github/GArdennes/GArdennes/blob/main/Decision_Trees.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

To study binary regression trees we consider an example from Ross Quinlan's datasets. The idea is to predict whether to play tennis based on weather information.

In [3]:
import pandas as pd
filename = 'https://github.com/lmassaron/datasets/'
filename += 'releases/download/1.0/tennis.feather'
tennis = pd.read_feather(filename)

With the dataset generated, we need to manipulate the data to separate the features from the target. The features are classification values.

In [None]:
print(tennis.keys()) #tells us what the header of the file says

In [7]:
X = tennis[['outlook','temperature','humidity','wind']]
X = pd.get_dummies(X)
y = tennis.play

We now import the DecisionTree algorithm to fit all the data and observe the rules to predict the target

In [None]:
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
dt.fit(X,y)

Next we visualize the data

In [None]:
!pip install dtreeviz


In [None]:
!apt-get install graphviz

In [None]:
from dtreeviz.trees import dtreeviz
viz = dtreeviz(dt, X, y, target_name='play_tennis',feature_names = X.columns, class_names = ["No","Yes"], scale =2.0)

In [None]:
viz

The bars represent the distribution of the target at each node. With the first node, the root, with the feature outlook overcast the target is divided. If the weather outlook is overcast(probability greater than 0.5), yes you can play, creating a leaf. If the weather outlook is not overcast(probability less than 0.5) we consider the next node since the probability less than 0.5 contains both yes and no. The goal is to divide the dataset precisely at the features which make play possible and the features which don't. The next node considers the feature humidity and distributes the target(play) according to that. The iteration continues until the last nodes are reached. The target values contain only one type in the leaves. In the end it should be obvious to us what rule determines if playing tennis would be possible in the face of these features.

The rule would be...

if outlook overcast > ... then play tennis
else
if outlook overcast <= ... and humidity > ... and wind strong <= ... then play tennis
else
if outlook overcast <= ... and humidity > ... and wind strong > ... and outlook rain is <= ... then play tennis
else
if outlook overcast <= ... and humidity <= ... and outlook sunny <= ... and wind weak > ... then play tennis

To take Regression trees further, we consider another dataset with a lot of noise(outliers) that is the RMS titanic dataset. With this dataset, we consider splitting the data into 75% for training and 25% for testing to evaluate our model fit.

In [19]:
filename = 'https://github.com/lmassaron/datasets/'
filename += 'releases/download/1.0/titanic.feather'
titanic = pd.read_feather(filename)

In [None]:
print(titanic.keys()) #tells us what the header of the file says

In [21]:
from sklearn.model_selection import train_test_split
X = titanic.iloc[:,:-1]
y = titanic.iloc[:,-1]
(X_train, X_test,
y_train, y_test) = train_test_split(X, y, random_state=0, shuffle=True)

Decision trees like Logical regression models stand the chance of overfitting data, thus the data is trained on 5 shuffled sets of the data.

In [22]:
dt = DecisionTreeClassifier(min_samples_split=5)
dt.fit(X_train, y_train)
accuracy = dt.score(X_test, y_test)
print(f"test accuracy: {accuracy:0.3}")


test accuracy: 0.774


Given the fact that decision trees are not accountable for when they stop unless a stop limit is specified or target values are singular, they continue endlessly exploring different paths. The term pruning refers to clamping down on the decision tree to render a simple and more precise model of the data with a more precise set of rules. 

In [23]:
path = dt.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities #alpha is an indicator of how good/complete a tree is, since pruning makes a tree complete, impurity is an indicator of chaos at a node/region

In [None]:
best_pruning = list()
for ccp_alpha in ccp_alphas:
  if ccp_alpha > 0:
    dt = DecisionTreeClassifier(random_state=0,ccp_alpha=ccp_alpha)
    dt.fit(X_train, y_train)
    best_pruning.append([ccp_alpha, dt.score(X_test,y_test)])
best_pruning = sorted(best_pruning,key=lambda x:x[1], reverse=True)
best_ccp_alpha = best_pruning[0][0]
dt = DecisionTreeClassifier(random_state=0,ccp_alpha=best_ccp_alpha)
dt.fit(X_train, y_train)
accuracy = dt.score(X_test, y_test)
print(f"test accuracy: {accuracy:0.3}") #accuracy has improved