# Tree Based Methods: Decision Trees and Random Forests

In [None]:
!pip install -r requirements.txt

In [None]:
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split
from sklearn.metrics import plot_confusion_matrix, log_loss

## 1 - Load and Clean the Data

We'll be working with a dataset called adult census which can be found [here](https://archive.ics.uci.edu/ml/datasets/adult). This contains US census information from 1994. The task is to predict whether or not an individual in the dataset earns more than $50k.

In [None]:
if not Path('adult-census.csv').exists():
    !wget https://s3-eu-west-1.amazonaws.com/faculty-client-teaching-materials/tree-based-methods/adult-census.csv

In [None]:
df = pd.read_csv("adult-census.csv")

In [None]:
df.head()

The first thing you'll notice is that the target column `salary` is not in a suitable state for prediction, let's fix that. First of all, let's check the different values it has.

In [None]:
list(df["salary"].unique())

We have two issues with the above; 1). Some entries have a full stop after them and some do not, but they should in fact both belong to the same class and 2). we need to convert these into a binary 0/1 variable. Let's do that now.

### Ex 1.1: Write a function that converts an entry from the salary column into a binary variable.

N.B. Be careful of the whitespace!

In [None]:
def convert_salary(salary):
    """
    salary: str
        This should be an entry from df["salary"]
    """
    
    if salary == " <=50K" or salary == " <=50K.":
        output = 0
    elif # your code goes here
        # and here
    else:
        raise ValueError(f"Invalid input {salary}")
    return output

### Ex 1.2: Convert the `salary` column to a binary variable using your function above and `.map`

In [None]:
df["salary"] = # your code here

Now we've converted the `salary` column to 1s and 0s we can check how imbalanced the classes are.

In [None]:
df["salary"].value_counts() / len(df)

So about 24% of people have a salary above $50k.

The final thing to notice is that there are quite a few categorical columns. And easy way to see this is via the following line of code.

In [None]:
df.dtypes

The `object` type columns are the categorical ones. We'll have to do something about these at some point, but let's leave them for now. We can also easily count how many different values in each column there are.

In [None]:
df.nunique()

As a final step for this section, we'll split the target off from the rest of the dataframe.

In [None]:
X = df.drop("salary", axis=1).copy()  # this stops X being a reference to df
y = df["salary"]

## 2 - Decision Trees

We'll now try fitting some simple models to this data.

We need to convert our categorical columns into something a model will understand. We'll do this using one-hot encoding. The idea here is that a categorical column with k different values will become k (or k-1) different columns with binary values. For example:

In [None]:
pd.get_dummies(X[["sex"]]).head()

You might notice that in the above the two columns are not needed, we can describe this using just 1 column. To do this we pass the optional argument `drop_first=True`. Let's now convert the whole dataframe.

In [None]:
X = pd.get_dummies(X, drop_first=True)

In [None]:
X.head()

Numerical columns have been kept the same but the categorical columns have been converted. Note the naming convention `<original-column_name>`_`<original-column-value>`.

We'll compare a decision tree to logistic regression. First of all we should split the data into a training, validation and test set.

In [None]:
# we set a random seed here to end up with the same train, validation and test sets
X_train, X_no_train, y_train, y_no_train = train_test_split(
    X, y, test_size=0.4, random_state=42
)
X_val, X_test, y_val, y_test = train_test_split(
    X_no_train, y_no_train, test_size=0.5, random_state=42
)

In [None]:
print(len(y_train), len(y_val), len(y_test))

Let's now fit Logistic Regression to it.

In [None]:
logistic = LogisticRegression()
logistic.fit(X_train, y_train)

Let's compare the accuracies across the training and testing sets.

In [None]:
print(logistic.score(X_train, y_train))
print(logistic.score(X_test, y_test))

The accuracy is similar across the two sets, but isn't particularly great; the baseline would be about 76% if we labelled everyone as earning < $50k.

### Ex 2.1: Fit a Decision Tree to the same training sets and compute the training and testing accuracies.

In [None]:
# your code here

In [None]:
# you can use multiple cells if you want!

Your accuracy should have got a bit better, but your model is clearly overfitting a lot. Let's try and stop that.

A useful parameter for Decision Trees is `min_samples_leaf`, this determines how many points have to be in a node for it to be considered valid. By default it is set to 1 so let's try increasing it a bit.

### Ex 2.2: Fit a Decision Tree with a range of leaf sizes to see if you can get a better accuracy.

You should fit this hyperparamter using the **validation set**, and then check your final performance on the test set.

In [None]:
# your code here

One of the reasons the Decision Tree is so much better is because of the categorical feartures. To see this, let's compare peformance using a simpler set of features.

In [None]:
simple_features = ["age", "education-num", "hours-per-week", "sex_ Male"]

### Ex 2.3: Compare a Logistic Regression and a Decision Tree trained only on the `simple_features`.

You probably want to not use the default `min_samples_leaf` again for the Decision Tree.

In [None]:
# your simple logistic regression goes here

In [None]:
# your simple decision tree goes here

The accuracies are now quite similar - the issue Logistic Regression has is it doesn't seem to be able to use the more complicated features.

Back to using all of the features - whilst we've improved our accuracy, that isn't everything. Let's try plotting a confusion matrix for logistic regression.

In [None]:
plot_confusion_matrix(
    logistic, X_test, y_test, cmap=plt.cm.Blues, normalize="true"
)
plt.show()

### Ex 2.4: Plot the confusion matrix for your best Decision Tree Classifier

In [None]:
# your code goes here

Confusion matrices are a much more nuanced way of measuring the performance of a classifier. Another metric than can be useful is the cross-entropy, defined like so
$$C = - \sum_{i=1}^n [y_i\log(p_i) + (1-y_i)\log(1-p_i)]$$

Note that often we'll take the mean of the points rather than just the sum. This can be useful when we really care about evaulating the accuracy of the *probabilities* outputted by the model as opposed to just its class predictions. Note that an sklearn classifer can output probabilities by using it's `predict_proba` method, shown below.

In [None]:
logistic.predict_proba(X_test)

This is an 2-D array, with each column giving predictions for a certain class (in this case they sum to 1). The order in this case is 0, 1, which can be seen by looking at the `classes_` attribute.

In [None]:
logistic.classes_

We can use the log_loss function from sklearn to calculate the (mean) cross-entropy.

In [None]:
log_loss(y_test, logistic.predict_proba(X_test)[:, 1])

### Ex 2.5: Write a function that given a trained sklearn model and test data will output the model's confusion matrix and print it's cross-entropy. Use it on your best tree model.

In [None]:
def evaluate_model(model, X, y):
    """
    model: sklearn classifier
        A trained sklearn classifier.
    X: array-like
        The features for the model to use to make predictions.
    y: array-like
        The target the model is trying to predict.
    """

    # your code goes here

In [None]:
evaluate_model(
    # your best tree model,
    X_test,
    y_test,
)

## 3 - Random Forests

We could also use a random forest for this problem as well, let's have a go at training one.

### Ex 3.1: Train a random forest on your data (`RandomForestClassifier`) and calculate its accuracy on the test set.

Note that since Random Forests take random samples of the data and the features they can give slightly different results if you run them multiple times. Try setting the `random_state` argument to prevent this from happening.

In [None]:
# your code here

This is probably better than the default decision tree, but maybe not better than your tuned tree. Let's now try tuning it.
There are two parameters you might want to play with, `n_estimators` and `min_samples_leaf`. The latter does the same as before. The former is the number of trees to use in the forest, you generally want this to be as high as possible but the higher it is the longer you'l have to wait and you'll also get more and more marginal benefits.

### Ex 3.2: Tune your Random Forest using the two hyperparameters above to see what accuracy you can get.

Again, remember to use the validation sets to tune.

In [None]:
# your code goes here

It probably shows a marginal improvement over the decision tree, though not that much.

### Ex 3.3: Finally, use the `evaluate_model` function to compare you best Random Forest to your best Decision Tree.

In [None]:
evaluate_model(
    # your best random forest,
    X_test,
    y_test,
)

In [None]:
evaluate_model(
    # your best decision tree,
    X_test,
    y_test,
)

You may find that the random forest gives much better probability estimates.

## Summary

* Decision trees can be a powerful approach to use in ML, particularly when dealing with lots of categorical features.
* However they do require tuning.
* Random Forests can offer even more improvement, though more marginal.

## 4 - Further Work: LightGBM [Optional]

If you want, try seeing how a LightGBM performs compared to the random forest on this problem, the documentation can be found [here](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html#lightgbm.LGBMClassifier).

Below we use the default `LGBMClassifier` to try and tackle our problem.

In [None]:
!pip install lightgbm

In [None]:
from lightgbm import LGBMClassifier

In [None]:
lgbm = LGBMClassifier()
# like neural networks, Graident Boosted Trees can actually
# use their validation set during training, in this case
# by passing it to the `eval_set` parameter
lgbm.fit(X_train, y_train, eval_set=(X_val, y_val))
lgbm.score(X_val, y_val)

In [None]:
lgbm.score(X_test, y_test)

In [None]:
evaluate_model(lgbm, X_test, y_test)

It seems even better than the Random Forest! Try tuning it using the validation set and its hyperparameters (see the documentation) to see how much better you can make it.