# Tree models

So far all the models we have looked at have been variations of a linear model. Specifically they have mostly been variations of a logistic regression model, but the theory is very similar for using general linear models for predicting continuous values. 

Linear models are great - they have well established methods for finding the best paratemeter values and its really easy to see from the results which are the variables that are contributing to the prediction. With methods like Lasso, Ridge and ElastiNet, we also have ways of trying to combat overfitting. But they suffer from a major problem - they can only find linear decision boundaries! 

What do I mean by that? Well, consider a logistic regression where $\beta_1=0.5, \beta_2=0.25$, 

so that $z=0.5 \times x_1 + 0.25 \times x_2$ and $y=\frac{1}{1 + e^{-z}}$. 

Here $x_1$ might be tumour volume and $x_2$ tumour curviture, and y might be benin or malignant. 

The plot below shows this, where the colour corresponds to the value of y for that appropriate values of $x_1$ and $x_2$. Red and Blue points show the true values of the points we are attempting to classify:



![](linear_data_linear_fit.png)

You can see a straight line where the prediction goes from blue (y=0) to red (y=1). This is the decision boundary. But what if the decision boundary is not straight? What if it looks like:

![](curved_data_linear_boundary.png)

Here a logistic model does very badly at classifying things as benign or malignant, because it can only use straight lines! This leads to a large collection of red points falling on the blue side of the line. 

There are many different ways of dealing with this, but one very powerful, popular, and cruitially easy way to do it is with decision tree-based methods.

Tree models not only solve the problem of non-linear decision boundaries, but are also resistant to over-fitting, and can handle missing values without you having to do any imputation!

## Decision trees

A decision tree is basically a flow chart that says how to classify something. For example a decision tree might look like:

    if x["number_of_legs"] > 2:
        if x["has_wiskers"]:
           y = "cat"
        else
           y = "dog"
    else:
        if x["is_black"]:
           y = "crow"
        else:
           y = "robin"

With enough levels in your tree you can build an approximation of the curved boundary above. You can also see how the meaning of one feature can be made to be dependent on the meaning of another (which is really the same thing as a decision boundary that isn't a straight line).

We'll look at two tree models: Random Forests and extreme gradient boosted trees (XGBoost).

### Random Forests

In a random forest you take a subset of your features, and a subset of your training examples, and make the best tree you can out of them. You then repeat this for many different subsets of training examples and features (hence a forest). When you classify an example, each tree gets a vote, and then class with the most votes wins. 

Training a random forest is easy:

In [1]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
from sklearn.ensemble import RandomForestClassifier

breast_cancer = load_breast_cancer()
X, Y = breast_cancer["data"], breast_cancer["target"]

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=1)

RF_model = RandomForestClassifier(random_state=1)
RF_model = RF_model.fit(X_train, Y_train)

y_train_pred = RF_model.predict(X_train)
y_test_pred = RF_model.predict(X_test)

print(f"Training accuracy: {accuracy_score(Y_train, y_train_pred)}, F1: {f1_score(Y_train, y_train_pred):.3f}")
print(f"Test accuracy: {accuracy_score(Y_test, y_test_pred):.3f}, F1: {f1_score(Y_test, y_test_pred):.3f}")

Training accuracy: 1.0, F1: 1.000
Test accuracy: 0.956, F1: 0.966


So with no tuning at all, we have already gotten perfect scores on the training data, and 95.6% accuracy on the test data. 

Like Lasso and Ridge Regression, there are parameters that need to be tuned for a RandomForest. In fact, Random Forests have many paratameters to tune, but two are particularly important. The first is `n_estimators` and the second is `max_features`. 

`n_estimators` says how many trees to use. Generally the more the better, but the slower and more memory intensive your model will be. Its also possible that you might suffer from over-fitting if you "grow" too many trees. 

`max_features` says how many features each tree should use. By default this is $\sqrt{total features} \approx 5$ in the case of the breast cancer data set. The more features that are used, the higher the accuracy, but the larger the probabilit of over-fitting. 

Again, we can use cross validation to help us pick the best values. 

Unlike `LogisticRegression`, there is no `RandomForestCV`. However, `sklearn` does provide the general purpose `GridSearchCV` and `RandomSearchCV`, which can be used with any model!

In [2]:
from sklearn.model_selection import GridSearchCV
import pandas as pd

RF_model = RandomForestClassifier(random_state=1)
params = {
    "n_estimators": [50, 100, 150, 200, 400],
    "max_features": [0.05, 0.1, 0.2, 0.3, 0.4, 0.5]
}

rf_cv = GridSearchCV(RF_model, params, return_train_score=True)
rf_cv = rf_cv.fit(X_train, Y_train)


The `cv_results_` attribute of the gridsearch contains the results for every combination of paramater vaules, but it is a very massive table, with too much data. We can make it more understandable, by just selecting a few columns:

In [3]:
pd.DataFrame(rf_cv.cv_results_)[["param_max_features",
                                 "param_n_estimators",
                                 "mean_train_score",
                                 "mean_test_score"]]

Unnamed: 0,param_max_features,param_n_estimators,mean_train_score,mean_test_score
0,0.05,50,0.999451,0.942857
1,0.05,100,1.0,0.947253
2,0.05,150,1.0,0.945055
3,0.05,200,1.0,0.947253
4,0.05,400,1.0,0.953846
5,0.1,50,0.998901,0.945055
6,0.1,100,1.0,0.947253
7,0.1,150,1.0,0.956044
8,0.1,200,1.0,0.951648
9,0.1,400,1.0,0.951648


You'll note that in general the more estimators the better, and actaully in this case, the feature features used per tree the better! We can look at what the best parameters are:

In [4]:
rf_cv.best_params_

{'max_features': 0.1, 'n_estimators': 150}

The absolute best score doesn't come from the most estimators, but if you look back at the table, you'll see that scores for max_features = 0.1 for 200 and 400 estimators are not much worse, so its probably just random noise. Still, using a smaller number of trees saves us time, and I think we can at least say that using more doesn't give us a better results. 

### (eXtreme) Gradient Boosting Trees

An alternative type of tree model to RandomForests is a gradient boosting tree. In a gradient boosting tree model, instead of learning a bunch of random trees, a single tree is learnt, and the difference between the predictions and the truth calculated (the loss, see last week). A second tree is then learnt specifically to correct the error of the first. A third tree is then learnt to correct the errors of the second, and so on...

A very popular implementation of boosting trees is eXtreme gradient boosting trees (XGBoost). This adds in regularisation (like the Ridge and Lasso regularisation we saw last week), and a whole bunch of computer science tricks to make the thing run efficiently on large datasets. 

XGBoost is pretty much the forfront of models designed to work with what is called "structured" data - that is data that comes in tables, rather than things like images, free text or sound files. 

However, thanks to sklearn, using it is not much more difficult than any other model

In [7]:
from xgboost import XGBClassifier

xgb_model = XGBClassifier()
xgb_model.fit(X_train, Y_train)

y_train_pred1 = xgb_model.predict(X_train)
y_test_pred1 = xgb_model.predict(X_test)

print(f"Training accuracy: {accuracy_score(Y_train, y_train_pred1)}, F1: {f1_score(Y_train, y_train_pred1):.3f}")
print(f"Test accuracy: {accuracy_score(Y_test, y_test_pred1):.3f}, F1: {f1_score(Y_test, y_test_pred1):.3f}")


Training accuracy: 1.0, F1: 1.000
Test accuracy: 0.956, F1: 0.966


XGBoost has [pages and pages](https://xgboost.readthedocs.io/en/release_3.2.0/parameter.html#parameters-for-tree-booster) of parameters that you can use for tuning. The manual recommends having a look at `n_estimators`, `max_depth`, `min_child_weight`, `gamma` and `eta`. 

One interesting thing you can do with XGBoost is called "early stopping". Instead of providing it with a number of rounds to run for, you instead provide it with your test data, and it just keeps training on the training data until the performance on the test data stops improving. Interstingly, this will stop if your test data performances starts getting worse (i.e. you start overfitting). This is probably best if you have a validation set seperate from your test set. 

In [11]:
xgb_model = XGBClassifier( early_stopping_rounds=10)
xgb_model.fit(X_train, Y_train, eval_set=[(X_test, Y_test)])

y_train_pred1 = xgb_model.predict(X_train)
y_test_pred1 = xgb_model.predict(X_test)
print(f"Training accuracy: {accuracy_score(Y_train, y_train_pred1)}, F1: {f1_score(Y_train, y_train_pred1):.3f}")
print(f"Test accuracy: {accuracy_score(Y_test, y_test_pred1):.3f}, F1: {f1_score(Y_test, y_test_pred1):.3f}")


[0]	validation_0-logloss:0.45819
[1]	validation_0-logloss:0.36825
[2]	validation_0-logloss:0.29957
[3]	validation_0-logloss:0.26435
[4]	validation_0-logloss:0.23181
[5]	validation_0-logloss:0.20999
[6]	validation_0-logloss:0.19081
[7]	validation_0-logloss:0.17827
[8]	validation_0-logloss:0.16918
[9]	validation_0-logloss:0.16826
[10]	validation_0-logloss:0.16249
[11]	validation_0-logloss:0.15697
[12]	validation_0-logloss:0.15583
[13]	validation_0-logloss:0.15824
[14]	validation_0-logloss:0.16045
[15]	validation_0-logloss:0.15394
[16]	validation_0-logloss:0.15391
[17]	validation_0-logloss:0.15644
[18]	validation_0-logloss:0.15714
[19]	validation_0-logloss:0.15831
[20]	validation_0-logloss:0.15838
[21]	validation_0-logloss:0.15724
[22]	validation_0-logloss:0.15654
[23]	validation_0-logloss:0.15829
[24]	validation_0-logloss:0.16084
[25]	validation_0-logloss:0.15986
[26]	validation_0-logloss:0.16073
Training accuracy: 0.9978021978021978, F1: 0.998
Test accuracy: 0.956, F1: 0.966


## Summary

We have now covered how to use LogsiticRegression, Ridge, Lasso, ElastiNet, RandomForest and XGBoost. We've covered the ideas of test, train and validation sets, and cross-validation. We've talked about data preprocessing, hyperparameter tuning and grid search. You are now pretty much ready to go out there and build some models!

The final thing to talk about is model interpretation, and that will come next week.
