In [11]:
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
from utils import plot_tree_boundaries

features = ['age','acutephysiologyscore']
outcome = 'actualhospitalmortality'

data = pd.read_csv('eicu_processed.csv')

x = data[features]
y = data[outcome]

x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.7, random_state=42)

# LightGBM and xgboost

In all of our previous workbooks, we've focused on creating decision trees that use only two features - `age` and `acutephysiologyscore`. While this made it easier to understand how each of the techniques we used results in different models, as we saw in Workbook 08 it doesn't necessarily provide us with the best quality model.

Furthermore, so far we've focused on using the `sklearn` library to train our models. Whilst this library is extremely useful - and in many cases is able to create high-performing models - it is always helpful to have other tools in our toolbox. [LightGBM](https://lightgbm.readthedocs.io/en/latest/index.html) is a framework for creating gradient boosted decision trees with a Python API. [xgboost](https://xgboost.readthedocs.io/en/stable/) is a similar library. Although they can both be used for the same end goal, they each have their own pros and cons.

**Question:** What are the differences between LightGBM and xgboost? When might you want to use one over the other?

This is an open-ended workbook - the purpose is for you to choose one of either LightGBM or xgboost and use it to create the best performing model for mortality prediction on our dataset that you can. Be sure to use your chosen library's developer documentation to help you, and feel free to use any other resources that you wish; if you want to go the extra mile, perhaps look at methods for [data imputation](https://scikit-learn.org/stable/modules/impute.html), or techniques to handle [imbalanced data](https://imbalanced-learn.org/stable/). Use the techniques from Workbook 08 to evaluate your final models - and be sure to remember what we learned in Workbook 03 - we don't want our models to overfit!

**Tip:** If you want to install a new Python package, you can do so in a Jupyter Notebook code cell with the following command: `!pip install package-name`.

In [9]:
pip install xgboost

Defaulting to user installation because normal site-packages is not writeable
Collecting xgboost
  Downloading xgboost-2.0.3-py3-none-macosx_12_0_arm64.whl.metadata (2.0 kB)
Downloading xgboost-2.0.3-py3-none-macosx_12_0_arm64.whl (1.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: xgboost
Successfully installed xgboost-2.0.3

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.2[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [13]:
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split

# create model instance
bst = XGBClassifier(n_estimators=2, max_depth=2, learning_rate=1, objective='binary:logistic')
# fit model
bst.fit(x_train, y_train)
# make predictions
preds = bst.predict(x_test)
preds

array([0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1])

## Extra Ideas

As you may have noticed, decision trees have a lot of hyperparameters (e.g. `max_depth`) that we can change. One of the advantages of decision trees is that, quite often, these don't need any tuning and work well straight out of the box. However, it is possible to perform hyperparamter tuning to find the best model possible. You could take a look at libraries such as [optuna](https://optuna.readthedocs.io/en/stable/) that do this for you. Just be careful you don't overfit!

Another advantage of decision trees is that they're highlhy interpretable - i.e., we can see which features are contributing most to the model's output. You could use a library such as [shap](https://github.com/shap/shap) to investigate which features are most important for your models.