# NCAR EdEc Bootcamp, Lesson 4

Supervised learning using decision trees

## Python Imports

In [2]:
import pandas as pd
import numpy as np

# Plotting
import matplotlib.pyplot as plt

# Machine Learning
import sklearn
from sklearn.preprocessing import StandardScaler
import xgboost as xgb
import catboost as cat

In [3]:
print('scikit-learn version:', sklearn.__version__)
print('XGBoost version:', xgb.__version__)
print('Catboost version:', cat.__version__)

scikit-learn version: 1.3.0
XGBoost version: 2.0.0
Catboost version: 1.2.1.1.16484


## Dataset Loading

We will be using this open source dataset to go through some ML workflow tasks:

https://essd.copernicus.org/articles/13/3013/2021/

In [4]:
df = pd.read_csv('http://hdl.handle.net/11304/89dd440e-4e10-496e-b476-1ccf0ebeb4f3')
df.head(3)

Unnamed: 0,id,country,htap_region,climatic_zone,lon,lat,alt,relative_alt,type,type_of_area,...,o3_perc90,o3_perc98,o3_dma8eu,o3_avgdma8epax,o3_drmdmax1h,o3_w90,o3_aot40,o3_nvgt070,o3_nvgt100,dataset
0,3336,Germany,EUR,cool_moist,8.30821,54.92497,12.0,3,background,rural,...,46.4399,54.8468,53.5738,38.8078,50.7704,86.1266,10197.4742,2.0,0.0,test
1,3338,Germany,EUR,cool_moist,12.72528,54.43667,1.0,1,background,rural,...,44.0575,53.7778,51.3996,35.8313,48.3935,69.0987,7573.2222,1.0,0.0,train
2,3339,Germany,EUR,cool_moist,6.093923,50.754704,205.0,66,background,urban,...,41.1803,58.4009,54.903,32.6169,49.8276,154.1263,8655.473,5.4,1.0,train


In [5]:
missing_values = df.isna().sum()
print(missing_values)

id                                         0
country                                    0
htap_region                                0
climatic_zone                              0
lon                                        0
lat                                        0
alt                                        0
relative_alt                               0
type                                       0
type_of_area                               0
water_25km                                 0
evergreen_needleleaf_forest_25km           0
evergreen_broadleaf_forest_25km            0
deciduous_needleleaf_forest_25km           0
deciduous_broadleaf_forest_25km            0
mixed_forest_25km                          0
closed_shrublands_25km                     0
open_shrublands_25km                       0
woody_savannas_25km                        0
savannas_25km                              0
grasslands_25km                            0
permanent_wetlands_25km                    0
croplands_

In [6]:
df.describe()

Unnamed: 0,id,lon,lat,alt,relative_alt,water_25km,evergreen_needleleaf_forest_25km,evergreen_broadleaf_forest_25km,deciduous_needleleaf_forest_25km,deciduous_broadleaf_forest_25km,...,o3_perc75,o3_perc90,o3_perc98,o3_dma8eu,o3_avgdma8epax,o3_drmdmax1h,o3_w90,o3_aot40,o3_nvgt070,o3_nvgt100
count,5577.0,5577.0,5577.0,5577.0,5577.0,5577.0,5577.0,5577.0,5577.0,5577.0,...,5577.0,5577.0,5577.0,5577.0,5577.0,5577.0,5577.0,5577.0,5577.0,5577.0
mean,11481.604447,13.47765,39.302449,264.63632,49.784113,12.667886,2.884884,0.511655,0.002241,2.676529,...,-172.774709,-164.79812,-153.907786,-133.845728,-152.661457,48.191788,-63.851527,12874.239386,-176.564118,-193.487111
std,4041.144957,88.07972,13.233924,466.298427,107.169033,19.386709,9.199527,4.170122,0.080375,9.027644,...,416.565376,420.595277,426.124266,407.712741,401.688016,79.985568,471.595949,11506.274694,390.45106,396.3882
min,3336.0,-170.564,-89.996,-4.0,-136.0,0.0,0.0,0.0,0.0,0.0,...,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0
25%,8252.0,-76.003333,35.4111,20.0,8.0,0.0,0.0,0.0,0.0,0.0,...,27.5682,36.0893,45.4875,45.3622,29.1433,49.3815,30.0541,2509.5722,0.0,0.0
50%,11732.0,7.478586,39.834461,90.0,21.0,0.0,0.0,0.0,0.0,0.0,...,35.3649,45.3622,58.3561,56.2694,36.3412,54.2553,105.0152,11957.588,2.5,0.0
75%,15431.0,127.11514,45.836945,287.0,47.0,23.4,1.6,0.0,0.0,0.0,...,40.0,50.75,65.12,62.6889,41.2555,58.6417,194.5853,20378.6112,9.3333,0.5
max,17722.0,174.87,82.45083,5500.0,1826.0,100.0,96.4,99.5,4.8,94.5,...,61.64,74.872,115.286,108.0775,71.1861,102.1522,738.4784,72430.8713,179.5,93.0


In [7]:
df.replace(-999.0, np.nan, inplace=True)

In [8]:
df.describe()

Unnamed: 0,id,lon,lat,alt,relative_alt,water_25km,evergreen_needleleaf_forest_25km,evergreen_broadleaf_forest_25km,deciduous_needleleaf_forest_25km,deciduous_broadleaf_forest_25km,...,o3_perc75,o3_perc90,o3_perc98,o3_dma8eu,o3_avgdma8epax,o3_drmdmax1h,o3_w90,o3_aot40,o3_nvgt070,o3_nvgt100
count,5577.0,5577.0,5577.0,5577.0,5577.0,5577.0,5577.0,5577.0,5577.0,5577.0,...,4447.0,4447.0,4447.0,4564.0,4552.0,5545.0,4490.0,4461.0,4552.0,4490.0
mean,11481.604447,13.47765,39.302449,264.63632,49.784113,12.667886,2.884884,0.511655,0.002241,2.676529,...,37.172352,47.175823,60.833433,58.179091,37.913457,54.235095,162.541878,16344.881654,8.628496,1.522356
std,4041.144957,88.07972,13.233924,466.298427,107.169033,19.386709,9.199527,4.170122,0.080375,9.027644,...,6.338292,7.048452,9.582222,9.168651,6.401068,8.273298,115.034771,10262.296344,11.904406,3.986506
min,3336.0,-170.564,-89.996,-4.0,-136.0,0.0,0.0,0.0,0.0,0.0,...,11.5285,16.3264,22.1391,20.3672,10.7048,16.5876,1.3458,0.0,0.0,0.0
25%,8252.0,-76.003333,35.4111,20.0,8.0,0.0,0.0,0.0,0.0,0.0,...,33.3972,42.8046,55.54815,53.021775,33.684925,49.465,77.992025,8943.1948,1.0,0.0
50%,11732.0,7.478586,39.834461,90.0,21.0,0.0,0.0,0.0,0.0,0.0,...,37.2588,47.6,60.8,58.425,38.279,54.2892,131.5306,14984.7905,4.25,0.0
75%,15431.0,127.11514,45.836945,287.0,47.0,23.4,1.6,0.0,0.0,0.0,...,41.0,52.0,66.8,64.0758,42.13205,58.6675,224.705125,22369.8842,12.0,1.0
max,17722.0,174.87,82.45083,5500.0,1826.0,100.0,96.4,99.5,4.8,94.5,...,61.64,74.872,115.286,108.0775,71.1861,102.1522,738.4784,72430.8713,179.5,93.0


In [9]:
missing_values2 = df.isna().sum()
print(missing_values2)

id                                            0
country                                       0
htap_region                                   0
climatic_zone                                 0
lon                                           0
lat                                           0
alt                                           0
relative_alt                                  0
type                                          0
type_of_area                                  0
water_25km                                    0
evergreen_needleleaf_forest_25km              0
evergreen_broadleaf_forest_25km               0
deciduous_needleleaf_forest_25km              0
deciduous_broadleaf_forest_25km               0
mixed_forest_25km                             0
closed_shrublands_25km                        0
open_shrublands_25km                          0
woody_savannas_25km                           0
savannas_25km                                 0
grasslands_25km                         

In [10]:
df.dropna(axis=1, inplace=True)
df.shape

(5577, 39)

## Splitting Data

Note: For this exercise, we will just have a testing and a training dataset, not a validation dataset.

In [8]:
from sklearn.model_selection import train_test_split

#### Regression split

Let's predict body mass from bill size and flipper size

In [9]:
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(std_scaled[:,0:3], std_scaled[:,3], 
                                                                    test_size=0.33, 
                                                                    random_state=42)

# Theory

## Decision Trees

A decision tree is a machine learning model that uses a tree-like structure to make predictions. The tree is constructed by recursively partitioning the data into smaller and smaller subsets, based on the values of the features. Each node in the tree represents a question about a feature, and the branches represent the different possible answers. The leaf nodes of the tree represent the predictions.

To make a prediction, the model starts at the root node of the tree and asks the question associated with that node. Based on the answer, the model follows the corresponding branch to the next node. The process continues until the model reaches a leaf node, which contains the prediction.

## Boosting vs Bagging

Boosting and bagging are both ensemble learning methods, which means that they combine multiple machine learning models to improve performance. However, they work in different ways.

__Boosting__ works by training a sequence of models, where each model is trained on the errors of the previous model. This means that the models are able to learn from each other and improve their performance over time. Boosting algorithms are typically more accurate than bagging algorithms, but they can also be more complex and slower to train.

__Bagging__ works by training multiple models on different subsets of the data. The predictions of the models are then averaged to produce the final prediction. Bagging algorithms are typically less accurate than boosting algorithms, but they are also simpler to implement and faster to train.

Examples of boosting algorithms:

- XGBoost
- CatBoost

Examples of bagging algorithms:

- Random Forest


## Scaling your data

No, you do not need to scale your data when doing XGBoost or other decision tree methods like CatBoost. In fact, scaling can actually make your model perform worse.

Decision tree methods work by splitting the feature space into smaller and smaller regions, based on the values of the features. The splits are chosen to maximize the purity of the regions, meaning that they should contain as many examples of the same class as possible.

Scaling the features does not change the underlying data distribution, so it will not affect the way that the decision tree splits the feature space. However, it can make it more difficult for the tree to learn the correct splits, especially if the features are scaled to very different ranges.

For example, if you have a feature that represents the price of a house, and you scale it to the range 0 to 1, the decision tree will be forced to make splits at arbitrary points in the price range, even if there are no natural breaks in the data. This can lead to a less accurate model.

There are a few exceptions to this rule. For example, if you are using a decision tree method for classification, and you have a target variable that is imbalanced (meaning that there are many more examples of one class than the other), you may want to scale your data to try to improve the performance of the model.