<a href="https://colab.research.google.com/github/PaulToronto/DataCamp-Track---Machine-Learning-Scientist-in-Python/blob/main/6_5_Machine_Learning_with_Tree_Based_Models_in_Python_Model_Tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 6-5 Machine Learning with Tree-Based Models in Python - Model Tuning.

## Imports

In [1]:
import pandas as pd

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score
from sklearn.metrics import mean_squared_error as MSE
from sklearn.ensemble import RandomForestRegressor
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder

## Data

In [2]:
base_url = 'https://drive.google.com/uc?id='

### Wisconsin Breast Cancer Dataset

In [3]:
id = '1oqwkLiOXsHomv_Nhm4JhEUf0GQE8h1rp'
breast = pd.read_csv(base_url + id)
breast = breast.drop(['id', 'Unnamed: 32'], axis=1)
breast['diagnosis'] = (breast['diagnosis'] == 'M').astype(int)
breast.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   diagnosis                569 non-null    int64  
 1   radius_mean              569 non-null    float64
 2   texture_mean             569 non-null    float64
 3   perimeter_mean           569 non-null    float64
 4   area_mean                569 non-null    float64
 5   smoothness_mean          569 non-null    float64
 6   compactness_mean         569 non-null    float64
 7   concavity_mean           569 non-null    float64
 8   concave points_mean      569 non-null    float64
 9   symmetry_mean            569 non-null    float64
 10  fractal_dimension_mean   569 non-null    float64
 11  radius_se                569 non-null    float64
 12  texture_se               569 non-null    float64
 13  perimeter_se             569 non-null    float64
 14  area_se                  5

### Indian Liver Patient Dataset

In [4]:
id = '1ZIKZwQV88fV7RFUSkhrTbGWGBxYxp9Rh'
liver = pd.read_csv(base_url + id)
liver.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 583 entries, 0 to 582
Data columns (total 11 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Age                         583 non-null    int64  
 1   Gender                      583 non-null    object 
 2   Total_Bilirubin             583 non-null    float64
 3   Direct_Bilirubin            583 non-null    float64
 4   Alkaline_Phosphotase        583 non-null    int64  
 5   Alamine_Aminotransferase    583 non-null    int64  
 6   Aspartate_Aminotransferase  583 non-null    int64  
 7   Total_Protiens              583 non-null    float64
 8   Albumin                     583 non-null    float64
 9   Albumin_and_Globulin_Ratio  579 non-null    float64
 10  Dataset                     583 non-null    int64  
dtypes: float64(5), int64(5), object(1)
memory usage: 50.2+ KB


### Auto-MPG Dataset

In [5]:
id = '14qqT73DvmgD0dx9zkcs3pxRLMCwSANii'
auto = pd.read_csv(base_url + id)
auto.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 392 entries, 0 to 391
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   mpg     392 non-null    float64
 1   displ   392 non-null    float64
 2   hp      392 non-null    int64  
 3   weight  392 non-null    int64  
 4   accel   392 non-null    float64
 5   origin  392 non-null    object 
 6   size    392 non-null    float64
dtypes: float64(4), int64(2), object(1)
memory usage: 21.6+ KB


## Tuning a CART's Hyperparameters

### Hyperparameters

- **Parameters** are learned from data through training
- **Hyperparameters** are not learned from data, they should be set prior to training

### What is hyperparameter tuning?

- **Problem**: search for a set of optimal hyperparameters for a learning algorithm
- **Solution**: find a set of optimal parameters that results in an optimal model
- **Optimal model**: yields an optimal score
- **Score**: In `sklearn` the default score for classification is **accuracy** and the default score for regression is $R^2$.
- Cross validation is used to estimate the generalization performance

### Why tune hyperparameters?

- In `sklearn`, a model's default hyperparameters are not optimal for all problems
- Hyperparameters should be tuned to obtain the best model performance

### Approaches to hyperparameter tuning

- Grid Search
- Random Search
- Bayesian Optimization
- Genetic Algorithms
- etc...

### Grid search cross validation

- Manually set a grid of discrete hyperparamter values
- Set a metric for scoring model performance
- Search exhaustively through the grid
- For each set of hyperparameters, evaluate each model's CV score
- The optimaal hyperparameters asr those of the model achieving the best CV score


### Inspecting the hyperparameters of a CART in `skearn`

In [6]:
X = breast.drop('diagnosis', axis=1)
y = breast['diagnosis']

In [7]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 30 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   radius_mean              569 non-null    float64
 1   texture_mean             569 non-null    float64
 2   perimeter_mean           569 non-null    float64
 3   area_mean                569 non-null    float64
 4   smoothness_mean          569 non-null    float64
 5   compactness_mean         569 non-null    float64
 6   concavity_mean           569 non-null    float64
 7   concave points_mean      569 non-null    float64
 8   symmetry_mean            569 non-null    float64
 9   fractal_dimension_mean   569 non-null    float64
 10  radius_se                569 non-null    float64
 11  texture_se               569 non-null    float64
 12  perimeter_se             569 non-null    float64
 13  area_se                  569 non-null    float64
 14  smoothness_se            5

In [8]:
y.value_counts()

Unnamed: 0_level_0,count
diagnosis,Unnamed: 1_level_1
0,357
1,212


In [9]:
SEED = 1

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.3,
                                                    stratify=y,
                                                    random_state=SEED)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((398, 30), (171, 30), (398,), (171,))

In [10]:
dt = DecisionTreeClassifier(random_state=SEED)
dt

In [11]:
# dict of hyperparameters
dt.get_params()

{'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'random_state': 1,
 'splitter': 'best'}

**max_depth** : int, default=None
- The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

**min_samples_leaf** : int or float, default=1
- The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least ``min_samples_leaf`` training samples in each of the left and right branches.  This may have the effect of smoothing the model, especially in regression.

**max_features** : int, float or {"sqrt", "log2"}, default=None
- The number of features to consider when looking for the best split:
    - If int, then consider `max_features` features at each split.
    - If float, then `max_features` is a fraction and `max(1, int(max_features * n_features_in_))` features are considered at
      each split.


In [12]:
# define the grid of hyperparameters
params_dt = {
    'max_depth': [3, 4, 5, 6],
    'min_samples_leaf': [0.04, 0.06, 0.08],
    'max_features': [0.2, 0.4, 0.6, 0.8]
}

In [13]:
# instatiate a 10-fold CV grid search object
grid_dt = GridSearchCV(
    estimator=dt,
    param_grid=params_dt,
    scoring='accuracy',
    cv=10,
    n_jobs=-1
)

grid_dt

In [14]:
grid_dt.fit(X_train, y_train)

In [15]:
# NOTE: this model is fitted on the entire training set
#. because the `refit` parameter is set to True by default
grid_dt.best_estimator_

In [16]:
grid_dt.best_params_

{'max_depth': 4, 'max_features': 0.2, 'min_samples_leaf': 0.06}

In [17]:
pd.DataFrame(grid_dt.cv_results_).sort_values(by='mean_test_score', ascending=False)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,param_max_features,param_min_samples_leaf,params,split0_test_score,split1_test_score,...,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
13,0.007907,0.003546,0.003699,0.00083,4,0.2,0.06,"{'max_depth': 4, 'max_features': 0.2, 'min_sam...",0.925,0.95,...,0.925,0.9,0.975,0.95,1.0,0.897436,0.923077,0.934551,0.032367,1
25,0.011242,0.003764,0.007693,0.002091,5,0.2,0.06,"{'max_depth': 5, 'max_features': 0.2, 'min_sam...",0.925,0.95,...,0.925,0.9,0.975,0.95,1.0,0.897436,0.923077,0.934551,0.032367,1
37,0.008694,0.004171,0.007497,0.004148,6,0.2,0.06,"{'max_depth': 6, 'max_features': 0.2, 'min_sam...",0.925,0.95,...,0.925,0.9,0.975,0.95,1.0,0.897436,0.923077,0.934551,0.032367,1
1,0.010731,0.002589,0.011774,0.008499,3,0.2,0.06,"{'max_depth': 3, 'max_features': 0.2, 'min_sam...",0.925,0.95,...,0.925,0.85,0.975,0.95,1.0,0.897436,0.923077,0.929551,0.040226,4
16,0.013946,0.003547,0.00905,0.003906,4,0.4,0.06,"{'max_depth': 4, 'max_features': 0.4, 'min_sam...",0.975,0.95,...,0.95,0.85,0.925,0.975,0.925,0.923077,0.923077,0.927115,0.037865,5
24,0.011033,0.002597,0.005281,0.002108,5,0.2,0.04,"{'max_depth': 5, 'max_features': 0.2, 'min_sam...",0.925,0.95,...,0.925,0.875,0.975,0.95,0.95,0.923077,0.923077,0.927115,0.030558,5
12,0.006833,0.003721,0.004526,0.00395,4,0.2,0.04,"{'max_depth': 4, 'max_features': 0.2, 'min_sam...",0.925,0.95,...,0.925,0.875,0.975,0.95,0.95,0.923077,0.923077,0.927115,0.030558,5
4,0.012299,0.004652,0.009139,0.004183,3,0.4,0.06,"{'max_depth': 3, 'max_features': 0.4, 'min_sam...",0.975,0.95,...,0.95,0.85,0.925,0.975,0.925,0.923077,0.923077,0.927115,0.037865,5
40,0.012188,0.00435,0.00667,0.003474,6,0.4,0.06,"{'max_depth': 6, 'max_features': 0.4, 'min_sam...",0.975,0.95,...,0.95,0.85,0.925,0.975,0.925,0.923077,0.923077,0.927115,0.037865,5
28,0.017164,0.005595,0.005071,0.002965,5,0.4,0.06,"{'max_depth': 5, 'max_features': 0.4, 'min_sam...",0.975,0.95,...,0.95,0.85,0.925,0.975,0.925,0.923077,0.923077,0.927115,0.037865,5


In [18]:
grid_dt.best_score_

np.float64(0.9345512820512821)

In [19]:
grid_dt.best_estimator_.score(X_test, y_test)

0.9064327485380117

In [20]:
# manual calculation of accuracy
sum(grid_dt.best_estimator_.predict(X_test) == y_test) / len(y_test)

0.9064327485380117

### Tree hyperparameters

In [21]:
dt = DecisionTreeClassifier(random_state=1)
dt.get_params()

{'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'random_state': 1,
 'splitter': 'best'}

### Set the tree's hyperparameter grid

In [22]:
# Define params_dt
params_dt = {
    'max_depth': [2, 3, 4],
    'min_samples_leaf': [0.12, 0.14, 0.16, 0.18]
}

### Search for the optimal tree

In [23]:
# Instantiate grid_dt
grid_dt = GridSearchCV(
    estimator=dt,
    param_grid=params_dt,
    scoring='roc_auc',
    cv=5,
    n_jobs=-1)

grid_dt

### Evaluate the optimal tree

In [24]:
# preprocessing
liver = liver.dropna()
liver = liver.copy()
liver['Is_male'] = (liver['Gender'] == 'Male').astype(int)
liver['Dataset'] = (liver['Dataset'] == 1).astype(int)
X = liver.drop(['Gender', 'Dataset'], axis=1)
y = liver['Dataset']

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    stratify=y,
    random_state=42
)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((463, 10), (116, 10), (463,), (116,))

In [25]:
X_train

Unnamed: 0,Age,Total_Bilirubin,Direct_Bilirubin,Alkaline_Phosphotase,Alamine_Aminotransferase,Aspartate_Aminotransferase,Total_Protiens,Albumin,Albumin_and_Globulin_Ratio,Is_male
420,55,10.9,5.1,1350,48,57,6.4,2.3,0.50,0
115,50,7.3,3.6,1580,88,64,5.6,2.3,0.60,1
558,51,4.0,2.5,275,382,330,7.5,4.0,1.10,1
30,57,4.0,1.9,190,45,111,5.2,1.5,0.40,1
109,36,0.9,0.1,486,25,34,5.9,2.8,0.90,1
...,...,...,...,...,...,...,...,...,...,...
41,62,0.6,0.1,160,42,110,4.9,2.6,1.10,1
130,45,3.2,1.4,512,50,58,6.0,2.7,0.80,1
247,55,0.9,0.2,190,25,28,5.9,2.7,0.80,1
376,33,0.7,0.1,168,35,33,7.0,3.7,1.10,1


In [26]:
grid_dt.fit(X_train, y_train)

In [27]:
best_model = grid_dt.best_estimator_
best_model

In [28]:
y_pred_proba = best_model.predict_proba(X_test)[:, 1]
y_pred_proba

array([0.95238095, 0.61797753, 0.61797753, 0.95238095, 0.95238095,
       0.95238095, 0.83050847, 0.75238095, 0.95238095, 0.45714286,
       0.83050847, 0.75238095, 0.95238095, 0.83050847, 0.83050847,
       0.75238095, 0.45714286, 0.95238095, 0.45714286, 0.75238095,
       0.45714286, 0.95238095, 0.75238095, 0.75238095, 0.61797753,
       0.95238095, 0.75238095, 0.45714286, 0.95238095, 0.75238095,
       0.61797753, 0.61797753, 0.95238095, 0.45714286, 0.83050847,
       0.75238095, 0.95238095, 0.75238095, 0.83050847, 0.45714286,
       0.83050847, 0.75238095, 0.75238095, 0.95238095, 0.45714286,
       0.95238095, 0.75238095, 0.83050847, 0.95238095, 0.83050847,
       0.83050847, 0.45714286, 0.75238095, 0.75238095, 0.83050847,
       0.75238095, 0.61797753, 0.61797753, 0.75238095, 0.45714286,
       0.45714286, 0.75238095, 0.75238095, 0.95238095, 0.75238095,
       0.95238095, 0.95238095, 0.75238095, 0.95238095, 0.75238095,
       0.75238095, 0.61797753, 0.45714286, 0.45714286, 0.95238

In [29]:
test_roc_auc = roc_auc_score(y_test, y_pred_proba)
test_roc_auc

np.float64(0.737860533041256)

## Tuning an RF's Hyperparameters

- CART hyperparameters
- number of estimators
- bootstrap
- etc...

### The Data

In [30]:
preprocessor = make_column_transformer(
    (OneHotEncoder(sparse_output=False, drop='first'), ['origin']),
    remainder='passthrough',
    force_int_remainder_cols=False
)

In [31]:
X = auto.drop('mpg', axis=1)
y = auto['mpg']

index = X.index

X = preprocessor.fit_transform(X)
X = pd.DataFrame(X,
                 columns=preprocessor.get_feature_names_out(),
                 index = index)

X.head()

Unnamed: 0,onehotencoder__origin_Europe,onehotencoder__origin_US,remainder__displ,remainder__hp,remainder__weight,remainder__accel,remainder__size
0,0.0,1.0,250.0,88.0,3139.0,14.5,15.0
1,0.0,1.0,304.0,193.0,4732.0,18.5,20.0
2,0.0,0.0,91.0,60.0,1800.0,16.4,10.0
3,0.0,1.0,250.0,98.0,3525.0,19.0,15.0
4,1.0,0.0,97.0,78.0,2188.0,15.8,10.0


In [32]:
SEED = 1

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=SEED
)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((313, 7), (79, 7), (313,), (79,))

### Tuning is expensive

- computationally expensive
- sometimes leads to very slight improvement
- should weight the impact of tuning on the whole project

### Inspecting RF Hyperparameters in `sklearn`

In [33]:
SEED = 1

rf = RandomForestRegressor(random_state=SEED)
rf

In [34]:
rf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'squared_error',
 'max_depth': None,
 'max_features': 1.0,
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 1,
 'verbose': 0,
 'warm_start': False}

In [35]:
# define a grid of hyperparameters
params_rf = {
    'n_estimators': [300, 400, 500],
    'max_depth': [4, 6, 8],
    'min_samples_leaf': [0.1, 0.2],
    'max_features': ['log2', 'sqrt']
}

In [36]:
# instatiate GridSearchCV object
grid_rf = GridSearchCV(
    estimator=rf,
    param_grid=params_rf,
    cv=3,
    scoring='neg_mean_squared_error',
    verbose=1,
    n_jobs=-1
)

grid_rf

In [None]:
grid_rf.fit(X_train, y_train)

Fitting 3 folds for each of 36 candidates, totalling 108 fits


In [None]:
grid_rf.best_params_

### Evaluating the best model performance

In [None]:
best_model = grid_rf.best_estimator_
best_model

In [None]:
y_pred = best_model.predict(X_test)
y_pred

In [None]:
rmse_test = MSE(y_test, y_pred)**(1/2)
rmse_test

### Random Forest Hyperparameters

In [None]:
rf = RandomForestRegressor(n_jobs=-1, random_state=2)
rf

In [None]:
rf.get_params()

In [None]:
# Define the dictionary 'params_rf'
params_rf = {
    'n_estimators': [100, 350, 500],
    'max_features': ['log2', 'auto', 'sqrt'],
    'min_samples_leaf': [2, 10, 30]
}

The data is not provided.

```python
grid_rf.fit(X_train, y_train)

# Extract the best estimator
best_model = grid_rf.best_estimator_

# Predict test set labels
y_pred = best_model.predict(X_test)

# Compute rmse_test
rmse_test = MSE(y_test, y_pred)**(1/2)

# Print rmse_test
print('Test RMSE of best model: {:.3f}'.format(rmse_test))
```