# Boosting

A metodologia foi inicialmente criada para resolver uma classificação, a idéia principal por é encontrar hipóteses fracas, aprender repetidamente e combinar essas hipóteses fracas dentro de uma única hipótese.
  
É um método de ensemble? Sim.  
**Métodos de ensemble** que tem como objetivo **combinar as predições de diversos estimadores mais simples** para gerar uma **predição final mais robusta**

- **Métodos de boosting**: têm como procedimento geral a construção de estimadores de forma sequencial, de modo que estimadores posteriores tentam reduzir o **viés** do estimador conjunto, que leva em consideração estimadores anteriores. Ex.: **adaboost**.

## Métodos de Ensemble


Há uma classe de algoritmos de Machine Learning, os chamados **métodos de ensemble** que tem como objetivo **combinar as predições de diversos estimadores mais simples** para gerar uma **predição final mais robusta**

Os métodos de ensemble costuman ser divididos em duas classes:

- **Métodos de média**: têm como procedimento geral construir diversos estimadores independentes, e tomar a média de suas predições como a predição final. O principal objetivo do método é reduzir **variância**, de modo que o modelo final seja melhor que todos os modelos individuais. Ex.: **random forest.**
<br>

- **Métodos de boosting**: têm como procedimento geral a construção de estimadores de forma sequencial, de modo que estimadores posteriores tentam reduzir o **viés** do estimador conjunto, que leva em consideração estimadores anteriores. Ex.: **adaboost**.

Há, ainda, uma terceira classe de método de ensemble, o chamado [stacking ensemble](https://machinelearningmastery.com/stacking-ensemble-machine-learning-with-python/), que consiste em "empilhar" modelos de modo a produzir a mistura. Não veremos esta modalidade em detalhes, mas deixo como sugestão para estudos posteriores! :)

Para mais detalhes sobre métodos de ensemble no contexto do sklearn, [clique aqui!](https://scikit-learn.org/stable/modules/ensemble.html)

Na aula de hoje, vamos conhecer em detalhes os procedimentos de bagging e boosting, ilustrados pelos métodos AdaBoost e GradientBoost, respectivamente. Vamos lá!

______

### Bagging vs Boosting

Pra lembrar as principais diferenças entre os dois métodos de ensemble que estudamos:

<img src=https://pluralsight2.imgix.net/guides/81232a78-2e99-4ccc-ba8e-8cd873625fdf_2.jpg width=600>

____
____
____

_________
_______
_________

## Boosting & AdaBoost

O AdaBoost significa **Adaptive Boosting**, e tem como procedimento geral **a criação sucessiva dos chamados weak learners**, que são modelos bem fracos de aprendizagem - geralmente, **árvores de um único nó (stumps)**.

<img src="https://miro.medium.com/max/1744/1*nJ5VrsiS1yaOR77d4h8gyw.png" width=300>

O AdaBoost utiliza os **erros da árvore anterior para melhorar a próxima árvore**. As predições finais são feitas com base **nos pesos de cada stump**, cuja determinação faz parte do algoritmo!

<img src="https://static.packt-cdn.com/products/9781788295758/graphics/image_04_046-1.png" width=700>

Vamos entender um pouco melhor...

Aqui, o bootstrapping não é utilizado: o método começa treinando um classificador fraco **no dataset original**, e depois treina diversas cópias adicionais do classificador **no mesmo dataset**, mas dando **um peso maior às observações que foram classificadas erroneamente** (ou, no caso de regressões, a observações **com o maior erro**).

Assim, após diversas iterações, classificadores/regressores vão sequencialmente "focando nos casos mais difíceis", e construindo um classificador encadeado que seja forte, apesar de utilizar diversos classificadores fracos em como elementos fundamentais.

<img src="https://www.researchgate.net/profile/Zhuo_Wang8/publication/288699540/figure/fig9/AS:668373486686246@1536364065786/Illustration-of-AdaBoost-algorithm-for-creating-a-strong-classifier-based-on-multiple.png" width=500>


De forma resumida, as principais ideias por trás deste algoritmo são:

- O algoritmo cria e combina um conjunto de **modelos fracos** (em geral, stumps);
- Cada stump é criado **levando em consideração os erros do stump anterior**;
- Alguns dos stumps têm **maior peso de decisão** do que outros na predição final;

As classes no sklearn são:

- [AdaBoostClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html)

- [AdaBoostRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostRegressor.html#sklearn.ensemble.AdaBoostRegressor)

Note que não há muitos hiperparâmetros. O mais importante, que deve ser tunado com o grid/random search, é:

- `n_estimators` : o número de weak learners encadeados;

Além disso, pode também ser interessante tunar os hiperparâmetros dos weak learners. Isso é possível de ser feito, como veremos a seguir!


Uma animação para entendermos melhor...  
- O projeto https://periodicos.uff.br/anaisdoser/article/download/29032/16865/100072
- O resultado https://mateusmaia.shinyapps.io/adaboosting/

Primeiro, vamos começar com nosso baseline:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import train_test_split

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

from sklearn.ensemble import AdaBoostClassifier

from sklearn.metrics import classification_report

In [2]:
df = pd.read_csv('./datasets/german_credit_data.csv', index_col=0)
df.head()

Unnamed: 0,Age,Sex,Job,Housing,Saving accounts,Checking account,Credit amount,Duration,Purpose,Risk
0,67,male,2,own,,little,1169,6,radio/TV,good
1,22,female,2,own,little,moderate,5951,48,radio/TV,bad
2,49,male,1,own,little,,2096,12,education,good
3,45,male,2,free,little,little,7882,42,furniture/equipment,good
4,53,male,2,free,little,little,4870,24,car,bad


In [3]:
df.dtypes

Age                  int64
Sex                 object
Job                  int64
Housing             object
Saving accounts     object
Checking account    object
Credit amount        int64
Duration             int64
Purpose             object
Risk                object
dtype: object

In [4]:
y_colum = 'Risk'

X = df.drop(columns=[y_colum])
y = df[y_colum]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [58]:
def pipe_preprocessor(path_dataset, y_colum):
    df = pd.read_csv(path_dataset, index_col=0)

    X = df.drop(columns=[y_colum])
    y = df[y_colum]
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
    
    pipe_features_num = Pipeline([
        ('input_num', SimpleImputer(strategy='mean')),
        ('std', StandardScaler())
    ])
    
    features_num = X_train.select_dtypes(include=np.number).columns.tolist()
    
    pipe_features_cat = Pipeline([
        ('input_cat', SimpleImputer(strategy='constant', fill_value='unknown')),
        ('ondehot', OneHotEncoder())
    ])
    
    features_cat = X_train.select_dtypes(exclude=np.number).columns.tolist()
    
    pre_processor = ColumnTransformer([
        ('transf_num', pipe_features_num, features_num),
        ('transf_cat', pipe_features_cat, features_cat)
    ])
    
    return pre_processor

In [6]:
pre_processor = pipe_preprocessor('./datasets/german_credit_data.csv', 'Risk')

In [7]:
pipe_ab = Pipeline([
    ('pre_processor', pre_processor),
    ('ab', AdaBoostClassifier(random_state=42))
])

In [8]:
pipe_ab.fit(X_train, y_train)

Pipeline(steps=[('pre_processor',
                 ColumnTransformer(transformers=[('transf_num',
                                                  Pipeline(steps=[('input_num',
                                                                   SimpleImputer()),
                                                                  ('std',
                                                                   StandardScaler())]),
                                                  ['Age', 'Job',
                                                   'Credit amount',
                                                   'Duration']),
                                                 ('transf_cat',
                                                  Pipeline(steps=[('input_cat',
                                                                   SimpleImputer(fill_value='unknown',
                                                                                 strategy='constant')),
                          

In [9]:
def metricas_classificacao(estimador, X, y):
    y_pred = estimador.predict(X)
    print(classification_report(y, y_pred))

In [10]:
metricas_classificacao(pipe_ab, X_train, y_train)

              precision    recall  f1-score   support

         bad       0.68      0.51      0.58       240
        good       0.81      0.89      0.85       560

    accuracy                           0.78       800
   macro avg       0.74      0.70      0.72       800
weighted avg       0.77      0.78      0.77       800



In [11]:
metricas_classificacao(pipe_ab, X_test, y_test)

              precision    recall  f1-score   support

         bad       0.62      0.48      0.54        60
        good       0.80      0.87      0.83       140

    accuracy                           0.76       200
   macro avg       0.71      0.68      0.69       200
weighted avg       0.74      0.76      0.75       200



In [12]:
print(len(pipe_ab['ab'].estimators_))
print(pipe_ab['ab'].estimator_weights_)
pipe_ab['ab'].estimators_

50
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1.]


[DecisionTreeClassifier(max_depth=1, random_state=1608637542),
 DecisionTreeClassifier(max_depth=1, random_state=1273642419),
 DecisionTreeClassifier(max_depth=1, random_state=1935803228),
 DecisionTreeClassifier(max_depth=1, random_state=787846414),
 DecisionTreeClassifier(max_depth=1, random_state=996406378),
 DecisionTreeClassifier(max_depth=1, random_state=1201263687),
 DecisionTreeClassifier(max_depth=1, random_state=423734972),
 DecisionTreeClassifier(max_depth=1, random_state=415968276),
 DecisionTreeClassifier(max_depth=1, random_state=670094950),
 DecisionTreeClassifier(max_depth=1, random_state=1914837113),
 DecisionTreeClassifier(max_depth=1, random_state=669991378),
 DecisionTreeClassifier(max_depth=1, random_state=429389014),
 DecisionTreeClassifier(max_depth=1, random_state=249467210),
 DecisionTreeClassifier(max_depth=1, random_state=1972458954),
 DecisionTreeClassifier(max_depth=1, random_state=1572714583),
 DecisionTreeClassifier(max_depth=1, random_state=1433267572),


Vamos deixar o base_estimator explícito

In [13]:
from sklearn.tree import DecisionTreeClassifier

In [14]:
pre_processor = pipe_preprocessor('./datasets/german_credit_data.csv', 'Risk')

In [15]:
basal = DecisionTreeClassifier(max_depth=1)

In [16]:
pipe_ab = Pipeline([
    ('pre_processor', pre_processor),
    ('ab', AdaBoostClassifier(base_estimator=basal, random_state=42))
])

In [17]:
pipe_ab.fit(X_train, y_train)

Pipeline(steps=[('pre_processor',
                 ColumnTransformer(transformers=[('transf_num',
                                                  Pipeline(steps=[('input_num',
                                                                   SimpleImputer()),
                                                                  ('std',
                                                                   StandardScaler())]),
                                                  ['Age', 'Job',
                                                   'Credit amount',
                                                   'Duration']),
                                                 ('transf_cat',
                                                  Pipeline(steps=[('input_cat',
                                                                   SimpleImputer(fill_value='unknown',
                                                                                 strategy='constant')),
                          

In [18]:
metricas_classificacao(pipe_ab, X_train, y_train)

              precision    recall  f1-score   support

         bad       0.68      0.51      0.58       240
        good       0.81      0.89      0.85       560

    accuracy                           0.78       800
   macro avg       0.74      0.70      0.72       800
weighted avg       0.77      0.78      0.77       800



In [19]:
metricas_classificacao(pipe_ab, X_test, y_test)

              precision    recall  f1-score   support

         bad       0.62      0.48      0.54        60
        good       0.80      0.87      0.83       140

    accuracy                           0.76       200
   macro avg       0.71      0.68      0.69       200
weighted avg       0.74      0.76      0.75       200



Podemos, também, mudar o estimador basal. Por exemplo, uma regressão logística fortemente regularizada.

In [20]:
from sklearn.linear_model import LogisticRegression

In [21]:
pre_processor = pipe_preprocessor('./datasets/german_credit_data.csv', 'Risk')

In [22]:
basal = LogisticRegression(C=0.1, random_state=42)

In [23]:
pipe_ab = Pipeline([
    ('pre_processor', pre_processor),
    ('ab', AdaBoostClassifier(base_estimator=basal, random_state=42))
])

In [24]:
pipe_ab.fit(X_train, y_train)

Pipeline(steps=[('pre_processor',
                 ColumnTransformer(transformers=[('transf_num',
                                                  Pipeline(steps=[('input_num',
                                                                   SimpleImputer()),
                                                                  ('std',
                                                                   StandardScaler())]),
                                                  ['Age', 'Job',
                                                   'Credit amount',
                                                   'Duration']),
                                                 ('transf_cat',
                                                  Pipeline(steps=[('input_cat',
                                                                   SimpleImputer(fill_value='unknown',
                                                                                 strategy='constant')),
                          

In [25]:
metricas_classificacao(pipe_ab, X_train, y_train)

              precision    recall  f1-score   support

         bad       0.70      0.13      0.22       240
        good       0.72      0.97      0.83       560

    accuracy                           0.72       800
   macro avg       0.71      0.55      0.53       800
weighted avg       0.72      0.72      0.65       800



In [26]:
metricas_classificacao(pipe_ab, X_test, y_test)

              precision    recall  f1-score   support

         bad       0.75      0.15      0.25        60
        good       0.73      0.98      0.84       140

    accuracy                           0.73       200
   macro avg       0.74      0.56      0.54       200
weighted avg       0.74      0.73      0.66       200



In [26]:
print(len(pipe_ab['ab'].estimators_))
print(pipe_ab['ab'].estimator_weights_)
pipe_ab['ab'].estimators_

50
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1.]


[LogisticRegression(C=0.1, random_state=1608637542),
 LogisticRegression(C=0.1, random_state=1273642419),
 LogisticRegression(C=0.1, random_state=1935803228),
 LogisticRegression(C=0.1, random_state=787846414),
 LogisticRegression(C=0.1, random_state=996406378),
 LogisticRegression(C=0.1, random_state=1201263687),
 LogisticRegression(C=0.1, random_state=423734972),
 LogisticRegression(C=0.1, random_state=415968276),
 LogisticRegression(C=0.1, random_state=670094950),
 LogisticRegression(C=0.1, random_state=1914837113),
 LogisticRegression(C=0.1, random_state=669991378),
 LogisticRegression(C=0.1, random_state=429389014),
 LogisticRegression(C=0.1, random_state=249467210),
 LogisticRegression(C=0.1, random_state=1972458954),
 LogisticRegression(C=0.1, random_state=1572714583),
 LogisticRegression(C=0.1, random_state=1433267572),
 LogisticRegression(C=0.1, random_state=434285667),
 LogisticRegression(C=0.1, random_state=613608295),
 LogisticRegression(C=0.1, random_state=893664919),
 Log

Não ficou muito legal. Por isso que, apesar de ser possível usar outros estimadores basais, é comum usarmos stumps mesmo (árvores com uma única quebra).

Vamos agora fazer o gridsearch!

In [27]:
from sklearn.model_selection import GridSearchCV, StratifiedKFold

In [28]:
pre_processor = pipe_preprocessor('./datasets/german_credit_data.csv', 'Risk')

In [29]:
basal = LogisticRegression(l1_ratio=0.5, random_state=42)

In [30]:
pipe_ab = Pipeline([
    ('pre_processor', pre_processor),
    ('ab', AdaBoostClassifier(base_estimator=basal, random_state=42))
])

In [31]:
params_grid_ab = {
    'ab__base_estimator__C': [0.1, 0.01],
    'ab__base_estimator__penalty': ['l1', 'l2', 'elasticnet'],
    'ab__n_estimators': [50, 100, 150]
}

splitter = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

grid_ab = GridSearchCV(
    estimator=pipe_ab,
    param_grid=params_grid_ab,
    scoring='f1_weighted',
    cv=splitter,
    verbose=10,
    n_jobs=-1
)

grid_ab.fit(X_train, y_train)

Fitting 5 folds for each of 18 candidates, totalling 90 fits


GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=42, shuffle=True),
             estimator=Pipeline(steps=[('pre_processor',
                                        ColumnTransformer(transformers=[('transf_num',
                                                                         Pipeline(steps=[('input_num',
                                                                                          SimpleImputer()),
                                                                                         ('std',
                                                                                          StandardScaler())]),
                                                                         ['Age',
                                                                          'Job',
                                                                          'Credit '
                                                                          'amount',
                               

In [32]:
grid_ab.best_params_

{'ab__base_estimator__C': 0.1,
 'ab__base_estimator__penalty': 'l2',
 'ab__n_estimators': 150}

In [33]:
metricas_classificacao(grid_ab, X_train, y_train)

              precision    recall  f1-score   support

         bad       0.61      0.30      0.41       240
        good       0.75      0.92      0.83       560

    accuracy                           0.73       800
   macro avg       0.68      0.61      0.62       800
weighted avg       0.71      0.73      0.70       800



_________
_______
_________

### Exercício
Utilizando o dataset de cancer: crie um modelo para predizer o tipo.  
Desta vez utilizando o AdaBoost

In [35]:
from sklearn.datasets import load_breast_cancer

dados = load_breast_cancer(as_frame=True)
print(dados['DESCR'])

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        worst/largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 0 is Mean Radi

In [36]:
df = dados['frame']
df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


##  Gradient boosting

Além dos métodos que estudamos, há ainda outras classes de métodos de ensemble!

Em particular, a classe de modelos que se utilizam do procedimento de **gradient boosting**.

O gradient boosting também é baseado no princípio de boosting (utilização de weak learners sequencialmente adicionados de modo a **sequencialmente minimizar os erros cometidos**).

<img src=https://miro.medium.com/max/788/1*pEu2LNmxf9ttXHIALPcEBw.png width=600>

Mas este método implementa o boosting através de um **gradiente** explícito.

A ideia é que caminhemos na direção do **erro mínimo** de maneira iterativa **passo a passo**.

Este caminho se dá justamente pelo **gradiente** da **função de custo/perda**, que mede justamente os erros cometidos.

<img src=https://upload.wikimedia.org/wikipedia/commons/a/a3/Gradient_descent.gif width=400>

Este método é conhecido como:

### Gradiente descendente

Deixei em ênfase porque este será um método de **enorme importância** no estudo de redes neurais (e é, em geral, um método de otimização muito utilizado).

O objetivo geral do método é bem simples: determinar quais são os **parâmetros** da hipótese que minimizam a função de custo/perda. Para isso, o método "percorre" a função de erro, indo em direção ao seu mínimo (e este "caminho" feito na função se dá justamente pela **determinação iterativa dos parâmetros**, isto é, **a cada passo, chegamos mais perto dos parâmetros finais da hipótese**, conforme eles são ajustados aos dados.

> **Pequeno interlúdio matemático:** o gradiente descendente implementado pelo gradient boosting é, na verdade, um **gradiente descendente funcional**, isto é, desejamos encontrar não um conjunto de parâmetros que minimiza o erro, mas sim **introduzir sequencialmente weak learners (hipótese simples) que minimizam o erro**. Desta forma, o gradient boosting minimiza a função de custo ao ecolher iterativamente hipóteses simples que apontam na direção do mínimo, neste espaço funcional.

Apesar do interlúdio acima, não precisamos nos preocupar muito com os detalhes matemáticos: o que importa é entender que no caso do gradient boosting, há alguns pontos importantes:

- Uma **função de custo/perda (loss)** é explicitamente minimizada por um procedimento de gradiente;

- O gradiente está relacionado com o procedimento de **encadeamento progressivo entre weak learners**, seguindo a ideia do boosting.

Pra quem quiser saber um pouco mais de detalhes (e se aventurar na matemática), sugiro [este post](https://www.gormanalysis.com/blog/gradient-boosting-explained/) ou então [este site](https://explained.ai/gradient-boosting/), que contém vários materiais ótimos para entender o método com todos os detalhes matemáticos.

Os [vídeos do StatQuest](https://www.youtube.com/playlist?list=PLblh5JKOoLUJjeXUvUE0maghNuY2_5fY6) também são uma boa referência!

As classes do sklearn são:

- [GradientBoostingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html)

- [GradientBoostingRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html#sklearn.ensemble.GradientBoostingRegressor)

E os principais hiperparâmetros a serem ajustados são:

- `n_estimators` : novamente, o número de weak learners encadeados.

- `learning_rate` : a constante que multiplica o gradiente no gradiente descendente. Essencialmente, controla o "tamanho do passo" a ser dado em direção ao mínimo.

Segundo o próprio [User Guide](https://scikit-learn.org/stable/modules/ensemble.html#gradient-boosting): "*Empirical evidence suggests that small values of `learning_rate` favor better test error. The lireature recommends to set the learning rate to a small constant (e.g. `learning_rate <= 0.1`) and choose `n_estimators` by early stopping.*"

Ainda sobre a learning rate, as ilustrações a seguir ajudam a entender sua importância:

<img src=https://www.jeremyjordan.me/content/images/2018/02/Screen-Shot-2018-02-24-at-11.47.09-AM.png width=700>

<img src=https://cdn-images-1.medium.com/max/1440/0*A351v9EkS6Ps2zIg.gif width=500>

In [34]:
from sklearn.datasets import load_diabetes

df = load_diabetes(as_frame=True)['frame']
df.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019908,-0.017646,151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.06833,-0.092204,75.0
2,0.085299,0.05068,0.044451,-0.005671,-0.045599,-0.034194,-0.032356,-0.002592,0.002864,-0.02593,141.0
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022692,-0.009362,206.0
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031991,-0.046641,135.0


In [35]:
df.shape

(442, 11)

In [36]:
df.describe()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
count,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0
mean,-3.634285e-16,1.308343e-16,-8.045349e-16,1.281655e-16,-8.835316000000001e-17,1.327024e-16,-4.574646e-16,3.777301e-16,-3.830854e-16,-3.412882e-16,152.133484
std,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,77.093005
min,-0.1072256,-0.04464164,-0.0902753,-0.1123996,-0.1267807,-0.1156131,-0.1023071,-0.0763945,-0.1260974,-0.1377672,25.0
25%,-0.03729927,-0.04464164,-0.03422907,-0.03665645,-0.03424784,-0.0303584,-0.03511716,-0.03949338,-0.03324879,-0.03317903,87.0
50%,0.00538306,-0.04464164,-0.007283766,-0.005670611,-0.004320866,-0.003819065,-0.006584468,-0.002592262,-0.001947634,-0.001077698,140.5
75%,0.03807591,0.05068012,0.03124802,0.03564384,0.02835801,0.02984439,0.0293115,0.03430886,0.03243323,0.02791705,211.5
max,0.1107267,0.05068012,0.1705552,0.1320442,0.1539137,0.198788,0.1811791,0.1852344,0.133599,0.1356118,346.0


Vamos treinar nosso classificador baseline de gradient boosting:

In [37]:
from sklearn.ensemble import GradientBoostingClassifier

In [38]:
pre_processor = pipe_preprocessor('./datasets/german_credit_data.csv', 'Risk')

In [39]:
pipe_gb = Pipeline([
    ('pre_processor', pre_processor),
    ('gb', GradientBoostingClassifier(random_state=42))
])

pipe_gb.fit(X_train, y_train)

Pipeline(steps=[('pre_processor',
                 ColumnTransformer(transformers=[('transf_num',
                                                  Pipeline(steps=[('input_num',
                                                                   SimpleImputer()),
                                                                  ('std',
                                                                   StandardScaler())]),
                                                  ['Age', 'Job',
                                                   'Credit amount',
                                                   'Duration']),
                                                 ('transf_cat',
                                                  Pipeline(steps=[('input_cat',
                                                                   SimpleImputer(fill_value='unknown',
                                                                                 strategy='constant')),
                          

In [40]:
metricas_classificacao(pipe_gb, X_train, y_train)

              precision    recall  f1-score   support

         bad       0.93      0.67      0.78       240
        good       0.87      0.98      0.92       560

    accuracy                           0.89       800
   macro avg       0.90      0.82      0.85       800
weighted avg       0.89      0.89      0.88       800



In [41]:
metricas_classificacao(pipe_gb, X_test, y_test)

              precision    recall  f1-score   support

         bad       0.68      0.47      0.55        60
        good       0.80      0.91      0.85       140

    accuracy                           0.78       200
   macro avg       0.74      0.69      0.70       200
weighted avg       0.76      0.78      0.76       200



In [42]:
print(len(pipe_gb['gb'].estimators_))
pipe_gb['gb'].estimators_

100


array([[DecisionTreeRegressor(criterion='friedman_mse', max_depth=3,
                              random_state=RandomState(MT19937) at 0x27C180D6D40)],
       [DecisionTreeRegressor(criterion='friedman_mse', max_depth=3,
                              random_state=RandomState(MT19937) at 0x27C180D6D40)],
       [DecisionTreeRegressor(criterion='friedman_mse', max_depth=3,
                              random_state=RandomState(MT19937) at 0x27C180D6D40)],
       [DecisionTreeRegressor(criterion='friedman_mse', max_depth=3,
                              random_state=RandomState(MT19937) at 0x27C180D6D40)],
       [DecisionTreeRegressor(criterion='friedman_mse', max_depth=3,
                              random_state=RandomState(MT19937) at 0x27C180D6D40)],
       [DecisionTreeRegressor(criterion='friedman_mse', max_depth=3,
                              random_state=RandomState(MT19937) at 0x27C180D6D40)],
       [DecisionTreeRegressor(criterion='friedman_mse', max_depth=3,
             

Pra casa: grid search para otimizar os hiperparâmetros!

### Exercício
Utilizando o dataset de cancer: crie um modelo para predizer o tipo.  
Desta vez utilizando o GradienteBoosting

In [53]:
from sklearn.datasets import load_breast_cancer

df = load_breast_cancer(as_frame = True)["frame"]

In [59]:
df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


In [55]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   mean radius              569 non-null    float64
 1   mean texture             569 non-null    float64
 2   mean perimeter           569 non-null    float64
 3   mean area                569 non-null    float64
 4   mean smoothness          569 non-null    float64
 5   mean compactness         569 non-null    float64
 6   mean concavity           569 non-null    float64
 7   mean concave points      569 non-null    float64
 8   mean symmetry            569 non-null    float64
 9   mean fractal dimension   569 non-null    float64
 10  radius error             569 non-null    float64
 11  texture error            569 non-null    float64
 12  perimeter error          569 non-null    float64
 13  area error               569 non-null    float64
 14  smoothness error         5

In [57]:
df.isna().sum().sum()

0

In [60]:
y_colum = 'target'

X = df.drop(columns=[y_colum])
y = df[y_colum]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [61]:
def preprocessor(df, y_colum):
    
    X = df.drop(columns=[y_colum])
    y = df[y_colum]
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
    
    pipe_features_num = Pipeline([
        ('input_num', SimpleImputer(strategy='mean')),
        ('std', StandardScaler())
    ])
    
    features_num = X_train.select_dtypes(include=np.number).columns.tolist()
    
    pipe_features_cat = Pipeline([
        ('input_cat', SimpleImputer(strategy='constant', fill_value='unknown')),
        ('ondehot', OneHotEncoder())
    ])
    
    features_cat = X_train.select_dtypes(exclude=np.number).columns.tolist()
    
    pre_processor = ColumnTransformer([
        ('transf_num', pipe_features_num, features_num),
        ('transf_cat', pipe_features_cat, features_cat)
    ])
    
    return pre_processor

In [62]:
preprocessor = preprocessor(df, "target")

In [63]:
pipe_gb = Pipeline([
    ("preprocessor", preprocessor),
    ('gb', GradientBoostingClassifier(random_state=42))
])

pipe_gb.fit(X_train, y_train)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('transf_num',
                                                  Pipeline(steps=[('input_num',
                                                                   SimpleImputer()),
                                                                  ('std',
                                                                   StandardScaler())]),
                                                  ['mean radius',
                                                   'mean texture',
                                                   'mean perimeter',
                                                   'mean area',
                                                   'mean smoothness',
                                                   'mean compactness',
                                                   'mean concavity',
                                                   'mean concave points',
                          

In [64]:
# Dataset de treino
metricas_classificacao(pipe_gb, X_train, y_train)

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       170
           1       1.00      1.00      1.00       285

    accuracy                           1.00       455
   macro avg       1.00      1.00      1.00       455
weighted avg       1.00      1.00      1.00       455



In [66]:
# Dataset de teste
metricas_classificacao(pipe_gb, X_test, y_test)

              precision    recall  f1-score   support

           0       0.97      0.90      0.94        42
           1       0.95      0.99      0.97        72

    accuracy                           0.96       114
   macro avg       0.96      0.95      0.95       114
weighted avg       0.96      0.96      0.96       114



In [75]:
y_pred = pipe_gb.predict(X_test)

In [76]:
y_pred

array([0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0,
       0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 1])

In [74]:
y_test

256    0
428    1
501    0
363    1
564    0
      ..
95     0
128    1
257    0
228    1
488    1
Name: target, Length: 114, dtype: int32

In [80]:
target_compare = pd.DataFrame({"y_pred": y_pred, "y_test": y_test})
target_compare["var"] = abs(y_pred - y_test)

In [83]:
# Comparando resultados
target_compare

Unnamed: 0,y_pred,y_test,var
256,0,0,0
428,1,1,0
501,0,0,0
363,0,1,1
564,0,0,0
...,...,...,...
95,0,0,0
128,1,1,0
257,0,0,0
228,1,1,0


In [72]:
pipe_gb["gb"].feature_importances_

array([2.37623303e-04, 3.60430649e-03, 6.14536269e-04, 2.23141410e-03,
       2.60141856e-04, 3.87397541e-04, 1.55248512e-03, 3.02698005e-02,
       5.46301208e-05, 3.58586889e-05, 1.91966125e-03, 2.69268180e-02,
       4.17666155e-03, 5.34802268e-03, 1.28324998e-03, 5.50085920e-04,
       1.68637520e-03, 7.56875054e-04, 1.25443008e-03, 9.98218546e-04,
       4.34453323e-01, 5.26449690e-02, 2.71483236e-01, 2.23291265e-02,
       1.06907316e-02, 4.57425629e-03, 1.12801977e-02, 1.06548576e-01,
       1.24773714e-04, 1.72221822e-03])