## Itens Requeridos

Seleção de Metodologias e Hiperparâmetros de Aprendizado de Máquina
- Incorporar as metodologias XGBoost e LightGBM;
- Utilizar as 5 melhores metodologias da etapa anterior;
- Utilizar o recurso de grade de hiper-parâmetros e validação cruzada para melhorar resultados com métricas selecionadas;
- Identificar os 2 melhores modelos obtidos.

In [None]:
import pandas as pd
df = pd.read_csv('dados_sem_anomalias.csv')
df.head()

Unnamed: 0,dispositivo_1,dispositivo_2,dispositivo_3,dispositivo_4,dispositivo_5,dispositivo_6,dispositivo_7,dispositivo_8,dispositivo_9,dispositivo_10,...,dispositivo_42,dispositivo_43,dispositivo_44,dispositivo_45,dispositivo_46,dispositivo_47,dispositivo_48,dispositivo_49,dispositivo_50,falha
0,73.18,61.7,44.79,34.7,64.35,31.37,71.95,46.84,45.4,57.63,...,57.5,49.11,35.51,49.83,35.35,56.37,56.21,50.41,42.17,0
1,48.7,36.58,42.64,51.02,66.17,43.68,51.84,57.06,40.92,33.1,...,42.58,45.03,55.41,56.54,34.13,50.11,49.88,49.82,69.11,0
2,45.65,69.17,48.58,34.39,42.41,41.61,59.15,55.03,59.03,59.72,...,74.03,48.05,39.78,58.47,63.05,54.8,68.53,45.07,71.07,0
3,63.11,49.81,38.17,59.98,61.59,59.39,48.5,55.62,52.2,30.47,...,43.08,47.89,32.3,66.46,54.78,60.01,21.4,53.12,50.01,0
4,28.41,38.22,43.15,39.12,58.32,71.58,36.61,45.84,35.68,45.38,...,58.2,55.04,36.48,52.88,54.85,66.86,50.58,58.64,53.66,0


## Incorporar as metodologias XGBoost e LightGBM

In [None]:
!pip install XGBoost LightGBM



In [None]:
from xgboost import XGBRFClassifier
from lightgbm import LGBMClassifier

## Utilizar as 5 melhores metodologias da etapa anterior

In [None]:
from sklearn.ensemble import HistGradientBoostingClassifier, RandomForestClassifier
from sklearn.ensemble import BaggingClassifier, GradientBoostingClassifier
from sklearn.tree import ExtraTreeClassifier

## Utilizar o recurso de grade de hiper-parâmetros e validação cruzada para melhorar resultados com métricas selecionadas

In [None]:
from sklearn.model_selection import train_test_split

X = df.drop('falha', axis=1).values
y = df['falha'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.95, random_state=42)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((8623, 50), (163854, 50), (8623,), (163854,))

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.model_selection import GridSearchCV
import numpy as np
from tqdm import tqdm

prepros = [
    (None,None),
    (StandardScaler(),{'with_mean':[True,False],'with_std':[True,False]}),
    (MinMaxScaler(),{'feature_range':[(0,1),(-1,1)]}),
]

redutores = [
    (None,None),
    (PCA(random_state=42),{'n_components':[16,32,None]}),
    (TruncatedSVD(random_state=42),{'n_components':[16,32]}),
]

aprendizados = [
    (XGBRFClassifier(random_state=42),{}),
    (LGBMClassifier(random_state=42),{}),
    (HistGradientBoostingClassifier(random_state=42), {'max_iter': [100, 200], 'max_depth': [5, 10]}),
    (RandomForestClassifier(random_state=42), {'criterion': ['gini', 'log_loss'], 'max_depth': [5, 10]}),
    (GradientBoostingClassifier(random_state=42), {'loss': ['deviance', 'exponential'], 'max_depth': [5, 10]}),
    (ExtraTreeClassifier(random_state=42),{'criterion': ['gini', 'log_loss'],'max_depth':[5,10]}),
    (BaggingClassifier(random_state=42), {'n_estimators': [10, 20], 'max_samples': [0.5, 1.0]})
]

### Realizando o treinamento

In [None]:
resultados = []
for pp, ppp, red, redp, ap, app in tqdm([(pp, ppp, red, redp, ap, app) for pp, ppp in prepros for red, redp in redutores for ap, app in aprendizados]):

    param_grid = {}
    steps = []

    pre_nome = pp.__class__.__name__
    red_nome = red.__class__.__name__
    ap_nome = ap.__class__.__name__

    if pp is not None:
        steps.append((pre_nome, pp))
        # parametros do pré-processamento
        for key in ppp.keys():
            param_grid[pre_nome + '__' + key] = ppp[key]

    if red is not None:
        steps.append((red_nome, red))
        # parametros do redutor
        for key in redp.keys():
            param_grid[red_nome + '__' + key] = redp[key]

    steps.append((ap_nome, ap))
    # parametros do aprendizado
    for key in app.keys():
        param_grid[ap_nome + '__' + key] = app[key]

    pipe = Pipeline(steps)

    grid = GridSearchCV(estimator=pipe, param_grid=param_grid, cv=5, scoring='f1', n_jobs=-1)
    grid.fit(X_train, y_train)
    cv = grid.cv_results_
    res = {
        'preprocessamento': pre_nome,
        'reducao': red_nome,
        'aprendizado': ap_nome,
        'tempo': cv['std_fit_time'],
        'f1': cv['mean_test_score'],
    }
    resultados.append(res)

  2%|▏         | 1/63 [00:20<21:05, 20.41s/it]

[LightGBM] [Info] Number of positive: 4309, number of negative: 4314
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.005057 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12750
[LightGBM] [Info] Number of data points in the train set: 8623, number of used features: 50
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.499710 -> initscore=-0.001160
[LightGBM] [Info] Start training from score -0.001160


 13%|█▎        | 8/63 [12:05<1:18:46, 85.94s/it] 

[LightGBM] [Info] Number of positive: 4309, number of negative: 4314
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.005048 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12750
[LightGBM] [Info] Number of data points in the train set: 8623, number of used features: 50
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.499710 -> initscore=-0.001160
[LightGBM] [Info] Start training from score -0.001160


 24%|██▍       | 15/63 [35:54<2:10:45, 163.45s/it]

[LightGBM] [Info] Number of positive: 4309, number of negative: 4314
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003308 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 8160
[LightGBM] [Info] Number of data points in the train set: 8623, number of used features: 32
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.499710 -> initscore=-0.001160
[LightGBM] [Info] Start training from score -0.001160


 35%|███▍      | 22/63 [49:00<1:11:22, 104.46s/it]

[LightGBM] [Info] Number of positive: 4309, number of negative: 4314
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.005141 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12750
[LightGBM] [Info] Number of data points in the train set: 8623, number of used features: 50
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.499710 -> initscore=-0.001160
[LightGBM] [Info] Start training from score -0.001160


 46%|████▌     | 29/63 [1:31:35<2:49:42, 299.48s/it]

[LightGBM] [Info] Number of positive: 4309, number of negative: 4314
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.005183 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12750
[LightGBM] [Info] Number of data points in the train set: 8623, number of used features: 50
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.499710 -> initscore=-0.001160
[LightGBM] [Info] Start training from score -0.001160


 57%|█████▋    | 36/63 [3:02:21<4:37:41, 617.09s/it]

[LightGBM] [Info] Number of positive: 4309, number of negative: 4314
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.007934 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 8160
[LightGBM] [Info] Number of data points in the train set: 8623, number of used features: 32
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.499710 -> initscore=-0.001160
[LightGBM] [Info] Start training from score -0.001160


 68%|██████▊   | 43/63 [3:49:36<1:58:47, 356.39s/it]

[LightGBM] [Info] Number of positive: 4309, number of negative: 4314
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.008012 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12750
[LightGBM] [Info] Number of data points in the train set: 8623, number of used features: 50
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.499710 -> initscore=-0.001160
[LightGBM] [Info] Start training from score -0.001160


 79%|███████▉  | 50/63 [4:11:38<38:58, 179.89s/it]

[LightGBM] [Info] Number of positive: 4309, number of negative: 4314
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.004852 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12750
[LightGBM] [Info] Number of data points in the train set: 8623, number of used features: 50
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.499710 -> initscore=-0.001160
[LightGBM] [Info] Start training from score -0.001160


 90%|█████████ | 57/63 [4:57:31<31:21, 313.51s/it]

[LightGBM] [Info] Number of positive: 4309, number of negative: 4314
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003299 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 8160
[LightGBM] [Info] Number of data points in the train set: 8623, number of used features: 32
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.499710 -> initscore=-0.001160
[LightGBM] [Info] Start training from score -0.001160


100%|██████████| 63/63 [5:20:51<00:00, 305.58s/it]


## Identificar os 2 melhores modelos obtidos

In [None]:
import pandas as pd

df_res = pd.DataFrame(resultados)

print(df_res)

   preprocessamento       reducao                     aprendizado  \
0          NoneType      NoneType                 XGBRFClassifier   
1          NoneType      NoneType                  LGBMClassifier   
2          NoneType      NoneType  HistGradientBoostingClassifier   
3          NoneType      NoneType          RandomForestClassifier   
4          NoneType      NoneType      GradientBoostingClassifier   
..              ...           ...                             ...   
58     MinMaxScaler  TruncatedSVD  HistGradientBoostingClassifier   
59     MinMaxScaler  TruncatedSVD          RandomForestClassifier   
60     MinMaxScaler  TruncatedSVD      GradientBoostingClassifier   
61     MinMaxScaler  TruncatedSVD             ExtraTreeClassifier   
62     MinMaxScaler  TruncatedSVD               BaggingClassifier   

                                                tempo  \
0                                [0.7526432594542181]   
1                               [0.20546152706473478]   


In [None]:
#salvando resultado porque demorou muito para rodar
df_res.to_csv('resultadosEtapa4.csv')

In [None]:
import pandas as pd
import numpy as np

df_resTeste = df_res

#transformando os vetores na sua média
df_resTeste['tempo'] = df_resTeste['tempo'].apply(np.mean)
df_resTeste['f1'] = df_resTeste['f1'].apply(np.mean)

df_resTeste.head()

Unnamed: 0,preprocessamento,reducao,aprendizado,tempo,f1
0,NoneType,NoneType,XGBRFClassifier,0.752643,0.824508
1,NoneType,NoneType,LGBMClassifier,0.205462,0.888806
2,NoneType,NoneType,HistGradientBoostingClassifier,0.378585,0.882475
3,NoneType,NoneType,RandomForestClassifier,0.501572,0.822239
4,NoneType,NoneType,GradientBoostingClassifier,2.727062,0.886526


In [None]:
df_resSort = df_resTeste.sort_values('f1', ascending=False)

df_resSort.head()

Unnamed: 0,preprocessamento,reducao,aprendizado,tempo,f1
1,NoneType,NoneType,LGBMClassifier,0.205462,0.888806
22,StandardScaler,NoneType,LGBMClassifier,0.200616,0.887354
46,MinMaxScaler,NoneType,GradientBoostingClassifier,0.809248,0.886529
4,NoneType,NoneType,GradientBoostingClassifier,2.727062,0.886526
25,StandardScaler,NoneType,GradientBoostingClassifier,0.817779,0.886464


Como foi possível notar os melhores modelos aproximaram-se do 90% de acerto e foram eles o LGBMClassifier e o GradientBoostingClassifier

In [None]:
#salvando resultado ordenado
df_resSort.to_csv('resultadosEtapa4Sort.csv')