#1) Introdução


Autoria: Jackson Corrêa

Linkedin: https://www.linkedin.com/in/jackson-corr%C3%AAa/
<br><br>
Este é um projeto de Ciência de Dados que tem como objetivo desenvolver um modelo de máquina preditiva para classificar companhias com potencial de falência, de acordo com os Dados de falência do Taiwan Economic Journal para os anos de 1999-2009. Pensando na otimização do tempo, a etapa de análise exploratória dos dados foi automatizada com a utilização do pacote 'dataprep'. Sendo assim, o foco da Análise é mais generalista, com foco na distribuição das variáveis e suas correlações. Também foi utilizada a biblioteca'feature engine', com o objetivo de automatizar a tarefa de seleção de variáveis e tornar o modelo computacionalmente menos oneroso.
<br>
<br>
O projeto consiste na seguintes etapas:

* Análise exploratória dos dados, de forma manual, mais simples e com menos insights;
* Pré-processamento dos dados com split em dados de treino e teste, seleção de variáveis para o modelo e padronização dos dados;
* Modelagem de máquinas preditivas, com análise de desempenho de vários modelos testados, tunagem de hiperparâmetros e elaboração de um modelo Ensemble de Classificador de Votação.
<br><br>

A principal métrica para avaliação do desempenho dos modelos é o Recall Score (revocação). Dessa forma, os melhores modelos serão aqueles com menores indicações de falsos negativos.
<br><br>

##1.1) Fonte de dados

Os dados utilizados neste projeto foram extraídos da plataforma Kaggle.

Link de acesso: https://www.kaggle.com/datasets/fedesoriano/company-bankruptcy-prediction
<br><br>

##1.2) Convenções

A seguir, estão listadas algumas conveções de sufixos e siglas utilizadas no nome de variáveis, com o intuito de deixar o código mais intuitivo à leitura:

* df - dataframe
* aux - auxiliar
* X_train - dados de treino ('X' maiúsculo)
* X_test - dados de teste ('X' maiúsculo)
* y_train - rótulos de treino ('y' minúsculo)
* y_test - rótulos de teste ('y' minúsculo)
* std - standard / padronização
* norm - normalized / normalizado / normalização
* transf - transformed / transformado
* base - baseline / base
* ori - original
* ens - ensemble
* opt - optimized / otimizado / ótimo / tunado
* eda - Exploratory Data Analytics
* os - oversampling
* us - undersampling
<br><br>

##1.3) Importação das bibliotecas

In [2]:
# Instalando a biblioteca 'category_encoders'
!pip install category_encoders
!pip install feature_engine
!pip install dataprep

Collecting category_encoders
  Downloading category_encoders-2.6.1-py2.py3-none-any.whl (81 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/81.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.9/81.9 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: category_encoders
Successfully installed category_encoders-2.6.1
Collecting feature_engine
  Downloading feature_engine-1.6.1-py2.py3-none-any.whl (326 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m326.6/326.6 kB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: feature_engine
Successfully installed feature_engine-1.6.1
Collecting dataprep
  Downloading dataprep-0.4.5-py3-none-any.whl (9.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.9/9.9 MB[0m [31m59.9 MB/s[0m eta [36m0:00:00[0m
Collecting flask_cors<4.0.0,>=3.0.10 (from dataprep)
  Downloading Flask_Cors-

In [3]:
# importando as bibliotecas

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier

from sklearn.model_selection import cross_val_score, GridSearchCV, RandomizedSearchCV, train_test_split, StratifiedKFold
from sklearn.metrics import make_scorer, accuracy_score, recall_score, precision_score, f1_score, classification_report

from sklearn.preprocessing import StandardScaler

from feature_engine.selection import DropConstantFeatures, SmartCorrelatedSelection, RecursiveFeatureAddition, RecursiveFeatureElimination

from imblearn.under_sampling import TomekLinks, EditedNearestNeighbours

from dataprep.eda import create_report

import warnings
warnings.filterwarnings("ignore")

##1.5) Importação do dataset

In [4]:
# Importando o dataset em formato csv
data=pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Projetos pessoais/01. Projetos Machine Learning/Falência de empresas Tailandesas/data.csv')

# criando um backup
df=data.copy()

#2) Análise exploratória

NOTA: a análise exploratória automatiza será realizada após a etapa de Feature Selection, utilizando o pacte Dataprep, e o conjunto de dados será aquele com as Features removidas no processo de Feature-Selection.
Abaixo, serão exibidos apenas algumas informações sobre o dataset, de forma mais simplória.


In [5]:
# Informações do dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6819 entries, 0 to 6818
Data columns (total 96 columns):
 #   Column                                                    Non-Null Count  Dtype  
---  ------                                                    --------------  -----  
 0   Bankrupt?                                                 6819 non-null   int64  
 1    ROA(C) before interest and depreciation before interest  6819 non-null   float64
 2    ROA(A) before interest and % after tax                   6819 non-null   float64
 3    ROA(B) before interest and depreciation after tax        6819 non-null   float64
 4    Operating Gross Margin                                   6819 non-null   float64
 5    Realized Sales Gross Margin                              6819 non-null   float64
 6    Operating Profit Rate                                    6819 non-null   float64
 7    Pre-tax net Interest Rate                                6819 non-null   float64
 8    After-tax net Int

In [6]:
# Visualizando as primeiras linhas do dataset - parte 01
df.iloc[:,0:20].describe()

Unnamed: 0,Bankrupt?,ROA(C) before interest and depreciation before interest,ROA(A) before interest and % after tax,ROA(B) before interest and depreciation after tax,Operating Gross Margin,Realized Sales Gross Margin,Operating Profit Rate,Pre-tax net Interest Rate,After-tax net Interest Rate,Non-industry income and expenditure/revenue,Continuous interest rate (after tax),Operating Expense Rate,Research and development expense rate,Cash flow rate,Interest-bearing debt interest rate,Tax rate (A),Net Value Per Share (B),Net Value Per Share (A),Net Value Per Share (C),Persistent EPS in the Last Four Seasons
count,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0
mean,0.032263,0.50518,0.558625,0.553589,0.607948,0.607929,0.998755,0.79719,0.809084,0.303623,0.781381,1995347000.0,1950427000.0,0.467431,16448010.0,0.115001,0.190661,0.190633,0.190672,0.228813
std,0.17671,0.060686,0.06562,0.061595,0.016934,0.016916,0.01301,0.012869,0.013601,0.011163,0.012679,3237684000.0,2598292000.0,0.017036,108275000.0,0.138667,0.03339,0.033474,0.03348,0.033263
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.476527,0.535543,0.527277,0.600445,0.600434,0.998969,0.797386,0.809312,0.303466,0.781567,0.0001566874,0.000128188,0.461558,0.0002030203,0.0,0.173613,0.173613,0.173676,0.214711
50%,0.0,0.502706,0.559802,0.552278,0.605997,0.605976,0.999022,0.797464,0.809375,0.303525,0.781635,0.0002777589,509000000.0,0.46508,0.0003210321,0.073489,0.1844,0.1844,0.1844,0.224544
75%,0.0,0.535563,0.589157,0.584105,0.613914,0.613842,0.999095,0.797579,0.809469,0.303585,0.781735,4145000000.0,3450000000.0,0.471004,0.0005325533,0.205841,0.19957,0.19957,0.199612,0.23882
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,9990000000.0,9980000000.0,1.0,990000000.0,1.0,1.0,1.0,1.0,1.0


In [7]:
# Visualizando as primeiras linhas do dataset - parte 02
df.iloc[:,20:40].describe()

Unnamed: 0,Cash Flow Per Share,Revenue Per Share (Yuan ¥),Operating Profit Per Share (Yuan ¥),Per Share Net profit before tax (Yuan ¥),Realized Sales Gross Profit Growth Rate,Operating Profit Growth Rate,After-tax Net Profit Growth Rate,Regular Net Profit Growth Rate,Continuous Net Profit Growth Rate,Total Asset Growth Rate,Net Value Growth Rate,Total Asset Return Growth Rate Ratio,Cash Reinvestment %,Current Ratio,Quick Ratio,Interest Expense Ratio,Total debt/Total net worth,Debt ratio %,Net worth/Assets,Long-term fund suitability ratio (A)
count,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0
mean,0.323482,1328641.0,0.109091,0.184361,0.022408,0.84798,0.689146,0.68915,0.217639,5508097000.0,1566212.0,0.264248,0.379677,403285.0,8376595.0,0.630991,4416337.0,0.113177,0.886823,0.008783
std,0.017611,51707090.0,0.027942,0.03318,0.012079,0.010752,0.013853,0.01391,0.010063,2897718000.0,114159400.0,0.009634,0.020737,33302160.0,244684700.0,0.011238,168406900.0,0.05392,0.05392,0.028153
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.317748,0.01563138,0.096083,0.17037,0.022065,0.847984,0.68927,0.68927,0.21758,4860000000.0,0.0004409689,0.263759,0.374749,0.007555047,0.004725903,0.630612,0.003007049,0.072891,0.851196,0.005244
50%,0.322487,0.02737571,0.104226,0.179709,0.022102,0.848044,0.689439,0.689439,0.217598,6400000000.0,0.0004619555,0.26405,0.380425,0.01058717,0.007412472,0.630698,0.005546284,0.111407,0.888593,0.005665
75%,0.328623,0.04635722,0.116155,0.193493,0.022153,0.848123,0.689647,0.689647,0.217622,7390000000.0,0.0004993621,0.264388,0.386731,0.01626953,0.01224911,0.631125,0.009273293,0.148804,0.927109,0.006847
max,1.0,3020000000.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,9990000000.0,9330000000.0,1.0,1.0,2750000000.0,9230000000.0,1.0,9940000000.0,1.0,1.0,1.0


In [8]:
# Visualizando as primeiras linhas do dataset - parte 03
df.iloc[:,40:60].describe()

Unnamed: 0,Borrowing dependency,Contingent liabilities/Net worth,Operating profit/Paid-in capital,Net profit before tax/Paid-in capital,Inventory and accounts receivable/Net value,Total Asset Turnover,Accounts Receivable Turnover,Average Collection Days,Inventory Turnover Rate (times),Fixed Assets Turnover Frequency,Net Worth Turnover Rate (times),Revenue per person,Operating profit per person,Allocation rate per person,Working Capital to Total Assets,Quick Assets/Total Assets,Current Assets/Total Assets,Cash/Total Assets,Quick Assets/Current Liability,Cash/Current Liability
count,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0
mean,0.374654,0.005968,0.108977,0.182715,0.402459,0.141606,12789710.0,9826221.0,2149106000.0,1008596000.0,0.038595,2325854.0,0.400671,11255790.0,0.814125,0.400132,0.522273,0.124095,3592902.0,37159990.0
std,0.016286,0.012188,0.027782,0.030785,0.013324,0.101145,278259800.0,256358900.0,3247967000.0,2477557000.0,0.03668,136632700.0,0.03272,294506300.0,0.059054,0.201998,0.218112,0.139251,171620900.0,510350900.0
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.370168,0.005366,0.096105,0.169376,0.397403,0.076462,0.0007101336,0.00438653,0.0001728256,0.0002330013,0.021774,0.01043285,0.392438,0.004120529,0.774309,0.241973,0.352845,0.033543,0.005239776,0.001973008
50%,0.372624,0.005366,0.104133,0.178456,0.400131,0.118441,0.0009678107,0.006572537,0.0007646743,0.0005930942,0.029516,0.01861551,0.395898,0.007844373,0.810275,0.386451,0.51483,0.074887,0.007908898,0.004903886
75%,0.376271,0.005764,0.115927,0.191607,0.404551,0.176912,0.001454759,0.008972876,4620000000.0,0.003652371,0.042903,0.03585477,0.401851,0.01502031,0.850383,0.540594,0.689051,0.161073,0.01295091,0.01280557
max,1.0,1.0,1.0,1.0,1.0,1.0,9740000000.0,9730000000.0,9990000000.0,9990000000.0,1.0,8810000000.0,1.0,9570000000.0,1.0,1.0,1.0,1.0,8820000000.0,9650000000.0


In [9]:
# Visualizando as primeiras linhas do dataset - parte 04
df.iloc[:,60:80].describe()

Unnamed: 0,Current Liability to Assets,Operating Funds to Liability,Inventory/Working Capital,Inventory/Current Liability,Current Liabilities/Liability,Working Capital/Equity,Current Liabilities/Equity,Long-term Liability to Current Assets,Retained Earnings to Total Assets,Total income/Total expense,Total expense/Assets,Current Asset Turnover Rate,Quick Asset Turnover Rate,Working capitcal Turnover Rate,Cash Turnover Rate,Cash Flow to Sales,Fixed Assets to Assets,Current Liability to Liability,Current Liability to Equity,Equity to Long-term Liability
count,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0
mean,0.090673,0.353828,0.277395,55806800.0,0.761599,0.735817,0.33141,54160040.0,0.934733,0.002549,0.029184,1195856000.0,2163735000.0,0.594006,2471977000.0,0.671531,1220121.0,0.761599,0.33141,0.115645
std,0.05029,0.035147,0.010469,582051600.0,0.206677,0.011678,0.013488,570270600.0,0.025564,0.012093,0.027149,2821161000.0,3374944000.0,0.008959,2938623000.0,0.009341,100754200.0,0.206677,0.013488,0.019529
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.053301,0.341023,0.277034,0.003163148,0.626981,0.733612,0.328096,0.0,0.931097,0.002236,0.014567,0.0001456236,0.0001417149,0.593934,0.0002735337,0.671565,0.08536037,0.626981,0.328096,0.110933
50%,0.082705,0.348597,0.277178,0.006497335,0.806881,0.736013,0.329685,0.001974619,0.937672,0.002336,0.022674,0.0001987816,0.0002247728,0.593963,1080000000.0,0.671574,0.196881,0.806881,0.329685,0.11234
75%,0.119523,0.360915,0.277429,0.01114677,0.942027,0.73856,0.332322,0.009005946,0.944811,0.002492,0.03593,0.0004525945,4900000000.0,0.594002,4510000000.0,0.671587,0.3722,0.942027,0.332322,0.117106
max,1.0,1.0,1.0,9910000000.0,1.0,1.0,1.0,9540000000.0,1.0,1.0,1.0,10000000000.0,10000000000.0,1.0,10000000000.0,1.0,8320000000.0,1.0,1.0,1.0


In [10]:
# Visualizando as primeiras linhas do dataset - parte 05
df.iloc[:,80:].describe()

Unnamed: 0,Cash Flow to Total Assets,Cash Flow to Liability,CFO to Assets,Cash Flow to Equity,Current Liability to Current Assets,Liability-Assets Flag,Net Income to Total Assets,Total assets to GNP price,No-credit Interval,Gross Profit to Sales,Net Income to Stockholder's Equity,Liability to Equity,Degree of Financial Leverage (DFL),Interest Coverage Ratio (Interest expense to EBIT),Net Income Flag,Equity to Liability
count,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0
mean,0.649731,0.461849,0.593415,0.315582,0.031506,0.001173,0.80776,18629420.0,0.623915,0.607946,0.840402,0.280365,0.027541,0.565358,1.0,0.047578
std,0.047372,0.029943,0.058561,0.012961,0.030845,0.034234,0.040332,376450100.0,0.01229,0.016934,0.014523,0.014463,0.015668,0.013214,0.0,0.050014
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
25%,0.633265,0.457116,0.565987,0.312995,0.018034,0.0,0.79675,0.0009036205,0.623636,0.600443,0.840115,0.276944,0.026791,0.565158,1.0,0.024477
50%,0.645366,0.45975,0.593266,0.314953,0.027597,0.0,0.810619,0.002085213,0.623879,0.605998,0.841179,0.278778,0.026808,0.565252,1.0,0.033798
75%,0.663062,0.464236,0.624769,0.317707,0.038375,0.0,0.826455,0.005269777,0.624168,0.613913,0.842357,0.281449,0.026913,0.565725,1.0,0.052838
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,9820000000.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [13]:
# Verificando a distriubuição das classes
df['Bankrupt?'].value_counts(1)

0    0.967737
1    0.032263
Name: Bankrupt?, dtype: float64

As classes estão altamente desbalanceadas.
<br><br>

#3) Pré-processamento

In [14]:
# Dividindo dados em X e y

X=df.drop('Bankrupt?',axis=1)
y=df['Bankrupt?']

# Split nos dados (antes de qualquer manipulação - evitar dataleakage)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=1,stratify=y)

# Criando conjunto X para StratifiedKFold para validação cruzada estratificada
X_train_skf = X_train
X_test_skf = X_test

# Criando conjunto X de backup das features originais
X_train_ori = X_train
X_test_ori = X_train

In [15]:
# Verificando a proporção das classes nos dados totais e no split
print(y.value_counts(1))
print('\n')
print(y_train.value_counts(1))

0    0.967737
1    0.032263
Name: Bankrupt?, dtype: float64


0    0.967736
1    0.032264
Name: Bankrupt?, dtype: float64


<br><br>
Na Etapa de Feature Selection, os Features constantes e altamente correlacionadas serão excluídas permanentemente, por isso as alterações já são feitas nos dados X_train, X_teste.

Nas etapas recursivas de adição e eliminação, as features não serão removidas do conjunto X_train e X_test. Serão criados novos conjuntos (X_train_transf e X_test_transf) onde haverá a remoção dessas features. Assim, serão mantidos dois tipos de conjuntos para comparação de desempenho:

* Conjunto sem as features constantes e features correlacionadas (X_train / X_test)
* Conjunto sem as features constantes, correlacionadas e removidas recursivamente (X_train_transf / X_test_transf)

In [16]:
# Feature selection

# Instanciando validação cruzada estratificada
skf = StratifiedKFold(n_splits=5, shuffle=False, random_state=None) # shuffle=False --> Não embaralha
                                                                    # random_state=None --> sem semente para embaralhamento

# Dropagem de Features constantes
drop_const = DropConstantFeatures()
# Features constantes
X_train_transf = drop_const.fit_transform(X_train)
X_test_transf = drop_const.transform(X_test)
print(f'Features excluídas na etapa "dropar features constantes"\n:{drop_const.features_to_drop_}\n')


# Features com alta correlação
smart_corr_selection = SmartCorrelatedSelection(method='pearson',threshold=0.75,selection_method='model_performance',estimator=RandomForestClassifier())
X_train = smart_corr_selection.fit_transform(X_train,y_train)
X_test = smart_corr_selection.transform(X_test)
print(f'Features excluídas na etapa "dropar features correlacionadas" (para corr. Pearson > 0.75):\n{smart_corr_selection.features_to_drop_}\n')


# Adição recursiva
recursive_feat_add = RecursiveFeatureAddition(RandomForestClassifier(),scoring='recall',cv=skf)
X_train_transf_add = recursive_feat_add.fit_transform(X_train,y_train)
X_test_transf_add = recursive_feat_add.transform(X_test)
print(f'Features excluídas na etapa "adição recursiva de features":\n{recursive_feat_add.features_to_drop_}\n')


# Eliminação recursiva
recursive_feat_drop = RecursiveFeatureElimination(RandomForestClassifier(),scoring='recall',cv=skf)
X_train_transf_drop = recursive_feat_drop.fit_transform(X_train,y_train)
X_test_transf_drop = recursive_feat_drop.transform(X_test)
print(f'Features excluídas na etapa "eliminação recursiva de features":\n{recursive_feat_drop.features_to_drop_}\n')


# Features finais, agrupando as remanescentes da adição e da dropagem recursiva
feat=list(set(list(X_train_transf_add.columns)+list(X_train_transf_drop.columns)))
# nota: 'set' converte a lista em um conjunto que não permite duplicatas
print(f'As features finais são:\n{feat}')


Features excluídas na etapa "dropar features constantes"
:[' Net Income Flag']

Features excluídas na etapa "dropar features correlacionadas" (para corr. Pearson > 0.75):
[' ROA(C) before interest and depreciation before interest', ' ROA(A) before interest and % after tax', ' ROA(B) before interest and depreciation after tax', ' Realized Sales Gross Margin', ' Operating Profit Rate', ' Pre-tax net Interest Rate', ' After-tax net Interest Rate', ' Cash flow rate', ' Net Value Per Share (B)', ' Net Value Per Share (A)', ' Cash Flow Per Share', ' Operating Profit Per Share (Yuan ¥)', ' Per Share Net profit before tax (Yuan ¥)', ' Regular Net Profit Growth Rate', ' Net worth/Assets', ' Borrowing dependency', ' Contingent liabilities/Net worth', ' Net profit before tax/Paid-in capital', ' Net Worth Turnover Rate (times)', ' Current Assets/Total Assets', ' Current Liability to Assets', ' Current Liabilities/Equity', ' Working capitcal Turnover Rate', ' Cash Flow to Sales', ' Current Liabilit

In [17]:
# Quantidade de Features Eliminadas
ini=int(len(list(X_train_ori.columns)))
fim=int(len(feat))
tot=ini-fim
print(f'No total, foram eliminadas {tot} features, reduzindo de {ini} para {fim}.\n')

No total, foram eliminadas 76 features, reduzindo de 95 para 19.



In [18]:
# Exportando um relatório de features excluídas
var_adic_rec=list(recursive_feat_add.features_to_drop_)
var_elim_rec=list(recursive_feat_drop.features_to_drop_)
var_drop = list(set(var_adic_rec) | set(var_elim_rec))
with open('Features excluídas.txt', 'a') as arquivo:
    arquivo.write('Features dropadas na etapa de adição recursiva:' + '\n\n')
    for item in var_adic_rec:
        arquivo.write(str(item) + ' \n')
    arquivo.write('\n\n')

    arquivo.write('Features dropadas na etapa de eliminação recursiva:' + '\n\n')
    for item in var_elim_rec:
        arquivo.write(str(item) + ' \n')
    arquivo.write('\n\n')

    arquivo.write('Features totais dropadas:' + '\n\n')
    for item in var_drop:
        arquivo.write(str(item) + ' \n')
    arquivo.write('\n')

In [19]:
# Conjunto de dados com as features finais
X_train_transf = X_train[feat]
X_test_transf = X_test[feat]


# Backup para EDA somente com as variáveis seleciondas
df_eda=df[feat]
df_eda['Target']=df['Bankrupt?']

In [20]:
# Análise exploratória dos dados já com features selecionadas
eda = create_report(df_eda)
eda.save('EDA - Features Selecionadas.html')



Report has been saved to EDA - Features Selecionadas.html!


In [21]:
# Padronização

# Instanciando
std_scaler = StandardScaler()         #Para dados X, sem feature selection
std_scaler_transf = StandardScaler()  #Para dados X, tansformados com feature selection

# Aplicando fit e transform
X_train = std_scaler.fit_transform(X_train) #fit e transform
X_test = std_scaler.transform(X_test)       #somente transform

# Aplicando fit e transform
X_train_transf = std_scaler_transf.fit_transform(X_train_transf) #fit e transform
X_test_transf = std_scaler_transf.transform(X_test_transf)       #somente transform

#4) Máquina preditiva

In [22]:
# Instanciando os modelos:
lr=LogisticRegression()

dt=DecisionTreeClassifier()

rf=RandomForestClassifier()

gbm=GradientBoostingClassifier()

knn=KNeighborsClassifier()

nb=GaussianNB()

svm=SVC()

# Lista de modelos
modelos = [lr,dt,rf,gbm,knn,nb,svm]

###4.1) Modelo Baseline SEM Feature Selection

In [23]:
# Modelo baseline com dados *SEM* Feature Selection

with open('Baseline SEM FeatSelect.txt', 'a') as arquivo:  #cria arquivo txt com as métricas
  print('Modelo baseline, *SEM* Feature Selection')
  for model in modelos:
    model.fit(X_train,y_train)
    y_pred = model.predict(X_test)
    print('='*60)
    print(model)
    print(classification_report(y_pred,y_test))
    arquivo.write('='*60 + '\n' + str(model) +'\n' + str(classification_report(y_pred,y_test)) + '\n')

Modelo baseline, *SEM* Feature Selection
LogisticRegression()
              precision    recall  f1-score   support

           0       0.99      0.97      0.98      1350
           1       0.14      0.43      0.21        14

    accuracy                           0.97      1364
   macro avg       0.57      0.70      0.59      1364
weighted avg       0.99      0.97      0.97      1364

DecisionTreeClassifier()
              precision    recall  f1-score   support

           0       0.98      0.98      0.98      1321
           1       0.34      0.35      0.34        43

    accuracy                           0.96      1364
   macro avg       0.66      0.66      0.66      1364
weighted avg       0.96      0.96      0.96      1364

RandomForestClassifier()
              precision    recall  f1-score   support

           0       1.00      0.97      0.98      1354
           1       0.11      0.50      0.19        10

    accuracy                           0.97      1364
   macro avg    

###4.2) Modelo Baseline COM Feature Selection

In [24]:
# Modelo baseline com dados *COM* Feature Selection

with open('Baseline COM FeatSelect.txt', 'a') as arquivo:  #cria arquivo txt com as métricas
  print('Modelo baseline, *COM* Feature Selection')
  for model in modelos:
    model.fit(X_train_transf,y_train)
    y_pred_transf = model.predict(X_test_transf)
    print('='*60)
    print(model)
    print(classification_report(y_pred_transf,y_test))
    arquivo.write('='*60 + '\n' + str(model) +'\n' + str(classification_report(y_pred_transf,y_test)) + '\n')


Modelo baseline, *COM* Feature Selection
LogisticRegression()
              precision    recall  f1-score   support

           0       0.99      0.97      0.98      1352
           1       0.11      0.42      0.18        12

    accuracy                           0.97      1364
   macro avg       0.55      0.69      0.58      1364
weighted avg       0.99      0.97      0.98      1364

DecisionTreeClassifier()
              precision    recall  f1-score   support

           0       0.98      0.98      0.98      1324
           1       0.25      0.28      0.26        40

    accuracy                           0.95      1364
   macro avg       0.61      0.63      0.62      1364
weighted avg       0.96      0.95      0.96      1364

RandomForestClassifier()
              precision    recall  f1-score   support

           0       1.00      0.97      0.98      1358
           1       0.09      0.67      0.16         6

    accuracy                           0.97      1364
   macro avg    

###4.3) Validação cruzada estratificada SEM Feature Selection

In [25]:
# Validação cruzada com split estratificado,  com dados *SEM* Feature Selection

# Instanciando novamente o kfold
skf = StratifiedKFold(n_splits=5, shuffle=False, random_state=None)

# Tranformando os dados y em array
y_train = np.array(y_train)

with open('StratifiedKFold - SEM FeatSelect.txt', 'a') as arquivo:  #cria arquivo txt com as métricas

  print('Validação cruzada estratificada, *SEM* Feature Selection')
  # Criandoas iterações
  for model in modelos:
    arquivo.write('='*60  + '\n' + str(model) + '\n')
    print('='*60)
    print(model)
    soma_recal=0    #Zera o valor da soma do recall
    num_iter=1      #Reinicia o valor de i

    for fold, (i_train, i_test) in enumerate(skf.split(X_train, y_train)):

      X_train_skf  ,  X_test_skf = X_train[i_train]  ,  X_train[i_test]   #Definindo X_train e X_test no fold
      y_train_skf  ,  y_test_skf = y_train[i_train]  ,  y_train[i_test]   #Definindo y_train e y_test no fold

      model.fit(X_train_skf, y_train_skf)                               #Treinando o modelo no fold
      y_pred_skf = model.predict(X_test_skf)                            #Predição no fold
      score = round(recall_score(y_pred_skf,y_test_skf),1)              #Calcula o recall no fold

      print(f'Recall fold {num_iter} de {skf.n_splits}: {round(100*recall_score(y_pred_skf,y_test_skf),1)}%')
      arquivo.write('Recall fold ' + str(num_iter) + ' de ' + str(skf.n_splits)  + ': ' + str(round(100*recall_score(y_pred_skf,y_test_skf),1) ) + '%\n' )

      soma_recal = soma_recal + score   #Faz o recall acumulado
      num_iter=num_iter+1               #Acrescenta mais 1 no texto printado (mais um Fold)

    print(f'Recall médio: {round(100 * soma_recal / skf.n_splits , 1)}%')  #Faz a média do recall final
    arquivo.write('Recall médio:'  + str(round(100 * soma_recal / skf.n_splits , 1)) + '%\n')

Validação cruzada estratificada, *SEM* Feature Selection
LogisticRegression()
Recall fold 1 de 5: 41.2%
Recall fold 2 de 5: 33.3%
Recall fold 3 de 5: 37.5%
Recall fold 4 de 5: 55.6%
Recall fold 5 de 5: 42.9%
Recall médio: 42.0%
DecisionTreeClassifier()
Recall fold 1 de 5: 26.8%
Recall fold 2 de 5: 28.6%
Recall fold 3 de 5: 25.6%
Recall fold 4 de 5: 34.5%
Recall fold 5 de 5: 23.5%
Recall médio: 28.0%
RandomForestClassifier()
Recall fold 1 de 5: 50.0%
Recall fold 2 de 5: 58.3%
Recall fold 3 de 5: 100.0%
Recall fold 4 de 5: 75.0%
Recall fold 5 de 5: 40.0%
Recall médio: 66.0%
GradientBoostingClassifier()
Recall fold 1 de 5: 26.3%
Recall fold 2 de 5: 40.0%
Recall fold 3 de 5: 57.9%
Recall fold 4 de 5: 75.0%
Recall fold 5 de 5: 21.4%
Recall médio: 46.0%
KNeighborsClassifier()
Recall fold 1 de 5: 33.3%
Recall fold 2 de 5: 44.4%
Recall fold 3 de 5: 85.7%
Recall fold 4 de 5: 85.7%
Recall fold 5 de 5: 40.0%
Recall médio: 58.0%
GaussianNB()
Recall fold 1 de 5: 3.0%
Recall fold 2 de 5: 3.2%
Recall

###4.4) Validação cruzada estratificada COM Feature Selection

In [26]:
# Validação cruzada com split estratificado,  com dados *COM* Feature Selection

# Instanciando novamente o kfold
skf = StratifiedKFold(n_splits=5, shuffle=False, random_state=None)

# y_train já transformado em array anteriormente

with open('StratifiedKFold - COM FeatSelect.txt', 'a') as arquivo:  #cria arquivo txt com as métricas
  print('Validação cruzada estratificada, *COM* Feature Selection')

  # Criandoas iterações
  for model in modelos:
    arquivo.write('='*60  + '\n' + str(model) + '\n')
    print('='*60)
    print(model)
    soma_recal=0    #Zera o valor da soma do recall
    num_iter=1      #Reinicia o valor de i

    for fold, (i_train, i_test) in enumerate(skf.split(X_train_transf, y_train)):

      X_train_skf  ,  X_test_skf = X_train_transf[i_train]  ,  X_train_transf[i_test]   #Definindo X_train e X_test no fold
      y_train_skf  ,  y_test_skf = y_train[i_train]  ,  y_train[i_test]                 #Definindo y_train e y_test no fold

      model.fit(X_train_skf, y_train_skf)                               #Treinando o modelo no fold
      y_pred_skf = model.predict(X_test_skf)                            #Predição no fold
      score = round(recall_score(y_pred_skf,y_test_skf),1)              #Calcula o recall no fold

      print(f'Recall fold {num_iter} de {skf.n_splits}: {round(100*recall_score(y_pred_skf,y_test_skf),1)}%')
      arquivo.write('Recall fold ' + str(num_iter) + ' de ' + str(skf.n_splits)  + ': ' + str(round(100*recall_score(y_pred_skf,y_test_skf),1) ) + '%\n' )

      soma_recal = soma_recal + score   #Faz o recall acumulado
      num_iter=num_iter+1               #Acrescenta mais 1 no texto printado (mais um Fold)

    print(f'Recall médio: {round(100 * soma_recal / skf.n_splits , 1)}%')  #Faz a média do recall final
    arquivo.write('Recall médio:'  + str(round(100 * soma_recal / skf.n_splits , 1)) + '%\n')

Validação cruzada estratificada, *COM* Feature Selection
LogisticRegression()
Recall fold 1 de 5: 25.0%
Recall fold 2 de 5: 41.7%
Recall fold 3 de 5: 64.7%
Recall fold 4 de 5: 40.0%
Recall fold 5 de 5: 50.0%
Recall médio: 42.0%
DecisionTreeClassifier()
Recall fold 1 de 5: 38.5%
Recall fold 2 de 5: 35.6%
Recall fold 3 de 5: 28.6%
Recall fold 4 de 5: 28.2%
Recall fold 5 de 5: 20.6%
Recall médio: 32.0%
RandomForestClassifier()
Recall fold 1 de 5: 45.5%
Recall fold 2 de 5: 54.5%
Recall fold 3 de 5: 77.8%
Recall fold 4 de 5: 70.0%
Recall fold 5 de 5: 40.0%
Recall médio: 58.0%
GradientBoostingClassifier()
Recall fold 1 de 5: 35.0%
Recall fold 2 de 5: 45.0%
Recall fold 3 de 5: 50.0%
Recall fold 4 de 5: 57.1%
Recall fold 5 de 5: 30.8%
Recall médio: 44.0%
KNeighborsClassifier()
Recall fold 1 de 5: 23.1%
Recall fold 2 de 5: 35.7%
Recall fold 3 de 5: 46.2%
Recall fold 4 de 5: 50.0%
Recall fold 5 de 5: 37.5%
Recall médio: 40.0%
GaussianNB()
Recall fold 1 de 5: 3.2%
Recall fold 2 de 5: 3.1%
Recall 

###4.5) Tunagem de hiperparâmetros COM Feature Selection

Os modelos passarão por otimização de hiperparâmetros e, posteriormente, um modelo Ensamble do tipo Stacking será criado (Classificador de Votação).

Par isso, serão selecionados algoritmos diferntes, que comentem erros que não estão relacionados entre si.

Sendo assim, dentre os três modelos baseados em árvores (DecisionTree, RandomForest e GradienteBoosting), será selecionado o RandomForest, por apresentar melhor desempenho nos testes anteriores.

Os Modelos NaiveBayes e SuportVectorMachine (SVM), mesmo sendo um algoritmos que cometem erros não relacionados com os erros dos demais algoritmos, serão descartado por apresentarem desempenho muito abaixo em todos os testes realizados.

Os algoritmos serão otimizados com base no conjunto resultante da Feature Selection pois, no geral, houve melhora no desempenho.
<br><br>

In [27]:
# Predições com modelos Tunados

with open('Modelos tunados - COM FeatSelect.txt', 'a') as arquivo:  #cria arquivo txt com as métricas

  print('Modelos tunados, com dados *COM* Feature Selection')

  # LogisticRegressor
  grid_lr = {'class_weight':[None,'balanced'],  'C':[0.01,0.1,1],  'solver':['lbfgs','sag'],  'penalty':['l1','l2']}
  lr_opt=GridSearchCV(lr, grid_lr, cv=skf)
  lr_opt.fit(X_train_transf,y_train)
  y_pred = lr_opt.predict(X_test_transf)
  print('='*60)
  print('Logistic Regressor:')
  print(classification_report(y_pred,y_test))
  arquivo.write('='*60  + '\nLogisticRegressor:\n' + str(classification_report(y_pred,y_test)) + '\n')

  # RandomForest
  grid_rf = {'max_depth':[None,3,5,7,10], 'criterion':['gini','entropy'],  'n_estimators':[100,300,500], 'class_weight':[None,'balanced']}
  rf_opt = GridSearchCV(rf, grid_rf, cv=skf)
  rf_opt.fit(X_train_transf,y_train)
  y_pred = rf_opt.predict(X_test_transf)
  print('='*60)
  print('Random Forest:')
  print(classification_report(y_pred,y_test))
  arquivo.write('='*60  + '\nRandomForest:\n' + str(classification_report(y_pred,y_test)) + '\n')


  # KNN
  grid_knn = {'weights':['uniform','distance'],  'n_neighbors':[1,2,3]}
  knn_opt = GridSearchCV(knn, grid_knn, cv=skf)
  knn_opt.fit(X_train_transf,y_train)
  y_pred = knn_opt.predict(X_test_transf)
  print('='*60)
  print('KNN:')
  print(classification_report(y_pred,y_test))
  arquivo.write('='*60  + '\nKNN:\n' + str(classification_report(y_pred,y_test)) + '\n')


Modelos tunados, com dados *COM* Feature Selection
Logistic Regressor:
              precision    recall  f1-score   support

           0       1.00      0.97      0.98      1359
           1       0.05      0.40      0.08         5

    accuracy                           0.97      1364
   macro avg       0.52      0.68      0.53      1364
weighted avg       0.99      0.97      0.98      1364

Random Forest:
              precision    recall  f1-score   support

           0       1.00      0.97      0.98      1359
           1       0.09      0.80      0.16         5

    accuracy                           0.97      1364
   macro avg       0.55      0.89      0.57      1364
weighted avg       1.00      0.97      0.98      1364

KNN:
              precision    recall  f1-score   support

           0       0.99      0.97      0.98      1354
           1       0.07      0.30      0.11        10

    accuracy                           0.96      1364
   macro avg       0.53      0.63    

In [44]:
# Visualizando os melhores hiperparâmetros de cada modelo

for i in (lr_opt,rf_opt,knn_opt):
  print('='*80)
  print(f'{i}:\n{i.best_params_}')

GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=None, shuffle=False),
             estimator=LogisticRegression(),
             param_grid={'C': [0.01, 0.1, 1],
                         'class_weight': [None, 'balanced'],
                         'penalty': ['l1', 'l2'], 'solver': ['lbfgs', 'sag']}):
{'C': 0.01, 'class_weight': None, 'penalty': 'l2', 'solver': 'lbfgs'}
GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=None, shuffle=False),
             estimator=RandomForestClassifier(),
             param_grid={'class_weight': [None, 'balanced'],
                         'criterion': ['gini', 'entropy'],
                         'max_depth': [None, 3, 5, 7, 10],
                         'n_estimators': [100, 300, 500]}):
{'class_weight': None, 'criterion': 'gini', 'max_depth': 7, 'n_estimators': 300}
GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=None, shuffle=False),
             estimator=KNeighborsClassifier(),
             param_grid={'n_neighbors': [1, 2

###4.6) Criação de modelo Ensemble

In [47]:
from sklearn.ensemble import VotingClassifier

with open('Modelo ensamble - COM FeatSelect.txt', 'a') as arquivo:  #cria arquivo txt com as métricas
  arquivo.write('\nModelos:\n' + 'LogisticRegressor\nRandomForest\nKNN\n' + '='*60 + '\n')

  # Criando o ensemble         #voting='hard' --> votos majoritários |  voting='soft' --> média das probabilidades
  ensemble = VotingClassifier(estimators=[ ('lr', lr_opt)  ,  ('rf', rf_opt)  ,  ('knn', knn_opt)  ],  voting='hard')

  # Treinando o modelo ensemble
  ensemble.fit(X_train_transf, y_train)

  # y_pred_ens = ensemble.predict(X_test_transf)
  y_pred_ens = ensemble.predict(X_test_transf)

  print('='*60)
  print('Modelo ensemble:')
  print(classification_report(y_pred_ens,y_test))
  print('='*60)
  arquivo.write( str(classification_report(y_pred,y_test)) + '\n')

Modelo ensemble:
              precision    recall  f1-score   support

           0       1.00      0.97      0.98      1360
           1       0.05      0.50      0.08         4

    accuracy                           0.97      1364
   macro avg       0.52      0.73      0.53      1364
weighted avg       1.00      0.97      0.98      1364



###4.7) Exportando o modelo final

O modelo RandmoForest, com ajuste de hiperparâmetros foi o que melhor performou, logo, este será o modelo final que poderá ser utilizado em novas predições.
<br><br>

In [88]:
# Exportando o modelo RandomForest

model = rf_opt.best_estimator_

import pickle

# Salvar o modelo
with open('Modelo RandomForest.pkl', 'wb') as arquivo:
    pickle.dump(model, arquivo)