#1) Introdução


Autoria: Jackson Corrêa

Linkedin: https://www.linkedin.com/in/jackson-corr%C3%AAa/
<br><br>
Este é um projeto de Ciência de Dados que tem como objetivo desenvolver um modelo de máquina preditiva para classificar companhias com potencial de falência, de acordo com os Dados de falência do Taiwan Economic Journal para os anos de 1999-2009. Pensando na otimização do tempo, a etapa de análise exploratória dos dados foi automatizada com a utilização do pacote 'dataprep'. Sendo assim, o foco da Análise é mais generalista, com foco na distribuição das variáveis e suas correlações. Também foi utilizada a biblioteca'feature engine', com o objetivo de automatizar a tarefa de seleção de variáveis e tornar o modelo computacionalmente menos oneroso.
<br>
<br>
O projeto consiste na seguintes etapas:

* Análise exploratória dos dados, de forma manual, mais simples e com menos insights;
* Pré-processamento dos dados com split em dados de treino e teste, seleção de variáveis para o modelo e padronização dos dados;
* Modelagem de máquinas preditivas, com análise de desempenho de vários modelos testados, tunagem de hiperparâmetros e elaboração de um modelo Ensemble de Classificador de Votação.
<br><br>

A principal métrica para avaliação do desempenho dos modelos é o Recall Score (revocação). Dessa forma, os melhores modelos serão aqueles com menores indicações de falsos negativos.
<br><br>

##1.1) Fonte de dados

Os dados utilizados neste projeto foram extraídos da plataforma Kaggle.

Link de acesso: https://www.kaggle.com/datasets/fedesoriano/company-bankruptcy-prediction
<br><br>

##1.2) Convenções

A seguir, estão listadas algumas convenções de sufixos e siglas utilizadas no nome de variáveis, com o intuito de deixar o código mais intuitivo à leitura:

* df - dataframe
* aux - auxiliar
* X_train - dados de treino ('X' maiúsculo)
* X_test - dados de teste ('X' maiúsculo)
* y_train - rótulos de treino ('y' minúsculo)
* y_test - rótulos de teste ('y' minúsculo)
* std - standard / padronização
* norm - normalized / normalizado / normalização
* transf - transformed / transformado
* base - baseline / base
* ori - original
* ens - ensemble
* opt - optimized / otimizado / ótimo / tunado
* eda - Exploratory Data Analytics
* os - oversampling
* us - undersampling
* new - novo / inédito
* prod - produção
* raw - cru / bruto
<br><br>

##1.3) Importação das bibliotecas

In [None]:
# Instalando a biblioteca 'category_encoders'
!pip install category_encoders
!pip install feature_engine
!pip install dataprep

In [2]:
# importando as bibliotecas

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier

from sklearn.model_selection import cross_val_score, GridSearchCV, RandomizedSearchCV, train_test_split, StratifiedKFold
from sklearn.metrics import make_scorer, accuracy_score, recall_score, precision_score, f1_score, classification_report

from sklearn.ensemble import VotingClassifier

from sklearn.preprocessing import StandardScaler

from feature_engine.selection import DropConstantFeatures, SmartCorrelatedSelection, RecursiveFeatureAddition, RecursiveFeatureElimination

from imblearn.under_sampling import TomekLinks, EditedNearestNeighbours

from dataprep.eda import create_report

import pickle

import shutil

import warnings
warnings.filterwarnings("ignore")

##1.5) Importação do dataset

In [3]:
# Importando o dataset em formato csv
data=pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Projetos pessoais/01. Projetos Machine Learning/Falência de empresas Tailandesas/data.csv')

# criando um backup
df=data.copy()

#2) Análise exploratória

NOTA: a análise exploratória automatiza será realizada após a etapa de Feature Selection, utilizando o pacte Dataprep, e o conjunto de dados será aquele com as Features removidas no processo de Feature-Selection.
Abaixo, serão exibidos apenas algumas informações sobre o dataset, de forma mais simplória.


In [4]:
# Informações do dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6819 entries, 0 to 6818
Data columns (total 96 columns):
 #   Column                                                    Non-Null Count  Dtype  
---  ------                                                    --------------  -----  
 0   Bankrupt?                                                 6819 non-null   int64  
 1    ROA(C) before interest and depreciation before interest  6819 non-null   float64
 2    ROA(A) before interest and % after tax                   6819 non-null   float64
 3    ROA(B) before interest and depreciation after tax        6819 non-null   float64
 4    Operating Gross Margin                                   6819 non-null   float64
 5    Realized Sales Gross Margin                              6819 non-null   float64
 6    Operating Profit Rate                                    6819 non-null   float64
 7    Pre-tax net Interest Rate                                6819 non-null   float64
 8    After-tax net Int

In [5]:
# Visualizando as primeiras linhas do dataset - parte 01
df.iloc[:,0:20].describe()

Unnamed: 0,Bankrupt?,ROA(C) before interest and depreciation before interest,ROA(A) before interest and % after tax,ROA(B) before interest and depreciation after tax,Operating Gross Margin,Realized Sales Gross Margin,Operating Profit Rate,Pre-tax net Interest Rate,After-tax net Interest Rate,Non-industry income and expenditure/revenue,Continuous interest rate (after tax),Operating Expense Rate,Research and development expense rate,Cash flow rate,Interest-bearing debt interest rate,Tax rate (A),Net Value Per Share (B),Net Value Per Share (A),Net Value Per Share (C),Persistent EPS in the Last Four Seasons
count,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0
mean,0.032263,0.50518,0.558625,0.553589,0.607948,0.607929,0.998755,0.79719,0.809084,0.303623,0.781381,1995347000.0,1950427000.0,0.467431,16448010.0,0.115001,0.190661,0.190633,0.190672,0.228813
std,0.17671,0.060686,0.06562,0.061595,0.016934,0.016916,0.01301,0.012869,0.013601,0.011163,0.012679,3237684000.0,2598292000.0,0.017036,108275000.0,0.138667,0.03339,0.033474,0.03348,0.033263
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.476527,0.535543,0.527277,0.600445,0.600434,0.998969,0.797386,0.809312,0.303466,0.781567,0.0001566874,0.000128188,0.461558,0.0002030203,0.0,0.173613,0.173613,0.173676,0.214711
50%,0.0,0.502706,0.559802,0.552278,0.605997,0.605976,0.999022,0.797464,0.809375,0.303525,0.781635,0.0002777589,509000000.0,0.46508,0.0003210321,0.073489,0.1844,0.1844,0.1844,0.224544
75%,0.0,0.535563,0.589157,0.584105,0.613914,0.613842,0.999095,0.797579,0.809469,0.303585,0.781735,4145000000.0,3450000000.0,0.471004,0.0005325533,0.205841,0.19957,0.19957,0.199612,0.23882
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,9990000000.0,9980000000.0,1.0,990000000.0,1.0,1.0,1.0,1.0,1.0


In [6]:
# Visualizando as primeiras linhas do dataset - parte 02
df.iloc[:,20:40].describe()

Unnamed: 0,Cash Flow Per Share,Revenue Per Share (Yuan ¥),Operating Profit Per Share (Yuan ¥),Per Share Net profit before tax (Yuan ¥),Realized Sales Gross Profit Growth Rate,Operating Profit Growth Rate,After-tax Net Profit Growth Rate,Regular Net Profit Growth Rate,Continuous Net Profit Growth Rate,Total Asset Growth Rate,Net Value Growth Rate,Total Asset Return Growth Rate Ratio,Cash Reinvestment %,Current Ratio,Quick Ratio,Interest Expense Ratio,Total debt/Total net worth,Debt ratio %,Net worth/Assets,Long-term fund suitability ratio (A)
count,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0
mean,0.323482,1328641.0,0.109091,0.184361,0.022408,0.84798,0.689146,0.68915,0.217639,5508097000.0,1566212.0,0.264248,0.379677,403285.0,8376595.0,0.630991,4416337.0,0.113177,0.886823,0.008783
std,0.017611,51707090.0,0.027942,0.03318,0.012079,0.010752,0.013853,0.01391,0.010063,2897718000.0,114159400.0,0.009634,0.020737,33302160.0,244684700.0,0.011238,168406900.0,0.05392,0.05392,0.028153
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.317748,0.01563138,0.096083,0.17037,0.022065,0.847984,0.68927,0.68927,0.21758,4860000000.0,0.0004409689,0.263759,0.374749,0.007555047,0.004725903,0.630612,0.003007049,0.072891,0.851196,0.005244
50%,0.322487,0.02737571,0.104226,0.179709,0.022102,0.848044,0.689439,0.689439,0.217598,6400000000.0,0.0004619555,0.26405,0.380425,0.01058717,0.007412472,0.630698,0.005546284,0.111407,0.888593,0.005665
75%,0.328623,0.04635722,0.116155,0.193493,0.022153,0.848123,0.689647,0.689647,0.217622,7390000000.0,0.0004993621,0.264388,0.386731,0.01626953,0.01224911,0.631125,0.009273293,0.148804,0.927109,0.006847
max,1.0,3020000000.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,9990000000.0,9330000000.0,1.0,1.0,2750000000.0,9230000000.0,1.0,9940000000.0,1.0,1.0,1.0


In [7]:
# Visualizando as primeiras linhas do dataset - parte 03
df.iloc[:,40:60].describe()

Unnamed: 0,Borrowing dependency,Contingent liabilities/Net worth,Operating profit/Paid-in capital,Net profit before tax/Paid-in capital,Inventory and accounts receivable/Net value,Total Asset Turnover,Accounts Receivable Turnover,Average Collection Days,Inventory Turnover Rate (times),Fixed Assets Turnover Frequency,Net Worth Turnover Rate (times),Revenue per person,Operating profit per person,Allocation rate per person,Working Capital to Total Assets,Quick Assets/Total Assets,Current Assets/Total Assets,Cash/Total Assets,Quick Assets/Current Liability,Cash/Current Liability
count,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0
mean,0.374654,0.005968,0.108977,0.182715,0.402459,0.141606,12789710.0,9826221.0,2149106000.0,1008596000.0,0.038595,2325854.0,0.400671,11255790.0,0.814125,0.400132,0.522273,0.124095,3592902.0,37159990.0
std,0.016286,0.012188,0.027782,0.030785,0.013324,0.101145,278259800.0,256358900.0,3247967000.0,2477557000.0,0.03668,136632700.0,0.03272,294506300.0,0.059054,0.201998,0.218112,0.139251,171620900.0,510350900.0
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.370168,0.005366,0.096105,0.169376,0.397403,0.076462,0.0007101336,0.00438653,0.0001728256,0.0002330013,0.021774,0.01043285,0.392438,0.004120529,0.774309,0.241973,0.352845,0.033543,0.005239776,0.001973008
50%,0.372624,0.005366,0.104133,0.178456,0.400131,0.118441,0.0009678107,0.006572537,0.0007646743,0.0005930942,0.029516,0.01861551,0.395898,0.007844373,0.810275,0.386451,0.51483,0.074887,0.007908898,0.004903886
75%,0.376271,0.005764,0.115927,0.191607,0.404551,0.176912,0.001454759,0.008972876,4620000000.0,0.003652371,0.042903,0.03585477,0.401851,0.01502031,0.850383,0.540594,0.689051,0.161073,0.01295091,0.01280557
max,1.0,1.0,1.0,1.0,1.0,1.0,9740000000.0,9730000000.0,9990000000.0,9990000000.0,1.0,8810000000.0,1.0,9570000000.0,1.0,1.0,1.0,1.0,8820000000.0,9650000000.0


In [8]:
# Visualizando as primeiras linhas do dataset - parte 04
df.iloc[:,60:80].describe()

Unnamed: 0,Current Liability to Assets,Operating Funds to Liability,Inventory/Working Capital,Inventory/Current Liability,Current Liabilities/Liability,Working Capital/Equity,Current Liabilities/Equity,Long-term Liability to Current Assets,Retained Earnings to Total Assets,Total income/Total expense,Total expense/Assets,Current Asset Turnover Rate,Quick Asset Turnover Rate,Working capitcal Turnover Rate,Cash Turnover Rate,Cash Flow to Sales,Fixed Assets to Assets,Current Liability to Liability,Current Liability to Equity,Equity to Long-term Liability
count,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0
mean,0.090673,0.353828,0.277395,55806800.0,0.761599,0.735817,0.33141,54160040.0,0.934733,0.002549,0.029184,1195856000.0,2163735000.0,0.594006,2471977000.0,0.671531,1220121.0,0.761599,0.33141,0.115645
std,0.05029,0.035147,0.010469,582051600.0,0.206677,0.011678,0.013488,570270600.0,0.025564,0.012093,0.027149,2821161000.0,3374944000.0,0.008959,2938623000.0,0.009341,100754200.0,0.206677,0.013488,0.019529
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.053301,0.341023,0.277034,0.003163148,0.626981,0.733612,0.328096,0.0,0.931097,0.002236,0.014567,0.0001456236,0.0001417149,0.593934,0.0002735337,0.671565,0.08536037,0.626981,0.328096,0.110933
50%,0.082705,0.348597,0.277178,0.006497335,0.806881,0.736013,0.329685,0.001974619,0.937672,0.002336,0.022674,0.0001987816,0.0002247728,0.593963,1080000000.0,0.671574,0.196881,0.806881,0.329685,0.11234
75%,0.119523,0.360915,0.277429,0.01114677,0.942027,0.73856,0.332322,0.009005946,0.944811,0.002492,0.03593,0.0004525945,4900000000.0,0.594002,4510000000.0,0.671587,0.3722,0.942027,0.332322,0.117106
max,1.0,1.0,1.0,9910000000.0,1.0,1.0,1.0,9540000000.0,1.0,1.0,1.0,10000000000.0,10000000000.0,1.0,10000000000.0,1.0,8320000000.0,1.0,1.0,1.0


In [9]:
# Visualizando as primeiras linhas do dataset - parte 05
df.iloc[:,80:].describe()

Unnamed: 0,Cash Flow to Total Assets,Cash Flow to Liability,CFO to Assets,Cash Flow to Equity,Current Liability to Current Assets,Liability-Assets Flag,Net Income to Total Assets,Total assets to GNP price,No-credit Interval,Gross Profit to Sales,Net Income to Stockholder's Equity,Liability to Equity,Degree of Financial Leverage (DFL),Interest Coverage Ratio (Interest expense to EBIT),Net Income Flag,Equity to Liability
count,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0
mean,0.649731,0.461849,0.593415,0.315582,0.031506,0.001173,0.80776,18629420.0,0.623915,0.607946,0.840402,0.280365,0.027541,0.565358,1.0,0.047578
std,0.047372,0.029943,0.058561,0.012961,0.030845,0.034234,0.040332,376450100.0,0.01229,0.016934,0.014523,0.014463,0.015668,0.013214,0.0,0.050014
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
25%,0.633265,0.457116,0.565987,0.312995,0.018034,0.0,0.79675,0.0009036205,0.623636,0.600443,0.840115,0.276944,0.026791,0.565158,1.0,0.024477
50%,0.645366,0.45975,0.593266,0.314953,0.027597,0.0,0.810619,0.002085213,0.623879,0.605998,0.841179,0.278778,0.026808,0.565252,1.0,0.033798
75%,0.663062,0.464236,0.624769,0.317707,0.038375,0.0,0.826455,0.005269777,0.624168,0.613913,0.842357,0.281449,0.026913,0.565725,1.0,0.052838
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,9820000000.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [79]:
# Verificando a distribuição das classes
df['Bankrupt?'].value_counts(1)

0    0.967737
1    0.032263
Name: Bankrupt?, dtype: float64

As classes estão altamente desbalanceadas.
<br><br>

#3) Pré-processamento

In [11]:
# Dividindo os dados em X e y
X = df.drop('Bankrupt?',axis=1)
y=df['Bankrupt?']

# Separando um percentual de dados para teste do modelo em produção - 5% para rodar no modelo em produção
X_note, X_prod, y_note, y_prod = train_test_split(X, y , test_size=0.05 , random_state=42 , stratify=y)

# Split nos dados (antes de qualquer manipulação - evitar dataleakage)
X_train, X_test, y_train, y_test = train_test_split(X_note, y_note, test_size=0.2, random_state=42, stratify = y_note)

# Criando conjunto X de backup das features originais
X_train_ori = X_train
X_test_ori = X_test

In [12]:
# Verificando a proporção das classes nos dados totais e no split
print(f'Percentuais no dataset original:\n{y.value_counts(1)}')
print('\n')
print(f'Percentuais no dataset para testar o modelo em produção:\n{y_prod.value_counts(1)}')
print('\n')
print(f'Percentuais no dataset do modelo do notebook:\n{y_note.value_counts(1)}')
print('\n')
print(f'Percentuais no dados de treino do modelo do notebook:\n{y_train.value_counts(1)}')
print('\n')
print(f'Percentuais no dados de teste do modelo do notebook:\n{y_test.value_counts(1)}')

Percentuais no dataset original:
0    0.967737
1    0.032263
Name: Bankrupt?, dtype: float64


Percentuais no dataset para testar o modelo em produção:
0    0.967742
1    0.032258
Name: Bankrupt?, dtype: float64


Percentuais no dataset do modelo do notebook:
0    0.967737
1    0.032263
Name: Bankrupt?, dtype: float64


Percentuais no dados de treino do modelo do notebook:
0    0.967773
1    0.032227
Name: Bankrupt?, dtype: float64


Percentuais no dados de teste do modelo do notebook:
0    0.967593
1    0.032407
Name: Bankrupt?, dtype: float64


<br><br>
Na Etapa de Feature Selection, os Features constantes e altamente correlacionadas serão excluídas permanentemente, por isso as alterações já são feitas nos dados X_train, X_teste.

Nas etapas recursivas de adição e eliminação, as features não serão removidas do conjunto X_train e X_test. Serão criados novos conjuntos (X_train_transf e X_test_transf) onde haverá a remoção dessas features. Assim, serão mantidos dois tipos de conjuntos para comparação de desempenho:

* Conjunto sem as features constantes e sem features correlacionadas (X_train / X_test)
* Conjunto sem as features constantes, sem as correlacionadas e sem as removidas recursivamente (X_train_transf / X_test_transf)

In [13]:
# Feature selection - features constantes

# Instanciando
drop_const = DropConstantFeatures()

# Aplicando transformação no conjunto de dados
X_train = drop_const.fit_transform(X_train)
X_test = drop_const.transform(X_test)

# Extraindo features remanescentes
remaining_features_constant = list(X_train.columns)

# Extraindo features removidas
droped_features_constant = list(drop_const.features_to_drop_)

# Relatório resumido
print(f'Remoção de features constantes\n- Removidas: {len(droped_features_constant)}\n- Remanescentes: {len(remaining_features_constant)}')

# Relatório completo:
print('\nAs features removidas são:\n')
for i in droped_features_constant:
  print(f'-{i}')

Remoção de features constantes
- Removidas: 1
- Remanescentes: 94

As features removidas são:

- Net Income Flag


In [14]:
# Feature selection - features altamente correlacionadas

# Instanciando validação cruzada estratificada
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) # shuffle=False --> Não embaralha // random_state=None --> sem semente para embaralhamento

# Instanciando
smart_corr_selection = SmartCorrelatedSelection(method='pearson',threshold=0.75,selection_method='model_performance',estimator=RandomForestClassifier(random_state=42),cv=skf)

# Aplicando treinamento e transformação
X_train = smart_corr_selection.fit_transform(X_train,y_train)
X_test = smart_corr_selection.transform(X_test)

# Extraindo features remanescentes
remaining_features_corr = list(X_train.columns)

# Extraindo features removidas
droped_features_corr = list(smart_corr_selection.features_to_drop_)

# Relatório resumido
print(f'Remoção de features altamente correlacinadas\n- Removidas: {len(droped_features_corr)}\n- Remanescentes: {len(remaining_features_corr)}')

# Relatório completo:
print('\nAs features removidas são:\n')
for i in droped_features_corr:
  print(f'-{i}')

Remoção de features altamente correlacinadas
- Removidas: 27
- Remanescentes: 67

As features removidas são:

- ROA(C) before interest and depreciation before interest
- ROA(A) before interest and % after tax
- ROA(B) before interest and depreciation after tax
- Operating Gross Margin
- Operating Profit Rate
- Pre-tax net Interest Rate
- After-tax net Interest Rate
- Net Value Per Share (A)
- Net Value Per Share (C)
- Operating Profit Per Share (Yuan ¥)
- Per Share Net profit before tax (Yuan ¥)
- Regular Net Profit Growth Rate
- Net worth/Assets
- Borrowing dependency
- Contingent liabilities/Net worth
- Operating profit/Paid-in capital
- Quick Assets/Total Assets
- Current Liability to Assets
- Operating Funds to Liability
- Current Liabilities/Liability
- Current Liabilities/Equity
- Working capitcal Turnover Rate
- Current Liability to Equity
- Equity to Long-term Liability
- Net Income to Total Assets
- Gross Profit to Sales
- Liability to Equity


In [15]:
# Feature selection - adição recursiva de features

# Instanciando validação cruzada estratificada
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) # shuffle=False --> Não embaralha // random_state=None --> sem semente para embaralhamento

# Instanciando
recursive_feat_add = RecursiveFeatureAddition(RandomForestClassifier(random_state=42),scoring='recall',cv=skf)

# Aplicando treinamento e transformação
X_train_add = recursive_feat_add.fit_transform(X_train,y_train)
X_test_add = recursive_feat_add.transform(X_test)

# Extraindo as features adicionadas (este método não elimina e, sim adiciona)
added_features = list(X_train_add.columns)

# Relatório resumido
print(f'Adição recursiva de features\n- Adicionadas: {len(added_features)}')

# Relatório completo:
print('\nAs features adicionadas são:\n')
for i in added_features:
  print(f'-{i}')


Adição recursiva de features
- Adicionadas: 4

As features adicionadas são:

- Continuous interest rate (after tax)
- Interest Expense Ratio
- Net Income to Stockholder's Equity
- Interest Coverage Ratio (Interest expense to EBIT)


In [17]:
# Feature selection - Eliminação recursiva

# Instanciando validação cruzada estratificada
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) # shuffle=False --> Não embaralha // random_state=None --> sem semente para embaralhamento

# Instanciando
recursive_feat_drop = RecursiveFeatureElimination(RandomForestClassifier(random_state=42),scoring='recall',cv=skf)

# Aplicando treinamento e transformação
X_train_drop = recursive_feat_drop.fit_transform(X_train,y_train)
X_test_drop = recursive_feat_drop.transform(X_test)

# Extraindo as features remanescentes
remaining_features_eliminated = list(X_train_drop.columns)

# Extraindo as features eliminadas
eliminated_features = list(recursive_feat_drop.features_to_drop_)

# Relatório resumido
print(f'Eliminação recursiva de features\n- Removidas: {len(eliminated_features)}\n- Remanescentes: {len(remaining_features_eliminated)}')

# Relatório completo:
print('\nAs features removidas são:\n')
for i in eliminated_features:
  print(f'-{i}')

Eliminação recursiva de features
- Removidas: 14
- Remanescentes: 53

As features removidas são:

- Research and development expense rate
- Continuous Net Profit Growth Rate
- Total Asset Return Growth Rate Ratio
- Current Ratio
- Accounts Receivable Turnover
- Allocation rate per person
- Working Capital to Total Assets
- Quick Assets/Current Liability
- Working Capital/Equity
- Cash Turnover Rate
- Cash Flow to Sales
- Cash Flow to Liability
- Current Liability to Current Assets
- Equity to Liability


In [18]:
# Features finais, agrupando as remanescentes da adição e da eliminação recursiva
remaining_features_recursive=list(set(  added_features  +  remaining_features_eliminated  ))
# nota: 'set' converte a lista em um conjunto que não permite duplicatas

In [19]:
# Quantidade de Features Eliminadas

# Total de features do conjunto de dados original
a = len(list(X_train_ori.columns))

# Total de features eliminadas na dropagem de constantes e de correlecionadas
b = len(droped_features_constant) + len(droped_features_corr)

# Total de features eliminadas na etapa recursiva
c = ( a - b ) - len(remaining_features_recursive)


print(f'Total de features eliminadas em todas as etapas: {b+c}\n')

print(f'''Conjuntos de dados X_train / X_test:
  - {a-b} features remanescente
  - {b} features constantes e correlacionadas removidas\n''')

print(f'''Conjuntos de dados X_train_transf / X_test_transf:
  - {a-b-c} features remanescentes
  - {b} features constantes e correlacionadas removidas
  - {c} features eliminadas recursivamente\n''')


Total de features eliminadas em todas as etapas: 42

Conjuntos de dados X_train / X_test:
  - 67 features remanescente
  - 28 features constantes e correlacionadas removidas

Conjuntos de dados X_train_transf / X_test_transf:
  - 53 features remanescentes
  - 28 features constantes e correlacionadas removidas
  - 14 features eliminadas recursivamente



In [20]:
# Conjunto de dados com as features finais
X_train_transf = X_train[remaining_features_recursive]
X_test_transf = X_test[remaining_features_recursive]

# Backup para EDA somente com as variáveis seleciondas
df_eda=df[remaining_features_recursive]
df_eda['Target']=df['Bankrupt?']

In [21]:
# Análise exploratória dos dados já com features selecionadas
eda = create_report(df_eda)
eda.save('EDA - Features Selecionadas.html')



Report has been saved to EDA - Features Selecionadas.html!


In [22]:
# Padronização

# Instanciando
std_scaler = StandardScaler()         #Para dados X, sem feature selection
std_scaler_transf = StandardScaler()  #Para dados X, tansformados com feature selection

# Aplicando fit e transform
X_train = std_scaler.fit_transform(X_train) #fit e transform
X_test = std_scaler.transform(X_test)       #somente transform

# Aplicando fit e transform
X_train_transf = std_scaler_transf.fit_transform(X_train_transf) #fit e transform
X_test_transf = std_scaler_transf.transform(X_test_transf)       #somente transform

NOTA:
Apesar de os modelos baseados em árvores não demandarem a padronização dos dados, optou-se em padronizá-los para que o mesmo conjunto de dados seja utilizado em todos os algoritmos.

#4) Máquina preditiva

In [23]:
# Instanciando os modelos:
lr=LogisticRegression()

dt=DecisionTreeClassifier(random_state=42)

rf=RandomForestClassifier(random_state=42)

gbm=GradientBoostingClassifier(random_state=42)

knn=KNeighborsClassifier()

nb=GaussianNB()

svm=SVC()

# Nota: manter a semente 'random_state=42' nos algoritmo para garantir a reprodutibilidade do código

# Lista de modelos
modelos = [lr,dt,rf,gbm,knn,nb,svm]

###4.1) Modelo Baseline SEM Feature Selection

In [24]:
# Modelo baseline com dados *SEM* Feature Selection

print('Modelo baseline, *SEM* Feature Selection')
for model in modelos:
  model.fit(X_train,y_train)
  y_pred = model.predict(X_test)
  print('='*60)
  print(model)
  print(classification_report(y_pred,y_test))

Modelo baseline, *SEM* Feature Selection
LogisticRegression()
              precision    recall  f1-score   support

           0       1.00      0.97      0.98      1285
           1       0.12      0.45      0.19        11

    accuracy                           0.97      1296
   macro avg       0.56      0.71      0.59      1296
weighted avg       0.99      0.97      0.98      1296

DecisionTreeClassifier(random_state=42)
              precision    recall  f1-score   support

           0       0.97      0.98      0.97      1247
           1       0.29      0.24      0.26        49

    accuracy                           0.95      1296
   macro avg       0.63      0.61      0.62      1296
weighted avg       0.94      0.95      0.95      1296

RandomForestClassifier(random_state=42)
              precision    recall  f1-score   support

           0       1.00      0.97      0.98      1288
           1       0.12      0.62      0.20         8

    accuracy                           0

###4.2) Modelo Baseline COM Feature Selection

In [25]:
# Modelo baseline com dados *COM* Feature Selection

print('Modelo baseline, *COM* Feature Selection')
for model in modelos:
  model.fit(X_train_transf,y_train)
  y_pred_transf = model.predict(X_test_transf)
  print('='*60)
  print(model)
  print(classification_report(y_pred_transf,y_test))


Modelo baseline, *COM* Feature Selection
LogisticRegression()
              precision    recall  f1-score   support

           0       0.99      0.97      0.98      1283
           1       0.10      0.31      0.15        13

    accuracy                           0.96      1296
   macro avg       0.54      0.64      0.56      1296
weighted avg       0.98      0.96      0.97      1296

DecisionTreeClassifier(random_state=42)
              precision    recall  f1-score   support

           0       0.98      0.98      0.98      1251
           1       0.36      0.33      0.34        45

    accuracy                           0.96      1296
   macro avg       0.67      0.66      0.66      1296
weighted avg       0.95      0.96      0.96      1296

RandomForestClassifier(random_state=42)
              precision    recall  f1-score   support

           0       1.00      0.97      0.98      1288
           1       0.12      0.62      0.20         8

    accuracy                           0

###4.3) Validação cruzada estratificada SEM Feature Selection

In [26]:
# Validação cruzada com split estratificado,  com dados *SEM* Feature Selection

# Instanciando novamente o kfold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Tranformando os dados y em array
y_train = np.array(y_train)

print('Validação cruzada estratificada, *SEM* Feature Selection')
# Criandoas iterações
for model in modelos:
  print('='*60)
  print(model)
  soma_recal=0    #Zera o valor da soma do recall
  num_iter=1      #Reinicia o valor de i

  for fold, (i_train, i_test) in enumerate(skf.split(X_train, y_train)):

    X_train_skf  ,  X_test_skf = X_train[i_train]  ,  X_train[i_test]   #Definindo X_train e X_test no fold
    y_train_skf  ,  y_test_skf = y_train[i_train]  ,  y_train[i_test]   #Definindo y_train e y_test no fold

    model.fit(X_train_skf, y_train_skf)                               #Treinando o modelo no fold
    y_pred_skf = model.predict(X_test_skf)                            #Predição no fold
    score = round(recall_score(y_pred_skf,y_test_skf),1)              #Calcula o recall no fold

    print(f'Recall fold {num_iter} de {skf.n_splits}: {round(100*recall_score(y_pred_skf,y_test_skf),1)}%')

    soma_recal = soma_recal + score   #Faz o recall acumulado
    num_iter=num_iter+1               #Acrescenta mais 1 no texto printado (mais um Fold)

  print(f'Recall médio: {round(100 * soma_recal / skf.n_splits , 1)}%')  #Faz a média do recall final

Validação cruzada estratificada, *SEM* Feature Selection
LogisticRegression()
Recall fold 1 de 5: 42.9%
Recall fold 2 de 5: 65.0%
Recall fold 3 de 5: 46.2%
Recall fold 4 de 5: 37.5%
Recall fold 5 de 5: 46.7%
Recall médio: 48.0%
DecisionTreeClassifier(random_state=42)
Recall fold 1 de 5: 28.1%
Recall fold 2 de 5: 41.0%
Recall fold 3 de 5: 32.3%
Recall fold 4 de 5: 28.6%
Recall fold 5 de 5: 26.3%
Recall médio: 32.0%
RandomForestClassifier(random_state=42)
Recall fold 1 de 5: 50.0%
Recall fold 2 de 5: 91.7%
Recall fold 3 de 5: 80.0%
Recall fold 4 de 5: 66.7%
Recall fold 5 de 5: 66.7%
Recall médio: 72.0%
GradientBoostingClassifier(random_state=42)
Recall fold 1 de 5: 52.6%
Recall fold 2 de 5: 60.0%
Recall fold 3 de 5: 45.0%
Recall fold 4 de 5: 36.8%
Recall fold 5 de 5: 44.4%
Recall médio: 46.0%
KNeighborsClassifier()
Recall fold 1 de 5: 27.3%
Recall fold 2 de 5: 45.5%
Recall fold 3 de 5: 66.7%
Recall fold 4 de 5: 46.2%
Recall fold 5 de 5: 66.7%
Recall médio: 54.0%
GaussianNB()
Recall fold 

###4.4) Validação cruzada estratificada COM Feature Selection

In [27]:
# Validação cruzada com split estratificado,  com dados *COM* Feature Selection

# Instanciando novamente o kfold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# y_train já transformado em array anteriormente

print('Validação cruzada estratificada, *COM* Feature Selection')

# Criandoas iterações
for model in modelos:
  print('='*60)
  print(model)
  soma_recal=0    #Zera o valor da soma do recall
  num_iter=1      #Reinicia o valor de i

  for fold, (i_train, i_test) in enumerate(skf.split(X_train_transf, y_train)):

    X_train_skf  ,  X_test_skf = X_train_transf[i_train]  ,  X_train_transf[i_test]   #Definindo X_train e X_test no fold
    y_train_skf  ,  y_test_skf = y_train[i_train]  ,  y_train[i_test]                 #Definindo y_train e y_test no fold

    model.fit(X_train_skf, y_train_skf)                               #Treinando o modelo no fold
    y_pred_skf = model.predict(X_test_skf)                            #Predição no fold
    score = round(recall_score(y_pred_skf,y_test_skf),1)              #Calcula o recall no fold

    print(f'Recall fold {num_iter} de {skf.n_splits}: {round(100*recall_score(y_pred_skf,y_test_skf),1)}%')

    soma_recal = soma_recal + score   #Faz o recall acumulado
    num_iter=num_iter+1               #Acrescenta mais 1 no texto printado (mais um Fold)

  print(f'Recall médio: {round(100 * soma_recal / skf.n_splits , 1)}%')  #Faz a média do recall final

Validação cruzada estratificada, *COM* Feature Selection
LogisticRegression()
Recall fold 1 de 5: 40.0%
Recall fold 2 de 5: 66.7%
Recall fold 3 de 5: 60.0%
Recall fold 4 de 5: 41.7%
Recall fold 5 de 5: 37.5%
Recall médio: 50.0%
DecisionTreeClassifier(random_state=42)
Recall fold 1 de 5: 34.4%
Recall fold 2 de 5: 42.5%
Recall fold 3 de 5: 39.3%
Recall fold 4 de 5: 31.6%
Recall fold 5 de 5: 39.4%
Recall médio: 36.0%
RandomForestClassifier(random_state=42)
Recall fold 1 de 5: 50.0%
Recall fold 2 de 5: 100.0%
Recall fold 3 de 5: 66.7%
Recall fold 4 de 5: 88.9%
Recall fold 5 de 5: 75.0%
Recall médio: 78.0%
GradientBoostingClassifier(random_state=42)
Recall fold 1 de 5: 42.9%
Recall fold 2 de 5: 69.2%
Recall fold 3 de 5: 41.2%
Recall fold 4 de 5: 42.1%
Recall fold 5 de 5: 47.6%
Recall médio: 48.0%
KNeighborsClassifier()
Recall fold 1 de 5: 30.0%
Recall fold 2 de 5: 44.4%
Recall fold 3 de 5: 77.8%
Recall fold 4 de 5: 50.0%
Recall fold 5 de 5: 62.5%
Recall médio: 52.0%
GaussianNB()
Recall fold

###4.5) Tunagem de hiperparâmetros SEM Feature Selection

In [41]:
# Predições com modelos Tunados *SEM* Feature Selection

# Instanciando novamente o kfold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)


print('Modelos tunados, com dados *SEM* Feature Selection')

# LogisticRegressor
grid_lr = {'class_weight':[None,'balanced'],  'C':[0.01,0.1,1],  'solver':['lbfgs','sag'],  'penalty':['l1','l2']}
lr_opt=GridSearchCV(lr, grid_lr, cv=skf)
lr_opt.fit(X_train,y_train)
y_pred = lr_opt.predict(X_test)
print('='*60)
print('Logistic Regressor:')
print(classification_report(y_pred,y_test))


# RandomForest
grid_rf = {'max_depth':[3,5], 'criterion':['gini','entropy'],  'n_estimators':[100,300,500], 'class_weight':[None,'balanced']}
rf_opt = GridSearchCV(rf, grid_rf, cv=skf)
rf_opt.fit(X_train,y_train)
y_pred = rf_opt.predict(X_test)
print('='*60)
print('Random Forest:')
print(classification_report(y_pred,y_test))


# KNN
grid_knn = {'weights':['uniform','distance'],  'n_neighbors':[1,2,3]}
knn_opt = GridSearchCV(knn, grid_knn, cv=skf)
knn_opt.fit(X_train,y_train)
y_pred = knn_opt.predict(X_test)
print('='*60)
print('KNN:')
print(classification_report(y_pred,y_test))


# SVM
grid_svm = {'kernel': ['linear', 'sigmoid'],  'class_weight':[None,'balanced'] }
svm_opt = GridSearchCV(svm, grid_svm, cv=skf)
svm_opt.fit(X_train,y_train)
y_pred = svm_opt.predict(X_test)
print('='*60)
print('SVM:')
print(classification_report(y_pred,y_test))


Modelos tunados, com dados *SEM* Feature Selection
Logistic Regressor:
              precision    recall  f1-score   support

           0       1.00      0.97      0.98      1289
           1       0.02      0.14      0.04         7

    accuracy                           0.96      1296
   macro avg       0.51      0.56      0.51      1296
weighted avg       0.99      0.96      0.98      1296

Random Forest:
              precision    recall  f1-score   support

           0       1.00      0.97      0.98      1292
           1       0.05      0.50      0.09         4

    accuracy                           0.97      1296
   macro avg       0.52      0.73      0.54      1296
weighted avg       1.00      0.97      0.98      1296

KNN:
              precision    recall  f1-score   support

           0       1.00      0.97      0.98      1286
           1       0.12      0.50      0.19        10

    accuracy                           0.97      1296
   macro avg       0.56      0.74    

In [49]:
# Visualizando os melhores hiperparâmetros de cada modelo tunado *SEM* feature selection
print('Melhores parâmetros para algoritmos treinados em dados *SEM* features selection:\n')
for i in (lr_opt,rf_opt,knn_opt, svm_opt):
  print('='*80)
  print(f'{i.estimator}:\n{i.best_params_}')

Melhores parâmetros para algoritmos treinados em dados *SEM* features selection:

LogisticRegression():
{'C': 0.01, 'class_weight': None, 'penalty': 'l2', 'solver': 'sag'}
RandomForestClassifier(random_state=42):
{'class_weight': None, 'criterion': 'gini', 'max_depth': 5, 'n_estimators': 500}
KNeighborsClassifier():
{'n_neighbors': 2, 'weights': 'uniform'}
SVC():
{'class_weight': None, 'kernel': 'linear'}


###4.6) Tunagem de hiperparâmetros COM Feature Selection

In [39]:
# Predições com modelos Tunados *COM* Features Selection

# (será adicionado o índice '2' para diferenciar dos modelos anteriores. Ex.: rf_opt2)

# Instanciando novamente o kfold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

print('Modelos tunados, com dados *COM* Feature Selection')

# LogisticRegressor
grid_lr = {'class_weight':[None,'balanced'],  'C':[0.01,0.1,1],  'solver':['lbfgs','sag'],  'penalty':['l1','l2']}
lr_opt2=GridSearchCV(lr, grid_lr, cv=skf)
lr_opt2.fit(X_train_transf,y_train)
y_pred = lr_opt2.predict(X_test_transf)
print('='*60)
print('Logistic Regressor:')
print(classification_report(y_pred,y_test))

# RandomForest
grid_rf = {'max_depth':[3,5], 'criterion':['gini','entropy'],  'n_estimators':[100,300,500], 'class_weight':[None,'balanced']}
rf_opt2 = GridSearchCV(rf, grid_rf, cv=skf)
rf_opt2.fit(X_train_transf,y_train)
y_pred = rf_opt2.predict(X_test_transf)
print('='*60)
print('Random Forest:')
print(classification_report(y_pred,y_test))


# KNN
grid_knn = {'weights':['uniform','distance'],  'n_neighbors':[1,2,3]}
knn_opt2 = GridSearchCV(knn, grid_knn, cv=skf)
knn_opt2.fit(X_train_transf,y_train)
y_pred = knn_opt2.predict(X_test_transf)
print('='*60)
print('KNN:')
print(classification_report(y_pred,y_test))


  # SVM
grid_svm = {'kernel': ['linear','sigmoid'],  'class_weight':[None,'balanced'] }
svm_opt2 = GridSearchCV(svm, grid_svm, cv=skf)
svm_opt2.fit(X_train_transf,y_train)
y_pred = svm_opt2.predict(X_test_transf)
print('='*60)
print('SVM:')
print(classification_report(y_pred,y_test))


Modelos tunados, com dados *COM* Feature Selection
Logistic Regressor:
              precision    recall  f1-score   support

           0       1.00      0.97      0.98      1289
           1       0.05      0.29      0.08         7

    accuracy                           0.97      1296
   macro avg       0.52      0.63      0.53      1296
weighted avg       0.99      0.97      0.98      1296

Random Forest:
              precision    recall  f1-score   support

           0       1.00      0.97      0.98      1293
           1       0.05      0.67      0.09         3

    accuracy                           0.97      1296
   macro avg       0.52      0.82      0.54      1296
weighted avg       1.00      0.97      0.98      1296

KNN:
              precision    recall  f1-score   support

           0       1.00      0.97      0.98      1288
           1       0.07      0.38      0.12         8

    accuracy                           0.97      1296
   macro avg       0.53      0.67    

In [44]:
# Visualizando os melhores hiperparâmetros de cada modelo

print('Melhores parâmetros para algoritmos treinados em dados *COM* features selection\n')
for i in (lr_opt2,rf_opt2,knn_opt2, svm_opt2):
  print('='*80)
  print(f'{i.estimator}:\n{i.best_params_}')

Melhores parâmetros para algoritmos treinados em dados *COM* features selection

LogisticRegression():
{'C': 0.01, 'class_weight': None, 'penalty': 'l2', 'solver': 'sag'}
RandomForestClassifier(random_state=42):
{'class_weight': None, 'criterion': 'gini', 'max_depth': 5, 'n_estimators': 100}
KNeighborsClassifier():
{'n_neighbors': 2, 'weights': 'uniform'}
SVC():
{'class_weight': None, 'kernel': 'linear'}


###4.7) Criação de modelo Ensemble

In [57]:
# Criação do modelo Ensemble
# Treinamento do modelo no conjunto de dados *COM* feature selection

# Criando o ensemble         #voting='hard' --> votos majoritários |  voting='soft' --> média das probabilidades
ensemble = VotingClassifier(estimators=[ ('lr', lr_opt2.best_estimator_)  ,  ('rf', rf_opt2.best_estimator_)  ,  ('knn', knn_opt2.best_estimator_) ,('svm', svm_opt2.best_estimator_)  ],  voting='hard')

# Treinando o modelo
ensemble.fit(X_train_transf,y_train)

# Novas predições com o modelo
y_pred = ensemble.predict(X_test_transf)

# Verificando as métricas
print(classification_report(y_pred, y_test))


              precision    recall  f1-score   support

           0       1.00      0.97      0.98      1294
           1       0.02      0.50      0.05         2

    accuracy                           0.97      1296
   macro avg       0.51      0.73      0.51      1296
weighted avg       1.00      0.97      0.98      1296



O Modelo Ensemble não apresentou melhor performance de Recall do que os modelos executados anteriormente.
<br><br>

#5) Exportando o modelo final

O modelo Random Forest, com ajuste de hiperparâmetros e treinado nos dados **COM** feature selection foi o que melhor performou.

Logo, este será o modelo final que poderá ser utilizado em novas predições.


Como este modelo não demanda que os dados estejam padronizados, antes de exportar o pacote do algoritmo vamos testá-lo em dados não padronizados. Caso as métricas sejam melhores vamos adotar este modelo, visto que há o ganho computacional em não se realizar a etapa de padronização.
<br><br>

In [80]:
# Recuperando o conjunto de dados não padronizados
# Dados de treino
X_train_raw = X_train_ori[remaining_features_recursive]    #Dropa features constantes

# Dados de teste
X_test_raw = X_test_ori[remaining_features_recursive]       #Dropa features constantes


# Instanciando RandomForest
rf_raw = RandomForestClassifier(random_state=42)

# RandomForest
grid_rf= {'max_depth':[3,5], 'criterion':['gini','entropy'],  'n_estimators':[100,300,500], 'class_weight':[None,'balanced']}

rf_raw = GridSearchCV(rf_raw, grid_rf, cv=skf)
rf_raw.fit(X_train_raw,y_train)
y_pred = rf_raw.predict(X_test_raw)
print('='*60)
print('Random Forest em dados não padronizados, *SEM* Feature selection:')
print(classification_report(y_pred,y_test))

Random Forest em dados não padronizados, *SEM* Feature selection:
              precision    recall  f1-score   support

           0       1.00      0.97      0.98      1293
           1       0.05      0.67      0.09         3

    accuracy                           0.97      1296
   macro avg       0.52      0.82      0.54      1296
weighted avg       1.00      0.97      0.98      1296



O desempenho foi igual ao do modelo treinado com dados padronizados.

Logo, adotaremos a não-padronização dos dados, visto que eliminaremos uma etapa do pipeline.

Serão exportados também:
- O conjunto de dados separado para testar o modelo em produção;
- A lista de features remanescentes após todas as etapas de feature selection

In [81]:
# Exportando o modelo RandomForest
import os
# Salvando o modelo otimizado
model_export = rf_raw.best_estimator_

# Criando nome do arquivo
nome_arquivo = 'Modelo_producao.pkl'

# Salvar o modelo
with open('Modelo_producao.pkl', 'wb') as zip_model:
    pickle.dump((model_export, X_prod, y_prod, remaining_features_recursive), zip_model)

# Extraindo o path da pasta do Drive
pasta = '/content/drive/MyDrive/Colab Notebooks/Projetos pessoais/01. Projetos Machine Learning/Falência de empresas Tailandesas'

# Verificar se o arquivo já existe no destino e, se existir, removê-lo
arquivo_destino = os.path.join(pasta, 'Modelo_producao.pkl')
if os.path.exists(arquivo_destino):
    os.remove(arquivo_destino)

# Mover o arquivo para o caminho de destino
shutil.move('Modelo_producao.pkl', pasta)

'/content/drive/MyDrive/Colab Notebooks/Projetos pessoais/01. Projetos Machine Learning/Falência de empresas Tailandesas/Modelo_producao.pkl'

#6) Importação do modelo

Observação: todo o código abaixo poderia ser escrito em outro notebook, de outro desenvolvedor. Mas por facilidade e para concentrar as informações em um único  arquivo o código de importação e predição em dados inéditos foi desenvolvido abaixo.

In [1]:
# Importando bibliotecas
import pandas as pd
import pickle
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report


# Carregar o modelo (direto da pasta onde estão os notebooks)
with open('/content/drive/MyDrive/Colab Notebooks/Projetos pessoais/01. Projetos Machine Learning/Falência de empresas Tailandesas/Modelo_producao.pkl', 'rb') as zip_model:
    model, X_new, y_new, features = pickle.load(zip_model)

X_new = X_new[features]

In [2]:
# Verificando a distribuição das classes
y_new.value_counts(1)

0    0.967742
1    0.032258
Name: Bankrupt?, dtype: float64

In [3]:
# Fazendo predições
y_pred_new = model.predict(X_new)

# Métricas
print(classification_report(y_pred_new,y_new, target_names=['classe 0', 'classe 1']))

              precision    recall  f1-score   support

    classe 0       1.00      0.97      0.99       338
    classe 1       0.18      0.67      0.29         3

    accuracy                           0.97       341
   macro avg       0.59      0.82      0.64       341
weighted avg       0.99      0.97      0.98       341



O modelo manteve eu desempenho, com Recall de 67%.

<br>
Agora, vamos criar a hipótese de que o index do dataset original seria o Id da empresa. Assim, podemos ranquear as empresas com maior probabilidade de falência e identificá-las pelo ID.

In [4]:
# Probabilidade de ocorrência dos eventos

prob = model.predict_proba(X_new)

val_0 = prob[:,0]     #Falência
val_1 = prob[:,1]     #Não falência

# Extraindo o index de cada instância (suposição de que seja o Id da empresa)
index=X_new.index

In [7]:
# Criando dataframe de ranqueamento

df_rank = pd.DataFrame()
df_rank['Id empresa'] = index
df_rank['Prob. falência'] = val_1 *100
df_rank['Prob. falência']=round(df_rank['Prob. falência'],2)
df_rank['Prob. não falência'] = val_0 *100
df_rank['Prob. não falência'] = round(df_rank['Prob. não falência'],2)

df_rank['Falência?'] = df_rank['Prob. falência'].apply(lambda x: "Sim" if x >= 50 else "Não")

df_rank['Risco'] = df_rank['Prob. falência'].apply(lambda x: "Alto" if x >= 50 else "Médio" if (x > 30 and x < 50) else "Baixo")

In [15]:
# Visualizando a distribuição do risco
df_rank['Risco'].value_counts(1)

Baixo    0.970674
Médio    0.020528
Alto     0.008798
Name: Risco, dtype: float64

In [22]:
# Filtrando 5 empresas com BAIXO risco, com menores prbabilidades de falência
df_rank[df_rank['Risco']=='Baixo'].sort_values(by=['Prob. falência'], ascending=True).head(5)

Unnamed: 0,Id empresa,Prob. falência,Prob. não falência,Falência?,Risco
162,3475,0.29,99.71,Não,Baixo
145,5939,0.29,99.71,Não,Baixo
277,280,0.29,99.71,Não,Baixo
278,6651,0.29,99.71,Não,Baixo
39,6732,0.29,99.71,Não,Baixo


In [21]:
# Filtrando 5 empresas com MÉDIO risco, com maiores probabilidades de falência
df_rank[df_rank['Risco']=='Médio'].sort_values(by=['Prob. falência'], ascending=False).head(5)

Unnamed: 0,Id empresa,Prob. falência,Prob. não falência,Falência?,Risco
193,4071,47.59,52.41,Não,Médio
2,54,45.62,54.38,Não,Médio
10,4990,40.98,59.02,Não,Médio
279,2589,40.86,59.14,Não,Médio
90,379,39.02,60.98,Não,Médio


In [23]:
# Filtrando 5 empresas com ALTO risco, com maiores probabilidades de falência
df_rank[df_rank['Risco']=='Alto'].sort_values(by=['Prob. falência'], ascending=False).head(5)

Unnamed: 0,Id empresa,Prob. falência,Prob. não falência,Falência?,Risco
319,2470,57.12,42.88,Sim,Alto
78,1640,56.35,43.65,Sim,Alto
56,3540,50.53,49.47,Sim,Alto


#7) Resultados alcançados

O modelo trouxe a possibilidade de identificar o desempenho financeiro contábil das empresas e avaliar o risco de falência, norteando a tomada de decisões assertiva nas seguintes situações:

* Concessão de crédito
* Investimento em ações
* Locação de imóveis para a empresa
* Precificação de seguros
* Realização de negócios com a empresa que possam envolver pagamentos a longo prazo, financiamentos, etc
* Oferta de serviços de consultoria contábil e reestruturação financeira