<h1 style="text-align: center;">
  <b>Seleção de Modelos de Machine Learning em Dados Multivariados e Anonimizados</b>
</h1>

<p style='text-align: justify;'>
  _______________________________________________________________________________________________________
</p>

<h2 style='text-align: justify;'>
  Machine Learning para um Problema de Classificação Supervisionada.
</h2>

<h3 style="text-align: justify;">
  Script desenvolvido para explorar diferentes técnicas de seleção de modelos de Machine Learning para identificar o algoritmo mais eficaz e eficiente, em um ambiente com dados multivariados e anonimizados.
</h3>

<p style='text-align: justify;'>
  Nosso trabalho é construir um modelo de Machine Learning capaz de prever se o cliente de  uma  seguradora  vai  renovar  ou  não  o  seguro  do  carro.  Com  esse  modelo  preditivo  a seguradora pode planejar melhor seu orçamento prevendo antecipadamente a renovação de um determinado cliente a qualquer momento.
</p>

<p style='text-align: justify;'>
  Por motivos de privacidade, não são fornecidas informações sobre a fonte de dados nem sobre os nomes das 178 variáveis descritivas que representam as caracteristicas dos clientes. Não  é  possível  identificar  o  que  cada  característica representa  e  os dados tratados nesse projeto já estão anonimizados.  A  variável  179  é  a variável LABEL_TARGET,  ela indica se o cliente renovou ou não o seguro nos 2 anos anteriores.
</p>

<p style='text-align: justify;'>
  Neste projeto, dividimos as análises em duas partes contidas em dois scripts python: a parte 1 trata do pré-processamento dos dados e a parte 2 trata da modelagem.
</p>

<p style='text-align: justify;'>
  _______________________________________________________________________________________________________
</p>

<h2 style='text-align: justify;'>
  Parte 1: Pré-Processamento dos Dados
</h2>

<p style='text-align: justify;'>
  ________________________________________
</p>

In [1]:
# Imports
import pandas as pd
import numpy as np
import pickle

In [2]:
# Versões dos pacotes usados neste jupyter notebook
%reload_ext watermark
%watermark -a "Michela Camboim" 

Author: Michela Camboim



<p style='text-align: justify;'>
  ________________________________________
</p>

<h4 style='text-align: justify;'>
Carregando os dados
</h4>

<p style='text-align: justify;'>
  ________________________________________
</p>

In [3]:
# Carregando os dados
df = pd.read_csv("dados/dataset.csv")

print(f'Shape: ', df.shape)

df.head()

Shape:  (11500, 179)


Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X170,X171,X172,X173,X174,X175,X176,X177,X178,LABEL_TARGET
0,135,190,229,223,192,125,55,-9,-33,-38,...,-17,-15,-31,-77,-103,-127,-116,-83,-51,False
1,386,382,356,331,320,315,307,272,244,232,...,164,150,146,152,157,156,154,143,129,True
2,-32,-39,-47,-37,-32,-36,-57,-73,-85,-94,...,57,64,48,19,-12,-30,-35,-35,-36,False
3,-105,-101,-96,-92,-89,-95,-102,-100,-87,-79,...,-82,-81,-80,-77,-85,-77,-72,-69,-65,False
4,-9,-65,-98,-102,-78,-48,-16,0,-21,-59,...,4,2,-12,-32,-41,-65,-83,-89,-73,False


<p style='text-align: justify;'>
  ________________________________________
</p>

<h4 style='text-align: justify;'>
Análise Exploratória e Limpeza de Dados
</h4>

<p style='text-align: justify;'>
  ________________________________________
</p>

In [4]:
# Categorias da variável alvo
df.LABEL_TARGET.value_counts()

False    9200
True     2300
Name: LABEL_TARGET, dtype: int64

In [5]:
# Converte de string para valor numérico
df["LABEL_TARGET"] = df["LABEL_TARGET"].astype(int)

In [6]:
# Visualizando alguns registros
df.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X170,X171,X172,X173,X174,X175,X176,X177,X178,LABEL_TARGET
0,135,190,229,223,192,125,55,-9,-33,-38,...,-17,-15,-31,-77,-103,-127,-116,-83,-51,0
1,386,382,356,331,320,315,307,272,244,232,...,164,150,146,152,157,156,154,143,129,1
2,-32,-39,-47,-37,-32,-36,-57,-73,-85,-94,...,57,64,48,19,-12,-30,-35,-35,-36,0
3,-105,-101,-96,-92,-89,-95,-102,-100,-87,-79,...,-82,-81,-80,-77,-85,-77,-72,-69,-65,0
4,-9,-65,-98,-102,-78,-48,-16,0,-21,-59,...,4,2,-12,-32,-41,-65,-83,-89,-73,0


In [7]:
# Resumo estatístico
df.describe()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X170,X171,X172,X173,X174,X175,X176,X177,X178,LABEL_TARGET
count,11500.0,11500.0,11500.0,11500.0,11500.0,11500.0,11500.0,11500.0,11500.0,11500.0,...,11500.0,11500.0,11500.0,11500.0,11500.0,11500.0,11500.0,11500.0,11500.0,11500.0
mean,-11.581391,-10.911565,-10.18713,-9.143043,-8.009739,-7.003478,-6.502087,-6.68713,-6.558,-6.168435,...,-10.145739,-11.630348,-12.943478,-13.66887,-13.363304,-13.045043,-12.70513,-12.426,-12.195652,0.2
std,165.626284,166.059609,163.524317,161.269041,160.998007,161.328725,161.467837,162.11912,162.03336,160.436352,...,164.652883,166.14979,168.554058,168.556486,167.25729,164.241019,162.895832,162.886311,164.852015,0.400017
min,-1839.0,-1838.0,-1835.0,-1845.0,-1791.0,-1757.0,-1832.0,-1778.0,-1840.0,-1867.0,...,-1867.0,-1865.0,-1642.0,-1723.0,-1866.0,-1863.0,-1781.0,-1727.0,-1829.0,0.0
25%,-54.0,-55.0,-54.0,-54.0,-54.0,-54.0,-54.0,-55.0,-55.0,-54.0,...,-55.0,-56.0,-56.0,-56.0,-55.0,-56.0,-55.0,-55.0,-55.0,0.0
50%,-8.0,-8.0,-7.0,-8.0,-8.0,-8.0,-8.0,-8.0,-7.0,-7.0,...,-9.0,-10.0,-10.0,-10.0,-10.0,-9.0,-9.0,-9.0,-9.0,0.0
75%,34.0,35.0,36.0,36.0,35.0,36.0,35.0,36.0,36.0,35.25,...,34.0,34.0,33.0,33.0,34.0,34.0,34.0,34.0,34.0,0.0
max,1726.0,1713.0,1697.0,1612.0,1518.0,1816.0,2047.0,2047.0,2047.0,2047.0,...,1777.0,1472.0,1319.0,1436.0,1733.0,1958.0,2047.0,2047.0,1915.0,1.0


In [8]:
# Verificando se há valores ausentes
df.isnull().values.any()

False

In [9]:
# Extrai a lista de colunas
lista_de_colunas = df.columns.tolist()


In [10]:
# Colunas das variáveis de entrada (input)
colunas_entrada = lista_de_colunas[0:178]

In [11]:
print(colunas_entrada)

['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20', 'X21', 'X22', 'X23', 'X24', 'X25', 'X26', 'X27', 'X28', 'X29', 'X30', 'X31', 'X32', 'X33', 'X34', 'X35', 'X36', 'X37', 'X38', 'X39', 'X40', 'X41', 'X42', 'X43', 'X44', 'X45', 'X46', 'X47', 'X48', 'X49', 'X50', 'X51', 'X52', 'X53', 'X54', 'X55', 'X56', 'X57', 'X58', 'X59', 'X60', 'X61', 'X62', 'X63', 'X64', 'X65', 'X66', 'X67', 'X68', 'X69', 'X70', 'X71', 'X72', 'X73', 'X74', 'X75', 'X76', 'X77', 'X78', 'X79', 'X80', 'X81', 'X82', 'X83', 'X84', 'X85', 'X86', 'X87', 'X88', 'X89', 'X90', 'X91', 'X92', 'X93', 'X94', 'X95', 'X96', 'X97', 'X98', 'X99', 'X100', 'X101', 'X102', 'X103', 'X104', 'X105', 'X106', 'X107', 'X108', 'X109', 'X110', 'X111', 'X112', 'X113', 'X114', 'X115', 'X116', 'X117', 'X118', 'X119', 'X120', 'X121', 'X122', 'X123', 'X124', 'X125', 'X126', 'X127', 'X128', 'X129', 'X130', 'X131', 'X132', 'X133', 'X134', 'X135', 'X136', 'X137', 'X138', 'X1

In [12]:
# Checando se temos colunas duplicadas nos dados de entrada
dup_cols = set([x for x in colunas_entrada if colunas_entrada.count(x) > 1])
print(dup_cols)
assert len(dup_cols) == 0, "você duplicou colunas em cols_input"

set()


In [13]:
# Checando se temos colunas duplicadas no dataset completo
dup_cols = set([x for x in lista_de_colunas if lista_de_colunas.count(x) > 1])
print(dup_cols)
assert len(dup_cols) == 0,'você duplicou colunas em lista_de_colunas'

set()


In [14]:
# Categorias da variável alvo
df.LABEL_TARGET.value_counts(normalize=True)

0    0.8
1    0.2
Name: LABEL_TARGET, dtype: float64

<p style='text-align: justify;'>
  ________________________________________
</p>

<p style='text-align: justify;'>
A prevalência é a porcentagem das amostras que têm a característica que se está tentando prever. Em nosso caso, significa que as pessoas que renovaram o seguro são da classe positiva (ocorrência do evento = 1) e quem não comprou é da classe negativa (não ocorreu o evento = 0).
</p>

<p style='text-align: justify;'>
Os termos positivo e negativo não têm conotação de coisa boa ou ruim. É apenas a nomenclatura usada para indicar a ocorrência ou não do evento.
</p>

<p style='text-align: justify;'>
A taxa é calculada por (número de amostras positivas / número de amostras). Portanto, uma taxa de prevalência de 0,2 significa que 20% de nossa amostra renovou o seguro do carro.
</p>

<p style='text-align: justify;'>
O desbalanceamento de classe é um problema que terá que ser resolvido durante o pré-processamento dos dados.
</p>


<p style='text-align: justify;'>
  ________________________________________
</p>

In [15]:
# Esta função calcula a prevalência da classe positiva (label = 1)
def calcula_prevalencia(y_actual):
    return sum(y_actual) / len(y_actual)

In [16]:
print("Prevalência da classe positiva: %.3f"% calcula_prevalencia(df["LABEL_TARGET"].values))

Prevalência da classe positiva: 0.200


<p style='text-align: justify;'>
  ________________________________________
</p>

<h4 style='text-align: justify;'>
Divisão dos Dados em Treino, Teste e Validação, Mantendo a Prevalência de Classe
</h4>

<p style='text-align: justify;'>
  ________________________________________
</p>

In [17]:
# Gerando amostras aleatórias dos dados, embaralha dados
df_data = df.sample(n = len(df))

df_data

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X170,X171,X172,X173,X174,X175,X176,X177,X178,LABEL_TARGET
1696,-94,-106,-116,-114,-94,-65,-39,-15,1,10,...,-30,-26,-25,-36,-59,-78,-100,-118,-124,0
27,-340,-381,-376,-336,-275,-204,-131,-70,-16,20,...,114,-39,-185,-293,-351,-379,-380,-350,-308,1
11329,-98,-89,-73,-62,-46,-33,-51,-66,-78,-59,...,60,18,3,-6,-20,-50,-70,-65,-48,0
11356,13,25,19,18,8,12,14,10,16,43,...,-61,-65,-57,-47,-38,-40,-37,-35,-44,0
8551,103,465,689,697,655,642,618,566,533,522,...,475,443,390,314,173,-258,-720,-1094,-1168,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6618,17,-24,-61,-81,-84,-72,-69,-60,-55,-40,...,-4,20,38,40,6,-55,-99,-99,-47,0
9009,-2,2,13,51,76,97,70,40,17,1,...,34,25,22,13,11,8,15,24,9,0
10871,-4,-12,-10,-21,-12,-3,0,-15,-18,-32,...,-103,-97,-85,-78,-64,-51,-38,-51,-62,0
6769,2,5,7,3,-3,-3,-2,7,8,-1,...,123,111,120,138,149,144,143,151,157,0


In [18]:
# Ajustando os índices do dataset
df_data = df_data.reset_index(drop = True)

In [19]:
df_data

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X170,X171,X172,X173,X174,X175,X176,X177,X178,LABEL_TARGET
0,-94,-106,-116,-114,-94,-65,-39,-15,1,10,...,-30,-26,-25,-36,-59,-78,-100,-118,-124,0
1,-340,-381,-376,-336,-275,-204,-131,-70,-16,20,...,114,-39,-185,-293,-351,-379,-380,-350,-308,1
2,-98,-89,-73,-62,-46,-33,-51,-66,-78,-59,...,60,18,3,-6,-20,-50,-70,-65,-48,0
3,13,25,19,18,8,12,14,10,16,43,...,-61,-65,-57,-47,-38,-40,-37,-35,-44,0
4,103,465,689,697,655,642,618,566,533,522,...,475,443,390,314,173,-258,-720,-1094,-1168,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11495,17,-24,-61,-81,-84,-72,-69,-60,-55,-40,...,-4,20,38,40,6,-55,-99,-99,-47,0
11496,-2,2,13,51,76,97,70,40,17,1,...,34,25,22,13,11,8,15,24,9,0
11497,-4,-12,-10,-21,-12,-3,0,-15,-18,-32,...,-103,-97,-85,-78,-64,-51,-38,-51,-62,0
11498,2,5,7,3,-3,-3,-2,7,8,-1,...,123,111,120,138,149,144,143,151,157,0


In [20]:
# Extrai uma amostra de 30% dos dados de forma aleatória
df_amostra_30 = df_data.sample(frac = 0.3)

df_amostra_30

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X170,X171,X172,X173,X174,X175,X176,X177,X178,LABEL_TARGET
2692,451,522,563,575,568,505,373,142,-197,-547,...,-832,-664,-509,-363,-238,-138,-50,25,110,1
2023,-193,-225,-258,-297,-353,-303,-133,153,343,290,...,346,221,151,122,133,163,193,228,253,1
10860,55,38,26,20,24,25,28,14,4,1,...,45,35,30,28,27,24,5,-14,-35,0
4655,224,219,198,177,152,126,107,92,82,77,...,-95,-92,-82,-63,-51,-47,-33,-12,1,0
4687,-61,-103,-130,-130,-107,-83,-56,-19,19,49,...,122,27,-76,-177,-217,-202,-137,-74,-23,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2766,24,18,8,-11,-40,-46,-50,-56,-46,-39,...,27,33,23,34,29,10,-10,-16,-20,0
8893,-61,-77,-80,-69,-62,-42,-30,-34,-22,-17,...,37,48,65,65,67,66,68,70,58,0
5077,72,93,101,94,93,144,111,13,-3,5,...,56,54,87,81,38,42,25,-16,-31,0
7780,-847,-703,-529,-378,-269,-182,-110,-35,40,113,...,-288,-744,-1062,-1170,-1106,-963,-786,-615,-478,1


In [21]:
# Fazendo a divisão

# Dados de teste
df_teste = df_amostra_30.sample(frac = 0.5) # 15% dados de teste

# Dados se validação
df_valid = df_amostra_30.drop(df_teste.index) # 15% dados de validação

# Dados de treino
df_treino = df_data.drop(df_amostra_30.index) # 70% dados de treino

print(f'Shape df_teste: {df_teste.shape}')
print(f'Shape df_valid: {df_valid.shape}')
print(f'Shape df_treino: {df_treino.shape}')

Shape df_teste: (1725, 179)
Shape df_valid: (1725, 179)
Shape df_treino: (8050, 179)


In [22]:
df_teste

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X170,X171,X172,X173,X174,X175,X176,X177,X178,LABEL_TARGET
4639,-17,-19,-25,-33,-43,-56,-64,-69,-70,-75,...,-74,-86,-100,-115,-124,-135,-141,-145,-144,0
3170,87,71,34,-9,-50,-70,-66,-44,-23,24,...,55,23,-20,-68,-107,-125,-96,-52,7,0
2575,-20,-35,-36,-21,-10,-10,-9,-5,-11,-16,...,-32,-34,-32,-27,-24,-20,-14,10,39,0
10718,3,3,6,12,15,23,25,26,22,23,...,27,19,14,10,13,19,20,27,26,0
4881,-33,-37,-48,-56,-69,-88,-97,-97,-75,-51,...,13,3,-12,-19,-15,-1,14,36,47,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4249,-22,-27,-47,-59,-59,-52,-36,-28,-31,-36,...,-21,-38,-56,-70,-69,-43,-20,17,46,0
9400,512,525,525,422,322,191,89,-13,-102,-163,...,302,270,209,163,131,121,109,78,50,1
8202,48,38,18,13,18,29,30,20,9,0,...,109,125,136,135,122,104,71,47,36,0
512,-12,-36,-49,-21,4,-4,20,55,69,74,...,32,39,27,16,9,9,-12,-12,-1,0


In [23]:
df_valid

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X170,X171,X172,X173,X174,X175,X176,X177,X178,LABEL_TARGET
2692,451,522,563,575,568,505,373,142,-197,-547,...,-832,-664,-509,-363,-238,-138,-50,25,110,1
2023,-193,-225,-258,-297,-353,-303,-133,153,343,290,...,346,221,151,122,133,163,193,228,253,1
10860,55,38,26,20,24,25,28,14,4,1,...,45,35,30,28,27,24,5,-14,-35,0
4687,-61,-103,-130,-130,-107,-83,-56,-19,19,49,...,122,27,-76,-177,-217,-202,-137,-74,-23,0
11290,36,28,23,15,18,18,20,19,13,10,...,39,31,18,13,6,6,3,-1,3,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1053,-38,-17,9,39,62,42,-12,-33,23,120,...,83,68,55,28,3,-8,-26,-35,-35,1
8893,-61,-77,-80,-69,-62,-42,-30,-34,-22,-17,...,37,48,65,65,67,66,68,70,58,0
5077,72,93,101,94,93,144,111,13,-3,5,...,56,54,87,81,38,42,25,-16,-31,0
7780,-847,-703,-529,-378,-269,-182,-110,-35,40,113,...,-288,-744,-1062,-1170,-1106,-963,-786,-615,-478,1


In [24]:
df_treino

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X170,X171,X172,X173,X174,X175,X176,X177,X178,LABEL_TARGET
0,-94,-106,-116,-114,-94,-65,-39,-15,1,10,...,-30,-26,-25,-36,-59,-78,-100,-118,-124,0
1,-340,-381,-376,-336,-275,-204,-131,-70,-16,20,...,114,-39,-185,-293,-351,-379,-380,-350,-308,1
2,-98,-89,-73,-62,-46,-33,-51,-66,-78,-59,...,60,18,3,-6,-20,-50,-70,-65,-48,0
5,-50,-55,-50,-49,-46,-48,-41,-45,-52,-60,...,-32,-32,-36,-33,-27,-19,-11,-8,-6,0
6,11,-30,-66,-99,-108,-100,-89,-88,-94,-95,...,-98,-90,-68,-52,-45,-50,-59,-57,-35,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11492,110,83,75,79,87,89,53,31,-13,-32,...,33,50,66,63,54,46,42,46,52,0
11494,2,-15,-11,19,30,53,25,-4,-28,-47,...,30,26,11,16,-5,-11,-23,-17,-11,0
11495,17,-24,-61,-81,-84,-72,-69,-60,-55,-40,...,-4,20,38,40,6,-55,-99,-99,-47,0
11496,-2,2,13,51,76,97,70,40,17,1,...,34,25,22,13,11,8,15,24,9,0


In [25]:
# analisando a prevalência das amostras geradas

print(f'Teste, {df_teste.LABEL_TARGET.value_counts(normalize=True)}')
print(f'Validação, {df_valid.LABEL_TARGET.value_counts(normalize=True)}')
print(f'Treino, {df_treino.LABEL_TARGET.value_counts(normalize=True)}')

Teste, 0    0.805217
1    0.194783
Name: LABEL_TARGET, dtype: float64
Validação, 0    0.792464
1    0.207536
Name: LABEL_TARGET, dtype: float64
Treino, 0    0.800497
1    0.199503
Name: LABEL_TARGET, dtype: float64


<p style='text-align: justify;'>
  ________________________________________
</p>

<h4 style='text-align: justify;'>
Balanceamento de Classe nos Dados de Treino. Estratégia Undersampling.
</h4>

<p style='text-align: justify;'>
  ________________________________________
</p>

In [26]:
df_treino.shape

(8050, 179)

In [27]:
df_treino.LABEL_TARGET.value_counts()

0    6444
1    1606
Name: LABEL_TARGET, dtype: int64

In [28]:
# Cria um índice com True/False
indice = df_treino.LABEL_TARGET == 1
indice

0        False
1         True
2        False
5        False
6        False
         ...  
11492    False
11494    False
11495    False
11496    False
11497    False
Name: LABEL_TARGET, Length: 8050, dtype: bool

In [29]:
# Define valores positivos e negativos do índice
df_train_pos = df_treino.loc[indice]
df_train_neg = df_treino.loc[~indice]
df_train_pos

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X170,X171,X172,X173,X174,X175,X176,X177,X178,LABEL_TARGET
1,-340,-381,-376,-336,-275,-204,-131,-70,-16,20,...,114,-39,-185,-293,-351,-379,-380,-350,-308,1
8,-214,-269,-319,-343,-349,-345,-339,-356,-380,-466,...,-644,-522,-321,-26,292,547,710,774,749,1
10,-370,-326,-38,341,730,959,969,905,789,654,...,-343,-343,-347,-349,-357,-365,-379,-396,-403,1
17,-264,-189,-117,-45,20,70,111,143,161,179,...,-231,-221,-248,-321,-444,-530,-548,-536,-486,1
19,-260,-257,-254,-243,-227,-211,-210,-213,-220,-233,...,250,322,247,-13,-4,14,333,537,602,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11448,-26,-451,-414,-403,-41,104,156,304,432,485,...,222,236,196,148,158,259,400,530,567,1
11457,-311,162,506,699,620,443,217,21,-169,-208,...,404,351,63,-376,-819,-866,-658,-346,-64,1
11466,-400,-609,-694,-662,-580,-442,-299,-180,-56,78,...,256,269,216,128,85,98,198,323,411,1
11467,16,-115,-206,-260,-278,-269,-257,-240,-218,-179,...,-25,-30,-29,-29,-10,19,51,76,87,1


In [30]:
df_train_neg

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X170,X171,X172,X173,X174,X175,X176,X177,X178,LABEL_TARGET
0,-94,-106,-116,-114,-94,-65,-39,-15,1,10,...,-30,-26,-25,-36,-59,-78,-100,-118,-124,0
2,-98,-89,-73,-62,-46,-33,-51,-66,-78,-59,...,60,18,3,-6,-20,-50,-70,-65,-48,0
5,-50,-55,-50,-49,-46,-48,-41,-45,-52,-60,...,-32,-32,-36,-33,-27,-19,-11,-8,-6,0
6,11,-30,-66,-99,-108,-100,-89,-88,-94,-95,...,-98,-90,-68,-52,-45,-50,-59,-57,-35,0
11,29,12,-15,-42,-50,-36,-38,-14,-4,6,...,-26,-36,-63,-101,-114,-111,-92,-77,-50,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11492,110,83,75,79,87,89,53,31,-13,-32,...,33,50,66,63,54,46,42,46,52,0
11494,2,-15,-11,19,30,53,25,-4,-28,-47,...,30,26,11,16,-5,-11,-23,-17,-11,0
11495,17,-24,-61,-81,-84,-72,-69,-60,-55,-40,...,-4,20,38,40,6,-55,-99,-99,-47,0
11496,-2,2,13,51,76,97,70,40,17,1,...,34,25,22,13,11,8,15,24,9,0


In [31]:
# Valor mínimo de registros entre classe positiva e negativa
valor_minimo = np.min([len(df_train_pos), len(df_train_neg)])

valor_minimo

1606

In [32]:
# Obtém valores aleatórios para o dataset de treino
df_treino_final = pd.concat([df_train_pos.sample(n = valor_minimo, random_state = 69), 
                             df_train_neg.sample(n = valor_minimo, random_state = 69)], 
                            axis = 0, 
                            ignore_index = True)

df_treino_final

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X170,X171,X172,X173,X174,X175,X176,X177,X178,LABEL_TARGET
0,-131,-150,-190,-223,-201,-141,-44,58,139,196,...,-38,-166,-262,-196,-87,-81,-118,-175,-261,1
1,-595,-534,-443,-335,-230,-120,-14,65,102,111,...,-35,-130,-249,-395,-486,-470,-377,-276,-167,1
2,-658,-259,93,369,547,630,627,564,470,370,...,-1860,-1820,-1495,-972,-504,-130,170,382,500,1
3,-89,-93,-95,-103,-109,-109,-78,-6,86,179,...,-137,-131,-72,-10,19,29,62,108,142,1
4,413,301,222,173,156,153,153,155,160,170,...,211,181,154,45,-127,-373,-647,-874,-962,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3207,-53,-64,-59,-49,-45,-64,-78,-72,-54,-41,...,56,101,146,146,108,58,22,4,-8,0
3208,97,99,99,95,78,51,65,84,93,91,...,104,97,85,57,43,34,38,51,48,0
3209,-28,-8,8,10,8,6,-1,9,23,38,...,-38,-37,-26,-21,-4,9,14,3,-17,0
3210,-78,-93,-117,-130,-126,-66,9,77,143,133,...,-20,-60,-69,-74,-65,-40,-12,0,-1,0


In [33]:
# Embaralhando o dataset anterior
df_treino_final = df_treino_final.sample(n = len(df_treino_final), random_state = 69).reset_index(drop = True)
df_treino_final

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,...,X170,X171,X172,X173,X174,X175,X176,X177,X178,LABEL_TARGET
0,23,31,33,27,22,5,-2,-6,-12,-17,...,-71,-82,-97,-105,-111,-101,-86,-66,-41,0
1,-109,-113,-119,-123,-124,-123,-120,-103,-92,-74,...,86,96,89,59,31,21,17,7,-5,0
2,-75,-80,-83,-83,-81,-74,-63,-44,-42,-54,...,-32,-40,-52,-71,-97,-127,-153,-172,-178,0
3,-70,-238,-370,-408,-360,-278,-196,-125,-55,5,...,247,161,113,93,108,44,-75,-225,-371,1
4,-126,-189,-223,-250,-273,-299,-331,-349,-367,-380,...,-351,-438,-792,-1126,-1185,-781,14,689,1122,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3207,-89,-76,-49,-32,-17,-12,-11,-13,-12,-20,...,-19,-15,-19,-28,-46,-58,-67,-71,-75,0
3208,-36,-26,-23,-22,-27,-31,-25,-23,-16,-12,...,182,145,143,156,194,242,282,321,338,1
3209,-173,-162,-140,-118,-91,-61,-37,-24,-17,-13,...,-253,-265,-278,-287,-295,-312,-327,-357,-427,0
3210,-16,-8,-1,0,14,12,-10,-39,-51,-64,...,-100,-106,-100,-94,-101,-95,-84,-66,-56,0


In [34]:
df_treino_final.shape

(3212, 179)

In [35]:
# balanceamento dos dados de treino
df_treino_final.LABEL_TARGET.value_counts()

0    1606
1    1606
Name: LABEL_TARGET, dtype: int64

In [36]:
# balanceamento dos dados de treino

df_treino_final.LABEL_TARGET.value_counts(normalize=True)

0    0.5
1    0.5
Name: LABEL_TARGET, dtype: float64

<p style='text-align: justify;'>
  ________________________________________
</p>

<h4 style='text-align: justify;'>
Salvando as Bases do Resultado do Pré-Processamento
</h4>

<p style='text-align: justify;'>
  ________________________________________
</p>

In [37]:
# Salvamos todos os datasets em disco no formato csv
df_treino.to_csv('dados/dados_treino.csv', index = False)
df_treino_final.to_csv('dados/dados_treino_final.csv', index = False)
df_valid.to_csv('dados/dados_valid.csv', index = False)
df_teste.to_csv('dados/dados_teste.csv', index = False)

In [38]:
# Salvamos os nomes dos dados de entrada (colunas preditoras) para facilitar a utilização mais tarde
pickle.dump(colunas_entrada, open('dados/colunas_entrada.sav', 'wb'))

<p style='text-align: justify;'>
  ________________________________________
</p>

<p style='text-align: justify;'>
A parte da modelagem é tratada no próximo script.
</p>

<p style='text-align: justify;'>
  ________________________________________
</p>

In [39]:
%reload_ext watermark
%watermark -a "Michela Camboim"

Author: Michela Camboim



In [40]:
%watermark -v -m

Python implementation: CPython
Python version       : 3.10.13
IPython version      : 8.15.0

Compiler    : MSC v.1916 64 bit (AMD64)
OS          : Windows
Release     : 10
Machine     : AMD64
Processor   : Intel64 Family 6 Model 142 Stepping 12, GenuineIntel
CPU cores   : 8
Architecture: 64bit



In [41]:
%watermark --iversions

pandas: 1.5.3
numpy : 1.25.2

