# Projeto - Crédito para Financiamento de Imóveis

## Orientações gerais

O projeto de encerramento de curso será dividido em três partes e utilizado como avaliação dos módulos de Data Engineering, Data Science e AWS.

## Contextualização

A PyCoders Ltda., cada vez mais especializada no mundo da Inteligência Artificial e Ciência de Dados, foi procurada por uma fintech para desenvolver um projeto de concessão de crédito para imóveis. Nesse projeto, espera-se a criação de score que discrimine ao máximo os bons pagadores dos maus pagadores. Para isso, foi disponibilizada uma base de dados com milhares de casos de empréstimos do passado, com diversas características dos clientes. Deve ser entregue um modelo para realizar essa classificação. Por questões contratuais, o pagamento será realizado baseado no desempenho (ROC AUC).


## Base de Dados

Serão utilizadas bases de dados com informações cadastrais, histórico de crédito e balanços financeiros de diversos clientes. O conjunto de dados está dividido em treino e teste, todos no formato csv. Toda a modelagem, validação e avaliação deve ser feita em cima do conjunto de treino, subdividindo tal base como achar melhor. Existe também a base das variáveis explicativas, para ajudar no desenvolvimento do projeto.

[Baixar aqui](https://s3-sa-east-1.amazonaws.com/lcpi/0694c90a-7782-47f7-8bbc-e611d31f9f21.zip)



# Parte 1: Data Engineering

**Preparação:** Salve os arquivos `.csv` em `/FileStore/tables/projeto_credito/`, sem alterar seus nomes.

![image.png](attachment:image.png)

## Antes de modelar

1. Crie um `fluxo de dados` no Databricks para as bases que serão utiizadas.
    1. Insira os dados brutos na primeria camada.
    1. Salve as transformações / limpezas na segunda camada.
   
    1. Crie uma pipeline para o processo.
        1. Pipeline, nesse caso, será um único script que gere todas as tabelas acima.

## Durante a modelagem

1. Selecione e salve as colunas relevantes e features criadas na terceira camada.

1. Salve as versões do modelo no formato `pickle` no DBFS.

1. Mantenha o controle das versões criando uma tabela no formato
**ATENÇÂO:** Como estamos na versão community, lembre de exportar a tabela abaixo para o DBFS antes da sessão encerrar.

|id_modelo|nome_modelo|data_treino|método|roc_auc|tempo_de_treino(s)|hyperparametros|path_to_pickle
|---|---|---|---|---|---|---|---|
|1|RandomForestRapida|2022-02-22|Random Forest Simples|0.76|124|\[max_depth=4, ...\]|/FileStore/tables/modelos/...|

## Após modelar

1. Salve os dados de treino e validação em uma tabela (para que seja possível reproduzir resultados no futuro).
    1. Crie uma coluna com o id do modelo escolhido para refêrencia. 

## Regras de Entrega

1. Um notebook (databricks) com a pipeline que gere as tabelas da fase *Antes de modelar*. O notebook deve rodar de uma vez só!

1. Um arquivo `.csv` com as informações da tabela gerada na fase *Durante a modelagem*.

1. Um notebook que gere as tabelas de treino e validação do passo *Após modelar*. O notebook deve rodar de uma vez só!

> **IMPORTANTE:** Tendo em vista que não teremos apresentação do projeto (e não queromos pedir que vocês gravem um vídeo explicando o notebook, haha), é indispensável que ele esteja organizado e comentado.

# Parte 2: Data Science

## Requisitos Obrigatórios do Projeto

1. **Análise Exploratória dos Dados:** análise descritiva dos dados numéricos e categóricos, bem como gráficos (de sua preferência).
2. **Data Cleaning:** a base de dados apresenta dados ausentes. Sendo assim, você deverá realizar uma limpeza dos dados, removendo-os ou preenchendo com valores coerentes.
3. **Conversão de variáveis categóricas**
4. **Balanceamento de amostras:** nesse caso, como o dataset possui muitas amostras, você pode utilizar o NearMiss para realizar um *under sampling*.
5. **Machine Learning:** aplique algum algoritmo de ML, de sua preferência, dividindo o seu conjunto de dados em treino e teste, para obter o `roc_auc_score` de ambos os cenários (treino e teste).

## Regras de Entrega

1. Deve ser entregue uma base com as predições para a base de teste.
    - Essa base deverá ser um Data Frame com duas colunas: a primeira sendo o SK_ID_CURR e a segunda a probabilidade de inadimplência.
    - ⚠️ Entregar as predições com a probabilidade da inadimplência ocorrer.
2. Deve ser entregue o notebook com as etapas que foram aplicadas na criação do modelo (especificadas na subseção anterior).

> **IMPORTANTE:** Tendo em vista que não teremos apresentação do projeto (e não queromos pedir que vocês gravem um vídeo explicando o notebook, haha), é indispensável que ele esteja organizado e comentado.

<a href="https://s3-sa-east-1.amazonaws.com/lcpi/94acac51-8ce4-465b-a06d-a1cf19ec5d93.ipynb" style="display: block; background-color: #222; padding: 20px; text-align: center; font-weight: 600;">
Clique aqui para fazer o download do notebook com as instruções.
</a>

### Importação das bibliotecas:

In [1]:
# Importação das bibliotecas
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.io as pio

### Carregando e explorando dataset description:

In [2]:
# Carregando dataset description
df_description = pd.read_csv('HomeCredit_columns_description.csv', encoding='latin1')

In [3]:
df_description.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 219 entries, 0 to 218
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Unnamed: 0   219 non-null    int64 
 1   Table        219 non-null    object
 2   Row          219 non-null    object
 3   Description  219 non-null    object
 4   Special      86 non-null     object
dtypes: int64(1), object(4)
memory usage: 8.7+ KB


In [4]:
df_description.head(10)

Unnamed: 0.1,Unnamed: 0,Table,Row,Description,Special
0,1,application_{train|test}.csv,SK_ID_CURR,ID of loan in our sample,
1,2,application_{train|test}.csv,TARGET,Target variable (1 - client with payment diffi...,
2,5,application_{train|test}.csv,NAME_CONTRACT_TYPE,Identification if loan is cash or revolving,
3,6,application_{train|test}.csv,CODE_GENDER,Gender of the client,
4,7,application_{train|test}.csv,FLAG_OWN_CAR,Flag if the client owns a car,
5,8,application_{train|test}.csv,FLAG_OWN_REALTY,Flag if client owns a house or flat,
6,9,application_{train|test}.csv,CNT_CHILDREN,Number of children the client has,
7,10,application_{train|test}.csv,AMT_INCOME_TOTAL,Income of the client,
8,11,application_{train|test}.csv,AMT_CREDIT,Credit amount of the loan,
9,12,application_{train|test}.csv,AMT_ANNUITY,Loan annuity,


In [5]:
df_description.tail(10)

Unnamed: 0.1,Unnamed: 0,Table,Row,Description,Special
209,212,previous_application.csv,DAYS_TERMINATION,Relative to application date of current applic...,time only relative to the application
210,213,previous_application.csv,NFLAG_INSURED_ON_APPROVAL,Did the client requested insurance during the ...,
211,214,installments_payments.csv,SK_ID_PREV,ID of previous credit in Home credit related t...,hashed
212,215,installments_payments.csv,SK_ID_CURR,ID of loan in our sample,hashed
213,216,installments_payments.csv,NUM_INSTALMENT_VERSION,Version of installment calendar (0 is for cred...,
214,217,installments_payments.csv,NUM_INSTALMENT_NUMBER,On which installment we observe payment,
215,218,installments_payments.csv,DAYS_INSTALMENT,When the installment of previous credit was su...,time only relative to the application
216,219,installments_payments.csv,DAYS_ENTRY_PAYMENT,When was the installments of previous credit p...,time only relative to the application
217,220,installments_payments.csv,AMT_INSTALMENT,What was the prescribed installment amount of ...,
218,221,installments_payments.csv,AMT_PAYMENT,What the client actually paid on previous cred...,


In [6]:
df_description.Table.unique()

array(['application_{train|test}.csv', 'bureau.csv', 'bureau_balance.csv',
       'POS_CASH_balance.csv', 'credit_card_balance.csv',
       'previous_application.csv', 'installments_payments.csv'],
      dtype=object)

### Copiando dataset description para limpeza:

In [7]:
df_description_cuted = df_description.copy()

In [8]:
df_description_cuted.head()

Unnamed: 0.1,Unnamed: 0,Table,Row,Description,Special
0,1,application_{train|test}.csv,SK_ID_CURR,ID of loan in our sample,
1,2,application_{train|test}.csv,TARGET,Target variable (1 - client with payment diffi...,
2,5,application_{train|test}.csv,NAME_CONTRACT_TYPE,Identification if loan is cash or revolving,
3,6,application_{train|test}.csv,CODE_GENDER,Gender of the client,
4,7,application_{train|test}.csv,FLAG_OWN_CAR,Flag if the client owns a car,


In [9]:
filter_0 = df_description_cuted['Table'] == 'application_{train|test}.csv'

In [10]:
df_description_cuted = df_description_cuted[filter_0]

In [11]:
df_description_cuted.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 122 entries, 0 to 121
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Unnamed: 0   122 non-null    int64 
 1   Table        122 non-null    object
 2   Row          122 non-null    object
 3   Description  122 non-null    object
 4   Special      56 non-null     object
dtypes: int64(1), object(4)
memory usage: 5.7+ KB


In [12]:
df_description_cuted.Table.unique()

array(['application_{train|test}.csv'], dtype=object)

### Carregando e explorando dataset train:

In [13]:
# Carregando dataset train
df_train = pd.read_csv('application_train.csv')
df_train

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,456162,0,Cash loans,F,N,N,0,112500.0,700830.0,22738.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,134978,0,Cash loans,F,N,N,0,90000.0,375322.5,14422.5,...,0,0,0,0,0.0,0.0,0.0,1.0,0.0,3.0
2,318952,0,Cash loans,M,Y,N,0,180000.0,544491.0,16047.0,...,0,0,0,0,0.0,0.0,0.0,1.0,1.0,3.0
3,361264,0,Cash loans,F,N,Y,0,270000.0,814041.0,28971.0,...,0,0,0,0,0.0,0.0,0.0,0.0,1.0,4.0
4,260639,0,Cash loans,F,N,Y,0,144000.0,675000.0,21906.0,...,0,0,0,0,0.0,0.0,0.0,10.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
246003,242114,0,Cash loans,F,N,Y,1,270000.0,1172470.5,34411.5,...,0,0,0,0,0.0,0.0,0.0,1.0,0.0,8.0
246004,452374,0,Cash loans,F,N,Y,0,180000.0,654498.0,27859.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,2.0
246005,276545,1,Revolving loans,M,N,N,1,112500.0,270000.0,13500.0,...,0,0,0,0,,,,,,
246006,236776,1,Cash loans,M,Y,N,3,202500.0,204858.0,17653.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


In [14]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 246008 entries, 0 to 246007
Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float64(65), int64(41), object(16)
memory usage: 229.0+ MB


In [15]:
# Set to see all collumns of dataframe.
pd.set_option('display.max_columns', None)

In [16]:
# Set to see all rows of dataframe.
pd.set_option('display.max_rows', None)

In [17]:
df_train.head(8)

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,NAME_TYPE_SUITE,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,DAYS_REGISTRATION,DAYS_ID_PUBLISH,OWN_CAR_AGE,FLAG_MOBIL,FLAG_EMP_PHONE,FLAG_WORK_PHONE,FLAG_CONT_MOBILE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,CNT_FAM_MEMBERS,REGION_RATING_CLIENT,REGION_RATING_CLIENT_W_CITY,WEEKDAY_APPR_PROCESS_START,HOUR_APPR_PROCESS_START,REG_REGION_NOT_LIVE_REGION,REG_REGION_NOT_WORK_REGION,LIVE_REGION_NOT_WORK_REGION,REG_CITY_NOT_LIVE_CITY,REG_CITY_NOT_WORK_CITY,LIVE_CITY_NOT_WORK_CITY,ORGANIZATION_TYPE,EXT_SOURCE_1,EXT_SOURCE_2,EXT_SOURCE_3,APARTMENTS_AVG,BASEMENTAREA_AVG,YEARS_BEGINEXPLUATATION_AVG,YEARS_BUILD_AVG,COMMONAREA_AVG,ELEVATORS_AVG,ENTRANCES_AVG,FLOORSMAX_AVG,FLOORSMIN_AVG,LANDAREA_AVG,LIVINGAPARTMENTS_AVG,LIVINGAREA_AVG,NONLIVINGAPARTMENTS_AVG,NONLIVINGAREA_AVG,APARTMENTS_MODE,BASEMENTAREA_MODE,YEARS_BEGINEXPLUATATION_MODE,YEARS_BUILD_MODE,COMMONAREA_MODE,ELEVATORS_MODE,ENTRANCES_MODE,FLOORSMAX_MODE,FLOORSMIN_MODE,LANDAREA_MODE,LIVINGAPARTMENTS_MODE,LIVINGAREA_MODE,NONLIVINGAPARTMENTS_MODE,NONLIVINGAREA_MODE,APARTMENTS_MEDI,BASEMENTAREA_MEDI,YEARS_BEGINEXPLUATATION_MEDI,YEARS_BUILD_MEDI,COMMONAREA_MEDI,ELEVATORS_MEDI,ENTRANCES_MEDI,FLOORSMAX_MEDI,FLOORSMIN_MEDI,LANDAREA_MEDI,LIVINGAPARTMENTS_MEDI,LIVINGAREA_MEDI,NONLIVINGAPARTMENTS_MEDI,NONLIVINGAREA_MEDI,FONDKAPREMONT_MODE,HOUSETYPE_MODE,TOTALAREA_MODE,WALLSMATERIAL_MODE,EMERGENCYSTATE_MODE,OBS_30_CNT_SOCIAL_CIRCLE,DEF_30_CNT_SOCIAL_CIRCLE,OBS_60_CNT_SOCIAL_CIRCLE,DEF_60_CNT_SOCIAL_CIRCLE,DAYS_LAST_PHONE_CHANGE,FLAG_DOCUMENT_2,FLAG_DOCUMENT_3,FLAG_DOCUMENT_4,FLAG_DOCUMENT_5,FLAG_DOCUMENT_6,FLAG_DOCUMENT_7,FLAG_DOCUMENT_8,FLAG_DOCUMENT_9,FLAG_DOCUMENT_10,FLAG_DOCUMENT_11,FLAG_DOCUMENT_12,FLAG_DOCUMENT_13,FLAG_DOCUMENT_14,FLAG_DOCUMENT_15,FLAG_DOCUMENT_16,FLAG_DOCUMENT_17,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,456162,0,Cash loans,F,N,N,0,112500.0,700830.0,22738.5,585000.0,Unaccompanied,Working,Incomplete higher,Single / not married,House / apartment,0.019689,-8676,-813,-4163.0,-1363,,1,1,1,1,0,0,Core staff,1.0,2,2,FRIDAY,17,0,0,0,1,1,0,Trade: type 2,,0.699373,0.171468,0.0619,0.0302,0.9762,0.6736,0.0055,0.0,0.1034,0.1667,0.0417,0.0,0.0504,0.0507,0.0,0.0,0.063,0.0313,0.9762,0.6864,0.0055,0.0,0.1034,0.1667,0.0417,0.0,0.0551,0.0528,0.0,0.0,0.0625,0.0302,0.9762,0.678,0.0055,0.0,0.1034,0.1667,0.0417,0.0,0.0513,0.0516,0.0,0.0,reg oper account,block of flats,0.0399,Block,No,0.0,0.0,0.0,0.0,-589.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,134978,0,Cash loans,F,N,N,0,90000.0,375322.5,14422.5,324000.0,Unaccompanied,Commercial associate,Secondary / secondary special,Married,House / apartment,0.025164,-13583,-223,-3554.0,-3287,,1,1,0,1,0,0,High skill tech staff,2.0,2,2,MONDAY,11,0,0,0,0,0,0,Business Entity Type 3,0.541385,0.199651,0.768808,0.0227,0.0566,0.9806,0.7348,0.0161,0.0,0.1034,0.0417,0.0833,0.0133,0.0185,0.0184,0.0,0.0,0.0231,0.0587,0.9806,0.7452,0.0162,0.0,0.1034,0.0417,0.0833,0.0136,0.0202,0.0192,0.0,0.0,0.0229,0.0566,0.9806,0.7383,0.0162,0.0,0.1034,0.0417,0.0833,0.0135,0.0188,0.0187,0.0,0.0,reg oper account,block of flats,0.0158,Block,No,0.0,0.0,0.0,0.0,-1409.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,1.0,0.0,3.0
2,318952,0,Cash loans,M,Y,N,0,180000.0,544491.0,16047.0,454500.0,Unaccompanied,Working,Secondary / secondary special,Married,House / apartment,0.035792,-13993,-6202,-7971.0,-4175,9.0,1,1,1,1,0,0,Managers,2.0,2,2,THURSDAY,15,0,0,0,0,0,0,Business Entity Type 1,,0.70488,0.626304,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,1.0,1.0,1.0,-675.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,1.0,1.0,3.0
3,361264,0,Cash loans,F,N,Y,0,270000.0,814041.0,28971.0,679500.0,Unaccompanied,Pensioner,Secondary / secondary special,Married,House / apartment,0.04622,-22425,365243,-11805.0,-1732,,1,0,0,1,1,0,,2.0,1,1,TUESDAY,9,0,0,0,0,0,0,XNA,,0.724576,0.810618,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0121,,No,2.0,0.0,2.0,0.0,-1588.0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,1.0,4.0
4,260639,0,Cash loans,F,N,Y,0,144000.0,675000.0,21906.0,675000.0,Unaccompanied,Working,Secondary / secondary special,Separated,House / apartment,0.026392,-18839,-2763,-5069.0,-2381,,1,1,0,1,1,0,Laborers,1.0,2,2,FRIDAY,16,0,0,0,0,0,0,Transport: type 4,0.592466,0.70631,0.331251,0.1907,0.1802,0.9891,0.8504,0.0344,0.0,0.4483,0.1667,0.2083,0.2751,0.1555,0.206,0.0,0.0,0.1943,0.187,0.9891,0.8563,0.0348,0.0,0.4483,0.1667,0.2083,0.2814,0.1699,0.2146,0.0,0.0,0.1926,0.1802,0.9891,0.8524,0.0347,0.0,0.4483,0.1667,0.2083,0.2799,0.1582,0.2097,0.0,0.0,reg oper account,block of flats,0.162,Panel,No,0.0,0.0,0.0,0.0,0.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,10.0,0.0,0.0
5,270576,0,Cash loans,M,Y,Y,0,157500.0,630000.0,30307.5,630000.0,Family,Working,Higher education,Married,House / apartment,0.025164,-11057,-1690,-5538.0,-3274,3.0,1,1,1,1,1,0,Laborers,2.0,2,2,TUESDAY,10,0,0,0,0,1,1,Self-employed,0.27727,0.192316,0.251239,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,-1158.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0
6,378296,1,Cash loans,F,N,Y,0,72000.0,197820.0,15534.0,180000.0,Unaccompanied,Working,Secondary / secondary special,Widow,House / apartment,0.028663,-18011,-5353,-10822.0,-834,,1,1,1,1,1,1,,1.0,2,2,THURSDAY,11,0,0,0,0,1,1,Business Entity Type 3,0.512102,0.472204,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,4.0,0.0,4.0,0.0,0.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,,
7,110095,0,Cash loans,M,Y,Y,0,90000.0,271957.5,18927.0,252000.0,Unaccompanied,Working,Secondary / secondary special,Married,Rented apartment,0.00712,-8980,-385,-2409.0,-1453,22.0,1,1,0,1,0,0,Laborers,2.0,2,2,FRIDAY,14,0,0,0,0,0,0,Business Entity Type 3,0.105328,0.166232,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,1.0,2.0,1.0,-149.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,,


In [62]:
df_train.dtypes

SK_ID_CURR                        int64
TARGET                            int64
NAME_CONTRACT_TYPE                int64
CODE_GENDER                       int64
FLAG_OWN_CAR                      int64
FLAG_OWN_REALTY                   int64
CNT_CHILDREN                      int64
AMT_INCOME_TOTAL                float64
AMT_CREDIT                      float64
AMT_ANNUITY                     float64
AMT_GOODS_PRICE                 float64
NAME_TYPE_SUITE                 float64
NAME_INCOME_TYPE                  int64
NAME_EDUCATION_TYPE               int64
NAME_FAMILY_STATUS                int64
NAME_HOUSING_TYPE                 int64
REGION_POPULATION_RELATIVE      float64
DAYS_BIRTH                        int64
DAYS_EMPLOYED                     int64
DAYS_REGISTRATION               float64
DAYS_ID_PUBLISH                   int64
OWN_CAR_AGE                     float64
FLAG_MOBIL                        int64
FLAG_EMP_PHONE                    int64
FLAG_WORK_PHONE                   int64


In [19]:
# Verificando e somando as possiveis linhas duplicadas.
df_train.duplicated().sum()

0

### Conversão das variáveis categóricas

In [20]:
# Verificando dados únicos das colunas:
df_train['CODE_GENDER'].unique()

array(['F', 'M', 'XNA'], dtype=object)

In [21]:
# Verificando dados únicos das colunas:
df_train['CODE_GENDER'].nunique()

3

In [22]:
# Mapeando variáveis categóricas:
# Substituindo o "F" por 0, o "M" por 1 e o "XNA" por 2
df_train['CODE_GENDER'] = df_train['CODE_GENDER'].map({'F': 0, 'M': 1, 'XNA': 2})

In [23]:
# Verificando dados únicos das colunas:
df_train['NAME_CONTRACT_TYPE'].unique()

array(['Cash loans', 'Revolving loans'], dtype=object)

In [24]:
# Substituindo o "Cash loans" por 0 e o "Revolving loans" por 1
df_train['NAME_CONTRACT_TYPE'] = df_train['NAME_CONTRACT_TYPE'].map({'Cash loans': 0, 'Revolving loans': 1})

In [25]:
# Verificando dados únicos das colunas:
df_train['FLAG_OWN_CAR'].unique()

array(['N', 'Y'], dtype=object)

In [26]:
# Substituindo o "N" por 0 e o "Y" por 1
df_train['FLAG_OWN_CAR'] = df_train['FLAG_OWN_CAR'].map({'N': 0, 'Y': 1})

In [27]:
# Verificando dados únicos das colunas:
df_train['FLAG_OWN_REALTY'].unique()

array(['N', 'Y'], dtype=object)

In [28]:
# Substituindo o "N" por 0 e o "Y" por 1
df_train['FLAG_OWN_REALTY'] = df_train['FLAG_OWN_REALTY'].map({'N': 0, 'Y': 1})

In [29]:
# Verificando dados únicos das colunas:
df_train['NAME_TYPE_SUITE'].unique()

array(['Unaccompanied', 'Family', 'Spouse, partner', 'Other_B',
       'Children', 'Other_A', nan, 'Group of people'], dtype=object)

In [30]:
# Substituindo os dados por números
df_train['NAME_TYPE_SUITE'] = df_train['NAME_TYPE_SUITE'].map({
    'Unaccompanied': 0, 
    'Family': 1,
    'Spouse, partner': 2,
    'Other_B': 3,
    'Children': 4,
    'Other_A': 5,
    'Group of people': 6
})

In [31]:
# Verificando dados únicos das colunas:
df_train['NAME_INCOME_TYPE'].unique()

array(['Working', 'Commercial associate', 'Pensioner', 'State servant',
       'Businessman', 'Unemployed', 'Student', 'Maternity leave'],
      dtype=object)

In [32]:
# Substituindo os dados por números
df_train['NAME_INCOME_TYPE'] = df_train['NAME_INCOME_TYPE'].map({
    'Working': 0, 
    'Commercial associate': 1,
    'Pensioner': 2,
    'State servant': 3,
    'Businessman': 4,
    'Unemployed': 5,
    'Student': 6,
    'Maternity leave': 7
})

In [33]:
# Verificando dados únicos das colunas:
df_train['NAME_EDUCATION_TYPE'].unique()

array(['Incomplete higher', 'Secondary / secondary special',
       'Higher education', 'Lower secondary', 'Academic degree'],
      dtype=object)

In [34]:
# Substituindo os dados por números
df_train['NAME_EDUCATION_TYPE'] = df_train['NAME_EDUCATION_TYPE'].map({
    'Incomplete higher': 0, 
    'Secondary / secondary special': 1,
    'Higher education': 2,
    'Lower secondary': 3,
    'Academic degree': 4
})

In [35]:
# Verificando dados únicos das colunas:
df_train['NAME_FAMILY_STATUS'].unique()

array(['Single / not married', 'Married', 'Separated', 'Widow',
       'Civil marriage', 'Unknown'], dtype=object)

In [36]:
# Substituindo os dados por números
df_train['NAME_FAMILY_STATUS'] = df_train['NAME_FAMILY_STATUS'].map({
    'Single / not married': 0, 
    'Married': 1,
    'Separated': 2,
    'Widow': 3,
    'Civil marriage': 4,
    'Unknown': 5
})

In [37]:
# Verificando dados únicos das colunas:
df_train['NAME_HOUSING_TYPE'].unique()

array(['House / apartment', 'Rented apartment', 'With parents',
       'Municipal apartment', 'Co-op apartment', 'Office apartment'],
      dtype=object)

In [38]:
# Substituindo os dados por números
df_train['NAME_HOUSING_TYPE'] = df_train['NAME_HOUSING_TYPE'].map({
    'House / apartment': 0, 
    'Rented apartment': 1,
    'With parents': 2,
    'Municipal apartment': 3,
    'Co-op apartment': 4,
    'Office apartment': 5
})

In [39]:
# Verificando dados únicos das colunas:
df_train['OCCUPATION_TYPE'].unique()

array(['Core staff', 'High skill tech staff', 'Managers', nan, 'Laborers',
       'Drivers', 'Sales staff', 'Cleaning staff', 'Cooking staff',
       'Accountants', 'Low-skill Laborers', 'Security staff',
       'Realty agents', 'Private service staff', 'Medicine staff',
       'Secretaries', 'HR staff', 'Waiters/barmen staff', 'IT staff'],
      dtype=object)

In [59]:
# Verificando dados únicos das colunas:
df_train['OCCUPATION_TYPE'].nunique()

18

In [61]:
df_train['OCCUPATION_TYPE'] = df_train['OCCUPATION_TYPE'].replace(df_train['OCCUPATION_TYPE'].unique(), np.arange(19))

In [40]:
# Verificando dados únicos das colunas:
df_train['WEEKDAY_APPR_PROCESS_START'].unique()

array(['FRIDAY', 'MONDAY', 'THURSDAY', 'TUESDAY', 'SATURDAY', 'WEDNESDAY',
       'SUNDAY'], dtype=object)

In [41]:
# Substituindo os dados por números
df_train['WEEKDAY_APPR_PROCESS_START'] = df_train['WEEKDAY_APPR_PROCESS_START'].map({
    'FRIDAY': 0, 
    'MONDAY': 1,
    'THURSDAY': 2,
    'TUESDAY': 3,
    'SATURDAY': 4,
    'WEDNESDAY': 5,
    'SUNDAY': 6
})

In [42]:
# Verificando dados únicos das colunas:
df_train['ORGANIZATION_TYPE'].unique()

array(['Trade: type 2', 'Business Entity Type 3',
       'Business Entity Type 1', 'XNA', 'Transport: type 4',
       'Self-employed', 'Industry: type 9', 'Industry: type 3',
       'Trade: type 7', 'Police', 'School', 'Mobile', 'Housing',
       'Government', 'Construction', 'Bank', 'Other', 'Industry: type 11',
       'Trade: type 1', 'Medicine', 'Industry: type 7', 'Kindergarten',
       'Business Entity Type 2', 'Security Ministries', 'Electricity',
       'Industry: type 4', 'Trade: type 3', 'Agriculture', 'Military',
       'Trade: type 6', 'Hotel', 'Security', 'Legal Services',
       'Industry: type 1', 'Restaurant', 'Industry: type 12', 'Services',
       'Realtor', 'University', 'Industry: type 5', 'Transport: type 2',
       'Industry: type 2', 'Advertising', 'Transport: type 3',
       'Emergency', 'Culture', 'Postal', 'Telecom', 'Insurance',
       'Transport: type 1', 'Cleaning', 'Industry: type 10',
       'Trade: type 4', 'Industry: type 6', 'Religion',
       'Industry

In [43]:
# Verificando dados únicos das colunas:
df_train['ORGANIZATION_TYPE'].nunique()

58

In [57]:
df_train['ORGANIZATION_TYPE'] = df_train['ORGANIZATION_TYPE'].replace(df_train['ORGANIZATION_TYPE'].unique(), np.arange(58))

In [45]:
# Verificando dados únicos das colunas:
df_train['FONDKAPREMONT_MODE'].unique()

array(['reg oper account', nan, 'reg oper spec account',
       'org spec account', 'not specified'], dtype=object)

In [46]:
# Substituindo os dados por números
df_train['FONDKAPREMONT_MODE'] = df_train['FONDKAPREMONT_MODE'].map({
    'reg oper account': 0, 
    'reg oper spec account': 1,
    'org spec account': 2,
    'not specified': 3
})

In [47]:
# Verificando dados únicos das colunas:
df_train['HOUSETYPE_MODE'].unique()

array(['block of flats', nan, 'specific housing', 'terraced house'],
      dtype=object)

In [48]:
# Substituindo os dados por números
df_train['HOUSETYPE_MODE'] = df_train['HOUSETYPE_MODE'].map({
    'block of flats': 0, 
    'specific housing': 1,
    'terraced house': 2
})

In [49]:
# Verificando dados únicos das colunas:
df_train['WALLSMATERIAL_MODE'].unique()

array(['Block', nan, 'Panel', 'Stone, brick', 'Monolithic', 'Others',
       'Wooden', 'Mixed'], dtype=object)

In [50]:
# Substituindo os dados por números
df_train['WALLSMATERIAL_MODE'] = df_train['WALLSMATERIAL_MODE'].map({
    'Block': 0, 
    'Panel': 1,
    'Stone, brick': 2,
    'Monolithic': 3,
    'Others': 4,
    'Wooden': 5,
    'Mixed': 6
})

In [51]:
# Verificando dados únicos das colunas:
df_train['EMERGENCYSTATE_MODE'].unique()

array(['No', nan, 'Yes'], dtype=object)

In [52]:
# Substituindo os dados por números
df_train['EMERGENCYSTATE_MODE'] = df_train['EMERGENCYSTATE_MODE'].map({
    'No': 0, 
    'Yes': 1
})

In [65]:
# Matriz de correlação
df_train.corr()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,NAME_TYPE_SUITE,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,DAYS_REGISTRATION,DAYS_ID_PUBLISH,OWN_CAR_AGE,FLAG_MOBIL,FLAG_EMP_PHONE,FLAG_WORK_PHONE,FLAG_CONT_MOBILE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,CNT_FAM_MEMBERS,REGION_RATING_CLIENT,REGION_RATING_CLIENT_W_CITY,WEEKDAY_APPR_PROCESS_START,HOUR_APPR_PROCESS_START,REG_REGION_NOT_LIVE_REGION,REG_REGION_NOT_WORK_REGION,LIVE_REGION_NOT_WORK_REGION,REG_CITY_NOT_LIVE_CITY,REG_CITY_NOT_WORK_CITY,LIVE_CITY_NOT_WORK_CITY,ORGANIZATION_TYPE,EXT_SOURCE_1,EXT_SOURCE_2,EXT_SOURCE_3,APARTMENTS_AVG,BASEMENTAREA_AVG,YEARS_BEGINEXPLUATATION_AVG,YEARS_BUILD_AVG,COMMONAREA_AVG,ELEVATORS_AVG,ENTRANCES_AVG,FLOORSMAX_AVG,FLOORSMIN_AVG,LANDAREA_AVG,LIVINGAPARTMENTS_AVG,LIVINGAREA_AVG,NONLIVINGAPARTMENTS_AVG,NONLIVINGAREA_AVG,APARTMENTS_MODE,BASEMENTAREA_MODE,YEARS_BEGINEXPLUATATION_MODE,YEARS_BUILD_MODE,COMMONAREA_MODE,ELEVATORS_MODE,ENTRANCES_MODE,FLOORSMAX_MODE,FLOORSMIN_MODE,LANDAREA_MODE,LIVINGAPARTMENTS_MODE,LIVINGAREA_MODE,NONLIVINGAPARTMENTS_MODE,NONLIVINGAREA_MODE,APARTMENTS_MEDI,BASEMENTAREA_MEDI,YEARS_BEGINEXPLUATATION_MEDI,YEARS_BUILD_MEDI,COMMONAREA_MEDI,ELEVATORS_MEDI,ENTRANCES_MEDI,FLOORSMAX_MEDI,FLOORSMIN_MEDI,LANDAREA_MEDI,LIVINGAPARTMENTS_MEDI,LIVINGAREA_MEDI,NONLIVINGAPARTMENTS_MEDI,NONLIVINGAREA_MEDI,FONDKAPREMONT_MODE,HOUSETYPE_MODE,TOTALAREA_MODE,WALLSMATERIAL_MODE,EMERGENCYSTATE_MODE,OBS_30_CNT_SOCIAL_CIRCLE,DEF_30_CNT_SOCIAL_CIRCLE,OBS_60_CNT_SOCIAL_CIRCLE,DEF_60_CNT_SOCIAL_CIRCLE,DAYS_LAST_PHONE_CHANGE,FLAG_DOCUMENT_2,FLAG_DOCUMENT_3,FLAG_DOCUMENT_4,FLAG_DOCUMENT_5,FLAG_DOCUMENT_6,FLAG_DOCUMENT_7,FLAG_DOCUMENT_8,FLAG_DOCUMENT_9,FLAG_DOCUMENT_10,FLAG_DOCUMENT_11,FLAG_DOCUMENT_12,FLAG_DOCUMENT_13,FLAG_DOCUMENT_14,FLAG_DOCUMENT_15,FLAG_DOCUMENT_16,FLAG_DOCUMENT_17,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
SK_ID_CURR,1.0,-0.002557,0.002782,9.5e-05,0.001462,0.001826,-7.7e-05,-0.002651,0.000156,5.7e-05,0.000315,2.4e-05,0.004073,0.001284,-0.000502,-0.001402,-5e-06,-0.002144,0.001836,-0.00256,-0.000174,0.002083,0.003137,-0.001796,-7.3e-05,0.004524,0.003005,-0.000822,-0.00171,-0.00141,0.001137,0.000899,0.003371,-0.001359,0.000358,0.001389,0.002699,-0.001356,0.000401,0.001454,-0.001803,0.001319,0.00131,0.002684,0.003921,-0.000606,0.001998,0.008445,0.000348,0.006202,-0.001272,0.006101,0.002384,0.003364,0.005889,0.003987,-0.003799,0.006509,0.003675,-0.000165,0.002206,0.007934,0.000408,0.005658,-0.001925,0.005666,0.001214,0.003264,0.005839,0.003733,-0.003059,0.005261,0.004252,-0.000244,0.001856,0.008285,0.000684,0.00619,-0.001183,0.005798,0.002092,0.003641,0.006094,0.004139,-0.003972,0.00624,0.001367,-0.000402,0.004859,-0.002485,0.000428,-0.001942,-0.00201,-0.001981,-0.000751,0.000739,0.001499,-0.00393,-0.004449,-0.00081,0.001486,-0.002993,0.002007,0.001068,-0.000351,-0.002229,-0.001172,-0.001036,-0.002298,0.003168,-0.000251,0.001622,0.000213,0.001389,0.000877,0.000714,-0.00366,-0.003445,0.003948,-0.000783,0.001091,0.004165
TARGET,-0.002557,1.0,-0.031127,0.055201,-0.022705,-0.006532,0.019246,-0.002286,-0.031103,-0.0141,-0.04006,-0.00332,-0.059575,-0.043475,0.002038,0.018293,-0.036342,0.079414,-0.044862,0.041581,0.051572,0.036632,0.000599,0.045894,0.027436,-0.000284,-0.02361,-0.002106,0.024632,0.009386,0.05917,0.061518,0.002338,-0.02327,0.006782,0.006463,0.001755,0.045404,0.050255,0.031159,-0.001537,-0.158619,-0.160978,-0.179246,-0.029823,-0.0238,-0.010012,-0.022979,-0.017455,-0.033079,-0.020724,-0.04372,-0.031946,-0.011024,-0.026637,-0.033427,-0.004459,-0.014609,-0.027802,-0.021355,-0.009605,-0.022861,-0.015412,-0.031233,-0.018986,-0.042958,-0.03099,-0.010985,-0.025055,-0.031198,-0.002836,-0.013898,-0.029485,-0.023274,-0.009954,-0.02315,-0.017454,-0.032884,-0.020667,-0.043524,-0.031776,-0.011621,-0.026423,-0.033295,-0.004048,-0.014522,-0.001613,0.0102,-0.032268,0.022934,0.013178,0.007895,0.029969,0.007809,0.029613,0.054471,0.005109,0.044072,-0.002746,-0.00129,-0.02775,-0.001955,-0.007613,-0.003451,-0.001038,-0.005252,-0.000848,-0.011839,-0.009037,-0.007152,-0.010565,-0.003001,-0.008746,-0.002312,-0.000758,0.003712,0.000833,0.003549,-0.000118,-0.013337,-0.001611,0.02128
NAME_CONTRACT_TYPE,0.002782,-0.031127,1.0,-0.009287,0.002919,0.067938,0.029502,-0.003067,-0.220958,-0.240553,-0.185186,-0.007311,-0.036099,0.037618,-0.024542,0.01114,0.024816,0.086415,-0.054676,0.018844,0.053509,0.010789,0.000653,0.055189,-0.032418,-0.095635,-0.021522,-0.010695,0.001057,0.00955,-0.022004,-0.023538,0.002215,0.03483,0.020495,0.018037,0.009625,0.014086,0.002618,-0.006749,0.011212,-0.013955,0.015357,-0.004351,0.013038,0.005322,-0.001974,0.004647,0.008615,0.016782,-8.1e-05,0.025164,0.026734,0.003148,0.014226,0.016785,0.00026,0.016066,0.011662,0.004271,-0.001045,0.004245,0.007746,0.015835,-0.000744,0.023939,0.025541,0.002463,0.013202,0.015551,-0.001704,0.013732,0.012371,0.004744,-0.001589,0.004577,0.008378,0.01649,-0.000129,0.024913,0.02638,0.002979,0.0139,0.016464,-0.000545,0.015731,0.001624,-0.004165,0.016472,0.002149,-0.005898,-0.018118,-0.008997,-0.018274,-0.00692,0.060338,-0.002066,-0.479896,0.004506,0.030438,-0.09863,0.04265,-0.085476,-0.00682,0.010774,0.027471,-0.000924,-0.005974,0.000252,-0.007474,-0.023233,-0.000153,-0.007091,-0.005016,-0.007308,0.051607,0.000281,-0.004206,-0.016042,-0.012824,-0.027359,-0.051759
CODE_GENDER,9.5e-05,0.055201,-0.009287,1.0,0.345904,-0.045044,0.047867,0.068043,0.021541,0.076718,0.022385,-0.010765,-0.138914,-0.014204,-0.103523,0.028855,0.013274,0.147637,-0.155432,0.076948,0.00037,-0.003721,-0.002796,0.156643,0.032751,-0.004839,-0.019985,0.018529,-0.059491,0.080752,-0.015758,-0.014706,0.004756,0.006496,0.024042,0.102795,0.105416,0.049017,0.137711,0.132446,0.007103,-0.309074,-0.014666,-0.024243,0.018236,0.01044,0.002325,0.014142,0.017147,0.02583,0.00994,0.026432,0.022685,0.007869,0.019031,0.023028,0.007258,0.008623,0.015607,0.007506,0.002124,0.012923,0.013725,0.023231,0.007097,0.024907,0.021548,0.006373,0.017561,0.020494,0.006266,0.006923,0.017985,0.009536,0.002355,0.014125,0.016608,0.025748,0.009629,0.026045,0.022457,0.007518,0.019065,0.022958,0.007309,0.008131,0.002747,-0.00208,0.024517,-0.005708,-0.002339,-0.006876,-0.019365,-0.006804,-0.015937,0.025173,-0.000565,-0.086358,-0.002024,0.000937,-0.099448,-0.000667,0.248976,-0.009515,-6.4e-05,0.004574,0.000949,0.040401,0.000423,0.020299,-6e-06,0.002513,0.0234,0.005438,0.002375,0.022556,0.003562,0.000808,-0.000321,0.009282,-0.00809,-0.017664
FLAG_OWN_CAR,0.001462,-0.022705,0.002919,0.345904,1.0,-0.003238,0.101919,0.07584,0.117474,0.141329,0.121457,0.001155,-0.094904,0.062122,-0.062787,-0.016593,0.0434,0.128466,-0.153989,0.087096,0.014924,,-0.002809,0.154159,0.010789,-0.004829,-0.007606,0.032128,-0.032151,0.150033,-0.02339,-0.022096,-0.000691,0.013106,-0.000379,0.038588,0.045582,0.002008,0.075412,0.088225,0.02246,-0.056459,0.05484,-0.015335,0.022956,0.017099,0.008021,0.040918,0.02815,0.035539,0.009351,0.047579,0.034104,0.008769,0.019814,0.035158,0.004866,0.018522,0.020794,0.014323,0.007219,0.040468,0.024879,0.033485,0.007796,0.046496,0.032326,0.008459,0.017806,0.032883,0.004025,0.01566,0.022849,0.016143,0.007918,0.040872,0.027654,0.035302,0.009038,0.047479,0.03381,0.009088,0.01968,0.034883,0.004776,0.017759,0.005052,-0.007489,0.038419,-0.00528,-0.007348,0.003572,-0.014577,0.003411,-0.015098,-0.040096,-0.001885,-0.06913,-0.002917,-0.01416,-0.106315,0.001874,0.228244,-0.010659,-4.9e-05,-0.000245,0.003972,0.081364,0.002393,0.045022,0.002315,-0.002166,0.000853,0.002367,0.007423,0.006495,0.003004,0.000631,0.001181,0.020017,-0.011916,-0.036655
FLAG_OWN_REALTY,0.001826,-0.006532,0.067938,-0.045044,-0.003238,1.0,-0.001395,0.002667,-0.039594,-0.005769,-0.045744,0.043767,0.035568,-0.008168,0.024651,-0.193215,0.014627,-0.12079,0.070883,-0.027463,0.007436,0.001844,-0.001341,-0.071188,-0.114931,0.007666,-0.041912,0.028769,-0.007846,0.008535,0.000511,0.001056,-0.015137,-0.104339,-0.036674,-0.032242,-0.017906,-0.060979,-0.063503,-0.038135,-0.017308,0.081971,0.003067,0.03966,0.009816,0.009462,-0.00133,0.000742,0.001186,-9.6e-05,0.017712,0.005533,-0.002677,0.013505,0.005451,0.010375,-0.000789,0.004033,0.010632,0.010046,-0.001779,0.001953,0.003548,0.001034,0.018046,0.007219,-0.002356,0.014115,0.005565,0.012293,-0.000933,0.005478,0.009597,0.009702,-0.001217,0.00045,0.002158,-0.000104,0.017407,0.005724,-0.00322,0.013607,0.005386,0.010205,-0.000423,0.004097,-0.012055,-0.015795,0.01309,-0.01149,-0.012945,0.017738,0.008422,0.017567,0.009211,0.028304,0.002859,-0.036502,0.003284,-0.01209,0.041536,0.002818,-0.035966,-0.005374,0.002323,-0.034079,-0.001195,-0.056916,-0.054318,-0.034568,-0.093003,-0.014133,-0.087907,-0.019435,-0.02607,-0.000113,-0.003971,-0.008755,0.008028,-0.004382,0.018759,0.067363
CNT_CHILDREN,-7.7e-05,0.019246,0.029502,0.047867,0.101919,-0.001395,1.0,0.012578,0.001899,0.021743,-0.002142,0.027459,-0.140677,0.022276,-0.01297,0.018417,-0.022819,0.332396,-0.239866,0.183855,-0.026649,0.009654,0.001164,0.240789,0.055483,-0.001662,-0.030148,0.024111,0.026477,0.87865,0.023448,0.022913,0.002773,-0.007004,-0.013362,0.008033,0.014468,0.021459,0.071218,0.069586,0.081885,-0.140029,-0.017523,-0.042886,-0.012428,-0.009076,0.007272,0.028848,0.001188,-0.006529,-0.008178,-0.008682,-0.00916,-0.001092,-0.008123,-0.010139,0.003551,-0.000676,-0.011861,-0.009482,0.006631,0.028328,0.001168,-0.006076,-0.006687,-0.008778,-0.008359,-0.00052,-0.007498,-0.010396,0.003331,-0.001026,-0.012201,-0.009212,0.006789,0.028914,0.001574,-0.006138,-0.008242,-0.008329,-0.008538,-0.000792,-0.007213,-0.010171,0.003551,-0.00072,0.015144,0.000394,-0.008106,0.014738,0.00999,0.013514,-0.001896,0.013064,-0.00281,-0.006417,0.002519,0.056936,-0.004722,-0.016977,-0.157104,-0.001034,0.052212,-0.001885,-0.002016,-0.004736,0.000335,0.004719,-0.005795,0.00526,0.009435,0.000344,0.003919,0.001337,0.00252,-0.001497,-0.000244,-3.4e-05,-0.002542,-0.009664,-0.012027,-0.04199
AMT_INCOME_TOTAL,-0.002651,-0.002286,-0.003067,0.068043,0.07584,0.002667,0.012578,1.0,0.1428,0.175158,0.145293,-0.015985,-0.005681,0.061832,-0.010039,-0.00135,0.06852,0.025419,-0.058849,0.025251,0.007498,-0.113479,0.000332,0.058646,-0.015705,-0.007251,0.00016,0.034824,-0.025217,0.015526,-0.078316,-0.083806,-0.001011,0.033583,0.028894,0.05704,0.053165,0.003105,0.005292,0.006988,-0.00065,0.023339,0.054826,-0.028683,0.031519,0.015602,0.00493,0.038872,0.092662,0.040579,0.005238,0.054816,0.138555,-0.001716,0.109399,0.036229,0.027449,0.073345,0.027221,0.011432,0.004608,0.033611,0.077978,0.036778,0.002105,0.052553,0.13029,-0.003482,0.09443,0.031497,0.022574,0.05973,0.030897,0.014779,0.004942,0.038751,0.090854,0.039775,0.004702,0.054358,0.137312,-0.001986,0.107161,0.035571,0.026174,0.06941,0.00639,-0.004695,0.038393,-0.006799,-0.00691,-0.012208,-0.012253,-0.012147,-0.011859,-0.016981,-0.001224,-0.015013,-2e-06,0.000865,-0.041791,0.004658,0.065718,0.016698,-0.000153,0.002324,0.002585,0.020092,0.019517,0.009531,0.00648,0.001635,0.002815,0.00227,0.000302,-0.000856,0.000839,0.001767,0.002106,0.022914,0.004162,0.010191
AMT_CREDIT,0.000156,-0.031103,-0.220958,0.021541,0.117474,-0.039594,0.001899,0.1428,1.0,0.769821,0.987024,0.015011,0.029892,0.099088,-0.039748,-0.026919,0.101197,-0.055686,-0.064905,0.009959,-0.006013,-0.093821,0.001606,0.063568,-0.019602,0.023672,0.028638,0.017578,-0.029363,0.063104,-0.101041,-0.110395,0.002822,0.053171,0.02401,0.050633,0.051302,-0.027533,-0.019538,-6.9e-05,0.010184,0.168391,0.131836,0.043012,0.061772,0.039109,0.007264,0.036783,0.050831,0.080411,0.014253,0.103929,0.079694,0.007359,0.062324,0.072344,0.01844,0.040474,0.053898,0.030751,0.005673,0.034104,0.044146,0.074504,0.008629,0.100795,0.076367,0.003846,0.054035,0.064195,0.014811,0.034352,0.059974,0.037311,0.006999,0.036448,0.05008,0.078837,0.013279,0.103188,0.079087,0.006344,0.060476,0.071033,0.017429,0.038151,0.008643,-0.008362,0.072984,-0.019462,-0.011003,0.001019,-0.020874,0.000964,-0.023172,-0.074416,0.009392,0.095867,-0.001592,-0.010593,-0.046133,-0.002636,0.082207,0.022547,-0.001946,0.02872,0.004302,0.054264,0.050899,0.031969,0.064496,0.01068,0.034726,0.02119,0.032827,-0.015101,-0.003954,0.004002,-0.002601,0.054316,0.017864,-0.049354
AMT_ANNUITY,5.7e-05,-0.0141,-0.240553,0.076718,0.141329,-0.005769,0.021743,0.175158,0.769821,1.0,0.7749,0.010337,0.000163,0.105623,-0.054291,-0.015399,0.119601,0.009139,-0.102962,0.039346,0.012427,-0.096567,0.000169,0.102174,-0.024975,0.022136,0.012581,0.071659,-0.030107,0.076123,-0.128411,-0.141656,0.00159,0.052607,0.04216,0.078562,0.073153,-0.006124,-0.000347,0.008643,0.001291,0.119544,0.126699,0.029117,0.07769,0.045405,0.014859,0.03552,0.060095,0.101399,0.014961,0.130204,0.099406,0.011536,0.078228,0.090262,0.025125,0.051249,0.067412,0.03514,0.014084,0.03194,0.051743,0.093136,0.007291,0.12597,0.094026,0.006599,0.068035,0.080104,0.020021,0.042726,0.075383,0.043426,0.014567,0.034991,0.059435,0.099441,0.013729,0.129063,0.098185,0.010325,0.075857,0.088635,0.024474,0.048433,0.004859,-0.007689,0.091318,-0.02025,-0.011578,-0.01108,-0.021595,-0.010907,-0.022787,-0.063573,0.003691,0.102711,-0.002396,-0.005344,-0.073169,-0.000944,0.128826,0.032918,-0.002017,-0.004267,0.000942,0.026726,0.038437,0.014403,0.00878,0.003119,-0.009436,0.004279,0.013219,-0.01598,0.003297,0.001617,0.012913,0.039495,0.012006,-0.013103


### Machine Learning:

#### Aplicando regressão logística nos dados do Dataset de treino:

In [63]:
from sklearn.linear_model import LogisticRegression

In [64]:
clf = LogisticRegression()

In [None]:
X = df[['Fare']] # Considerando apenas a taxa paga pela pessoa
y = df['Survived']

### Carregando e explorando dataset application_test_student:

In [54]:
# Carregando dataset application_test_student
#df_test = pd.read_csv('application_test_student.csv')
#df_test.head()

In [55]:
#df_test.dtypes

# Parte 3: AWS

**TBD**