## Módulo: Analytics Engineering

## Aula 2 - Exercício 1


#### Preparar um pipeline com as camadas "bronze", "silver" e "gold" para o dataset sobre informações de clientes em um banco. O dataset está disponível para download no seguinte website: "https://www.kaggle.com/datasets/parisrohan/credit-score-classification".

#### Efetuar a limpeza de dados necessária e garantir a qualidade de dados com todos os conceitos apresentados até agora.

#### Armazenar todas as camadas em um banco de dados no PostgreSQL.


#### Dicionário de Dados

- **Age (Idade):** Representa a idade da pessoa.
- **Annual_Income (Renda Anual):** Representa a renda anual da pessoa.
- **Monthly_Inhand_Salary (Salário Mensal Líquido):** Representa o salário base mensal de uma pessoa.
- **Num_Bank_Accounts (Número de Contas Bancárias):** Representa o número de contas bancárias que uma pessoa possui.
- **Num_Credit_Card (Número de Cartões de Crédito):** Representa o número de outros cartões de crédito que uma pessoa possui.
- **Interest_Rate (Taxa de Juros):** Representa a taxa de juros no cartão de crédito (porcentagem).
- **Num_of_Loan (Número de Empréstimos):** Representa o número de empréstimos obtidos no banco.
- **Delay_from_due_date (Atraso a partir da Data de Vencimento):** Representa o número médio de dias de atraso a partir da data de pagamento (dias).
- **Num_of_Delayed_Payment (Número de Pagamentos Atrasados):** Representa o número médio de pagamentos atrasados por uma pessoa.
- **Changed_Credit_Limit (Alteração no Limite de Crédito):** Representa a alteração percentual no limite de crédito do cartão (porcentagem).
- **Num_Credit_Inquiries (Número de Consultas de Crédito):** Representa o número de consultas de cartão de crédito.
- **Credit_Mix (Mix de Crédito):** Representa a classificação da mistura de créditos (Ruim, Padrão, Bom).
- **Outstanding_Debt (Dívida Pendente):** Representa a dívida restante a ser paga.
- **Credit_Utilization_Ratio (Taxa de Utilização de Crédito):** Representa a taxa de utilização do cartão de crédito (porcentagem).
- **Credit_History_Age (Idade da História de Crédito):** Representa a idade da história de crédito da pessoa (dias).
- **Payment_of_Min_Amount (Pagamento do Valor Mínimo):** Representa se apenas o valor mínimo foi pago pela pessoa.
- **Total_EMI_per_month (Total de EMI Mensal):** Representa os pagamentos mensais de EMI.
- **Amount_invested_monthly (Valor Investido Mensalmente):** Representa o valor investido mensalmente pelo cliente.
- **Monthly_Balance (Saldo Mensal):** Representa o valor do saldo mensal do cliente.


In [None]:
!pip install ydata_profiling

In [6]:
import pandas as pd
from ydata_profiling import ProfileReport

df = pd.read_csv("data/test.csv")
df.head(3)

Unnamed: 0,ID,Customer_ID,Month,Name,Age,SSN,Occupation,Annual_Income,Monthly_Inhand_Salary,Num_Bank_Accounts,...,Num_Credit_Inquiries,Credit_Mix,Outstanding_Debt,Credit_Utilization_Ratio,Credit_History_Age,Payment_of_Min_Amount,Total_EMI_per_month,Amount_invested_monthly,Payment_Behaviour,Monthly_Balance
0,0x160a,CUS_0xd40,September,Aaron Maashoh,23,821-00-0265,Scientist,19114.12,1824.843333,3,...,2022.0,Good,809.98,35.030402,22 Years and 9 Months,No,49.574949,236.64268203272132,Low_spent_Small_value_payments,186.26670208571767
1,0x160b,CUS_0xd40,October,Aaron Maashoh,24,821-00-0265,Scientist,19114.12,1824.843333,3,...,4.0,Good,809.98,33.053114,22 Years and 10 Months,No,49.574949,21.465380264657146,High_spent_Medium_value_payments,361.444003853782
2,0x160c,CUS_0xd40,November,Aaron Maashoh,24,821-00-0265,Scientist,19114.12,1824.843333,3,...,4.0,Good,809.98,33.811894,,No,49.574949,148.23393788500923,Low_spent_Medium_value_payments,264.67544623343


In [7]:
df.describe()

Unnamed: 0,Monthly_Inhand_Salary,Num_Bank_Accounts,Num_Credit_Card,Interest_Rate,Delay_from_due_date,Num_Credit_Inquiries,Credit_Utilization_Ratio,Total_EMI_per_month
count,42502.0,50000.0,50000.0,50000.0,50000.0,48965.0,50000.0,50000.0
mean,4182.004291,16.83826,22.92148,68.77264,21.05264,30.0802,32.279581,1491.304305
std,3174.109304,116.396848,129.314804,451.602363,14.860397,196.984121,5.106238,8595.647887
min,303.645417,-1.0,0.0,1.0,-5.0,0.0,20.509652,0.0
25%,1625.188333,3.0,4.0,8.0,10.0,4.0,28.06104,32.222388
50%,3086.305,6.0,5.0,13.0,18.0,7.0,32.28039,74.733349
75%,5934.189094,7.0,7.0,20.0,28.0,10.0,36.468591,176.157491
max,15204.633333,1798.0,1499.0,5799.0,67.0,2593.0,48.540663,82398.0


## Relatório inicial


In [8]:
profile = ProfileReport(df, title="Pandas Profiling Report")

profile.to_file("resultados_inicial.html")

Summarize dataset: 100%|██████████| 101/101 [00:24<00:00,  4.05it/s, Completed]                                               
Generate report structure: 100%|██████████| 1/1 [00:28<00:00, 28.04s/it]
Render HTML: 100%|██████████| 1/1 [00:03<00:00,  3.76s/it]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 45.77it/s]


In [13]:
print(df.dtypes)
display(df)

ID                           object
Customer_ID                  object
Month                        object
Name                         object
Age                          object
SSN                          object
Occupation                   object
Annual_Income                object
Monthly_Inhand_Salary       float64
Num_Bank_Accounts             int64
Num_Credit_Card               int64
Interest_Rate                 int64
Num_of_Loan                  object
Type_of_Loan                 object
Delay_from_due_date           int64
Num_of_Delayed_Payment       object
Changed_Credit_Limit         object
Num_Credit_Inquiries        float64
Credit_Mix                   object
Outstanding_Debt             object
Credit_Utilization_Ratio    float64
Credit_History_Age           object
Payment_of_Min_Amount        object
Total_EMI_per_month         float64
Amount_invested_monthly      object
Payment_Behaviour            object
Monthly_Balance              object
dtype: object


Unnamed: 0,ID,Customer_ID,Month,Name,Age,SSN,Occupation,Annual_Income,Monthly_Inhand_Salary,Num_Bank_Accounts,...,Num_Credit_Inquiries,Credit_Mix,Outstanding_Debt,Credit_Utilization_Ratio,Credit_History_Age,Payment_of_Min_Amount,Total_EMI_per_month,Amount_invested_monthly,Payment_Behaviour,Monthly_Balance
0,0x160a,CUS_0xd40,September,Aaron Maashoh,23,821-00-0265,Scientist,19114.12,1824.843333,3,...,2022.0,Good,809.98,35.030402,22 Years and 9 Months,No,49.574949,236.64268203272135,Low_spent_Small_value_payments,186.26670208571772
1,0x160b,CUS_0xd40,October,Aaron Maashoh,24,821-00-0265,Scientist,19114.12,1824.843333,3,...,4.0,Good,809.98,33.053114,22 Years and 10 Months,No,49.574949,21.465380264657146,High_spent_Medium_value_payments,361.44400385378196
2,0x160c,CUS_0xd40,November,Aaron Maashoh,24,821-00-0265,Scientist,19114.12,1824.843333,3,...,4.0,Good,809.98,33.811894,,No,49.574949,148.23393788500925,Low_spent_Medium_value_payments,264.67544623342997
3,0x160d,CUS_0xd40,December,Aaron Maashoh,24_,821-00-0265,Scientist,19114.12,,3,...,4.0,Good,809.98,32.430559,23 Years and 0 Months,No,49.574949,39.08251089460281,High_spent_Medium_value_payments,343.82687322383634
4,0x1616,CUS_0x21b1,September,Rick Rothackerj,28,004-07-5839,_______,34847.84,3037.986667,2,...,5.0,Good,605.03,25.926822,27 Years and 3 Months,No,18.816215,39.684018417945296,High_spent_Large_value_payments,485.2984336755923
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,0x25fe5,CUS_0x8600,December,Sarah McBridec,4975,031-35-0942,Architect,20002.88,1929.906667,10,...,12.0,_,3571.7,34.780553,,Yes,60.964772,146.48632477751087,Low_spent_Small_value_payments,275.53956951573343
49996,0x25fee,CUS_0x942c,September,Nicks,25,078-73-5990,Mechanic,39628.99,,4,...,7.0,Good,502.38,27.758522,31 Years and 11 Months,NM,35.104023,181.44299902757518,Low_spent_Small_value_payments,409.39456169535066
49997,0x25fef,CUS_0x942c,October,Nicks,25,078-73-5990,Mechanic,39628.99,3359.415833,4,...,7.0,Good,502.38,36.858542,32 Years and 0 Months,No,35.104023,__10000__,Low_spent_Large_value_payments,349.7263321025098
49998,0x25ff0,CUS_0x942c,November,Nicks,25,078-73-5990,Mechanic,39628.99,,4,...,7.0,Good,502.38,39.139840,32 Years and 1 Months,No,35.104023,97.59857973344877,High_spent_Small_value_payments,463.23898098947717


In [10]:
df_cln = df.copy()