# Extração dos Dados – Camada Bronze

Este notebook é responsável pela **extração dos dados brutos** do dataset de risco de crédito,
armazenando-os na **camada Bronze** do Data Lake.

O objetivo desta etapa é garantir que os dados sejam coletados e persistidos
sem alterações estruturais, preservando sua forma original para fins de rastreabilidade
e auditoria.

In [0]:
bronze_path = "/Volumes/mvp_bank/landing/bank_mkt_volume/bronze/credit_risk_dataset.csv"

df_bronze = (
    spark.read
         .option("header", "true")
         .option("inferSchema", "true")
         .csv(f"dbfs:{bronze_path}")
)

df_bronze.limit(20).display()


person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
22,59000,RENT,123.0,PERSONAL,D,35000,16.02,1,0.59,Y,3
21,9600,OWN,5.0,EDUCATION,B,1000,11.14,0,0.1,N,2
25,9600,MORTGAGE,1.0,MEDICAL,C,5500,12.87,1,0.57,N,3
23,65500,RENT,4.0,MEDICAL,C,35000,15.23,1,0.53,N,2
24,54400,RENT,8.0,MEDICAL,C,35000,14.27,1,0.55,Y,4
21,9900,OWN,2.0,VENTURE,A,2500,7.14,1,0.25,N,2
26,77100,RENT,8.0,EDUCATION,B,35000,12.42,1,0.45,N,3
24,78956,RENT,5.0,MEDICAL,B,35000,11.11,1,0.44,N,4
24,83000,RENT,8.0,PERSONAL,A,35000,8.9,1,0.42,N,2
21,10000,OWN,6.0,VENTURE,D,1600,14.74,1,0.16,N,3


## Fonte dos Dados

O conjunto de dados utilizado é o **Credit Risk Dataset**, disponibilizado publicamente
na plataforma Kaggle.

Link da fonte:
https://www.kaggle.com/datasets/laotse/credit-risk-dataset

O dataset contém informações demográficas, financeiras e características dos empréstimos,
sendo amplamente utilizado em estudos de análise de risco de crédito.


In [0]:
from pyspark.sql.functions import current_timestamp

df_bronze = df_bronze.withColumn("ingestion_ts", current_timestamp())

(
    df_bronze.write
        .format("delta")
        .mode("overwrite")
        .saveAsTable("mvp_bank.landing.bronze_credit_risk")
)


## Processo de Extração

A extração consiste em:
1. Upload manual do arquivo CSV no Databricks (Volume do Unity Catalog)
2. Leitura do arquivo utilizando Apache Spark
3. Persistência dos dados em formato Delta Lake na tabela Bronze

Foi adicionada uma coluna de auditoria (`ingestion_ts`) para registrar o momento da ingestão,
permitindo versionamento e rastreabilidade do pipeline.


In [0]:
%sql
SELECT * FROM mvp_bank.landing.bronze_credit_risk LIMIT 20;


person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length,ingestion_ts
22,59000,RENT,123.0,PERSONAL,D,35000,16.02,1,0.59,Y,3,2025-12-11T23:22:24.118Z
21,9600,OWN,5.0,EDUCATION,B,1000,11.14,0,0.1,N,2,2025-12-11T23:22:24.118Z
25,9600,MORTGAGE,1.0,MEDICAL,C,5500,12.87,1,0.57,N,3,2025-12-11T23:22:24.118Z
23,65500,RENT,4.0,MEDICAL,C,35000,15.23,1,0.53,N,2,2025-12-11T23:22:24.118Z
24,54400,RENT,8.0,MEDICAL,C,35000,14.27,1,0.55,Y,4,2025-12-11T23:22:24.118Z
21,9900,OWN,2.0,VENTURE,A,2500,7.14,1,0.25,N,2,2025-12-11T23:22:24.118Z
26,77100,RENT,8.0,EDUCATION,B,35000,12.42,1,0.45,N,3,2025-12-11T23:22:24.118Z
24,78956,RENT,5.0,MEDICAL,B,35000,11.11,1,0.44,N,4,2025-12-11T23:22:24.118Z
24,83000,RENT,8.0,PERSONAL,A,35000,8.9,1,0.42,N,2,2025-12-11T23:22:24.118Z
21,10000,OWN,6.0,VENTURE,D,1600,14.74,1,0.16,N,3,2025-12-11T23:22:24.118Z
