# Objetivo 

- Extração dos dados via API do Kaggle para listar e baixar o conjunto de dados para o estudo de score de crédito. 
- Os dados serão salvos em formato parquet, em uma primeira "camada" considerada bronze.

## Pcotes e funções

In [3]:
# !pip install kaggle
# !pip install fastparquet

In [2]:
from kaggle.api.kaggle_api_extended import KaggleApi
import zipfile
import os
import pandas as pd

In [3]:
pd.set_option('display.max_columns', None)

In [5]:
api = KaggleApi()
api.authenticate()

## Listagem das bases de dados relacionados ao score de crédito

In [6]:
!kaggle datasets list -s credit-score-classification

ref                                                              title                                                     size  lastUpdated                 downloadCount  voteCount  usabilityRating  
---------------------------------------------------------------  --------------------------------------------------  ----------  --------------------------  -------------  ---------  ---------------  
parisrohan/credit-score-classification                           Credit score classification                            9973595  2022-06-22 14:30:55.590000          49673        330  0.9411765        
sujithmandala/credit-score-classification-dataset                Credit Score Classification Dataset                       1054  2023-05-22 16:17:38.083000           5898         93  1.0              
iremnurtokuroglu/credit-score-classification-cleaned-dataset     Credit Score Classification Cleaned Dataset            4159334  2024-11-26 01:41:22.027000           1872         31  0.7647059    

## Download dos dados

In [7]:
!kaggle datasets download parisrohan/credit-score-classification

Dataset URL: https://www.kaggle.com/datasets/parisrohan/credit-score-classification
License(s): CC0-1.0
Downloading credit-score-classification.zip to /home/hugo/Documents/Git_GitHub/Estudo_Score_Credito/vScore_Credito/1.Base_de_Dados
  0%|                                               | 0.00/9.51M [00:00<?, ?B/s]
100%|███████████████████████████████████████| 9.51M/9.51M [00:00<00:00, 915MB/s]


## Leitura dos dados e visualização das primeiras linhas

In [7]:
zf = zipfile.ZipFile('credit-score-classification.zip') 

In [None]:
# Nomes dos arquivos no zip 

zf.namelist()

['test.csv', 'train.csv']

In [9]:
treino = pd.read_csv(zf.open('train.csv'))
treino.head()

  treino = pd.read_csv(zf.open('train.csv'))


Unnamed: 0,ID,Customer_ID,Month,Name,Age,SSN,Occupation,Annual_Income,Monthly_Inhand_Salary,Num_Bank_Accounts,Num_Credit_Card,Interest_Rate,Num_of_Loan,Type_of_Loan,Delay_from_due_date,Num_of_Delayed_Payment,Changed_Credit_Limit,Num_Credit_Inquiries,Credit_Mix,Outstanding_Debt,Credit_Utilization_Ratio,Credit_History_Age,Payment_of_Min_Amount,Total_EMI_per_month,Amount_invested_monthly,Payment_Behaviour,Monthly_Balance,Credit_Score
0,0x1602,CUS_0xd40,January,Aaron Maashoh,23,821-00-0265,Scientist,19114.12,1824.843333,3,4,3,4,"Auto Loan, Credit-Builder Loan, Personal Loan,...",3,7.0,11.27,4.0,_,809.98,26.82262,22 Years and 1 Months,No,49.574949,80.41529543900253,High_spent_Small_value_payments,312.49408867943663,Good
1,0x1603,CUS_0xd40,February,Aaron Maashoh,23,821-00-0265,Scientist,19114.12,,3,4,3,4,"Auto Loan, Credit-Builder Loan, Personal Loan,...",-1,,11.27,4.0,Good,809.98,31.94496,,No,49.574949,118.28022162236736,Low_spent_Large_value_payments,284.62916249607184,Good
2,0x1604,CUS_0xd40,March,Aaron Maashoh,-500,821-00-0265,Scientist,19114.12,,3,4,3,4,"Auto Loan, Credit-Builder Loan, Personal Loan,...",3,7.0,_,4.0,Good,809.98,28.609352,22 Years and 3 Months,No,49.574949,81.699521264648,Low_spent_Medium_value_payments,331.2098628537912,Good
3,0x1605,CUS_0xd40,April,Aaron Maashoh,23,821-00-0265,Scientist,19114.12,,3,4,3,4,"Auto Loan, Credit-Builder Loan, Personal Loan,...",5,4.0,6.27,4.0,Good,809.98,31.377862,22 Years and 4 Months,No,49.574949,199.4580743910713,Low_spent_Small_value_payments,223.45130972736783,Good
4,0x1606,CUS_0xd40,May,Aaron Maashoh,23,821-00-0265,Scientist,19114.12,1824.843333,3,4,3,4,"Auto Loan, Credit-Builder Loan, Personal Loan,...",6,,11.27,4.0,Good,809.98,24.797347,22 Years and 5 Months,No,49.574949,41.420153086217326,High_spent_Medium_value_payments,341.48923103222177,Good


In [10]:
treino.shape

(100000, 28)

In [15]:
treino.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 28 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   ID                        100000 non-null  object 
 1   Customer_ID               100000 non-null  object 
 2   Month                     100000 non-null  object 
 3   Name                      90015 non-null   object 
 4   Age                       100000 non-null  object 
 5   SSN                       100000 non-null  object 
 6   Occupation                100000 non-null  object 
 7   Annual_Income             100000 non-null  object 
 8   Monthly_Inhand_Salary     84998 non-null   float64
 9   Num_Bank_Accounts         100000 non-null  int64  
 10  Num_Credit_Card           100000 non-null  int64  
 11  Interest_Rate             100000 non-null  int64  
 12  Num_of_Loan               100000 non-null  object 
 13  Type_of_Loan              88592 non-null   ob

In [None]:
# Converte a coluna Monthly_Balance para string

treino['Monthly_Balance'] = treino['Monthly_Balance'].astype('str') 

In [23]:
# treino[treino['Monthly_Balance'] == '__-333333333333333333333333333__']

In [11]:
teste = pd.read_csv(zf.open('test.csv'))
teste.head()

Unnamed: 0,ID,Customer_ID,Month,Name,Age,SSN,Occupation,Annual_Income,Monthly_Inhand_Salary,Num_Bank_Accounts,Num_Credit_Card,Interest_Rate,Num_of_Loan,Type_of_Loan,Delay_from_due_date,Num_of_Delayed_Payment,Changed_Credit_Limit,Num_Credit_Inquiries,Credit_Mix,Outstanding_Debt,Credit_Utilization_Ratio,Credit_History_Age,Payment_of_Min_Amount,Total_EMI_per_month,Amount_invested_monthly,Payment_Behaviour,Monthly_Balance
0,0x160a,CUS_0xd40,September,Aaron Maashoh,23,821-00-0265,Scientist,19114.12,1824.843333,3,4,3,4,"Auto Loan, Credit-Builder Loan, Personal Loan,...",3,7,11.27,2022.0,Good,809.98,35.030402,22 Years and 9 Months,No,49.574949,236.64268203272132,Low_spent_Small_value_payments,186.26670208571767
1,0x160b,CUS_0xd40,October,Aaron Maashoh,24,821-00-0265,Scientist,19114.12,1824.843333,3,4,3,4,"Auto Loan, Credit-Builder Loan, Personal Loan,...",3,9,13.27,4.0,Good,809.98,33.053114,22 Years and 10 Months,No,49.574949,21.465380264657146,High_spent_Medium_value_payments,361.444003853782
2,0x160c,CUS_0xd40,November,Aaron Maashoh,24,821-00-0265,Scientist,19114.12,1824.843333,3,4,3,4,"Auto Loan, Credit-Builder Loan, Personal Loan,...",-1,4,12.27,4.0,Good,809.98,33.811894,,No,49.574949,148.23393788500923,Low_spent_Medium_value_payments,264.67544623343
3,0x160d,CUS_0xd40,December,Aaron Maashoh,24_,821-00-0265,Scientist,19114.12,,3,4,3,4,"Auto Loan, Credit-Builder Loan, Personal Loan,...",4,5,11.27,4.0,Good,809.98,32.430559,23 Years and 0 Months,No,49.574949,39.08251089460281,High_spent_Medium_value_payments,343.82687322383634
4,0x1616,CUS_0x21b1,September,Rick Rothackerj,28,004-07-5839,_______,34847.84,3037.986667,2,4,6,1,Credit-Builder Loan,3,1,5.42,5.0,Good,605.03,25.926822,27 Years and 3 Months,No,18.816215,39.684018417945296,High_spent_Large_value_payments,485.2984336755923


In [12]:
teste.shape

(50000, 27)

In [25]:
# Converte a coluna Monthly_Balance para string

teste['Monthly_Balance'] = teste['Monthly_Balance'].astype('str') 

# Salva os dados - dados bronze

Salva os dados em formato parquet para, posteriormente, alguns tratamentos e criação de variáveis serem feitos.

In [26]:
treino.to_parquet('Dados_Bronze/treino.parquet', engine='fastparquet')
teste.to_parquet('Dados_Bronze/teste.parquet', engine='fastparquet')