# Purpose

- Create a notebook to open an treat basic aspects of data, to later be used to feed the RAG architecture

In [1]:
import pandas as pd

# Define Constant Variables

- SOURCE_PATH -> Refering to the file to process
- TARGET_PATH -> Refering to the saving file

In [2]:
SOURCE_PATH = "../../data/raw/train.gz"
TARGET_PATH = "../../data/processed/train.csv"

# Key functions

In [3]:
def gz_to_dataframe(path_gz: str) -> pd.DataFrame:
    
    df = pd.read_csv(path_gz, compression="gzip")
    return df

def save_dataframe_csv(df: pd.DataFrame, path_csv: str):
    df.to_csv(path_csv, index=False)
    print(f"[OK] CSV salvo em {path_csv}")

In [4]:
df = gz_to_dataframe(SOURCE_PATH)

## Quick Snippet of Data

### Simple Visualization

In [5]:
df.head()

Unnamed: 0,REF_DATE,TARGET,VAR2,IDADE,VAR4,VAR5,VAR6,VAR7,VAR8,VAR9,...,VAR141,VAR142,VAR143,VAR144,VAR145,VAR146,VAR147,VAR148,VAR149,ID
0,2017-06-01 00:00:00+00:00,0,M,34.137,,RO,-8.808779,-63.87847,D,E,...,2680.289259,D,,,,,102,EMAIL INEXISTENTE#@#NOME INEXISTENTE#@#CEP INE...,2.6.1,181755
1,2017-08-18 00:00:00+00:00,0,M,40.447,,PB,-7.146537,-34.92608,E,E,...,1777.725469,E,,,,,102,EMAIL INEXISTENTE#@#NOME INEXISTENTE#@#CEP INE...,2.6.1,287633
2,2017-06-30 00:00:00+00:00,0,F,33.515,,RS,-27.900178,-53.314035,,E,...,1695.494979,E,,,,,102,EMAIL INEXISTENTE#@#NOME INEXISTENTE#@#CEP INE...,2.6.1,88015
3,2017-08-05 00:00:00+00:00,1,F,25.797,,BA,-12.948874,-38.451863,E,E,...,1399.037809,E,,,,,102,EMAIL INEXISTENTE#@#NOME INEXISTENTE#@#CEP INE...,2.6.1,122576
4,2017-07-29 00:00:00+00:00,0,F,54.074,,RS,-30.05181,-51.213277,B,E,...,7868.793296,C,,,,,102,EMAIL INEXISTENTE,2.6.1,1272


In [6]:
df.shape

(120750, 151)

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120750 entries, 0 to 120749
Columns: 151 entries, REF_DATE to ID
dtypes: float64(34), int64(3), object(114)
memory usage: 139.1+ MB


In [8]:
df.columns

Index(['REF_DATE', 'TARGET', 'VAR2', 'IDADE', 'VAR4', 'VAR5', 'VAR6', 'VAR7',
       'VAR8', 'VAR9',
       ...
       'VAR141', 'VAR142', 'VAR143', 'VAR144', 'VAR145', 'VAR146', 'VAR147',
       'VAR148', 'VAR149', 'ID'],
      dtype='object', length=151)

In [9]:
df.dtypes

REF_DATE     object
TARGET        int64
VAR2         object
IDADE       float64
VAR4         object
             ...   
VAR146      float64
VAR147        int64
VAR148       object
VAR149       object
ID            int64
Length: 151, dtype: object

In [10]:
df.describe(include="all")

Unnamed: 0,REF_DATE,TARGET,VAR2,IDADE,VAR4,VAR5,VAR6,VAR7,VAR8,VAR9,...,VAR141,VAR142,VAR143,VAR144,VAR145,VAR146,VAR147,VAR148,VAR149,ID
count,120750,120750.0,106131,107040.0,202,117394,117394.0,117394.0,67640,117447,...,120750.0,120585,1576,1576,679.0,168.0,120750.0,120750,120750,120750.0
unique,242,,2,,1,27,,,5,5,...,,5,2,2,,,,8,1,
top,2017-03-31 00:00:00+00:00,,F,,S,SP,,,E,E,...,,E,N,N,,,,EMAIL INEXISTENTE,2.6.1,
freq,1061,,60131,,202,19079,,,54928,116130,...,,94058,897,1408,,,,52379,120750,
mean,,0.245027,,42.125255,,,-14.411389,-45.90348,,,...,1854.833006,,,,4018.743785,1942.649762,101.841656,,,165324.864199
std,,0.430105,,15.198476,,,8.995077,7.529788,,,...,893.999792,,,,3700.836248,3143.75785,0.540016,,,95488.44232
min,,0.0,,18.014,,,-33.521563,-72.900276,,,...,0.0,,,,0.0,0.0,100.0,,,3.0
25%,,0.0,,30.05725,,,-22.842778,-49.903564,,,...,1513.2274,,,,1633.195,0.0,102.0,,,82727.25
50%,,0.0,,39.867,,,-13.01059,-46.574908,,,...,1627.157652,,,,3024.48,935.12,102.0,,,165298.0
75%,,0.0,,52.997,,,-6.357067,-39.023621,,,...,1820.670284,,,,5217.67,2260.125,102.0,,,248248.0


# Saving the "processed" version

What was done...

- Quick look into the data
- Mapped possible edge cases
- Quick check to see if all the data is good to go


Saw that the data was not perfectly clean, and if was not the main features of the project been:

- REF_DATE  (Data de referência do registro)
- TARGET    (Alvo binário de inadimplência (1: Mau Pagador, i.e. atrado > 60 dias em 2 meses))
- VAR2      (Sexo)
- IDADE     (Idade do Individuo)
- VAR4      (Flag de óbito)
- VAR5      (Unidade Federativa (UF) brasileira)
- VAR8      (Classe social estimada)

In [11]:
save_dataframe_csv(df, TARGET_PATH)

[OK] CSV salvo em ../../data/processed/train.csv
