#### **Transformation of Raw Data Files** ####

In this short notebook we will transform raw data files in the csv format into a more analysis-friendly and efficient parquet format.

In [2]:
import os
from pathlib import Path

import duckdb
import pandas as pd

In [None]:
BASE_DIR = Path(os.getcwd()).parent.parent

RAW_DIR = BASE_DIR / "home-credit-default-risk"

PARQUET_DIR = BASE_DIR / "ginbulbu-DS.v2.5.3.4.1" / "data"

PARQUET_DIR.mkdir(parents=True, exist_ok=True)

tables = [
    "application_train",
    "application_test",
    "bureau",
    "bureau_balance",
    "credit_card_balance",
    "installments_payments",
    "POS_CASH_balance",
    "previous_application",
]

for t in tables:
    csv_file = RAW_DIR / f"{t}.csv"
    parquet_file = PARQUET_DIR / f"{t}.parquet"

    if csv_file.exists():
        print(f"Converting {csv_file} to {parquet_file}...")
        duckdb.sql(
            f"""COPY (SELECT * FROM read_csv_auto('{csv_file}')) 
                   TO '{parquet_file}' (FORMAT 'parquet');"""
        )
    else:
        print(f"CSV file {csv_file} does not exist. Checked path: {csv_file}")

Converting /Users/gbulbuke/Documents/DS_Course/capstone/home-credit-default-risk/application_train.csv to /Users/gbulbuke/Documents/DS_Course/capstone/ginbulbu-DS.v2.5.3.4.1/data/application_train.parquet...
Converting /Users/gbulbuke/Documents/DS_Course/capstone/home-credit-default-risk/application_test.csv to /Users/gbulbuke/Documents/DS_Course/capstone/ginbulbu-DS.v2.5.3.4.1/data/application_test.parquet...
Converting /Users/gbulbuke/Documents/DS_Course/capstone/home-credit-default-risk/bureau.csv to /Users/gbulbuke/Documents/DS_Course/capstone/ginbulbu-DS.v2.5.3.4.1/data/bureau.parquet...
Converting /Users/gbulbuke/Documents/DS_Course/capstone/home-credit-default-risk/bureau_balance.csv to /Users/gbulbuke/Documents/DS_Course/capstone/ginbulbu-DS.v2.5.3.4.1/data/bureau_balance.parquet...
Converting /Users/gbulbuke/Documents/DS_Course/capstone/home-credit-default-risk/credit_card_balance.csv to /Users/gbulbuke/Documents/DS_Course/capstone/ginbulbu-DS.v2.5.3.4.1/data/credit_card_balan

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Converting /Users/gbulbuke/Documents/DS_Course/capstone/home-credit-default-risk/POS_CASH_balance.csv to /Users/gbulbuke/Documents/DS_Course/capstone/ginbulbu-DS.v2.5.3.4.1/data/POS_CASH_balance.parquet...
Converting /Users/gbulbuke/Documents/DS_Course/capstone/home-credit-default-risk/previous_application.csv to /Users/gbulbuke/Documents/DS_Course/capstone/ginbulbu-DS.v2.5.3.4.1/data/previous_application.parquet...
