# **Loading, Exploring, and Converting Dataset to Parquet Format**

**Important Note**: The original `HomeCredit_columns_description.csv` file had a corrupted character at position 59, which prevented it from being read using UTF-8 encoding. **This character was manually corrected (removed and rewritten).**

This notebook aims to achieve immediate performance improvements throughout the project, especially during the exploratory data analysis (EDA) and feature engineering phases. The primary objective is to convert the dataset from CSV to Parquet format, leveraging the significant benefits it offers in terms of reduced disk space usage and faster data loading. By conducting a comprehensive performance comparison between Parquet and CSV formats, we evaluate their efficiency in terms of read/write speed, disk space utilization, and data compression. Through this analysis, we identify the most suitable format that optimizes the dataset's performance characteristics.

In this notebook, we begin by loading the dataset for the first time and providing an overview of its dimensions. Subsequently, a benchmark is performed to determine the optimal persistence method, with the ultimate goal of transitioning from CSV to Parquet format. Finally, the dataset is converted into the identified best Parquet format, striking a balance between disk footprint and loading time.

# Reading CSV files

Alright, let's dive into the code! Our goal here is to load each table from the dataset and store them in a convenient dictionary format using dataframes. We'll go through all the CSV files located in a specified directory and measure the time it takes to load each table. The loaded dataframes will be organized within a dictionary, where each table is associated with a unique key. This approach allows us to easily access and manipulate the data. During the loading process, we'll display the table names, their respective shapes, and the time taken to load them. Finally, we'll also calculate and present the total time spent loading all the tables. Let's get started and load those tables!

In [5]:
from pepper.persist import _get_filenames_glob  # Importing function to get CSV filenames
from pepper.utils import pretty_timedelta_str, bold  # Importing utility functions
import pandas as pd  # Importing pandas for data manipulation
import time  # Importing time for measuring execution time

csv_dir = "../../dataset/csv/"  # Directory path where CSV files are located
filenames = _get_filenames_glob(csv_dir, "csv")  # Get list of CSV filenames

data_dict = {}  # Dictionary to store dataframes
read_times = []  # List to store loading times of each table

for filename in filenames:
    t = -time.time()  # Start measuring loading time
    data_key = filename[:-4]  # Extract data key from filename
    data = pd.read_csv(csv_dir + filename, encoding='utf-8')  # Read CSV file into a dataframe
    t += time.time()  # Calculate elapsed time for loading
    read_times.append(t)  # Store loading time in the list
    data_dict[data_key] = data  # Add dataframe to the dictionary
    # Display table name, shape, and loading time
    print(f"{bold(data_key)}: {data.shape} - {pretty_timedelta_str(t, 2)}")

# Display total read time for all tables
print(f">> {bold('total read time')}: {pretty_timedelta_str(sum(read_times), 2)}")

[1mapplication_test[0m: (48744, 121) - 523 ms, 602 mus
[1mapplication_train[0m: (307511, 122) - 3 s, 262 ms
[1mbureau[0m: (1716428, 17) - 2 s, 385 ms
[1mbureau_balance[0m: (27299925, 3) - 4 s, 593 ms
[1mcredit_card_balance[0m: (3840312, 23) - 7 s, 570 ms
[1mHomeCredit_columns_description[0m: (219, 5) - 6 ms, 14 mus
[1minstallments_payments[0m: (13605401, 8) - 10 s, 382 ms
[1mPOS_CASH_balance[0m: (10001358, 8) - 5 s, 342 ms
[1mprevious_application[0m: (1670214, 37) - 7 s, 247 ms
[1msample_submission[0m: (48744, 2) - 15 ms, 990 mus
>> [1mtotal read time[0m: 41 s, 329 ms


# Dataset Overview

In this section, we create a `metadata` dataframe that provides an overview of the dimensions of our dataset, which consists of multiple tables. This information can be useful for gaining insights into the structure and size of the dataset. We calculate the number of samples, number of features, and the total number of cells in each table. Additionally, we retrieve the size of each corresponding CSV file and add the CSV read times to the `metadata`. Finally, we sort the `metadata` dataframe by the number of cells in descending order and display the results.

In [6]:
# Import necessary modules
from pepper.utils import bold
from pepper.utils import get_file_size
import pandas as pd

# Create metadata dataframe to provide an overview of the dataset dimensions
metadata = pd.DataFrame(
    [(key, *data.shape) for key, data in data_dict.items()],
    columns=["table_name", "n_samples", "n_features"]
)

# Calculate the total number of cells in each table
metadata["n_cells"] = metadata.n_samples * metadata.n_features

# Retrieve the size of each CSV file
metadata["csv_size"] = metadata.table_name.apply(
    lambda x: get_file_size(csv_dir + x + ".csv")
)

# Add CSV read times to the metadata dataframe
metadata["csv_read_time"] = read_times

# Sort the metadata dataframe by the number of cells in descending order
metadata = metadata.sort_values(by="n_cells", ascending=False)

# Print the total number of cells in the dataset using bold formatting
print(f"{bold('n_cells')}: {metadata.n_cells.sum()}")

# Display the metadata dataframe
display(metadata)

[1mn_cells[0m: 493571166


Unnamed: 0,table_name,n_samples,n_features,n_cells,csv_size,csv_read_time
6,installments_payments,13605401,8,108843208,723118349,10.38217
4,credit_card_balance,3840312,23,88327176,424582605,7.57021
3,bureau_balance,27299925,3,81899775,375592889,4.59367
7,POS_CASH_balance,10001358,8,80010864,392703158,5.342954
8,previous_application,1670214,37,61797918,404973293,7.247273
1,application_train,307511,122,37516342,166133370,3.262308
2,bureau,1716428,17,29179276,170016717,2.385545
0,application_test,48744,121,5898024,26567651,0.523602
9,sample_submission,48744,2,97488,536202,0.01599
5,HomeCredit_columns_description,219,5,1095,37391,0.006014


# Benchmark: Saving Dataset to Parquet with Various Configurations

When saving the dataset in the Parquet format, the default configuration uses the `pyarrow` engine for processing. If `pyarrow` is not available, it falls back to using the `fastparquet` engine. The data is compressed using the `snappy` compression algorithm.

Our goal is to compare the performance in terms of disk memory usage and loading time for six possible configurations. These configurations depend on the choice of engine (`pyarrow` or `fastparquet`) and compression (`snappy`, `gzip`, `brotli`, or no compression).

For each configuration, we create a dedicated subdirectory to store the Parquet files. The dataset is then saved to Parquet using the specified engine and compression. The progress of the operation is displayed, indicating the configuration and the time taken.

Please note that we have excluded the combination of `fastparquet` engine with `brotli` compression due to an unidentified issue that causes the execution to enter an infinite loop.

**Warning**: If you plan to rerun this benchmark, please allocate approximately 10 minutes for the execution.

**Further readings**:
* [**`pandas.DataFrame.to_parquet`** documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_parquet.html)
* [**Stack Overflow**: Python - Save pandas data frame to Parquet file](https://stackoverflow.com/questions/41066582/python-save-pandas-data-frame-to-parquet-file)

In [7]:
# Import necessary modules
from pepper.persist import all_to_parquet
import itertools
import time

# Define CSV and Parquet directories
csv_dir = "../../dataset/csv/"
pqt_dir = "../../dataset/pqt/"

# Define the list of engines and compressions to test
engines = ["pyarrow", "fastparquet"]
compressions = ["snappy", "gzip", "brotli", None]

# Iterate over all combinations of engines and compressions
for engine, compression in itertools.product(engines, compressions):
    # Skip the combination of fastparquet and brotli due to an unidentified issue
    if engine == "fastparquet" and compression == "brotli":
        continue
    
    # Create a configuration name based on the engine and compression
    config_name = f"{engine}_{str(compression).lower()}"
    
    # Define the subdirectory for the Parquet files
    pqt_subdir = pqt_dir + config_name + "/"
    
    # Measure the time taken to save the dataset to Parquet
    t = -time.time()
    print(f"Saving dataset to {pqt_subdir}", end="")
    all_to_parquet(data_dict, pqt_subdir, engine, compression)
    t += time.time()
    print(f" in {pretty_timedelta_str(t, 2)}")

Saving dataset to ../../dataset/pqt/pyarrow_snappy/.......... in 25 s, 799 ms
Saving dataset to ../../dataset/pqt/pyarrow_gzip/.......... in 2 m, 18 s
Saving dataset to ../../dataset/pqt/pyarrow_brotli/.......... in 2 m, 16 s
Saving dataset to ../../dataset/pqt/pyarrow_none/.......... in 24 s, 126 ms
Saving dataset to ../../dataset/pqt/fastparquet_snappy/.......... in 26 s, 533 ms
Saving dataset to ../../dataset/pqt/fastparquet_gzip/.......... in 3 m, 6 s
Saving dataset to ../../dataset/pqt/fastparquet_none/.......... in 24 s, 195 ms


# Read Comparison: Parquet vs CSV Formats

The objective here is to compare the performance of the different configurations (engines and compressions) when reading Parquet files.

For each combination of engine and compression, we measure the _file size_ and _read time_ for each table in the `metadata` dataframe. The measurements are stored in the `metadata` dataframe for further analysis and comparison.

In [8]:
from pepper.utils import get_file_size

# Function to measure the time taken to read a Parquet file
def pqt_read_time(pqt_dir, table_name):
    t = time.time()
    pd.read_parquet(pqt_dir + table_name + ".pqt")
    return time.time() - t

# Function to get the file size of a Parquet file
def pqt_file_size(pqt_dir, table_name):
    return get_file_size(pqt_dir + table_name + ".pqt")

# List of engines and compressions to iterate over
engines = ["pyarrow", "fastparquet"]
compressions = ["snappy", "gzip", "brotli", None]

# Iterate over all combinations of engines and compressions
for engine, compression in itertools.product(engines, compressions):
    # Skip the combination of fastparquet and brotli
    if engine == "fastparquet" and compression == "brotli":
        continue
    config_name = f"{engine}_{str(compression).lower()}"
    pqt_subdir = pqt_dir + config_name + "/"
    
    # Calculate and store the file size for each table using the specified engine and compression
    metadata[f"pqt_{config_name}_size"] = metadata.table_name.apply(
        lambda x: pqt_file_size(pqt_subdir, x)
    )
    
    # Measure and store the read time for each table using the specified engine and compression
    metadata[f"pqt_{config_name}_read_time"] = metadata.table_name.apply(
        lambda x: pqt_read_time(pqt_subdir, x)
    )

# Display the updated metadata dataframe
display(metadata)

Unnamed: 0,table_name,n_samples,n_features,n_cells,csv_size,csv_read_time,pqt_pyarrow_snappy_size,pqt_pyarrow_snappy_read_time,pqt_pyarrow_gzip_size,pqt_pyarrow_gzip_read_time,pqt_pyarrow_brotli_size,pqt_pyarrow_brotli_read_time,pqt_pyarrow_none_size,pqt_pyarrow_none_read_time,pqt_fastparquet_snappy_size,pqt_fastparquet_snappy_read_time,pqt_fastparquet_gzip_size,pqt_fastparquet_gzip_read_time,pqt_fastparquet_none_size,pqt_fastparquet_none_read_time
6,installments_payments,13605401,8,108843208,723118349,10.38217,330470104,3.219193,246648550,1.621335,234206927,1.731876,478259694,1.064058,417551342,2.224942,273744883,1.681106,874103290,1.536755
4,credit_card_balance,3840312,23,88327176,424582605,7.57021,111274155,1.300946,87382864,1.31059,84062930,1.254331,231893301,1.012982,158525573,1.203214,99998309,1.233918,671997342,1.389169
3,bureau_balance,27299925,3,81899775,375592889,4.59367,21426895,2.538228,7220080,2.998022,6528751,2.655512,212427359,2.393628,39104894,2.044033,8773070,1.879325,573299674,2.351481
7,POS_CASH_balance,10001358,8,80010864,392703158,5.342954,124435906,1.242508,89478858,1.52219,84330013,1.436072,192425645,1.203197,166379319,1.406085,93648196,1.31837,664506876,1.580753
8,previous_application,1670214,37,61797918,404973293,7.247273,62912447,2.056663,49908242,2.102959,48304590,1.969963,80714342,1.791365,115293797,1.773969,62131222,1.770063,514893753,1.924712
1,application_train,307511,122,37516342,166133370,3.262308,22225869,0.517867,18770994,0.567186,18486431,0.455677,24879919,0.505942,49802974,0.458047,25306399,0.482109,253550609,0.560997
2,bureau,1716428,17,29179276,170016717,2.385545,35241265,0.518472,25883443,0.500996,24365040,0.473197,61232824,0.510756,52235506,0.479927,29284991,0.4816,234062490,0.5843
0,application_test,48744,121,5898024,26567651,0.523602,4255523,0.099171,3596498,0.096352,3505436,0.087762,4861820,0.097831,8361289,0.083632,4258899,0.094115,40157544,0.108442
9,sample_submission,48744,2,97488,536202,0.01599,296358,0.015504,170995,0.012058,156444,0.019707,489947,0.023255,215625,0.0126,77364,0.012637,780849,0.013176
5,HomeCredit_columns_description,219,5,1095,37391,0.006014,13372,0.010037,10505,0.009007,9931,0.011017,23605,0.009794,10992,0.009524,7179,0.015567,41639,0.008494


Based on the summary below, the best choice seems to be `pyarrow` + `gzip`.

This is the default configuration that we have settled on.

It provides a 5x improvement in both speed and disk footprint.

In [10]:
# Display the sum of each column in the `metadata` DataFrame
display(metadata.sum(axis=0))

table_name                          installments_paymentscredit_card_balancebureau...
n_samples                                                                    58538856
n_features                                                                        346
n_cells                                                                     493571166
csv_size                                                                   2684261625
csv_read_time                                                               41.329736
pqt_pyarrow_snappy_size                                                     712551894
pqt_pyarrow_snappy_read_time                                                11.518589
pqt_pyarrow_gzip_size                                                       529071029
pqt_pyarrow_gzip_read_time                                                  10.740693
pqt_pyarrow_brotli_size                                                     503956493
pqt_pyarrow_brotli_read_time                          

Let's generate a *pretty* table to present these results on presentation slides (the Markdown table can be copied to the clipboard).

The code snippet below takes the `metadata` DataFrame, selects the desired columns for the resulting table, and creates a copy of it. It then calculates the total values for each column and assigns them to the "TOTAL" row. Next, it applies formatting functions to specific columns to format file sizes, time durations, and large integers. The column names are modified for better readability. Finally, the formatted table is displayed using the `display_dataframe_in_markdown` function.

In [11]:
from pepper.utils import display_dataframe_in_markdown, format_iB, pretty_timedelta_str

# Function to format file size
def format_size(x):
    sz, unity = format_iB(x)
    return f"{sz:.1f} {unity}"

# Function to format time duration
def format_time(x):
    return pretty_timedelta_str(x, 1)

# Function to format large integers with thousand separators
def format_bigint(x):
    return f"{x:n}"

# Get the columns from the metadata DataFrame
cols = metadata.columns

# Select the desired columns for the resulting table
res_cols = list(cols[:6]) + list(cols[cols.str.contains("pyarrow_gzip")])
res = metadata[res_cols]

# Create a copy of the resulting table
res_2 = res.copy()

# Calculate the total values for each column and assign it to the "TOTAL" row
total = res_2.sum(axis=0)
total[0] = "**TOTAL**"
res_2.loc["TOTAL"] = total

# Apply formatting functions to specific columns
res_2.csv_size = res_2.csv_size.apply(format_size)
res_2.pqt_pyarrow_gzip_size = res_2.pqt_pyarrow_gzip_size.apply(format_size)
res_2.csv_read_time = res_2.csv_read_time.apply(format_time)
res_2.pqt_pyarrow_gzip_read_time = res_2.pqt_pyarrow_gzip_read_time.apply(format_time)
res_2.n_samples = res_2.n_samples.apply(format_bigint)
res_2.n_cells = res_2.n_cells.apply(format_bigint)

# Modify column names for better readability
res_2.columns = (
    res_2.columns
    .str.replace("n_", "#")
    .str.replace("pqt_pyarrow_gzip", "parquet")
    .str.replace("read_time", "readtime")
    .str.replace("_", " ")
)

# Format the "TOTAL" row to be displayed in bold
res_2.loc["TOTAL"] = res_2.loc["TOTAL"].apply(lambda x: f"**{x}**")

# Display the formatted table in Markdown format
display_dataframe_in_markdown(res_2)

|table name|#samples|#features|#cells|csv size|csv readtime|parquet size|parquet readtime|
|-|-|-|-|-|-|-|-|
|installments_payments|13 605 401|8|108 843 208|689.6 MiB|10 s|235.2 MiB|1 s|
|credit_card_balance|3 840 312|23|88 327 176|404.9 MiB|7 s|83.3 MiB|1 s|
|bureau_balance|27 299 925|3|81 899 775|358.2 MiB|4 s|6.9 MiB|2 s|
|POS_CASH_balance|10 001 358|8|80 010 864|374.5 MiB|5 s|85.3 MiB|1 s|
|previous_application|1 670 214|37|61 797 918|386.2 MiB|7 s|47.6 MiB|2 s|
|application_train|307 511|122|37 516 342|158.4 MiB|3 s|17.9 MiB|567 ms|
|bureau|1 716 428|17|29 179 276|162.1 MiB|2 s|24.7 MiB|500 ms|
|application_test|48 744|121|5 898 024|25.3 MiB|523 ms|3.4 MiB|96 ms|
|sample_submission|48 744|2|97 488|523.6 KiB|15 ms|167.0 KiB|12 ms|
|HomeCredit_columns_description|219|5|1 095|36.5 KiB|6 ms|10.3 KiB|9 ms|
|****TOTAL****|**58 538 856**|**346**|**493 571 166**|**2.5 GiB**|**41 s**|**504.6 MiB**|**10 s**|

# Cleanup

Let's retrieve the files generated in the `pyarrow` + `gzip` configuration and move them to the parent directory `dataset/pqt/` for safekeeping:

In [13]:
from pathlib import Path
def move_files_to_parent(dir_path):
    """Moves all files in the specified directory to its parent directory.
    
    Parameters:
    -----------
    dir_path (str):
        The path to the directory containing the files.
    """
    parent_dir_path = Path(dir_path).parent
    for file in Path(dir_path).iterdir():
        print(file.name)
        file.rename(parent_dir_path.joinpath(file.name))

pqt_subdir = "../../dataset/pqt/pyarrow_gzip/"
move_files_to_parent(pqt_subdir)

application_test.pqt
application_train.pqt
bureau.pqt
bureau_balance.pqt
credit_card_balance.pqt
HomeCredit_columns_description.pqt
installments_payments.pqt
POS_CASH_balance.pqt
previous_application.pqt
sample_submission.pqt


And let's free up disk space by removing the files in the other parquet formats generated during our benchmark (because life is all about taking risks, right?):

In [42]:
import os, shutil
def _dangerous_rmtree_all_subdirs(dir_path):
    """Recursively removes all subdirectories and their contents within the specified directory.
    
    Parameters:
    -----------
    dir_path (str):
        The path to the directory to be removed.
    """
    for child_name in os.listdir(dir_path):
        child_path = os.path.join(dir_path, child_name)
        if os.path.isdir(child_path):
            shutil.rmtree(child_path)

pqt_dir = "../../dataset/pqt"
_dangerous_rmtree_all_subdirs(pqt_dir)