# Memory Efficient File Format for Fast Data Loading for Images and Tables
The goal of this notebook is to convert all the data, especially the images which are 105.100 Files in 86 Folders, in only one `HDF5` file with 86 *groups* and 105.100 *datasets*. Since the size of the `HDF5` file increases rapidly with the number of images, we convert the images in `binary` format at first. The `csv` files will be converted in compressed `Parquett` files.

**NOTE: All the codes are commented out. The final data can be found [here](https://www.kaggle.com/ismailbaris/hum-parquet-hdf5). See section `Read Images` at the end of this notebook to read the images!**

**Links:**
- **[Link](https://www.kaggle.com/ismailbaris/hum-parquet-hdf5) to the dataset.**
- [Link](https://databricks.com/glossary/what-is-parquet) to `parquet` homepage.
- [Link](https://www.hdfgroup.org/solutions/hdf5/) to `hdf5` homepage.
- [Link](https://www.machinecurve.com/index.php/2020/04/13/how-to-use-h5py-and-keras-to-train-with-data-from-hdf5-files/) to *How to use hdf5 files with keras*.

## What is Parquet?
Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. Apache Parquet is designed to be a common interchange format for both batch and interactive workloads. It is similar to other columnar-storage file formats available in Hadoop, namely RCFile and ORC* ([Reference](https://databricks.com/glossary/what-is-parquet)).

## What is HDF5?
*Utilize the HDF5 high performance data software library and file format to manage, process, and store your heterogeneous data. HDF5 is built for fast I/O processing and storage. HDF® is portable, with no vendor lock-in, and is a self-describing file format, meaning everything all data and metadata can be passed along in one file. There is no limit on the number or size of data objects in the collection, giving great flexibility for big data* ([Reference](https://www.hdfgroup.org/solutions/hdf5/)).

Importing necessary packages and defining paths.

In [None]:
from pathlib import Path
import os
import pandas as pd
import h5py
import numpy as np
from PIL import Image
import io

# data_path = Path("../input/h-and-m-personalized-fashion-recommendations")  # Path to the H&M data.
# image_path = Path("../input/h-and-m-personalized-fashion-recommendations/images")  # Path to the images.

# export_path = Path("./")  # Path, where the new files will be exported.

# export_path.mkdir(parents=True, exist_ok=True)

## Reading Articles
We read the `articles.csv` file and determine which articles have an image present.

In [None]:
# articles = pd.read_csv(data_path / "articles.csv")

The next step is to determine all present image names in the directory `images`..

In [None]:
# file_list = []
# for p, d, f in os.walk(image_path):
#     for item in f:
#         file_list.append(int(item.split(".")[0]))

We add a new column to the data and mark all the article ids which have an image

In [None]:
# articles["image"] = False
# articles.loc[articles['article_id'].isin(file_list), "image"] = True

Export the `articles` to the `parquet` file format now.

In [None]:
# articles.to_parquet(os.path.join(export_path, "articles.parquet.gzip"), compression='gzip')

## Conversion to Parquet
Firstly, we convert all the remaining data in the `parquet` file format:

In [None]:
# files = ["customers", "sample_submission", "transactions_train"]

# for item in files:
#     print("\r> Processing File: ", end=str(item))
#     dataframe = pd.read_csv(data_path / f"{item}.csv")
#     dataframe.to_parquet(export_path / f"{item}.parquet.gzip", compression='gzip')
#     del dataframe

# print("\n> [Done]")

## Enhance Transaction Dataset
Since the information in the file `transaction_train` contains only 5 columns, we will join all the available information from the file `customers` data.

In [None]:
# customers = pd.read_parquet(export_path / "customers.parquet.gzip")  # Open the customer dataset.
# transactions = pd.read_parquet(export_path / "transactions_train.parquet.gzip")  # Open the transaction dataset.

# training_data = pd.merge(transactions, customers, on=["customer_id"])
# del customers
# del transactions


In [None]:
# # Save the new dataset in a new `parquet` file.
# training_data.to_parquet(export_path / "train.parquet.gzip", compression='gzip')

In [None]:
# training_data.columns

In [None]:
# del training_data

## Convert Images to HDF5
Although the HDF5 file format is very efficient format to read and write image files, the hdf5 file size increases rapidly with the number of images. This happens because the numpy array takes more storage space than the original image files. To overcome this problem, we will store the images as binary files. The strategy is, to create one HDF5 image dataset and group all the subdirectories in that one file. Here is the file structure:

- images.h5 (Created `HDF5` file)
    - 010 (Group 1: The first three digits of the `article_id`)
       - 0108775015 (Image 1: `article_id`)
       - 0108775044 (Image 2: `article_id`)
       - ...
    - 010 (Group 2)
       - 0110065001 (Image 1)
       - ...

At first, we will group the images in a directory:

In [None]:
# image_dict = dict()
# for p, d, f in os.walk(image_path):
#     file_list = list()
#     cat_id = os.path.basename(p)
#     for item in f:
#         file_list.append(os.path.join(p, item))

#     if len(file_list) != 0:
#         image_dict[cat_id] = file_list

After that, we start to write the files in the `HDF5` file.

In [None]:
# hf = h5py.File(export_path / "images.h5", "a")  # Create the HDF5 file as in append mode.
# 
# for group, image_paths in image_dict.items():  # Iterate over all image groups (first three digits of `article_id`).
#     print("\r> Processing Group ", end=group)
# 
#     grp = hf.create_group(group)  # Create a group where the name of the group is the first three digits of `article_id`.
# 
#     for image_path in image_paths:  # Iterate over all images within that group.
#         name = os.path.basename(image_path.split(".")[0])  # Extract the `article_id` (filename without extension).
# 
#         with open(image_path, 'rb') as img_f:  # Open the image as python binary.
#             binary_data = img_f.read()
# 
#         binary_data_np = np.asarray(binary_data)
# 
#         dset = grp.create_dataset(name, data=binary_data_np)  # Save the binary array in the group.
# 
# hf.close()
# print("\n> [Done]")

### Read Images
To read the data back, we use the `visititems` function of `h5py`

In [None]:
group = []  # List all groups.

data = []  # Store all the full data paths (group/file). These are the keys to access the image data.


def func(name, obj):  # Function to recursively store all the keys
    if isinstance(obj, h5py.Dataset):
        data.append(name)
    elif isinstance(obj, h5py.Group):
        group.append(name)

hf_path = Path("../input/hum-parquet-hdf5/hum-data-efc")
hf = h5py.File(hf_path / "images.h5", 'r')
hf.visititems(func)  # This operation fills the previously created lists `group` and `data`.

Print the first dataset.

In [None]:
print(data[0])

Read the first file as image:

In [None]:
hf_data = np.array(hf[data[0]])
image = Image.open(io.BytesIO(hf_data))
print('Image Size:', image.size)

You can also loop over all images, in order to use it in the training process

In [None]:
# for ds in data:
#     hf_ds = np.array(hf[ds])
#     image = Image.open(io.BytesIO(hf_ds))
#     print('Image S-ize:', image.size)
hf.close()