## 1. Load and read the dataset

Read the dataset from hdf5 file and show the data struction

In [4]:
import h5py
import json
import os
import random

def get_all_hdf5_files(directory):
    """
    Get all HDF5 files in a directory and its subdirectories.
    """
    hdf5_files = []
    for root, dirs, files in os.walk(directory):
        for file in files:
            if file.endswith('.hdf5'):
                hdf5_files.append(os.path.join(root, file))
    return hdf5_files

def print_structure(group, indent=0):
    """
    Print the hierarchical structure of an HDF5 group.
    """
    for key in group.keys():
        item = group[key]
        print(" " * indent + key)
        if isinstance(item, h5py.Group):
            print_structure(item, indent + 4)
        elif isinstance(item, h5py.Dataset):
            print(" " * indent, item.shape, item.dtype)
            if key == "entities" or key == "target_entity" or key == "instruction":
                decoded = [x.decode('utf-8') for x in item]
                print(" " * indent, decoded)
        
        elif isinstance(item, h5py.Datatype):
            print(" " * (indent + 4) + "Datatype", json.loads())

In [5]:
dataset_root = "/mnt/data/310_jiarui/datafactory/get_coffee" 
hdf5_files = get_all_hdf5_files(dataset_root)
print(hdf5_files[:3])

['/mnt/data/310_jiarui/datafactory/get_coffee/data_52.hdf5', '/mnt/data/310_jiarui/datafactory/get_coffee/data_50.hdf5', '/mnt/data/310_jiarui/datafactory/get_coffee/data_0.hdf5']


In [6]:
example_file = random.choice(hdf5_files)
with h5py.File(example_file, 'r') as f:
    print_structure(f)

data
    2025-07-09 17:16:11
        instruction
         (1,) |S23
         ['Get me a cup of coffee.']
        meta_info
            entities
             (4,) |S14
             ['table', 'coffee_machine', 'mug_seen', 'bottom']
            episode_config
             () |S1460
            target_entity
             (1,) |S8
             ['mug_seen']
        observation
            depth
             (211, 6, 480, 480) float32
            ee_state
             (211, 8) float32
            point_cloud_colors
             (211, 30330, 3) float32
            point_cloud_points
             (211, 30330, 3) float32
            q_acceleration
             (211, 7) float32
            q_state
             (211, 7) float32
            q_velocity
             (211, 7) float32
            rgb
             (211, 6, 480, 480, 3) uint8
            robot_mask
             (211, 6, 480, 480) float32
        trajectory
         (211, 8) float32


## 2. Convert to tf dataset

In the baseline methods, OpenVLA and Octo are trained using datasets in the TFDS format. In this repository, we have modified the tfds dataset builder to convert the HDF5 source format of VLABench into TFDS format. We offer [singlethread tfds builder](../VLABench/utils/rlds_builder.py) and [multithread tfds builder](../VLABench/utils/multithread_rlds_builder.py).

To convert the dataset, run
```sh
python scripts/convert_to_rlds.py --save_dir /your/vlabench/dataset/root --task {target_task}
```
to create the tfds builder in the dataset directory.
Then, run
```sh
cd /your/vlabench/dataset/root/target_task
tfds build --overwrite
```
to generate rlds format dataset.

We recommend the readers to refer to https://github.com/kpertsch/rlds_dataset_builder for further guidance.

## 3. Convert to Lerobot dataset

In the baseline methods, $\pi_0$ uses lerobot format dataset for finetining. We offer a script to convert hdf5 format dataset into lerobot format, similar to Libero dataset used in $\pi_0$.

Run 
```sh
python scripts/convert_to_lerobot.py --dataset-name xxx --dataset-path /your/path/to/hdf5 --max-files 500 --task-list task1 task2 ... task n
```
to create a multi-task lerobot dataset.

In [None]:
python scripts/convert_to_lerobot.py --dataset-name get_coff_simple_100 --dataset-path /mnt/data/310_jiarui/datafactory --max-files 500 --task-list get_coffee

SyntaxError: invalid decimal literal (3562125674.py, line 1)

: 

In [None]:
python scripts/convert_to_lerobot.py --dataset-name set_study_table --dataset-path /mnt/data/310_jiarui/datafactory/fix --max-files 500 --task-list set_study_table