# Uploading Image Data Repository Bioimaging datasets to Hugging Face

This notebook workflow demonstrates how to reformat a Bioimaging Dataset from Image Data Repository for Hugging Face compliance.

## Data download

### List all Zarr files in the given dataset
First, obtain the dataset S3 path by clicking on a single well in the IDR web interface, and selecting "Show filepaths..." in the right thumbnail:

 ![image.png](images/image1.png)

In [None]:
import bioimageaipub as bp
from pathlib import Path

bp.list_available_data_s3(endpoint_url="https://uk1s3.embassy.ebi.ac.uk", data_dir="s3://bia-integrator-data/S-BIAD845/", file_type=".zarr")

['bia-integrator-data/S-BIAD845/009bd3ab-eb79-4cf4-8a11-ad028b827c03/009bd3ab-eb79-4cf4-8a11-ad028b827c03.zarr',
 'bia-integrator-data/S-BIAD845/016aec49-1db4-4f7e-a134-6accf678b0da/016aec49-1db4-4f7e-a134-6accf678b0da.zarr',
 'bia-integrator-data/S-BIAD845/032164c1-bad8-43c0-9f08-842c5fc12665/032164c1-bad8-43c0-9f08-842c5fc12665.zarr',
 'bia-integrator-data/S-BIAD845/0414c2c6-9c98-4d8e-832b-e0a626586b60/0414c2c6-9c98-4d8e-832b-e0a626586b60.zarr',
 'bia-integrator-data/S-BIAD845/041afe9e-e15c-406f-83d3-1f46d5260b61/041afe9e-e15c-406f-83d3-1f46d5260b61.zarr',
 'bia-integrator-data/S-BIAD845/08568419-fdb4-4f4d-a00f-c9f9c0f74713/08568419-fdb4-4f4d-a00f-c9f9c0f74713.zarr',
 'bia-integrator-data/S-BIAD845/0869dc84-bcb6-4392-af3f-0f9764a5a1bb/0869dc84-bcb6-4392-af3f-0f9764a5a1bb.zarr',
 'bia-integrator-data/S-BIAD845/09d35e01-6ba8-467d-9e35-517f05510a1d/09d35e01-6ba8-467d-9e35-517f05510a1d.zarr',
 'bia-integrator-data/S-BIAD845/0e7a70c5-664f-4b77-b981-cc3ee3120545/0e7a70c5-664f-4b77-b981-cc3

#### Download the actual Zarrs

You can change to to_download_zarrs.txt to specify which exact Zarrs to download (for low memory settings)

In [None]:
#!/bin/bash
bp.download_zarrs_from_s3(path="data/idr_zarrs", list_path="../example-data/to_download_zarrs.txt")

## Image format conversion

Here, we convert the raw image data to a more accessible format for the wider AI community. Recommended formats are 16-bit PNG and TIFF.

In [None]:
bp.convert_zarr(root_data_path=Path("data/idr_zarrs"), converted_data_path=Path("data/idr_converted"), zarr_file_list="example-data/to_download_zarrs.txt", output_format="png")

### Reformat the dataset as Imagefolder
https://huggingface.co/docs/datasets/image_dataset

In [None]:
train_split_ratio = 0.8

bp.test_train_split(converted_data_path=Path("data/idr_converted"), train_ratio=train_split_ratio)
bp.split_into_folders(target=Path("data/idr_converted/test"), max_files=10000)
bp.split_into_folders(target=Path("data/idr_converted/train"), max_files=10000)

## Image-level metadata annotation

#### Import and harmonize metadata from IDR

The unique advantage of IDR is the high quality annotation it provides. In following, we download the annotation as a data table from additional source.

In [None]:
anno = bp.fetch_idr_annotation(study_id="idr0012")

  anno = pd.read_csv("https://raw.githubusercontent.com/IDR/idr0012-fuchs-cellmorph/700788ee94c21e35b0d614845159e780440772ca/screenA/idr0012-screenA-annotation.csv", sep = ",")


In [27]:
anno

Unnamed: 0,Plate,Well Number,Well,Plate_Well,Characteristics [Organism],Term Source 1 REF,Term Source 1 Accession,Characteristics [Cell Line],Term Source 2 REF,Term Source 2 Accession,...,Phenotype 15 Term Accession,Phenotype 16,Phenotype 16 Term Name a,Phenotype 16 Term Accession a,Phenotype 16 Term Name b,Phenotype 16 Term Accession b,Phenotype 17,Phenotype 17 Term Name,Phenotype 17 Term Accession,Phenotype 18
0,HT01,1,A1,HT01_A1,,,,,,,...,,,,,,,,,,
1,HT01,2,A2,HT01_A2,,,,,,,...,,,,,,,,,,
2,HT01,3,A3,HT01_A3,,,,,,,...,,,,,,,,,,
3,HT01,4,A4,HT01_A4,Homo sapiens,NCBITaxon,NCBITaxon_9606,HeLa,EFO,EFO_0001185,...,,,,,,,,,,
4,HT01,5,A5,HT01_A5,Homo sapiens,NCBITaxon,NCBITaxon_9606,HeLa,EFO,EFO_0001185,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26107,HT68,380,P20,HT68_P20,,,,,,,...,,,,,,,,,,
26108,HT68,381,P21,HT68_P21,,,,,,,...,,,,,,,,,,
26109,HT68,382,P22,HT68_P22,,,,,,,...,,,,,,,,,,
26110,HT68,383,P23,HT68_P23,,,,,,,...,,,,,,,,,,


In [31]:
anno["Plate_Well"].unique()

array(['HT01_A1', 'HT01_A2', 'HT01_A3', ..., 'HT68_P22', 'HT68_P23',
       'HT68_P24'], shape=(26112,), dtype=object)

In [None]:
huggingface_df = bp.produce_hf_anno(num_fields=2, anno=anno)

Unnamed: 0,file_name,Plate_Well,Plate,Well Number,Well,Characteristics [Organism],Term Source 1 REF,Term Source 1 Accession,Characteristics [Cell Line],Term Source 2 REF,...,Phenotype 15 Term Accession,Phenotype 16,Phenotype 16 Term Name a,Phenotype 16 Term Accession a,Phenotype 16 Term Name b,Phenotype 16 Term Accession b,Phenotype 17,Phenotype 17 Term Name,Phenotype 17 Term Accession,Phenotype 18
0,HT01_A1_0.png,HT01_A1,HT01,1,A1,,,,,,...,,,,,,,,,,
1,HT01_A1_1.png,HT01_A1,HT01,1,A1,,,,,,...,,,,,,,,,,
2,HT01_A2_0.png,HT01_A2,HT01,2,A2,,,,,,...,,,,,,,,,,
3,HT01_A2_1.png,HT01_A2,HT01,2,A2,,,,,,...,,,,,,,,,,
4,HT01_A3_0.png,HT01_A3,HT01,3,A3,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52219,HT68_P22_1.png,HT68_P22,HT68,382,P22,,,,,,...,,,,,,,,,,
52220,HT68_P23_0.png,HT68_P23,HT68,383,P23,,,,,,...,,,,,,,,,,
52221,HT68_P23_1.png,HT68_P23,HT68,383,P23,,,,,,...,,,,,,,,,,
52222,HT68_P24_0.png,HT68_P24,HT68,384,P24,,,,,,...,,,,,,,,,,


### Add OMExcavator metadata

[OMExcavator](https://codebase.helmholtz.cloud/stefan.dvoretskii/omero-crawler) is a tool to extract OMERO image metadata in an interoperable JSON-LD format. We will use it here to extract metadata fields which are provided in the OMERO annotation on top of those in the official CSV - mostly technical metadata related to imaging.

In [None]:
bp.fetch_omero_metadata(omexcavator_path="c:\src\omexcavator")

SyntaxError: invalid syntax (483816088.py, line 1)

Fetching metadata is a heavy process currently, and can take around 14 hours for a dataset of 55,000 images. 

In [None]:
## Parse a single OMExcavator JSON file to see what metadata fields are available
import json
with open("../example-data/image-1811248-well-1000475-plate-4287-screen-1202.jsonld", "r") as f:
    omexcavator_metadata = json.load(f)

In [34]:
for ann_obj in omexcavator_metadata["image"]["Annotations"]:
    print(ann_obj["@type"], ":", ann_obj["Value"])

http://www.openmicroscopy.org/Schemas/OME/2016-06#MapAnnotation : [['siRNA Identifier', 'D-003255-06'], ['siRNA Identifier', 'D-003255-07'], ['siRNA Identifier', 'D-003255-08'], ['siRNA Identifier', 'D-003255-09'], ['siRNA Pool Identifier', 'M-003255-02']]
http://www.openmicroscopy.org/Schemas/OME/2016-06#MapAnnotation : [['Original GeneID Target', '1111'], ['Original Gene Target', 'CHEK1'], ['Original RefSeq Target', 'NM_001274'], ['Original LocusLink Target', '20127419']]
http://www.openmicroscopy.org/Schemas/OME/2016-06#MapAnnotation : [['RefSeq Accession', 'NM_001274'], ['RefSeq Hits', '4'], ['Analysis Gene Annotation Build', 'NCBI36.3, RefSeq release 27, Jan 2008']]
http://www.openmicroscopy.org/Schemas/OME/2016-06#MapAnnotation : [['Organism', 'Homo sapiens']]
http://www.openmicroscopy.org/Schemas/OME/2016-06#MapAnnotation : [['Channels', 'Alexa 488:tubulin;Hoechst:DNA;Tritc:actin']]
http://www.openmicroscopy.org/Schemas/OME/2016-06#MapAnnotation : [['Cell Line', 'HeLa']]
http://

In [35]:
idr_anno_collist = anno.columns.tolist()
idr_anno_collist

['Plate',
 'Well Number',
 'Well',
 'Plate_Well',
 'Characteristics [Organism]',
 'Term Source 1 REF',
 'Term Source 1 Accession',
 'Characteristics [Cell Line]',
 'Term Source 2 REF',
 'Term Source 2 Accession',
 'siRNA Pool Identifier',
 'siRNA Identifier',
 'Original GeneID Target',
 'Original Gene Target',
 'Original RefSeq Target',
 'Original LocusLink Target',
 'Reagent Design Gene Annotation Build',
 'Gene Identifier',
 'Gene Symbol',
 'RefSeq Accession',
 'RefSeq Hits',
 'Gene Annotation Comments',
 'Analysis Gene Annotation Build',
 'Control Type',
 'Control Comments',
 'Channels',
 'n',
 'ext',
 'ecc',
 'Ato',
 'Nex',
 'Nin',
 'Nto',
 'AF',
 'BC',
 'C',
 'M',
 'LA',
 'P',
 'Cluster',
 'Has Phenotype',
 'Phenotype Annotation Level',
 'Phenotype 1',
 'Phenotype 1 Term Name a',
 'Phenotype 1 Term Accession a',
 'Phenotype 1 Term Name b',
 'Phenotype 1 Term Accession b',
 'Phenotype 2',
 'Phenotype 2 Term Name',
 'Phenotype 2 Term Accession',
 'Phenotype 3',
 'Phenotype 3 Term Na

In [36]:
list(filter(lambda x: "Channel" in x, idr_anno_collist))

['Channels']

Looks like all the Image Annotations are in the IDR table annotation already (**might be different for other datasets!**). However, the IDR annotation does not contain Physical Size of images, and also the channel colors.

In [37]:
list(filter(lambda x: "Size" in x, idr_anno_collist))

[]

In [38]:
## Add physical size metadata to the huggingface dataframe
huggingface_df["PhysicalSizeX"] = f'{omexcavator_metadata["image"]["Pixels"]["PhysicalSizeX"]["Value"]} {omexcavator_metadata["image"]["Pixels"]["PhysicalSizeX"]["Unit"]}'
huggingface_df["PhysicalSizeY"] = f'{omexcavator_metadata["image"]["Pixels"]["PhysicalSizeY"]["Value"]} {omexcavator_metadata["image"]["Pixels"]["PhysicalSizeY"]["Unit"]}'

In [39]:
huggingface_df["Channel_Color"] = "; ".join([f"{i}: {chan['Color']}" for i, chan in enumerate(omexcavator_metadata["channels"])])

In [None]:
huggingface_df

Unnamed: 0,file_name,Plate_Well,Plate,Well Number,Well,Characteristics [Organism],Term Source 1 REF,Term Source 1 Accession,Characteristics [Cell Line],Term Source 2 REF,...,Phenotype 16 Term Accession a,Phenotype 16 Term Name b,Phenotype 16 Term Accession b,Phenotype 17,Phenotype 17 Term Name,Phenotype 17 Term Accession,Phenotype 18,PhysicalSizeX,PhysicalSizeY,Channel_Color
0,HT01_A1_0.png,HT01_A1,HT01,1,A1,,,,,,...,,,,,,,,338.6666666666667 MICROMETER,338.6666666666667 MICROMETER,0: -16776961; 1: 16711935; 2: 65535
1,HT01_A1_1.png,HT01_A1,HT01,1,A1,,,,,,...,,,,,,,,338.6666666666667 MICROMETER,338.6666666666667 MICROMETER,0: -16776961; 1: 16711935; 2: 65535
2,HT01_A2_0.png,HT01_A2,HT01,2,A2,,,,,,...,,,,,,,,338.6666666666667 MICROMETER,338.6666666666667 MICROMETER,0: -16776961; 1: 16711935; 2: 65535
3,HT01_A2_1.png,HT01_A2,HT01,2,A2,,,,,,...,,,,,,,,338.6666666666667 MICROMETER,338.6666666666667 MICROMETER,0: -16776961; 1: 16711935; 2: 65535
4,HT01_A3_0.png,HT01_A3,HT01,3,A3,,,,,,...,,,,,,,,338.6666666666667 MICROMETER,338.6666666666667 MICROMETER,0: -16776961; 1: 16711935; 2: 65535
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52219,HT68_P22_1.png,HT68_P22,HT68,382,P22,,,,,,...,,,,,,,,338.6666666666667 MICROMETER,338.6666666666667 MICROMETER,0: -16776961; 1: 16711935; 2: 65535
52220,HT68_P23_0.png,HT68_P23,HT68,383,P23,,,,,,...,,,,,,,,338.6666666666667 MICROMETER,338.6666666666667 MICROMETER,0: -16776961; 1: 16711935; 2: 65535
52221,HT68_P23_1.png,HT68_P23,HT68,383,P23,,,,,,...,,,,,,,,338.6666666666667 MICROMETER,338.6666666666667 MICROMETER,0: -16776961; 1: 16711935; 2: 65535
52222,HT68_P24_0.png,HT68_P24,HT68,384,P24,,,,,,...,,,,,,,,338.6666666666667 MICROMETER,338.6666666666667 MICROMETER,0: -16776961; 1: 16711935; 2: 65535


In [None]:
bp.save_hf_anno(huggingface_df, converted_data_path="data/idr_converted", split_folder="all", mixed_data_type_columns=["Original GeneID Target"])

### Test conversion to PyArrow

That's what HuggingFace uses to build dataset preview

![image.png](images/image2.png)

See https://discuss.huggingface.co/t/dataset-preview-pyarrow-lib-arrowtypeerror-expected-bytes-got-a-float-object/169874 for the bug description. I check the tables locally, and remove the problematic mixed dtype columns (sacrificing metadata, which is sad)

In [None]:
import pyarrow as pa
converted_data_path="data/idr_converted"
df = pd.read_csv(f"{converted_data_path}/metadata/metadata.csv")
pat = pa.Table.from_pandas(df)

  df = pd.read_csv(f"{converted_data_path}/metadata/metadata.csv")


In [None]:
import os 
import pandas as pd

for splitdir in os.listdir(converted_data_path / "train"):
    if os.path.isdir(os.path.join(converted_data_path, "train", splitdir)):
        bp.save_hf_anno(huggingface_df, converted_data_path="data/idr_converted", split_folder=f"train/{splitdir}", mixed_data_type_columns=["Cluster"])

In [43]:
import pyarrow as pa
for i in range(1,5):
    df = pd.read_csv(f"{converted_data_path}/train{i}/metadata.csv")
    
    pat = pa.Table.from_pandas(df)

  df = pd.read_csv(f"{converted_data_path}/train{i}/metadata.csv")


In [None]:
test_images = []
for image_fname in os.listdir(f"{converted_data_path}/test/"):
    test_images.append(image_fname)
    
test_huggingface_anno = pd.merge(pd.Series(test_images, name = "file_name"), huggingface_df, how="left")
test_huggingface_anno

Unnamed: 0,file_name,Plate_Well,Plate,Well Number,Well,Characteristics [Organism],Term Source 1 REF,Term Source 1 Accession,Characteristics [Cell Line],Term Source 2 REF,...,Phenotype 16 Term Accession a,Phenotype 16 Term Name b,Phenotype 16 Term Accession b,Phenotype 17,Phenotype 17 Term Name,Phenotype 17 Term Accession,Phenotype 18,PhysicalSizeX,PhysicalSizeY,Channel_Color
0,HT60_F21_1.png,HT60_F21,HT60,141.0,F21,Homo sapiens,NCBITaxon,NCBITaxon_9606,HeLa,EFO,...,,,,,,,,338.6666666666667 MICROMETER,338.6666666666667 MICROMETER,0: -16776961; 1: 16711935; 2: 65535
1,HT53_O10_0.png,HT53_O10,HT53,346.0,O10,Homo sapiens,NCBITaxon,NCBITaxon_9606,HeLa,EFO,...,,,,,,,,338.6666666666667 MICROMETER,338.6666666666667 MICROMETER,0: -16776961; 1: 16711935; 2: 65535
2,HT30_I23_1.png,HT30_I23,HT30,215.0,I23,Homo sapiens,NCBITaxon,NCBITaxon_9606,HeLa,EFO,...,,,,,,,other phenotype,338.6666666666667 MICROMETER,338.6666666666667 MICROMETER,0: -16776961; 1: 16711935; 2: 65535
3,HT47_K19_0.png,HT47_K19,HT47,259.0,K19,Homo sapiens,NCBITaxon,NCBITaxon_9606,HeLa,EFO,...,,,,,,,,338.6666666666667 MICROMETER,338.6666666666667 MICROMETER,0: -16776961; 1: 16711935; 2: 65535
4,HT25_O18_1.png,HT25_O18,HT25,354.0,O18,Homo sapiens,NCBITaxon,NCBITaxon_9606,HeLa,EFO,...,,,,,,,,338.6666666666667 MICROMETER,338.6666666666667 MICROMETER,0: -16776961; 1: 16711935; 2: 65535
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9263,HT24_P9_0.png,HT24_P9,HT24,369.0,P9,Homo sapiens,NCBITaxon,NCBITaxon_9606,HeLa,EFO,...,,,,,,,,338.6666666666667 MICROMETER,338.6666666666667 MICROMETER,0: -16776961; 1: 16711935; 2: 65535
9264,HT37_C13_0.png,HT37_C13,HT37,61.0,C13,Homo sapiens,NCBITaxon,NCBITaxon_9606,HeLa,EFO,...,,,,,,,,338.6666666666667 MICROMETER,338.6666666666667 MICROMETER,0: -16776961; 1: 16711935; 2: 65535
9265,HT44_O9_1.png,HT44_O9,HT44,345.0,O9,Homo sapiens,NCBITaxon,NCBITaxon_9606,HeLa,EFO,...,,,,,,,,338.6666666666667 MICROMETER,338.6666666666667 MICROMETER,0: -16776961; 1: 16711935; 2: 65535
9266,HT55_L24_0.png,HT55_L24,HT55,288.0,L24,Homo sapiens,NCBITaxon,NCBITaxon_9606,HeLa,EFO,...,,,,,,,,338.6666666666667 MICROMETER,338.6666666666667 MICROMETER,0: -16776961; 1: 16711935; 2: 65535


In [None]:
for splitdir in os.listdir(converted_data_path / "test"):
    if os.path.isdir(os.path.join(converted_data_path, "test", splitdir)):
        bp.save_hf_anno(huggingface_df, converted_data_path="data/idr_converted", split_folder=f"test/{splitdir}", mixed_data_type_columns=["Cluster"])
        
df = pd.read_csv(f"{converted_data_path}/test/metadata.csv")
pat = pa.Table.from_pandas(df)

  df = pd.read_csv(f"{converted_data_path}/test/metadata.csv")


In [46]:
df = pd.read_csv(f"{converted_data_path}/train4/metadata.csv")
pat = pa.Table.from_pandas(df)

## Image data upload

Upload the dataset as a big folder

In [None]:
destination_dataset = "stefanches/genomic-bioimaging"
converted_folder_path = "/vol/sdb/S-BIAD845-converted/"

bp.hf_upload_converted_folder(converted_folder_path, destination_dataset)

Folder 'train' contains 36,426 entries (36,426 files and 0 subdirectories). This exceeds the recommended 10,000 entries per folder.
Consider reorganising into sub-folders.
Consider reorganising into sub-folders.
Recovering from metadata files: 100%|██████████| 45763/45763 [00:15<00:00, 2885.82it/s]





---------- 2025-11-03 14:05:12 (0:00:00) ----------
Files:   hashed 19/45763 (344.5K/279.7G) | pre-uploaded: 0/0 (0.0/279.7G) (+45763 unsure) | committed: 0/45763 (0.0/279.7G) | ignored: 0
Workers: hashing: 13 | get upload mode: 1 | pre-uploading: 0 | committing: 0 | waiting: 0
---------------------------------------------------


Processing Files (256 / 256): 100%|██████████| 1.58GB / 1.58GB,  386MB/s  
New Data Upload: 100%|██████████| 18.8MB / 18.8MB, 5.21MB/s  
Processing Files (256 / 256): 100%|██████████| 1.56GB / 1.56GB,  489MB/s  
New Data Upload: |          |  0.00B /  0.00B,  0.00B/s  
Processing Files (256 / 256): 100%|██████████| 1.56GB / 1.56GB,  570MB/s  
New Data Upload: |          |  0.00B /  0.00B,  0.00B/s  
Processing Files (256 / 256): 100%|██████████| 1.57GB / 1.57GB,  539MB/s  
New Data Upload: |          |  0.00B /  0.00B,  0.00B/s  
Processing Files (57 / 57): 100%|██████████|  361MB /  361MB,  302MB/s    
New Data Upload: |          |  0.00B /  0.00B,  0.00B/s  
Processing Files (256 / 256): 100%|██████████| 1.56GB / 1.56GB,  524MB/s  
New Data Upload: |          |  0.00B /  0.00B,  0.00B/s  
Processing Files (256 / 256): 100%|██████████| 1.57GB / 1.57GB,  534MB/s  
New Data Upload: |          |  0.00B /  0.00B,  0.00B/s  
Processing Files (256 / 256): 100%|██████████| 1.57GB / 1.57GB,  

[K[F
---------- 2025-11-03 14:06:12 (0:01:00) ----------
Files:   hashed 7751/45763 (47.0G/279.7G) | pre-uploaded: 6400/7430 (39.2G/279.7G) (+38265 unsure) | committed: 800/45763 (4.5G/279.7G) | ignored: 0
Workers: hashing: 6 | get upload mode: 3 | pre-uploading: 4 | committing: 1 | waiting: 0
---------------------------------------------------
                             

Processing Files (256 / 256): 100%|██████████| 1.56GB / 1.56GB,  459MB/s  
New Data Upload: |          |  0.00B /  0.00B,  0.00B/s  
Processing Files (256 / 256): 100%|██████████| 1.56GB / 1.56GB,  460MB/s  
New Data Upload: |          |  0.00B /  0.00B,  0.00B/s  
Processing Files (256 / 256): 100%|██████████| 1.57GB / 1.57GB,  392MB/s  
New Data Upload: |          |  0.00B /  0.00B,  0.00B/s  
Processing Files (256 / 256): 100%|██████████| 1.57GB / 1.57GB,  563MB/s  
New Data Upload: |          |  0.00B /  0.00B,  0.00B/s  
Processing Files (256 / 256): 100%|██████████| 1.57GB / 1.57GB,  357MB/s  
New Data Upload: |          |  0.00B /  0.00B,  0.00B/s  
Processing Files (256 / 256): 100%|██████████| 1.57GB / 1.57GB,  342MB/s  
New Data Upload: |          |  0.00B /  0.00B,  0.00B/s  
Processing Files (256 / 256): 100%|██████████| 1.57GB / 1.57GB,  261MB/s  
New Data Upload: |          |  0.00B /  0.00B,  0.00B/s  
Processing Files (256 / 256): 100%|██████████| 1.56GB / 1.56GB,  223M

[K[F
---------- 2025-11-03 14:07:13 (0:02:01) ----------
Files:   hashed 15075/45763 (91.8G/279.7G) | pre-uploaded: 13312/14780 (81.5G/279.7G) (+30914 unsure) | committed: 1800/45763 (10.6G/279.7G) | ignored: 0
Workers: hashing: 6 | get upload mode: 2 | pre-uploading: 5 | committing: 1 | waiting: 0
---------------------------------------------------
                             

Processing Files (256 / 256): 100%|██████████| 1.56GB / 1.56GB,  159MB/s  
New Data Upload: |          |  0.00B /  0.00B,  0.00B/s  
Processing Files (256 / 256): 100%|██████████| 1.56GB / 1.56GB,  156MB/s  
New Data Upload: |          |  0.00B /  0.00B,  0.00B/s  
Processing Files (256 / 256): 100%|██████████| 1.57GB / 1.57GB,  201MB/s  
New Data Upload: |          |  0.00B /  0.00B,  0.00B/s  
Processing Files (256 / 256): 100%|██████████| 1.56GB / 1.56GB,  195MB/s  
New Data Upload: |          |  0.00B /  0.00B,  0.00B/s  
Processing Files (256 / 256): 100%|██████████| 1.57GB / 1.57GB,  196MB/s  
New Data Upload: |          |  0.00B /  0.00B,  0.00B/s  
Processing Files (256 / 256): 100%|██████████| 1.57GB / 1.57GB,  196MB/s  
New Data Upload: |          |  0.00B /  0.00B,  0.00B/s  
Processing Files (256 / 256): 100%|██████████| 1.57GB / 1.57GB,  157MB/s  
New Data Upload: |          |  0.00B /  0.00B,  0.00B/s  
Processing Files (256 / 256): 100%|██████████| 1.56GB / 1.56GB,  156M

[K[F
---------- 2025-11-03 14:08:13 (0:03:01) ----------
Files:   hashed 21693/45763 (132.3G/279.7G) | pre-uploaded: 19456/21601 (119.0G/279.7G) (+24093 unsure) | committed: 2800/45763 (16.7G/279.7G) | ignored: 0
Workers: hashing: 4 | get upload mode: 1 | pre-uploading: 8 | committing: 1 | waiting: 0
---------------------------------------------------
                             

Processing Files (256 / 256): 100%|██████████| 1.57GB / 1.57GB,  157MB/s  
New Data Upload: |          |  0.00B /  0.00B,  0.00B/s  
Processing Files (256 / 256): 100%|██████████| 1.57GB / 1.57GB,  157MB/s  
New Data Upload: |          |  0.00B /  0.00B,  0.00B/s  
Processing Files (256 / 256): 100%|██████████| 1.56GB / 1.56GB,  156MB/s  
New Data Upload: |          |  0.00B /  0.00B,  0.00B/s  
Processing Files (256 / 256): 100%|██████████| 1.57GB / 1.57GB,  157MB/s  
New Data Upload: |          |  0.00B /  0.00B,  0.00B/s  
Processing Files (256 / 256): 100%|██████████| 1.56GB / 1.56GB,  156MB/s  
New Data Upload: |          |  0.00B /  0.00B,  0.00B/s  
Processing Files (256 / 256): 100%|██████████| 1.57GB / 1.57GB,  157MB/s  
New Data Upload: |          |  0.00B /  0.00B,  0.00B/s  
Processing Files (256 / 256): 100%|██████████| 1.56GB / 1.56GB,  156MB/s  
New Data Upload: |          |  0.00B /  0.00B,  0.00B/s  
Processing Files (1000 / 1000): |          | 6.13GB /     ?B,  142MB/

[K[F
---------- 2025-11-03 14:09:13 (0:04:01) ----------
Files:   hashed 28582/45763 (174.5G/279.7G) | pre-uploaded: 26903/28370 (164.7G/279.7G) (+17324 unsure) | committed: 3800/45763 (22.8G/279.7G) | ignored: 0
Workers: hashing: 6 | get upload mode: 2 | pre-uploading: 5 | committing: 1 | waiting: 0
---------------------------------------------------
                             

Processing Files (256 / 256): 100%|██████████| 1.58GB / 1.58GB,  303MB/s  
New Data Upload: |          |  0.00B /  0.00B,  0.00B/s  
Processing Files (256 / 256): 100%|██████████| 1.57GB / 1.57GB,  302MB/s  
New Data Upload: |          |  0.00B /  0.00B,  0.00B/s  
Processing Files (256 / 256): 100%|██████████| 1.56GB / 1.56GB,  433MB/s  
New Data Upload: |          |  0.00B /  0.00B,  0.00B/s  
Processing Files (256 / 256): 100%|██████████| 1.57GB / 1.57GB,  435MB/s  
New Data Upload: |          |  0.00B /  0.00B,  0.00B/s  
Processing Files (256 / 256): 100%|██████████| 1.56GB / 1.56GB,  313MB/s  
New Data Upload: |          |  0.00B /  0.00B,  0.00B/s  
Processing Files (256 / 256): 100%|██████████| 1.57GB / 1.57GB,  328MB/s  
New Data Upload: |          |  0.00B /  0.00B,  0.00B/s  
Processing Files (256 / 256): 100%|██████████| 1.56GB / 1.56GB,  372MB/s  
New Data Upload: |          |  0.00B /  0.00B,  0.00B/s  
Processing Files (256 / 256): 100%|██████████| 1.57GB / 1.57GB,  393M

[K[F
---------- 2025-11-03 14:10:14 (0:05:02) ----------
Files:   hashed 35924/45763 (219.5G/279.7G) | pre-uploaded: 34561/35684 (211.5G/279.7G) (+10010 unsure) | committed: 4800/45763 (29.0G/279.7G) | ignored: 0
Workers: hashing: 7 | get upload mode: 2 | pre-uploading: 4 | committing: 1 | waiting: 0
---------------------------------------------------
                             


Processing Files (256 / 256): 100%|██████████| 1.57GB / 1.57GB,  435MB/s  
New Data Upload: |          |  0.00B /  0.00B,  0.00B/s  
Processing Files (256 / 256): 100%|██████████| 1.56GB / 1.56GB,  372MB/s  
New Data Upload: |          |  0.00B /  0.00B,  0.00B/s  
Processing Files (256 / 256): 100%|██████████| 1.57GB / 1.57GB,  326MB/s  
New Data Upload: |          |  0.00B /  0.00B,  0.00B/s  
Processing Files (256 / 256): 100%|██████████| 1.57GB / 1.57GB,  342MB/s  
New Data Upload: |          |  0.00B /  0.00B,  0.00B/s  
Processing Files (256 / 256): 100%|██████████| 1.57GB / 1.57GB,  602MB/s  
New Data Upload: |          |  0.00B /  0.00B,  0.00B/s  
Processing Files (256 / 256): 100%|██████████| 1.57GB / 1.57GB,  571MB/s  
New Data Upload: |          |  0.00B /  0.00B,  0.00B/s  
Processing Files (256 / 256): 100%|██████████| 1.56GB / 1.56GB,  574MB/s  
New Data Upload: |          |  0.00B /  0.00B,  0.00B/s  
Processing Files (256 / 256): 100%|██████████| 1.57GB / 1.57GB,  566

[K[F
---------- 2025-11-03 14:11:15 (0:06:03) ----------
Files:   hashed 43130/45763 (263.5G/279.7G) | pre-uploaded: 41984/42943 (256.9G/279.7G) (+2751 unsure) | committed: 6800/45763 (41.2G/279.7G) | ignored: 0
Workers: hashing: 9 | get upload mode: 1 | pre-uploading: 3 | committing: 1 | waiting: 0
---------------------------------------------------
                             

Processing Files (256 / 256): 100%|██████████| 1.57GB / 1.57GB,  393MB/s  
New Data Upload: |          |  0.00B /  0.00B,  0.00B/s  
Processing Files (110 / 110):  43%|████▎     |  672MB / 1.56GB,  305MB/s  Failed to preupload LFS: 502 Server Error: Bad Gateway for url: https://huggingface.co/api/datasets/stefanches/genomic-bioimaging/revision/main?expand=xetEnabled
ERROR:huggingface_hub._upload_large_folder:Failed to preupload LFS: 502 Server Error: Bad Gateway for url: https://huggingface.co/api/datasets/stefanches/genomic-bioimaging/revision/main?expand=xetEnabled
Processing Files (256 / 256): 100%|██████████| 1.57GB / 1.57GB,  270MB/s  
New Data Upload: |          |  0.00B /  0.00B,  0.00B/s  
Processing Files (256 / 256): 100%|██████████| 1.56GB / 1.56GB,  390MB/s  
New Data Upload: |          |  0.00B /  0.00B,  0.00B/s  
Processing Files (256 / 256): 100%|██████████| 1.57GB / 1.57GB,  245MB/s  
New Data Upload: |          |  0.00B /  0.00B,  0.00B/s  
Processing Files (256 / 256

[K[F
---------- 2025-11-03 14:12:15 (0:07:03) ----------
Files:   hashed 45763/45763 (279.7G/279.7G) | pre-uploaded: 45694/45694 (279.7G/279.7G) | committed: 8800/45763 (53.5G/279.7G) | ignored: 0
Workers: hashing: 0 | get upload mode: 0 | pre-uploading: 0 | committing: 1 | waiting: 13
---------------------------------------------------
                             

Processing Files (1000 / 1000): |          | 6.14GB /     ?B,  527MB/s  
New Data Upload: |          |  0.00B /  0.00B,  0.00B/s  
Processing Files (1000 / 1000): |          | 6.11GB /     ?B,  531MB/s    
New Data Upload: |          |  0.00B /  0.00B,  0.00B/s  
Processing Files (1000 / 1000): |          | 6.12GB /     ?B,  525MB/s    
New Data Upload: |          |  0.00B /  0.00B,  0.00B/s  


[K[F
---------- 2025-11-03 14:13:15 (0:08:03) ----------
Files:   hashed 45763/45763 (279.7G/279.7G) | pre-uploaded: 45694/45694 (279.7G/279.7G) | committed: 11381/45763 (69.3G/279.7G) | ignored: 0
Workers: hashing: 0 | get upload mode: 0 | pre-uploading: 0 | committing: 1 | waiting: 13
---------------------------------------------------
                             

Processing Files (1000 / 1000): |          | 6.11GB /     ?B,  517MB/s    
New Data Upload: |          |  0.00B /  0.00B,  0.00B/s  
Processing Files (1000 / 1000): |          | 6.11GB /     ?B,  530MB/s    
New Data Upload: |          |  0.00B /  0.00B,  0.00B/s  
Processing Files (1000 / 1000): |          | 6.12GB /     ?B,  530MB/s    
New Data Upload: |          |  0.00B /  0.00B,  0.00B/s  


[K[F
---------- 2025-11-03 14:14:16 (0:09:04) ----------
Files:   hashed 45763/45763 (279.7G/279.7G) | pre-uploaded: 45694/45694 (279.7G/279.7G) | committed: 13800/45763 (84.0G/279.7G) | ignored: 0
Workers: hashing: 0 | get upload mode: 0 | pre-uploading: 0 | committing: 1 | waiting: 13
---------------------------------------------------
                             

Processing Files (1000 / 1000): |          | 6.12GB /     ?B,  482MB/s    
New Data Upload: |          |  0.00B /  0.00B,  0.00B/s  


## Dataset-level metadata annotation

To add the dataset details (Dataset Source, License etc.), use the HuggingFace Dataset Card interface:![image.png](images/image3.png) 



For time purposes, the Large  Models assistance can be helpful in filling out the Dataset Card, however these were not used in the test experiments due to the high possibility of errors, and this is in no way encouraged. 
Please, always check the information filled in by Large Language Models in regards to the factuality if you use them.

This step is manual, and consists of fetching the dataset metadata from the already existing sources. For the Imaging Data Repository datasets, there are ["study" pages](https://idr.openmicroscopy.org/study/idr0012/) which provide useful links to various repositories where additional Dataset-level metadata can be found:
![image.png](images/image4.png)


## Conclusion

This notebook demonstrates how to upload a Bioimaging study data to HuggingFace, ensuring all the metadata is AI-ready and accessible to users through their favourite Machine Learning framework. Small deviations from the process are possible for diverse datasets, e.g. those hosted on FTP servers instead of S3 buckets.