# CFDE Dataset Croissant Builder Tutorial ü•ê

Author: Ido Diamant, Ma'ayan Lab, CFDE DRC

## Introduction

Croissant ü•ê is a high-level format for machine learning datasets that combines metadata, resource file descriptions, data structure, and default ML semantics into a single file.

Croissant builds on schema.org, and its `sc:Dataset` vocabulary, a widely used format to represent datasets on the Web and make them searchable.

The [`mlcroissant`](https://github.com/mlcommons/croissant/tree/main/python/mlcroissant) Python library empowers developers to interact with Croissant:

- Programmatically write your JSON-LD Croissant files.
- Verify your JSON-LD Croissant files.
- Load data from Croissant datasets.

Croissant specifications describe datasets hierarchically in four layers:
1. Metadata
2. Resource
3. Structure
4. ML Semantics

This tutorial notebook will walk through the process of following the [Croissant specification](https://docs.mlcommons.org/croissant/docs/croissant-spec.html) to construct a Croissant JSON file for the CM4AI U2OS Cell Map Protein Localization Assemblies dataset created by the Bridge2AI DCC.

In [None]:
import datetime
import hashlib
import mlcroissant as mlc #Install from pip or from source
import pandas as pd
import requests
from zipfile import ZipFile

## Construct Croissant from CFDE Attribute Table

### Metadata

For each dataset, we need to define a number of properties to populate the metadata layer of the Croissant specification. Many of these fields are required, but there are some optional ones that can provide users of the dataset with additional context that may be useful for dataset discovery or pre-processing steps.

**Required**
| Field | Type | Cardinality | Description | Required |
|-------|------|-------------|-------------|----------|
|name|Text|ONE|The name of the dataset.|YES|
|description|Text|ONE|A description of the dataset|YES|
|license|URL/CreativeWork|MANY|The license of the dataset. Croissant recommends using the URL of a known license, e.g., one of the licenses listed at https://spdx.org/licenses/.|YES|
|url|URL|ONE|The URL of the dataset. This generally corresponds to the Web page for the dataset.|YES|
|creator|Organization/Person|MANY|The creator(s) of the dataset.|YES|
|datePublished|Date/DateTime|ONE|The date the dataset was published.|YES|
|@context|URL|ONE|A set of JSON-LD context definitions that make the rest of the Croissant description less verbose. See the recommended JSON-LD context in Appendix 1.|YES|
|@type|Text|ONE|The type of a croissant dataset must be schema.org/Dataset.|YES|
|dct:conformsTo|URL|ONE|Croissant datasets must declare that they conform to the versioned schema: http://mlcommons.org/croissant/1.0|YES|


**Recommended**
| Field | Type | Cardinality | Description | Required |
|-------|------|-------------|-------------|----------|
|keywords|DefinedTerm/Text/URL|MANY|A set of keywords associated with the dataset, either as free text, or a DefinedTerm with a formal definition.|NO|
|publisher|Organization/Person|MANY|The publisher of the dataset, which may be distinct from its creator.|NO|
|version|Number/Text|ONE|The version of the dataset following the requirements below.|NO|
|dateCreated|Date/DateTime|ONE|The date the dataset was initially created.|NO|
|dateModified|Date/DateTime|ONE|The date the dataset was last modified.|NO|
|sameAs|URL|MANY|The URL of another Web resource that represents the same dataset as this one.|NO|
|sdLicence|CreativeWork/URL|MANY|A license document that applies to this structured data, typically indicated by URL.|NO|
|inLanguage|Language/Text|MANY|The language(s) of the content of the dataset.|NO|
|isLiveDataset|Boolean|ONE|Whether the dataset is a live dataset.|NO|
|citeAs|Text|ONE|A citation to the dataset itself, or a citation for a publication that describes the dataset, ideally using the bibtex format.|NO|

In [None]:
name='CM4AI U2OS Cell Map Protein Localization Assemblies'
description="Protein localization assemblies constructed from integrating AP-MS biomolecular interaction and IF imaging data"

short_citation = 'Schaffer, Nature, 2025'
title = 'Multimodal cell maps as a foundation for structural and functional genomics'
author = 'Schafffer, LV'
journal = 'Nature'
year = 2025
volume = 642
pages = '222‚Äì231'

cite_as=(f'@article{{{short_citation}, title={{{title}}}, author={{{author}}}, journal={{{journal}}}, year={{{year}}}, volume={{{volume}}}, pages={{{pages}}}}}')

creators = [
    mlc.Organization(name="Cell Maps for AI", url="https://cm4ai.org/")
]

publishers = [
    mlc.Organization(name="Ma'ayan Lab", url="https://maayanlab.cloud/")
]

license = "http://creativecommons.org/licenses/by-nc-sa/4.0"
version = "0.1.0"
url="https://maayanlab.cloud/Harmonizome/dataset/CM4AI+U2OS+Cell+Map+Protein+Localization+Assemblies"
date_published=datetime.date(2025, 9, 29)

### Resource
To describe our resources (available FileObjects/FileSets), we describe each archive or file in the distribution property. This layer describes how to access files containing the dataset records.

This layer also handles the required ```distribution``` property:
| Field | Type | Cardinality | Description | Required |
|-------|------|-------------|-------------|----------|
|distribution|FileObject/FileSet|MANY|By contrast with schema.org/Dataset, Croissant requires the distribution property to have values of type FileObject or FileSet.|YES|


Within the distribution property, we need to define several proprties for each FileObject or FileSet that is included for the dataset.

**FileObject**
| Field | Type | Cardinality | Description | Required |
|-------|------|-------------|-------------|----------|
|sc:name|Text|ONE|The name of the file. As much as possible, the name should reflect the name of the file as downloaded, including the file extension. e.g. "images.zip".|YES|
|sc:contentUrl|URL|ONE|Actual bytes of the media object, for example the image file or video file.|YES|
|sc:contentSize|Text|ONE|File size in (mega/kilo/‚Ä¶)bytes. Defaults to bytes if a unit is not specified.|NO|
|sc:encodingFormat|Text|ONE|The format of the file, given as a mime type.|YES|
|sc:sameAs|URL|MANY|URL (or local name) of a FileObject with the same content, but in a different format.|YES|
|sc:sha256|Text|ONE|Checksum for the file contents.|YES|
|containedIn|Text|MANY|Another FileObject or FileSet that this one is contained in, e.g., in the case of a file extracted from an archive. When this property is present, the contentUrl is evaluated as a relative path within the container object.|NO|

**FileSet**
| Field | Type | Cardinality | Description | Required |
|-------|------|-------------|-------------|----------|
|containedIn|Reference|MANY|The source of data for the FileSet, e.g., an archive. If multiple values are provided for containedIn, then the union of their contents is taken (e.g., this can be used to combine files from multiple archives).|YES|
|includes|Text|MANY|A glob pattern that specifies the files to include.|NO|
|excludes|Text|MANY|A glob pattern that specifies the files to exclude.|NO|

In [None]:
# Utility function to generate a SHA256 checksum for a FileObject from a URL
def get_sha256(url):
    sha256 = hashlib.sha256()
    response = requests.get(url, stream=True)
    for chunk in response.iter_content(chunk_size=65536):
        sha256.update(chunk)
    return sha256.hexdigest()

In [None]:
file_url = 'https://maayanlab.cloud/static/hdfs/harmonizome/data/cm4aiu2os/gene_attribute_matrix.txt.zip'

distribution = [
    mlc.FileObject(
        id="dataset-attribute-table-archive",
        name="dataset-attribute-table-archive",
        description="Dataset attribute table archive from Harmonizome.",
        content_url=file_url,
        encoding_formats="application/zip",
        sha256=get_sha256(file_url)
    ),
    mlc.FileObject(
        id="dataset-attribute-table",
        name="dataset-attribute-table",
        description="Dataset attribute table from Harmonizome.",
        content_url="gene_attribute_matrix.txt",
        encoding_formats="text/tab-separated-values",
        contained_in={"dataset-attribute-table-archive"}
    )
]

### Structure/Semantics
To load in our dataset using Croissant, we need to define at least one record set. Within the record set, we'll then define the fields which contain information for each record.

In the case of our attribute tables, each column in the file will be a field, and we'll define our index as the key(s) for the record set.


**RecordSet**
| Field | Type | Cardinality | Description | Required |
|-------|------|-------------|-------------|----------|
|field|Field|MANY|A data element that appears in the records of the RecordSet (e.g., one column of a table).|YES|
|key|Text|MANY|One or more fields whose values uniquely identify each record in the RecordSet.|NO|
|data|JSON|MANY|One or more records that constitute the data of the RecordSet.|NO|
|examples|JSON/URL|MANY|One or more records provided as example content of the RecordSet, or a reference to data source that contains examples.|NO|

**Field**
| Field | Type | Cardinality | Description | Required |
|-------|------|-------------|-------------|----------|
|source|DataSource/URL|ONE|The data source of the field. This will generally reference a FileObject or FileSet's contents (e.g., a specific column of a table).|YES|
|dataType|DataType|MANY|The data type of the field, identified by the URI of the corresponding class. It could be either an atomic type (e.g, sc:Integer) or a semantic type (e.g., sc:GeoLocation).|YES|
|repeated|Boolean|ONE|If true, then the Field is a list of values of type dataType.|NO|
|equivalentProperty|URL|MANY|A property that is equivalent to this Field. Used in the case a dataType is specified on the RecordSet to map specific fields to specific properties associated with that dataType.|NO|
|references|Reference|MANY|Another Field of another RecordSet that this field references. This is the equivalent of a foreign key reference in a relational database.|NO|
|subField|Field|MANY|Another Field that is nested inside this one.|NO|
|parentField|Reference|MANY|A special case of SubField that should be hidden because it references a Field that already appears in the RecordSet.|NO|

For this dataset, we're not making use of all the semantic features available (e.g. defining train/test splits), but we'll include data types and instructions to extract each column from the data matrix into its corresponding Field.

In [None]:
matrix = pd.read_csv(file_url, sep='\t', compression='zip')
display(matrix)

Unnamed: 0,Gene,9-1-1-RAD17-RFC complex,AP-1 transcription factor complex,Actin cytoskeleton remodeling and signal transduction 1,Actin cytoskeleton remodeling and signal transduction 2,Actin filaments,Aminoacyl-tRNA synthetase multienzyme assembly,Amyloid precursor protein complex 1,Amyloid precursor protein complex 2,Anaphase-promoting and cell cycle kinase complexes,...,mRNA Surveillance and Repair Pathway,mRNA cleavage and polyadenylation specificity factor complex,mRNA regulation and decay,mRNA regulation complex,mRNA surveillance and decay pathway,proton-transporting V-type ATPase complex,snRNP assembly,tRNA protein synthesis complex,tRNA splicing endonuclease and telomere cap assembly,tRNA-splicing ligase complex
0,AAAS,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,AAGAB,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,AAK1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,AAMDC,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,AAMP,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4830,ZW10,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4831,ZWINT,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4832,ZYG11B,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4833,ZYX,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
fields = []
array_size = matrix.shape[0]

fields.append(
  mlc.Field(
    id="associations/gene",
    name="gene",
    description="The NCBI gene symbol",
    data_types=mlc.DataType.TEXT,
    is_array=True,
    array_shape=str(array_size),
    source=mlc.Source(
      file_object="dataset-attribute-table",
      extract=mlc.Extract(column="Gene")
    )
  )
)

for col in matrix.columns[1:]:
  fields.append(
    mlc.Field(
      id=f"associations/{col.replace(' ','_')}",
      name=f"associations/{col}",
      data_types=mlc.DataType.INTEGER,
      is_array=True,
      array_shape=str(array_size),
      source=mlc.Source(
        file_object="dataset-attribute-table",
        extract=mlc.Extract(column=col)
      )
    )
  )

print(len(fields))

272


In [None]:
record_sets = [
    mlc.RecordSet(
        id="associations",
        name="associations",
        key='associations/gene',
        fields = fields
    )
]

### Assemble Dataset and Write JSON
Once we have defined each dataset layer, we can package the dataset metadata into a Croissant JSON-LD file.

In [None]:
metadata = mlc.Metadata(
    name = name,
    description = description,
    cite_as = cite_as,
    url = url,
    date_published=date_published,
    creators = creators,
    publisher = publishers,
    license = license,
    version = version,
    distribution = distribution,
    record_sets = record_sets
)

# Display any warnings/suggestions encountered by the validator when building the metadata
print(metadata.issues.report())




In [None]:
import os
import json
if not os.path.exists('CFDE'):
  os.mkdir('CFDE')
with open(f"CFDE/cm4aiu2os.json", "w") as f:
  content = metadata.to_json()
  content = json.dumps(content, indent=4, default=str)
  f.write(content)
  f.write("\n")  # Terminate file with newline

## Load Dataset from Croissant JSON

In [None]:
dataset = mlc.Dataset(jsonld=f"CFDE/cm4aiu2os.json")
dataset.metadata.to_json().keys()

dict_keys(['@context', '@type', 'name', 'description', 'conformsTo', 'citeAs', 'creator', 'datePublished', 'license', 'publisher', 'url', 'version', 'distribution', 'recordSet'])

Although the mlcroissant package allows us to iterate over the records of a RecordSet, the process is slow for datasets with many records or fields. In this case, we'll use the distribution and structure/semantics we detailed earlier to load the dataset in using zipfile and pandas.

In [None]:
if not os.path.exists('/tmp/croissant'):
  os.mkdir('/tmp/croissant')

In [None]:
dataset = mlc.Dataset(jsonld=f"CFDE/cm4aiu2os.json")
meta = dataset.metadata.to_json()
content_url = meta['distribution'][0]['contentUrl']
file_name = meta['distribution'][1]['contentUrl']
extract = list(map(lambda x: x['source']['extract']['column'], meta['recordSet'][0]['field']))
fields = list(map(lambda x: x['@id'], meta['recordSet'][0]['field']))
key = meta['recordSet'][0]['key']['@id']

response = requests.get(content_url)
with open(f'/tmp/croissant/{file_name}.zip', 'wb') as f:
  f.write(response.content)

with ZipFile(f'/tmp/croissant/{file_name}.zip', 'r') as z:
  z.extractall('/tmp/croissant/')

cm4aiu2os = pd.read_csv(f'/tmp/croissant/{file_name}', sep='\t')
cm4aiu2os = cm4aiu2os[extract]
cm4aiu2os.columns = fields
cm4aiu2os = cm4aiu2os.set_index(key)
display(cm4aiu2os)

print(dataset.metadata.description)
print(dataset.metadata.cite_as)
print(dataset.metadata.date_published)

Unnamed: 0_level_0,associations/9-1-1-RAD17-RFC_complex,associations/AP-1_transcription_factor_complex,associations/Actin_cytoskeleton_remodeling_and_signal_transduction_1,associations/Actin_cytoskeleton_remodeling_and_signal_transduction_2,associations/Actin_filaments,associations/Aminoacyl-tRNA_synthetase_multienzyme_assembly,associations/Amyloid_precursor_protein_complex_1,associations/Amyloid_precursor_protein_complex_2,associations/Anaphase-promoting_and_cell_cycle_kinase_complexes,associations/Arp2/3_protein_complex,...,associations/mRNA_Surveillance_and_Repair_Pathway,associations/mRNA_cleavage_and_polyadenylation_specificity_factor_complex,associations/mRNA_regulation_and_decay,associations/mRNA_regulation_complex,associations/mRNA_surveillance_and_decay_pathway,associations/proton-transporting_V-type_ATPase_complex,associations/snRNP_assembly,associations/tRNA_protein_synthesis_complex,associations/tRNA_splicing_endonuclease_and_telomere_cap_assembly,associations/tRNA-splicing_ligase_complex
associations/gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AAAS,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
AAGAB,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
AAK1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
AAMDC,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
AAMP,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ZW10,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ZWINT,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ZYG11B,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ZYX,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Protein localization assemblies constructed from integrating AP-MS biomolecular interaction and IF imaging data
@article{Schaffer, Nature, 2025, title={Multimodal cell maps as a foundation for structural and functional genomics}, author={Schafffer, LV}, journal={Nature}, year={2025}, volume={642}, pages={222‚Äì231}}
2025-09-29 00:00:00
