# Tutorial for `mlcroissant` 🥐

## Introduction

Croissant 🥐 is a high-level format for machine learning datasets that combines metadata, resource file descriptions, data structure, and default ML semantics into a single file.

Croissant builds on schema.org, and its `sc:Dataset` vocabulary, a widely used format to represent datasets on the Web, and make them searchable.

The [`mlcroissant`](https://github.com/mlcommons/croissant/python/mlcroissant) Python library empowers developers to interact with Croissant:

- Programmatically write your JSON-LD Croissant files.
- Verify your JSON-LD Croissant files.
- Load data from Croissant datasets.

In [1]:
# # Install mlcroissant from the source
# !brew install -y python3-dev graphviz libgraphviz-dev pkg-config
# !pip install "git+https://github.com/${GITHUB_REPOSITORY:-mlcommons/croissant}.git@${GITHUB_HEAD_REF:-main}#subdirectory=python/mlcroissant&egg=mlcroissant[dev]"

In [2]:
import mlcroissant as mlc
import hashlib
import json

In [3]:
# DEMAND
test_demand = "gym4real/data/wds/demand/datasets/test1000.csv"

with open(test_demand, "rb") as f:
    sha256_hash_test_demand = hashlib.sha256(f.read()).hexdigest()

In [36]:
# FileObjects and FileSets define the resources of the dataset.
distribution = [
    mlc.FileObject(
        id="github-repository",
        name="github-repository",
        description="Gym4ReaL repository on GitHub.",
        content_url="https://github.com/Daveonwave/gym4ReaL",
        encoding_formats=["git+https"],
        sha256="main",
    ),
    mlc.FileObject(
        id="demand_test",
        name="demand_test",
        description="Demand patterns (test set)",
        content_url="https://raw.githubusercontent.com/Daveonwave/gym4ReaL/main/" + test_demand,
        encoding_formats=["text/csv"],
        sha256=sha256_hash_test_demand,
    ),
    mlc.FileSet(
        id="demand_train",
        name="demand_train",
        description="Train demand patterns files",
        encoding_formats=["text/csv"],
        contained_in=["github-repository"],
        includes="gym4real/data/wds/demand/datasets/interval_*.csv",
    ),
]


In [37]:
# --- DEMAND TRAIN ---
train_demand_fields = []

for i in range(54):
    train_demand_fields.append(
        mlc.Field(
            id= 'train_' + str(i),
            name=str(i),
            description=f"Demand pattern of a week within a year (train set)",
            data_types=mlc.DataType.FLOAT,
            source=mlc.Source(
                        file_set="demand_train",
                        # file_object='github-repository',
                        # Extract the field from the column of a FileObject/FileSet:
                        extract=mlc.Extract(column=str(i)))))

# --- DEMAND TEST ---
test_demand_fields = []

for i in range(1000):
    test_demand_fields.append(
        mlc.Field(
            id='test_' + str(i),
            name=str(i),
            description=f"Demand pattern of a week within a year (test set)",
            data_types=mlc.DataType.FLOAT,
            source=mlc.Source(
                        file_object="demand_test",
                        # file_object='github-repository',
                        # Extract the field from the column of a FileObject/FileSet:
                        extract=mlc.Extract(column=str(i)))))

In [38]:
record_sets = [
    mlc.RecordSet(
        id="demand_pattern_train",
        name="demand_pattern_train",
        # Each record has one or many fields...
        fields=train_demand_fields,
    ),
    mlc.RecordSet(
        id="demand_pattern_test",
        name="demand_pattern_test",
        # Each record has one or many fields...
        fields=test_demand_fields,
    ),
]

# Metadata contains information about the dataset.
metadata = mlc.Metadata(
    name="wds_data",
    # Descriptions can contain plain text or markdown.
    description=(
        "Water Distribution System 'Anutown' demand pattern data (train and test)."
    ),
    # cite_as=(
    #     "@article{brown2020language, title={Language Models are Few-Shot"
    # ),
    url="https://github.com/Daveonwave/gym4ReaL",
    license="https://creativecommons.org/licenses/by/4.0/",
    distribution=distribution,
    record_sets=record_sets,
)

In [39]:
print(metadata.issues.report())

  -  [Metadata(wds_data)] Property "http://mlcommons.org/croissant/citeAs" is recommended, but does not exist.
  -  [Metadata(wds_data)] Property "https://schema.org/datePublished" is recommended, but does not exist.
  -  [Metadata(wds_data)] Property "https://schema.org/version" is recommended, but does not exist.


In [40]:
with open("metadata_WDS.json", "w") as f:
  content = metadata.to_json()
  content = json.dumps(content, indent=2)
  print(content)
  f.write(content)
  f.write("\n")  # Terminate file with newline

{
  "@context": {
    "@language": "en",
    "@vocab": "https://schema.org/",
    "citeAs": "cr:citeAs",
    "column": "cr:column",
    "conformsTo": "dct:conformsTo",
    "cr": "http://mlcommons.org/croissant/",
    "rai": "http://mlcommons.org/croissant/RAI/",
    "data": {
      "@id": "cr:data",
      "@type": "@json"
    },
    "dataType": {
      "@id": "cr:dataType",
      "@type": "@vocab"
    },
    "dct": "http://purl.org/dc/terms/",
    "examples": {
      "@id": "cr:examples",
      "@type": "@json"
    },
    "extract": "cr:extract",
    "field": "cr:field",
    "fileProperty": "cr:fileProperty",
    "fileObject": "cr:fileObject",
    "fileSet": "cr:fileSet",
    "format": "cr:format",
    "includes": "cr:includes",
    "isLiveDataset": "cr:isLiveDataset",
    "jsonPath": "cr:jsonPath",
    "key": "cr:key",
    "md5": "cr:md5",
    "parentField": "cr:parentField",
    "path": "cr:path",
    "recordSet": "cr:recordSet",
    "references": "cr:references",
    "regex": "cr:re

In [41]:
dataset = mlc.Dataset(jsonld="metadata_WDS.json")

  -  [Metadata(wds_data)] Property "http://mlcommons.org/croissant/citeAs" is recommended, but does not exist.
  -  [Metadata(wds_data)] Property "https://schema.org/datePublished" is recommended, but does not exist.
  -  [Metadata(wds_data)] Property "https://schema.org/version" is recommended, but does not exist.


...and yield records from this dataset:

In [42]:
records = dataset.records(record_set="demand_pattern_train")

for i, record in enumerate(records):
  print(record)
  if i > 10:
    break

{'train_0': 0.0263059516896281, 'train_1': 0.0507741118009319, 'train_2': 0.0264841291985131, 'train_3': 0.0242795299314405, 'train_4': 0.0261226765121758, 'train_5': 0.01671504745887, 'train_6': 0.0082237633889451, 'train_7': 0.0104343913386532, 'train_8': 0.0309155451515997, 'train_9': 0.0239738072883243, 'train_10': 0.0169113270651604, 'train_11': -0.0007399897782609, 'train_12': 0.0239447734971827, 'train_13': 0.0370419357907571, 'train_14': 0.0228664729933332, 'train_15': 0.0163147568791094, 'train_16': 0.0178869772234797, 'train_17': 0.0098753279829056, 'train_18': 0.0343250568171407, 'train_19': 0.0186230309622721, 'train_20': 0.0329460699050128, 'train_21': 0.034002015021041, 'train_22': 0.0403209054081022, 'train_23': 0.0219601085583627, 'train_24': 0.0181399773035339, 'train_25': 0.030884085415979, 'train_26': 0.0209436523094625, 'train_27': 0.0185719894297001, 'train_28': 0.0179164983875613, 'train_29': 0.0315279472381174, 'train_30': 0.0316867317247521, 'train_31': 0.021380

In [43]:
import pandas as pd

pd.DataFrame.from_records(list(records))

Unnamed: 0,train_0,train_1,train_2,train_3,train_4,train_5,train_6,train_7,train_8,train_9,...,train_44,train_45,train_46,train_47,train_48,train_49,train_50,train_51,train_52,train_53
0,0.026306,0.050774,0.026484,0.024280,0.026123,0.016715,0.008224,0.010434,0.030916,0.023974,...,0.031533,0.039852,0.020140,0.020462,0.006265,0.045847,0.026999,-0.004558,0.036097,0.022512
1,0.016751,0.010319,0.001768,0.012497,0.002994,0.010508,0.011454,-0.004152,-0.007820,0.019974,...,0.029186,0.014953,0.024390,0.006842,0.023565,0.007037,-0.001514,0.020469,0.012829,0.016945
2,0.004875,0.013385,0.019027,0.024996,0.027797,0.011457,0.010905,-0.002278,0.028342,0.005058,...,0.008298,0.030901,0.003734,0.027949,0.026256,0.006755,0.026708,0.000995,0.026634,0.002821
3,0.021159,0.011407,0.005788,0.026918,0.010900,0.031438,0.026038,0.010626,0.028941,0.013072,...,0.001920,0.013456,0.022301,0.008879,0.029388,0.001208,0.023917,0.007627,0.035184,0.012492
4,0.031284,0.035913,0.010201,0.014955,0.020589,-0.001218,0.016108,0.021996,0.030372,0.036109,...,0.021662,0.011421,0.021159,0.029968,0.036582,0.043277,0.034045,0.023711,0.039454,0.034689
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5539,0.642907,0.610875,0.713977,0.568482,0.547551,0.599023,0.545045,0.659483,0.664110,0.618589,...,0.620095,0.711989,0.601223,0.680003,0.627978,0.542115,0.629203,0.670349,0.675510,0.657331
5540,0.512012,0.564214,0.556908,0.505588,0.591297,0.565704,0.554468,0.649790,0.560428,0.546584,...,0.526727,0.603055,0.591176,0.541517,0.570879,0.562715,0.620995,0.627673,0.600452,0.588287
5541,0.555436,0.517716,0.515272,0.518963,0.559871,0.501079,0.574604,0.596186,0.565110,0.553670,...,0.452269,0.527857,0.518052,0.633685,0.654373,0.489990,0.528807,0.599134,0.516563,0.527317
5542,0.502079,0.548542,0.504645,0.468332,0.385093,0.499699,0.462575,0.502673,0.481098,0.506005,...,0.405953,0.441832,0.421683,0.476084,0.469088,0.427968,0.500619,0.432514,0.457281,0.471290
