# Tutorial for `mlcroissant` 🥐

## Introduction

Croissant 🥐 is a high-level format for machine learning datasets that combines metadata, resource file descriptions, data structure, and default ML semantics into a single file.

Croissant builds on schema.org, and its `sc:Dataset` vocabulary, a widely used format to represent datasets on the Web, and make them searchable.

The [`mlcroissant`](https://github.com/mlcommons/croissant/python/mlcroissant) Python library empowers developers to interact with Croissant:

- Programmatically write your JSON-LD Croissant files.
- Verify your JSON-LD Croissant files.
- Load data from Croissant datasets.

In [1]:
# # Install mlcroissant from the source
# !brew install -y python3-dev graphviz libgraphviz-dev pkg-config
# !pip install "git+https://github.com/${GITHUB_REPOSITORY:-mlcommons/croissant}.git@${GITHUB_HEAD_REF:-main}#subdirectory=python/mlcroissant&egg=mlcroissant[dev]"

In [2]:
import mlcroissant as mlc
import hashlib
import json

In [3]:
# DEMAND
train_demand = "gym4real/data/microgrid/demand/profiles_train.csv"
test_demand = "gym4real/data/microgrid/demand/profiles_test.csv"

with open(train_demand, "rb") as f:
    sha256_hash_train_demand = hashlib.sha256(f.read()).hexdigest()

with open(test_demand, "rb") as f:
    sha256_hash_test_demand = hashlib.sha256(f.read()).hexdigest()


In [4]:
# FileObjects and FileSets define the resources of the dataset.
distribution = [
    mlc.FileObject(
        id="github-repository",
        name="github-repository",
        description="Gym4ReaL repository on GitHub.",
        content_url="https://github.com/Daveonwave/gym4ReaL",
        encoding_formats=["git+https"],
        sha256="main",
    ),
    mlc.FileObject(
        id="energy_demands_train",
        name="energy_demands_train",
        description="Energy consumptions of different households (train set)",
        content_url="https://raw.githubusercontent.com/Daveonwave/gym4ReaL/main/" + train_demand,
        encoding_formats=["text/csv"],
        sha256=sha256_hash_train_demand,
    ),
    mlc.FileObject(
        id="energy_demands_test",
        name="energy_demands_test",
        description="Energy consumptions of different households (test set)",
        content_url="https://raw.githubusercontent.com/Daveonwave/gym4ReaL/main/" + test_demand,
        encoding_formats=["text/csv"],
        sha256=sha256_hash_test_demand,
    ),
    mlc.FileSet(
        id="energy_generation",
        name="energy_generation",
        description="Train and test of energy PV generation",
        encoding_formats=["text/csv"],
        contained_in=["github-repository"],
        includes="gym4real/data/microgrid/generation/*.csv",
    ),
    mlc.FileSet(
        id="energy_prices",
        name="energy_prices",
        description="Train and test of energy market",
        encoding_formats=["text/csv"],
        contained_in=["github-repository"],
        includes="gym4real/data/microgrid/market/*.csv",
    ),
    mlc.FileSet(
        id="ambient_temperature",
        name="ambient_temperature",
        description="Train and test of ambient temperature",
        encoding_formats=["text/csv"],
        contained_in=["github-repository"],
        includes="gym4real/data/microgrid/temp_amb/*.csv",
    ),
]


In [33]:
# --- DEMAND TRAIN ---
train_demand_fields = [
    mlc.Field(
        id="delta_time_demand_train",
        name="delta_time",
        description="Time corresponding to the energy consumption in seconds",
        data_types=mlc.DataType.FLOAT,
        source=mlc.Source(
                    file_object="energy_demands_train",
                    # file_object='github-repository',
                    # Extract the field from the column of a FileObject/FileSet:
                    extract=mlc.Extract(column="delta_time")))]

for i in range(370):
    train_demand_fields.append(
        mlc.Field(
            id=str(i),
            name=str(i),
            description=f"The data consumption regarding {i}-th house",
            data_types=mlc.DataType.FLOAT,
            source=mlc.Source(
                        file_object="energy_demands_train",
                        # file_object='github-repository',
                        # Extract the field from the column of a FileObject/FileSet:
                        extract=mlc.Extract(column=str(i)))))

# --- DEMAND TEST ---
test_demand_fields = [
    mlc.Field(
        id="delta_time_demand_test",
        name="delta_time",
        description="Time corresponding to the energy consumption in seconds",
        data_types=mlc.DataType.FLOAT,
        source=mlc.Source(
                    file_object="energy_demands_test",
                    # file_object='github-repository',
                    # Extract the field from the column of a FileObject/FileSet:
                    extract=mlc.Extract(column="delta_time")))]

for i in range(370, 398):
    test_demand_fields.append(
        mlc.Field(
            id=str(i),
            name=str(i),
            description=f"The data consumption regarding {i}-th house",
            data_types=mlc.DataType.FLOAT,
            source=mlc.Source(
                        file_set="energy_demands_test",
                        # file_object='github-repository',
                        # Extract the field from the column of a FileObject/FileSet:
                        extract=mlc.Extract(column=str(i)))))

# --- GENERATION ---
generation_fields = [
    mlc.Field(
        id="delta_time_generation",
        name="delta_time",
        description="Time corresponding to the hourly energy generation",
        data_types=mlc.DataType.FLOAT,
        source=mlc.Source(
                    file_set="energy_generation",
                    # file_object='github-repository',
                    # Extract the field from the column of a FileObject/FileSet:
                    extract=mlc.Extract(column="delta_time"))),
    mlc.Field(
        id="pv",
        name="pv",
        description="Energy generation",
        data_types=mlc.DataType.FLOAT,
        source=mlc.Source(
                    file_set="energy_generation",
                    # file_object='github-repository',
                    # Extract the field from the column of a FileObject/FileSet:
                    extract=mlc.Extract(column="PV"))),
    mlc.Field(
        id="timestamp_generation",
        name="timestamp",
        description="Timestamp",
        data_types=mlc.DataType.DATE,
        source=mlc.Source(
                    file_set="energy_generation",
                    # file_object='github-repository',
                    # Extract the field from the column of a FileObject/FileSet:
                    extract=mlc.Extract(column="timestamp")))]

# --- MARKET ---
market_fields = [
    mlc.Field(
        id="delta_time_market",
        name="delta_time",
        description="Time corresponding to the hourly energy price",
        data_types=mlc.DataType.FLOAT,
        source=mlc.Source(
                    file_set="energy_prices",
                    # file_object='github-repository',
                    # Extract the field from the column of a FileObject/FileSet:
                    extract=mlc.Extract(column="delta_time"),
                ),
    ),
    mlc.Field(
        id="ask",
        name="ask",
        description="Energy cost price",
        data_types=mlc.DataType.FLOAT,
        source=mlc.Source(
                    file_set="energy_prices",
                    # file_object='github-repository',
                    # Extract the field from the column of a FileObject/FileSet:
                    extract=mlc.Extract(column="ask"),
                ),
    ),
    mlc.Field(
        id="bid",
        name="bid",
        description="Energy sell price",
        data_types=mlc.DataType.FLOAT,
        source=mlc.Source(
                    file_set="energy_prices",
                    # file_object='github-repository',
                    # Extract the field from the column of a FileObject/FileSet:
                    extract=mlc.Extract(column="bid"),
                ),
    ),
    mlc.Field(
        id="timestamp_market",
        name="timestamp",
        description="Timestamp",
        data_types=mlc.DataType.DATE,
        source=mlc.Source(
                    file_set="energy_prices",
                    # file_object='github-repository',
                    # Extract the field from the column of a FileObject/FileSet:
                    extract=mlc.Extract(column="timestamp"),
                ),
    ),
]



# --- AMBIENT TEMPERATURE ---
temp_amb_fields = [
    mlc.Field(
        id="delta_time",
        name="delta_time",
        description="Time corresponding to the daily ambient temperature",
        data_types=mlc.DataType.FLOAT,
        source=mlc.Source(
                    file_set="ambient_temperature",
                    # file_object='github-repository',
                    # Extract the field from the column of a FileObject/FileSet:
                    extract=mlc.Extract(column="delta_time"),
                ),
    ),
    mlc.Field(
        id="temp",
        name="temp_amb",
        description="Ambient Temperature [K]",
        data_types=mlc.DataType.FLOAT,
        source=mlc.Source(
                    file_set="ambient_temperature",
                    # file_object='github-repository',
                    # Extract the field from the column of a FileObject/FileSet:
                    extract=mlc.Extract(column="temp_amb"),
                ),
    ),
    mlc.Field(
        id="timestamp_temp_amb",
        name="timestamp",
        description="Timestamp",
        data_types=mlc.DataType.DATE,
        source=mlc.Source(
                    file_set="ambient_temperature",
                    # file_object='github-repository',
                    # Extract the field from the column of a FileObject/FileSet:
                    extract=mlc.Extract(column="timestamp"),
                ),
    ),
]

In [34]:
record_sets = [
    mlc.RecordSet(
        id="energy_consumption_table_train",
        name="energy_consumption_table_train",
        # Each record has one or many fields...
        fields=train_demand_fields,
    ),
    mlc.RecordSet(
        id="energy_consumption_table_test",
        name="energy_consumption_table_test",
        # Each record has one or many fields...
        fields=test_demand_fields,
    ),
    # RecordSets contains records in the dataset.
    mlc.RecordSet(
        id="energy_generation_tables",
        name="energy_generation_tables",
        # Each record has one or many fields...
        fields=generation_fields,
    ),
    # RecordSets contains records in the dataset.
    mlc.RecordSet(
        id="energy_market_tables",
        name="energy_market_tables",
        # Each record has one or many fields...
        fields=market_fields,
    ),
    # RecordSets contains records in the dataset.
    mlc.RecordSet(
        id="temp_amb_tables",
        name="temp_amb_tables",
        # Each record has one or many fields...
        fields=temp_amb_fields,
    ),
]

# Metadata contains information about the dataset.
metadata = mlc.Metadata(
    name="microgrid_data",
    # Descriptions can contain plain text or markdown.
    description=(
        "Italian energy data of consumption, PV generation, market and ambient temperature."
    ),
    # cite_as=(
    #     "@article{brown2020language, title={Language Models are Few-Shot"
    # ),
    url="https://github.com/Daveonwave/gym4ReaL",
    license="https://creativecommons.org/licenses/by/4.0/",
    distribution=distribution,
    record_sets=record_sets,
)

In [35]:
print(metadata.issues.report())

  -  [Metadata(microgrid_data)] Property "http://mlcommons.org/croissant/citeAs" is recommended, but does not exist.
  -  [Metadata(microgrid_data)] Property "https://schema.org/datePublished" is recommended, but does not exist.
  -  [Metadata(microgrid_data)] Property "https://schema.org/version" is recommended, but does not exist.


In [36]:
with open("metadata_MG.json", "w") as f:
  content = metadata.to_json()
  content = json.dumps(content, indent=2)
  print(content)
  f.write(content)
  f.write("\n")  # Terminate file with newline

{
  "@context": {
    "@language": "en",
    "@vocab": "https://schema.org/",
    "citeAs": "cr:citeAs",
    "column": "cr:column",
    "conformsTo": "dct:conformsTo",
    "cr": "http://mlcommons.org/croissant/",
    "rai": "http://mlcommons.org/croissant/RAI/",
    "data": {
      "@id": "cr:data",
      "@type": "@json"
    },
    "dataType": {
      "@id": "cr:dataType",
      "@type": "@vocab"
    },
    "dct": "http://purl.org/dc/terms/",
    "examples": {
      "@id": "cr:examples",
      "@type": "@json"
    },
    "extract": "cr:extract",
    "field": "cr:field",
    "fileProperty": "cr:fileProperty",
    "fileObject": "cr:fileObject",
    "fileSet": "cr:fileSet",
    "format": "cr:format",
    "includes": "cr:includes",
    "isLiveDataset": "cr:isLiveDataset",
    "jsonPath": "cr:jsonPath",
    "key": "cr:key",
    "md5": "cr:md5",
    "parentField": "cr:parentField",
    "path": "cr:path",
    "recordSet": "cr:recordSet",
    "references": "cr:references",
    "regex": "cr:re

In [37]:
dataset = mlc.Dataset(jsonld="metadata_MG.json")

  -  [Metadata(microgrid_data)] Property "http://mlcommons.org/croissant/citeAs" is recommended, but does not exist.
  -  [Metadata(microgrid_data)] Property "https://schema.org/datePublished" is recommended, but does not exist.
  -  [Metadata(microgrid_data)] Property "https://schema.org/version" is recommended, but does not exist.


...and yield records from this dataset:

In [38]:
records = dataset.records(record_set="temp_amb_tables")

for i, record in enumerate(records):
  print(record)
  if i > 10:
    break

{'delta_time': 0.0, 'temp': 273.894, 'timestamp_temp_amb': Timestamp('2015-01-01 00:00:00')}
{'delta_time': 86400.0, 'temp': 275.95099999999996, 'timestamp_temp_amb': Timestamp('2015-01-02 00:00:00')}
{'delta_time': 172800.0, 'temp': 278.027, 'timestamp_temp_amb': Timestamp('2015-01-03 00:00:00')}
{'delta_time': 259200.0, 'temp': 279.32099999999997, 'timestamp_temp_amb': Timestamp('2015-01-04 00:00:00')}
{'delta_time': 345600.0, 'temp': 278.34799999999996, 'timestamp_temp_amb': Timestamp('2015-01-05 00:00:00')}
{'delta_time': 432000.0, 'temp': 277.998, 'timestamp_temp_amb': Timestamp('2015-01-06 00:00:00')}
{'delta_time': 518400.0, 'temp': 278.10499999999996, 'timestamp_temp_amb': Timestamp('2015-01-07 00:00:00')}
{'delta_time': 604800.0, 'temp': 278.62899999999996, 'timestamp_temp_amb': Timestamp('2015-01-08 00:00:00')}
{'delta_time': 691200.0, 'temp': 279.142, 'timestamp_temp_amb': Timestamp('2015-01-09 00:00:00')}
{'delta_time': 777600.0, 'temp': 280.82399999999996, 'timestamp_temp_

In [39]:
import pandas as pd

pd.DataFrame.from_records(list(records))

Unnamed: 0,delta_time,temp,timestamp_temp_amb
0,0.0,273.894,2015-01-01
1,86400.0,275.951,2015-01-02
2,172800.0,278.027,2015-01-03
3,259200.0,279.321,2015-01-04
4,345600.0,278.348,2015-01-05
...,...,...,...
1821,31104000.0,280.005,2019-12-27
1822,31190400.0,278.614,2019-12-28
1823,31276800.0,277.226,2019-12-29
1824,31363200.0,276.363,2019-12-30
