# Tutorial for `mlcroissant` 🥐

## Introduction

Croissant 🥐 is a high-level format for machine learning datasets that combines metadata, resource file descriptions, data structure, and default ML semantics into a single file.

Croissant builds on schema.org, and its `sc:Dataset` vocabulary, a widely used format to represent datasets on the Web, and make them searchable.

The [`mlcroissant`](https://github.com/mlcommons/croissant/python/mlcroissant) Python library empowers developers to interact with Croissant:

- Programmatically write your JSON-LD Croissant files.
- Verify your JSON-LD Croissant files.
- Load data from Croissant datasets.

In [170]:
# # Install mlcroissant from the source
# !brew install -y python3-dev graphviz libgraphviz-dev pkg-config
# !pip install "git+https://github.com/${GITHUB_REPOSITORY:-mlcommons/croissant}.git@${GITHUB_HEAD_REF:-main}#subdirectory=python/mlcroissant&egg=mlcroissant[dev]"

In [171]:
import mlcroissant as mlc
import hashlib
import json

In [172]:
file = "gym4real/data/microgrid/demand/consumption_profiles.csv"

with open(file, "rb") as f:
    sha256_hash = hashlib.sha256(f.read()).hexdigest()

# FileObjects and FileSets define the resources of the dataset.
distribution = [
    mlc.FileObject(
        id="energy_demands",
        name="energy_demands",
        description="energy consumptions of different households",
        content_url="https://raw.githubusercontent.com/Daveonwave/gym4ReaL/main/" + file,
        encoding_formats=["text/csv"],
        sha256=sha256_hash,
    ),
]

fields = [
    mlc.Field(
        id="delta_times",
        name="delta_times",
        description="Time corresponding to the energy consumption in seconds",
        data_types=mlc.DataType.FLOAT,
        source=mlc.Source(
                    file_object="energy_demands",
                    # file_object='github-repository',
                    # Extract the field from the column of a FileObject/FileSet:
                    extract=mlc.Extract(column="delta_times"),
                ),
    )
]

for i in range(398):
    fields.append(
        mlc.Field(
            id=str(i),
            name=str(i),
            description=f"The data regarding {i}-th house",
            data_types=mlc.DataType.FLOAT,
            source=mlc.Source(
                        file_object="energy_demands",
                        # file_object='github-repository',
                        # Extract the field from the column of a FileObject/FileSet:
                        extract=mlc.Extract(column=str(i)),
                    ),
        )
    )

record_sets = [
    # RecordSets contains records in the dataset.
    mlc.RecordSet(
        id="energy_consumption_table",
        name="energy_consumption_table",
        # Each record has one or many fields...
        fields=fields,
    )
]

# Metadata contains information about the dataset.
metadata = mlc.Metadata(
    name="energy_consumption_profiles",
    # Descriptions can contain plain text or markdown.
    description=(
        "Energy consumption profiles of Italian households"
    ),
    # cite_as=(
    #     "@article{brown2020language, title={Language Models are Few-Shot"
    # ),
    url="https://github.com/Daveonwave/gym4ReaL",
    license="https://creativecommons.org/licenses/by/4.0/",
    distribution=distribution,
    record_sets=record_sets,
)

When creating `Metadata`:
- We also check for errors in the configuration.
- We generate warnings if the configuration doesn't follow guidelines and best practices.

For instance, in this case:

In [173]:
print(metadata.issues.report())

  -  [Metadata(energy_consumption_profiles)] Property "http://mlcommons.org/croissant/citeAs" is recommended, but does not exist.
  -  [Metadata(energy_consumption_profiles)] Property "https://schema.org/datePublished" is recommended, but does not exist.
  -  [Metadata(energy_consumption_profiles)] Property "https://schema.org/version" is recommended, but does not exist.


`Property "https://schema.org/license" is recommended`...

We can see at a glance that we miss an important metadata to build datasets for responsible AI: the license!

## Build the Croissant file and yield data

Let's write the Croissant JSON-LD to a file on disk!

In [174]:
with open("gym4real/data/microgrid/demand/metadata_consumption_profiles.json", "w") as f:
  content = metadata.to_json()
  content = json.dumps(content, indent=2)
  print(content)
  f.write(content)
  f.write("\n")  # Terminate file with newline

{
  "@context": {
    "@language": "en",
    "@vocab": "https://schema.org/",
    "citeAs": "cr:citeAs",
    "column": "cr:column",
    "conformsTo": "dct:conformsTo",
    "cr": "http://mlcommons.org/croissant/",
    "rai": "http://mlcommons.org/croissant/RAI/",
    "data": {
      "@id": "cr:data",
      "@type": "@json"
    },
    "dataType": {
      "@id": "cr:dataType",
      "@type": "@vocab"
    },
    "dct": "http://purl.org/dc/terms/",
    "examples": {
      "@id": "cr:examples",
      "@type": "@json"
    },
    "extract": "cr:extract",
    "field": "cr:field",
    "fileProperty": "cr:fileProperty",
    "fileObject": "cr:fileObject",
    "fileSet": "cr:fileSet",
    "format": "cr:format",
    "includes": "cr:includes",
    "isLiveDataset": "cr:isLiveDataset",
    "jsonPath": "cr:jsonPath",
    "key": "cr:key",
    "md5": "cr:md5",
    "parentField": "cr:parentField",
    "path": "cr:path",
    "recordSet": "cr:recordSet",
    "references": "cr:references",
    "regex": "cr:re

From this JSON-LD file, we can easily create a dataset...

In [175]:
dataset = mlc.Dataset(jsonld="gym4real/data/microgrid/demand/metadata_consumption_profiles.json")

  -  [Metadata(energy_consumption_profiles)] Property "http://mlcommons.org/croissant/citeAs" is recommended, but does not exist.
  -  [Metadata(energy_consumption_profiles)] Property "https://schema.org/datePublished" is recommended, but does not exist.
  -  [Metadata(energy_consumption_profiles)] Property "https://schema.org/version" is recommended, but does not exist.


...and yield records from this dataset:

In [176]:
records = dataset.records(record_set="energy_consumption_table")

for i, record in enumerate(records):
  print(record)
  if i > 10:
    break

{'delta_times': 0.0, '0': 96.9999999999999, '1': 289.0, '2': 142.0, '3': 294.0, '4': 1743.0, '5': 226.0, '6': 46.0, '7': 136.9999999999999, '8': 136.9999999999999, '9': 70.0, '10': 114.0, '11': 88.0, '12': 231.0, '13': 333.9999999999999, '14': 46.0, '15': 200.0, '16': 86.0, '17': 421.0, '18': 83.0, '19': 190.0, '20': 346.0, '21': 223.0, '22': 482.0, '23': 52.0, '24': 321.0, '25': 421.0, '26': 115.0, '27': 800.9999999999999, '28': 417.0, '29': 87.0, '30': 253.0, '31': 296.0, '32': 540.0, '33': 238.0, '34': 782.0, '35': 468.0, '36': 385.0, '37': 424.0, '38': 111.0, '39': 789.0, '40': 457.0, '41': 102.0, '42': 321.0, '43': 869.0000000000001, '44': 662.0, '45': 63.0, '46': 289.0, '47': 85.0, '48': 547.9999999999999, '49': 176.0, '50': 292.0, '51': 128.0, '52': 35.0, '53': 218.0, '54': 253.0, '55': 1216.0, '56': 203.0, '57': 166.9999999999999, '58': 853.0, '59': 192.0, '60': 136.9999999999999, '61': 27.0, '62': 113.0, '63': 93.0, '64': 241.0, '65': 669.0, '66': 74.0, '67': 237.0, '68': 120.

In [177]:
# DEMAND
train_demand = "gym4real/data/microgrid/demand/profiles_train.csv"
test_demand = "gym4real/data/microgrid/demand/profiles_test.csv"

with open(train_demand, "rb") as f:
    sha256_hash_train_demand = hashlib.sha256(f.read()).hexdigest()

with open(test_demand, "rb") as f:
    sha256_hash_test_demand = hashlib.sha256(f.read()).hexdigest()

# MARKET
train_market = "gym4real/data/microgrid/market/gme_2015-2019_train.csv"
test_market = "gym4real/data/microgrid/market/gme_2019-2020_test.csv"

with open(train_market, "rb") as f:
    sha256_hash_train_market = hashlib.sha256(f.read()).hexdigest()

with open(test_market, "rb") as f:
    sha256_hash_test_market = hashlib.sha256(f.read()).hexdigest()

# GENERATION
train_generation = "gym4real/data/microgrid/generation/pv_ninja_2015-2019_3kW_train.csv"
test_generation = "gym4real/data/microgrid/generation/pv_ninja_2019-2020_3kW_test.csv"

with open(train_demand, "rb") as f:
    sha256_hash_train_demand = hashlib.sha256(f.read()).hexdigest()

with open(test_demand, "rb") as f:
    sha256_hash_test_demand = hashlib.sha256(f.read()).hexdigest()

# TEMP AMBIENT
train_demand = "gym4real/data/microgrid/demand/profiles_train.csv"
test_demand = "gym4real/data/microgrid/demand/profiles_test.csv"

with open(train_demand, "rb") as f:
    sha256_hash_train_demand = hashlib.sha256(f.read()).hexdigest()

with open(test_demand, "rb") as f:
    sha256_hash_test_demand = hashlib.sha256(f.read()).hexdigest()



# FileObjects and FileSets define the resources of the dataset.
distribution = [
    mlc.FileObject(
        id="github-repository",
        name="github-repository",
        description="Gym4ReaL repository on GitHub.",
        content_url="https://github.com/Daveonwave/gym4ReaL",
        encoding_formats=["git+https"],
        sha256="main",
    ),
    mlc.FileObject(
        id="energy_demands_train",
        name="energy_demands_train",
        description="Energy consumptions of different households (train set)",
        content_url="https://raw.githubusercontent.com/Daveonwave/gym4ReaL/main/" + train_demand,
        encoding_formats=["text/csv"],
        sha256=sha256_hash_train_demand,
    ),
    mlc.FileObject(
        id="energy_demands_test",
        name="energy_demands_test",
        description="Energy consumptions of different households (test set)",
        content_url="https://raw.githubusercontent.com/Daveonwave/gym4ReaL/main/" + test_demand,
        encoding_formats=["text/csv"],
        sha256=sha256_hash_test_demand,
    ),
    mlc.FileSet(
        id="energy_generation",
        name="energy_generation",
        description="Train and test of energy PV generation",
        encoding_formats=["text/csv"],
        contained_in=["github-repository"],
        includes="gym4real/data/microgrid/generation/*.csv",
    ),
    mlc.FileSet(
        id="energy_prices",
        name="energy_prices",
        description="Train and test of energy market",
        encoding_formats=["text/csv"],
        contained_in=["github-repository"],
        includes="gym4real/data/microgrid/market/*.csv",
    ),
    mlc.FileSet(
        id="ambient_temperature",
        name="ambient_temperature",
        description="Train and test of ambient temperature",
        encoding_formats=["text/csv"],
        contained_in=["github-repository"],
        includes="gym4real/data/microgrid/temp_amb/*.csv",
    ),
]

# --- DEMAND TRAIN ---
train_demand_fields = [
    mlc.Field(
        id="delta_time_demand_train",
        name="delta_time_demand_train",
        description="Time corresponding to the energy consumption in seconds",
        data_types=mlc.DataType.FLOAT,
        source=mlc.Source(
                    file_object="energy_demands_train",
                    # file_object='github-repository',
                    # Extract the field from the column of a FileObject/FileSet:
                    extract=mlc.Extract(column="delta_time"),
                ),
    )
]

for i in range(370):
    train_demand_fields.append(
        mlc.Field(
            id=str(i),
            name=str(i),
            description=f"The data consumption regarding {i}-th house",
            data_types=mlc.DataType.FLOAT,
            source=mlc.Source(
                        file_object="energy_demands_train",
                        # file_object='github-repository',
                        # Extract the field from the column of a FileObject/FileSet:
                        extract=mlc.Extract(column=str(i)),
                    ),
        )
    )

# --- DEMAND TEST ---
test_demand_fields = [
    mlc.Field(
        id="delta_time_demand_test",
        name="delta_time_demand_test",
        description="Time corresponding to the energy consumption in seconds",
        data_types=mlc.DataType.FLOAT,
        source=mlc.Source(
                    file_object="energy_demands_test",
                    # file_object='github-repository',
                    # Extract the field from the column of a FileObject/FileSet:
                    extract=mlc.Extract(column="delta_time"),
                ),
    )
]

for i in range(370, 398):
    test_demand_fields.append(
        mlc.Field(
            id=str(i),
            name=str(i),
            description=f"The data consumption regarding {i}-th house",
            data_types=mlc.DataType.FLOAT,
            source=mlc.Source(
                        file_set="energy_demands_test",
                        # file_object='github-repository',
                        # Extract the field from the column of a FileObject/FileSet:
                        extract=mlc.Extract(column=str(i)),
                    ),
        )
    )

# --- GENERATION ---


# --- MARKET ---
market_fields = [
    mlc.Field(
        id="delta_time_market",
        name="delta_time_market",
        description="Time corresponding to the hourly energy price",
        data_types=mlc.DataType.FLOAT,
        source=mlc.Source(
                    file_set="energy_prices",
                    # file_object='github-repository',
                    # Extract the field from the column of a FileObject/FileSet:
                    extract=mlc.Extract(column="delta_time"),
                ),
    ),
    mlc.Field(
        id="ask",
        name="ask",
        description="Energy cost price",
        data_types=mlc.DataType.FLOAT,
        source=mlc.Source(
                    file_set="energy_prices",
                    # file_object='github-repository',
                    # Extract the field from the column of a FileObject/FileSet:
                    extract=mlc.Extract(column="ask"),
                ),
    ),
    mlc.Field(
        id="bid",
        name="bid",
        description="Energy sell price",
        data_types=mlc.DataType.FLOAT,
        source=mlc.Source(
                    file_set="energy_prices",
                    # file_object='github-repository',
                    # Extract the field from the column of a FileObject/FileSet:
                    extract=mlc.Extract(column="bid"),
                ),
    ),
    mlc.Field(
        id="timestamp",
        name="timestamp",
        description="Timestamp",
        data_types=mlc.DataType.DATE,
        source=mlc.Source(
                    file_set="energy_prices",
                    # file_object='github-repository',
                    # Extract the field from the column of a FileObject/FileSet:
                    extract=mlc.Extract(column="timestamp"),
                ),
    ),
]



# --- AMBIENT TEMPERATURE ---



# ---------------------------

record_sets = [
    # RecordSets contains records in the dataset.
    mlc.RecordSet(
        id="energy_market_tables",
        name="energy_market_tables",
        # Each record has one or many fields...
        fields=market_fields,
    ),
    mlc.RecordSet(
        id="energy_consumption_table_train",
        name="energy_consumption_table_train",
        # Each record has one or many fields...
        fields=train_demand_fields,
    ),
    mlc.RecordSet(
        id="energy_consumption_table_test",
        name="energy_consumption_table_test",
        # Each record has one or many fields...
        fields=test_demand_fields,
    ),
]

# Metadata contains information about the dataset.
metadata = mlc.Metadata(
    name="microgrid_data",
    # Descriptions can contain plain text or markdown.
    description=(
        "Italian energy data of consumption, PV generation, market and ambient temperature."
    ),
    # cite_as=(
    #     "@article{brown2020language, title={Language Models are Few-Shot"
    # ),
    url="https://github.com/Daveonwave/gym4ReaL",
    license="https://creativecommons.org/licenses/by/4.0/",
    distribution=distribution,
    record_sets=record_sets,
)

In [178]:
print(metadata.issues.report())

  -  [Metadata(microgrid_data)] Property "http://mlcommons.org/croissant/citeAs" is recommended, but does not exist.
  -  [Metadata(microgrid_data)] Property "https://schema.org/datePublished" is recommended, but does not exist.
  -  [Metadata(microgrid_data)] Property "https://schema.org/version" is recommended, but does not exist.


In [179]:
with open("metadata_MG.json", "w") as f:
  content = metadata.to_json()
  content = json.dumps(content, indent=2)
  print(content)
  f.write(content)
  f.write("\n")  # Terminate file with newline

{
  "@context": {
    "@language": "en",
    "@vocab": "https://schema.org/",
    "citeAs": "cr:citeAs",
    "column": "cr:column",
    "conformsTo": "dct:conformsTo",
    "cr": "http://mlcommons.org/croissant/",
    "rai": "http://mlcommons.org/croissant/RAI/",
    "data": {
      "@id": "cr:data",
      "@type": "@json"
    },
    "dataType": {
      "@id": "cr:dataType",
      "@type": "@vocab"
    },
    "dct": "http://purl.org/dc/terms/",
    "examples": {
      "@id": "cr:examples",
      "@type": "@json"
    },
    "extract": "cr:extract",
    "field": "cr:field",
    "fileProperty": "cr:fileProperty",
    "fileObject": "cr:fileObject",
    "fileSet": "cr:fileSet",
    "format": "cr:format",
    "includes": "cr:includes",
    "isLiveDataset": "cr:isLiveDataset",
    "jsonPath": "cr:jsonPath",
    "key": "cr:key",
    "md5": "cr:md5",
    "parentField": "cr:parentField",
    "path": "cr:path",
    "recordSet": "cr:recordSet",
    "references": "cr:references",
    "regex": "cr:re

In [180]:
dataset = mlc.Dataset(jsonld="metadata_MG.json")

  -  [Metadata(microgrid_data)] Property "http://mlcommons.org/croissant/citeAs" is recommended, but does not exist.
  -  [Metadata(microgrid_data)] Property "https://schema.org/datePublished" is recommended, but does not exist.
  -  [Metadata(microgrid_data)] Property "https://schema.org/version" is recommended, but does not exist.


In [181]:
records = dataset.records(record_set="energy_market_tables")

for i, record in enumerate(records):
  print(record)

{'delta_time_market': 0.0, 'ask': 0.000139327563, 'bid': 5.2327563e-05, 'timestamp': Timestamp('2015-01-01 00:00:00')}
{'delta_time_market': 3600.0, 'ask': 0.0001368927779999, 'bid': 4.9892778e-05, 'timestamp': Timestamp('2015-01-01 01:00:00')}
{'delta_time_market': 7200.0, 'ask': 0.0001261, 'bid': 3.91e-05, 'timestamp': Timestamp('2015-01-01 02:00:00')}
{'delta_time_market': 10800.0, 'ask': 0.00012287, 'bid': 3.587e-05, 'timestamp': Timestamp('2015-01-01 03:00:00')}
{'delta_time_market': 14400.0, 'ask': 0.0001204, 'bid': 3.34e-05, 'timestamp': Timestamp('2015-01-01 04:00:00')}
{'delta_time_market': 18000.0, 'ask': 0.000123473838, 'bid': 3.6473838e-05, 'timestamp': Timestamp('2015-01-01 05:00:00')}
{'delta_time_market': 21600.0, 'ask': 0.0001261, 'bid': 3.91e-05, 'timestamp': Timestamp('2015-01-01 06:00:00')}
{'delta_time_market': 25200.0, 'ask': 0.00013151852, 'bid': 4.4518520000000006e-05, 'timestamp': Timestamp('2015-01-01 07:00:00')}
{'delta_time_market': 28800.0, 'ask': 0.00012542

In [185]:
records.dataset



In [187]:
import pandas as pd

df = pd.DataFrame.from_records([r for r in records])
df.head()

Unnamed: 0,delta_time_market,ask,bid,timestamp
0,0.0,0.000139,5.2e-05,2015-01-01 00:00:00
1,3600.0,0.000137,5e-05,2015-01-01 01:00:00
2,7200.0,0.000126,3.9e-05,2015-01-01 02:00:00
3,10800.0,0.000123,3.6e-05,2015-01-01 03:00:00
4,14400.0,0.00012,3.3e-05,2015-01-01 04:00:00
