# Tutorial for `mlcroissant` 🥐

## Introduction

Croissant 🥐 is a high-level format for machine learning datasets that combines metadata, resource file descriptions, data structure, and default ML semantics into a single file.

Croissant builds on schema.org, and its `sc:Dataset` vocabulary, a widely used format to represent datasets on the Web, and make them searchable.

The [`mlcroissant`](https://github.com/mlcommons/croissant/python/mlcroissant) Python library empowers developers to interact with Croissant:

- Programmatically write your JSON-LD Croissant files.
- Verify your JSON-LD Croissant files.
- Load data from Croissant datasets.

In [1]:
# # Install mlcroissant from the source
# !brew install -y python3-dev graphviz libgraphviz-dev pkg-config
# !pip install "git+https://github.com/${GITHUB_REPOSITORY:-mlcommons/croissant}.git@${GITHUB_HEAD_REF:-main}#subdirectory=python/mlcroissant&egg=mlcroissant[dev]"

In [2]:
import mlcroissant as mlc
import hashlib
import json

When creating `Metadata`:
- We also check for errors in the configuration.
- We generate warnings if the configuration doesn't follow guidelines and best practices.

For instance, in this case:

`Property "https://schema.org/license" is recommended`...

We can see at a glance that we miss an important metadata to build datasets for responsible AI: the license!

## Build the Croissant file and yield data

Let's write the Croissant JSON-LD to a file on disk!

From this JSON-LD file, we can easily create a dataset...

...and yield records from this dataset:

In [27]:
# DEMAND
mef = "gym4real/data/dam/MEF.csv"
demand_train = 'gym4real/data/dam/demand/demand_train.csv'
demand_test = 'gym4real/data/dam/demand/demand_test.csv'
inflow_train = 'gym4real/data/dam/inflow/inflow_train.csv'
inflow_test = 'gym4real/data/dam/inflow/inflow_test.csv'

with open(mef, "rb") as f:
    sha256_hash_mef = hashlib.sha256(f.read()).hexdigest()

with open(demand_train, "rb") as f:
    sha256_hash_demand_train = hashlib.sha256(f.read()).hexdigest()
with open(demand_test, "rb") as f:
    sha256_hash_demand_test = hashlib.sha256(f.read()).hexdigest()

with open(inflow_train, "rb") as f:
    sha256_hash_inflow_train = hashlib.sha256(f.read()).hexdigest()
with open(inflow_test, "rb") as f:
    sha256_hash_inflow_test = hashlib.sha256(f.read()).hexdigest()


# FileObjects and FileSets define the resources of the dataset.
distribution = [
    mlc.FileObject(
        id="demand_train",
        name="demand_train",
        description="Train water demand",
        content_url="https://raw.githubusercontent.com/Daveonwave/gym4ReaL/main/" + demand_train,
        encoding_formats=["text/csv"],
        sha256=sha256_hash_demand_train,
    ),
    mlc.FileObject(
        id="demand_test",
        name="demand_test",
        description="Test water demand",
        content_url="https://raw.githubusercontent.com/Daveonwave/gym4ReaL/main/" + demand_test,
        encoding_formats=["text/csv"],
        sha256=sha256_hash_demand_test,
    ),

    mlc.FileObject(
        id="inflow_train",
        name="inflow_train",
        description="Train water inflow",
        content_url="https://raw.githubusercontent.com/Daveonwave/gym4ReaL/main/" + inflow_train,
        encoding_formats=["text/csv"],
        sha256=sha256_hash_inflow_train,
    ),
    mlc.FileObject(
        id="inflow_test",
        name="inflow_test",
        description="Test water inflow",
        content_url="https://raw.githubusercontent.com/Daveonwave/gym4ReaL/main/" + inflow_test,
        encoding_formats=["text/csv"],
        sha256=sha256_hash_inflow_test,
    ),

    mlc.FileObject(
        id="mef_file_object",
        name="min_env_flow",
        description="Minimum amount of water to release",
        content_url="https://raw.githubusercontent.com/Daveonwave/gym4ReaL/main/" + mef,
        encoding_formats=["text/csv"],
        sha256=sha256_hash_mef,
    ),
]


train_demand_fields = [
    mlc.Field(
        id="year_demand_train",
        name="year",
        description="The year that the data was collected in",
        data_types=mlc.DataType.INTEGER,
        source=mlc.Source(
                    file_object="demand_train",
                    # file_object='github-repository',
                    # Extract the field from the column of a FileObject/FileSet:
                    extract=mlc.Extract(column="year"),
                ),
    )
]

for i in range(366):
    train_demand_fields.append(
        mlc.Field(
            id=str(i) + '_demand_train',
            name=str(i),
            description=f"The data regarding day {i}",
            data_types=mlc.DataType.FLOAT,
            source=mlc.Source(
                        file_object="demand_train",
                        # file_object='github-repository',
                        # Extract the field from the column of a FileObject/FileSet:
                        extract=mlc.Extract(column=str(i)),
                    ),
        )
    )

test_demand_fields = [
    mlc.Field(
        id="year_demand_test",
        name="year",
        description="The year that the data was collected in",
        data_types=mlc.DataType.INTEGER,
        source=mlc.Source(
                    file_object="demand_test",
                    # file_object='github-repository',
                    # Extract the field from the column of a FileObject/FileSet:
                    extract=mlc.Extract(column="year"),
                ),
    )
]

for i in range(366):
    test_demand_fields.append(
        mlc.Field(
            id=str(i) + '_demand_test',
            name=str(i),
            description=f"The data regarding day {i}",
            data_types=mlc.DataType.FLOAT,
            source=mlc.Source(
                        file_object="demand_test",
                        # file_object='github-repository',
                        # Extract the field from the column of a FileObject/FileSet:
                        extract=mlc.Extract(column=str(i)),
                    ),
        )
    )



train_inflow_fields = [
    mlc.Field(
        id="year_inflow_train",
        name="year",
        description="The year that the data was collected in",
        data_types=mlc.DataType.INTEGER,
        source=mlc.Source(
                    file_object="inflow_train",
                    # file_object='github-repository',
                    # Extract the field from the column of a FileObject/FileSet:
                    extract=mlc.Extract(column="year"),
                ),
    )
]

for i in range(366):
    train_inflow_fields.append(
        mlc.Field(
            id=str(i) + '_inflow_train',
            name=str(i),
            description=f"The data regarding day {i}",
            data_types=mlc.DataType.FLOAT,
            source=mlc.Source(
                        file_object="inflow_train",
                        # file_object='github-repository',
                        # Extract the field from the column of a FileObject/FileSet:
                        extract=mlc.Extract(column=str(i)),
                    ),
        )
    )

test_inflow_fields = [
    mlc.Field(
        id="year_inflow_test",
        name="year",
        description="The year that the data was collected in",
        data_types=mlc.DataType.INTEGER,
        source=mlc.Source(
                    file_object="inflow_test",
                    # file_object='github-repository',
                    # Extract the field from the column of a FileObject/FileSet:
                    extract=mlc.Extract(column="year"),
                ),
    )
]

for i in range(366):
    test_inflow_fields.append(
        mlc.Field(
            id=str(i) + '_inflow_test',
            name=str(i),
            description=f"The data regarding day {i}",
            data_types=mlc.DataType.FLOAT,
            source=mlc.Source(
                        file_object="inflow_test",
                        # file_object='github-repository',
                        # Extract the field from the column of a FileObject/FileSet:
                        extract=mlc.Extract(column=str(i)),
                    ),
        )
    )

mef_fields = [
    mlc.Field(
        id="MEF",
        name="MEF",
        description="The year that the data was collected in",
        data_types=mlc.DataType.FLOAT,
        source=mlc.Source(
                    file_object="mef_file_object",
                    # file_object='github-repository',
                    # Extract the field from the column of a FileObject/FileSet:
                    extract=mlc.Extract(column="MEF"),
                ),
    )
]


# ---------------------------

record_sets = [
    # RecordSets contains records in the dataset.
    mlc.RecordSet(
        id="demand_train_table",
        name="demand_train_table",
        # Each record has one or many fields...
        fields=train_demand_fields,
    ),
    mlc.RecordSet(
        id="demand_test_table",
        name="demand_test_table",
        # Each record has one or many fields...
        fields=test_demand_fields,
    ),
    mlc.RecordSet(
        id="inflow_train_table",
        name="inflow_train_table",
        # Each record has one or many fields...
        fields=train_inflow_fields,
    ),
    mlc.RecordSet(
        id="inflow_test_table",
        name="inflow_test_table",
        # Each record has one or many fields...
        fields=test_inflow_fields,
    ),
    mlc.RecordSet(
        id="mef_table",
        name="mef_table",
        # Each record has one or many fields...
        fields=mef_fields,
    ),

]

# Metadata contains information about the dataset.
metadata = mlc.Metadata(
    name="dam_data",
    # Descriptions can contain plain text or markdown.
    description=(
        "Data of water demand, inflow and minimum daily release"
    ),
    # cite_as=(
    #     "@article{brown2020language, title={Language Models are Few-Shot"
    # ),
    url="https://github.com/Daveonwave/gym4ReaL",
    license="https://creativecommons.org/licenses/by/4.0/",
    distribution=distribution,
    record_sets=record_sets,
)

In [28]:
print(metadata.issues.report())

  -  [Metadata(dam_data)] Property "http://mlcommons.org/croissant/citeAs" is recommended, but does not exist.
  -  [Metadata(dam_data)] Property "https://schema.org/datePublished" is recommended, but does not exist.
  -  [Metadata(dam_data)] Property "https://schema.org/version" is recommended, but does not exist.


In [29]:
with open("metadata_dam.json", "w") as f:
  content = metadata.to_json()
  content = json.dumps(content, indent=2)
  print(content)
  f.write(content)
  f.write("\n")  # Terminate file with newline

{
  "@context": {
    "@language": "en",
    "@vocab": "https://schema.org/",
    "citeAs": "cr:citeAs",
    "column": "cr:column",
    "conformsTo": "dct:conformsTo",
    "cr": "http://mlcommons.org/croissant/",
    "rai": "http://mlcommons.org/croissant/RAI/",
    "data": {
      "@id": "cr:data",
      "@type": "@json"
    },
    "dataType": {
      "@id": "cr:dataType",
      "@type": "@vocab"
    },
    "dct": "http://purl.org/dc/terms/",
    "examples": {
      "@id": "cr:examples",
      "@type": "@json"
    },
    "extract": "cr:extract",
    "field": "cr:field",
    "fileProperty": "cr:fileProperty",
    "fileObject": "cr:fileObject",
    "fileSet": "cr:fileSet",
    "format": "cr:format",
    "includes": "cr:includes",
    "isLiveDataset": "cr:isLiveDataset",
    "jsonPath": "cr:jsonPath",
    "key": "cr:key",
    "md5": "cr:md5",
    "parentField": "cr:parentField",
    "path": "cr:path",
    "recordSet": "cr:recordSet",
    "references": "cr:references",
    "regex": "cr:re

In [30]:
dataset = mlc.Dataset(jsonld="metadata_dam.json")

  -  [Metadata(dam_data)] Property "http://mlcommons.org/croissant/citeAs" is recommended, but does not exist.
  -  [Metadata(dam_data)] Property "https://schema.org/datePublished" is recommended, but does not exist.
  -  [Metadata(dam_data)] Property "https://schema.org/version" is recommended, but does not exist.


In [31]:
records = dataset.records(record_set="demand_train_table")

for i, record in enumerate(records):
  print(record)
  if i == 10:
      break

{'year_demand_train': 1985, '0_demand_train': 84.3, '1_demand_train': 113.6, '2_demand_train': 113.9, '3_demand_train': 96.6, '4_demand_train': 83.0, '5_demand_train': 83.1, '6_demand_train': 98.5, '7_demand_train': 98.7, '8_demand_train': 98.7, '9_demand_train': 98.7, '10_demand_train': 88.9, '11_demand_train': 82.9, '12_demand_train': 83.9, '13_demand_train': 99.1, '14_demand_train': 111.5, '15_demand_train': 114.9, '16_demand_train': 128.5, '17_demand_train': 133.0, '18_demand_train': 131.1, '19_demand_train': 131.9, '20_demand_train': 133.2, '21_demand_train': 119.3, '22_demand_train': 115.0, '23_demand_train': 117.6, '24_demand_train': 114.3, '25_demand_train': 113.1, '26_demand_train': 112.5, '27_demand_train': 113.9, '28_demand_train': 114.4, '29_demand_train': 113.7, '30_demand_train': 112.8, '31_demand_train': 112.8, '32_demand_train': 112.8, '33_demand_train': 113.4, '34_demand_train': 113.6, '35_demand_train': 115.0, '36_demand_train': 115.0, '37_demand_train': 115.0, '38_de

In [32]:
records.dataset



In [33]:
import pandas as pd

df = pd.DataFrame.from_records([r for r in records])
df.head()

Unnamed: 0,year_demand_train,0_demand_train,1_demand_train,2_demand_train,3_demand_train,4_demand_train,5_demand_train,6_demand_train,7_demand_train,8_demand_train,...,356_demand_train,357_demand_train,358_demand_train,359_demand_train,360_demand_train,361_demand_train,362_demand_train,363_demand_train,364_demand_train,365_demand_train
0,1985,84.3,113.6,113.9,96.6,83.0,83.1,98.5,98.7,98.7,...,59.3,59.3,59.5,59.2,59.3,60.3,60.9,60.9,60.9,
1,1959,99.0,99.0,99.0,100.0,99.0,99.0,122.0,122.0,122.0,...,120.0,121.0,119.0,119.0,118.0,120.0,119.0,119.0,119.0,
2,1997,140.3,139.7,139.8,140.3,139.2,138.5,138.5,138.5,138.5,...,109.8,109.9,109.4,103.5,100.2,99.5,99.2,99.2,100.7,
3,2008,45.28,45.04,45.22,45.34,45.84,45.89,45.8,50.58,50.87,...,154.11,151.41,136.46,134.77,134.32,134.82,135.13,134.37,134.25,134.53
4,2006,44.98,44.53,40.43,37.51,36.8,36.67,36.56,36.4,36.48,...,99.87,99.27,99.19,99.5,99.28,86.26,85.03,84.97,84.66,
