# Tutorial for `mlcroissant` 🥐

## Introduction

Croissant 🥐 is a high-level format for machine learning datasets that combines metadata, resource file descriptions, data structure, and default ML semantics into a single file.

Croissant builds on schema.org, and its `sc:Dataset` vocabulary, a widely used format to represent datasets on the Web, and make them searchable.

The [`mlcroissant`](https://github.com/mlcommons/croissant/python/mlcroissant) Python library empowers developers to interact with Croissant:

- Programmatically write your JSON-LD Croissant files.
- Verify your JSON-LD Croissant files.
- Load data from Croissant datasets.

In [109]:
# # Install mlcroissant from the source
# !brew install -y python3-dev graphviz libgraphviz-dev pkg-config
# !pip install "git+https://github.com/${GITHUB_REPOSITORY:-mlcommons/croissant}.git@${GITHUB_HEAD_REF:-main}#subdirectory=python/mlcroissant&egg=mlcroissant[dev]"

In [110]:
import mlcroissant as mlc
import hashlib

file = "gym4real/data/dam/clean_data/demand_dummy.csv"

with open(file, "rb") as f:
    sha256_hash = hashlib.sha256(f.read()).hexdigest()

# FileObjects and FileSets define the resources of the dataset.
distribution = [
    # gpt-3 is hosted on a GitHub repository:
    mlc.FileObject(
        id="water_demand",
        name="water_demand",
        description="Water demand that the reservoir must meet",
        content_url="https://raw.githubusercontent.com/Daveonwave/gym4ReaL/main/" + file,
        encoding_formats=["text/csv"],
        sha256=sha256_hash,
    ),
]

fields = [
    mlc.Field(
        id="year",
        name="year",
        description="The year that the data was collected in",
        data_types=mlc.DataType.INTEGER,
        source=mlc.Source(
                    file_object="water_demand",
                    # file_object='github-repository',
                    # Extract the field from the column of a FileObject/FileSet:
                    extract=mlc.Extract(column="year"),
                ),
    )
]

for i in range(366):
    fields.append(
        mlc.Field(
            id=str(i),
            name=str(i),
            description=f"The data regarding day {i}",
            data_types=mlc.DataType.FLOAT,
            source=mlc.Source(
                        file_object="water_demand",
                        # file_object='github-repository',
                        # Extract the field from the column of a FileObject/FileSet:
                        extract=mlc.Extract(column=str(i)),
                    ),
        )
    )

record_sets = [
    # RecordSets contains records in the dataset.
    mlc.RecordSet(
        id="jsonl",
        name="jsonl",
        # Each record has one or many fields...
        fields=fields
    )
]

# Metadata contains information about the dataset.
metadata = mlc.Metadata(
    name="LakeComoDemand",
    # Descriptions can contain plain text or markdown.
    description=(
        "Data of water demand of the reservoir of Lake Como"
    ),
    # cite_as=(
    #     "@article{brown2020language, title={Language Models are Few-Shot"
    # ),
    url="https://github.com/Daveonwave/gym4ReaL",
    distribution=distribution,
    record_sets=record_sets,
)

When creating `Metadata`:
- We also check for errors in the configuration.
- We generate warnings if the configuration doesn't follow guidelines and best practices.

For instance, in this case:

In [111]:
print(metadata.issues.report())

  -  [Metadata(LakeComoDemand)] Property "http://mlcommons.org/croissant/citeAs" is recommended, but does not exist.
  -  [Metadata(LakeComoDemand)] Property "https://schema.org/datePublished" is recommended, but does not exist.
  -  [Metadata(LakeComoDemand)] Property "https://schema.org/license" is recommended, but does not exist.
  -  [Metadata(LakeComoDemand)] Property "https://schema.org/version" is recommended, but does not exist.


`Property "https://schema.org/license" is recommended`...

We can see at a glance that we miss an important metadata to build datasets for responsible AI: the license!

## Build the Croissant file and yield data

Let's write the Croissant JSON-LD to a file on disk!

In [112]:
import json

with open("croissant.json", "w") as f:
  content = metadata.to_json()
  content = json.dumps(content, indent=2)
  print(content)
  f.write(content)
  f.write("\n")  # Terminate file with newline

{
  "@context": {
    "@language": "en",
    "@vocab": "https://schema.org/",
    "citeAs": "cr:citeAs",
    "column": "cr:column",
    "conformsTo": "dct:conformsTo",
    "cr": "http://mlcommons.org/croissant/",
    "rai": "http://mlcommons.org/croissant/RAI/",
    "data": {
      "@id": "cr:data",
      "@type": "@json"
    },
    "dataType": {
      "@id": "cr:dataType",
      "@type": "@vocab"
    },
    "dct": "http://purl.org/dc/terms/",
    "examples": {
      "@id": "cr:examples",
      "@type": "@json"
    },
    "extract": "cr:extract",
    "field": "cr:field",
    "fileProperty": "cr:fileProperty",
    "fileObject": "cr:fileObject",
    "fileSet": "cr:fileSet",
    "format": "cr:format",
    "includes": "cr:includes",
    "isLiveDataset": "cr:isLiveDataset",
    "jsonPath": "cr:jsonPath",
    "key": "cr:key",
    "md5": "cr:md5",
    "parentField": "cr:parentField",
    "path": "cr:path",
    "recordSet": "cr:recordSet",
    "references": "cr:references",
    "regex": "cr:re

From this JSON-LD file, we can easily create a dataset...

In [113]:
dataset = mlc.Dataset(jsonld="croissant.json")

  -  [Metadata(LakeComoDemand)] Property "http://mlcommons.org/croissant/citeAs" is recommended, but does not exist.
  -  [Metadata(LakeComoDemand)] Property "https://schema.org/datePublished" is recommended, but does not exist.
  -  [Metadata(LakeComoDemand)] Property "https://schema.org/license" is recommended, but does not exist.
  -  [Metadata(LakeComoDemand)] Property "https://schema.org/version" is recommended, but does not exist.


...and yield records from this dataset:

In [114]:
records = dataset.records(record_set="jsonl")

for i, record in enumerate(records):
  print(record)
  if i > 10:
    break

{'year': 1994, '0': 0.0, '1': 0.0, '2': 0.0, '3': 0.0, '4': 0.0, '5': 0.0, '6': 0.0, '7': 0.0, '8': 0.0, '9': 0.0, '10': 0.0, '11': 0.0, '12': 0.0, '13': 0.0, '14': 0.0, '15': 0.0, '16': 0.0, '17': 0.0, '18': 0.0, '19': 0.0, '20': 0.0, '21': 0.0, '22': 0.0, '23': 0.0, '24': 0.0, '25': 0.0, '26': 0.0, '27': 0.0, '28': 0.0, '29': 0.0, '30': 0.0, '31': 0.0, '32': 0.0, '33': 0.0, '34': 0.0, '35': 0.0, '36': 0.0, '37': 0.0, '38': 0.0, '39': 0.0, '40': 0.0, '41': 0.0, '42': 0.0, '43': 0.0, '44': 0.0, '45': 0.0, '46': 0.0, '47': 0.0, '48': 0.0, '49': 0.0, '50': 0.0, '51': 0.0, '52': 0.0, '53': 0.0, '54': 0.0, '55': 0.0, '56': 0.0, '57': 0.0, '58': 0.0, '59': 0.0, '60': 0.0, '61': 0.0, '62': 0.0, '63': 0.0, '64': 0.0, '65': 0.0, '66': 0.0, '67': 0.0, '68': 0.0, '69': 0.0, '70': 0.0, '71': 0.0, '72': 0.0, '73': 0.0, '74': 0.0, '75': 0.0, '76': 0.0, '77': 0.0, '78': 0.0, '79': 0.0, '80': 0.0, '81': 0.0, '82': 0.0, '83': 0.0, '84': 0.0, '85': 0.0, '86': 0.0, '87': 0.0, '88': 0.0, '89': 0.0, '90':