# Using Croissant in Machine Learning Pipelines ü•ê

Croissant provides a single-file JSON-LD format for Machine Learning (ML) datasets that contains information about data sources, data structure and relevant additional metadata. The standardized format aims to improve the discoverability, accessibility, and interoperability of ML datasets. In this notebook we'll demonstrate using an example croissant file (linked to a dataset from the UKCEH Environment Information Data Centre (EIDC)) in an ML-pipeline.

In [None]:
# Installing necessary libraries
%%capture --no-display
# Install mlcroissant from the source
!apt-get install -y python3-dev graphviz libgraphviz-dev pkg-config
!pip install "git+https://github.com/${GITHUB_REPOSITORY:-mlcommons/croissant}.git@${GITHUB_HEAD_REF:-main}#subdirectory=python/mlcroissant&egg=mlcroissant[dev]"
!pip install array_record
!pip install tfds-nightly
!pip install tensorflow
!pip install torch
!apt-get install tree

In [None]:
# Importing necessary libraries
import json
import os
from etils import epath
import mlcroissant as mlc
import requests
import tensorflow_datasets as tfds
import torch
from tqdm import tqdm
import pandas as pd

## Loading the data

Currently the underlying data described in the croissant file can be loaded directly using either the [mlcroissant](https://github.com/mlcommons/croissant/tree/main/python/mlcroissant) python library or the [tensorflow croissant builder](https://www.tensorflow.org/datasets/format_specific_dataset_builders#croissantbuilder). Here we'll demonstrate both.

In [None]:
# Load the dataset from the croissant file using mlcroissant
croissant_file_path = "/tmp/croissantSpikeZip.json" #"../../croissantSpikeZip.json"
dataset = Dataset(jsonld=croissant_file_path)  # Use mlc.Dataset to parse Croissant metadata
metadata = dataset.metadata.to_json() # Convert the metadata to a JSON object
records = dataset.records(record_set="rs-abberfraw") # Extract records from the dataset
df = pd.DataFrame(records) # Convert the records to a pandas dataframe 
df[:5] # Display the first 5 records

In [None]:
# Load the dataset from the croissant file using tensorflow custom builder
builder = tfds.core.dataset_builders.CroissantBuilder(
    jsonld="/tmp/croissantSpikeZip.json",
    record_set_ids=["rs-abberfraw"],
    file_format='array_record',
    data_dir="/tmp/croissant_ukceh",
)
print(f"Dataset's description:\n{builder.info.description}\n")
print(f"Dataset's citation:\n{builder.info.citation}\n")
print(f"Dataset's features:\n{builder.info.features}")

builder.download_and_prepare() # Download and prepare the dataset
datasource = builder.as_data_source() 
for i in datasource['default'][:10]:
  print(i)