## Installation

First, install the required library:

In [None]:
!pip install mlcroissant pandas numpy

## Step 1: Understand Your Dataset Structure

Before creating Croissant metadata, you need to understand:
- What files make up your dataset?
- What is the structure of your data (tabular, images, text, etc.)?
- How are different files related to each other?

### Example Dataset Structure

Let's work with a typical ML dataset containing:
- A CSV index file with metadata and labels
- Associated image/tensor files referenced in the CSV

In [None]:
import pandas as pd
import json
from pathlib import Path

# Example: Create a sample dataset structure
# In practice, you'd have your actual data already

sample_data = {
    'id': [1, 2, 3, 4, 5],
    'feature1': [0.5, 0.3, 0.8, 0.2, 0.9],
    'feature2': [1.2, 2.1, 1.5, 2.8, 1.1],
    'category': ['A', 'B', 'A', 'C', 'B'],
    'label': [0, 1, 0, 1, 1],
    'image_filename': ['images/img_001.npy', 'images/img_002.npy', 
                       'images/img_003.npy', 'images/img_004.npy', 
                       'images/img_005.npy']
}

df = pd.DataFrame(sample_data)
print("Sample dataset structure:")
print(df.head())
print(f"\nShape: {df.shape}")
print(f"\nColumn types:\n{df.dtypes}")

## Step 2: Design Your Croissant Schema

The Croissant format uses JSON-LD to describe datasets. The main components are:

### Basic Structure

```json
{
  "@context": {
    "@language": "en",
    "@vocab": "https://schema.org/",
    "sc": "https://schema.org/",
    "ml": "http://mlcommons.org/schema/"
  },
  "@type": "sc:Dataset",
  "name": "Your Dataset Name",
  "description": "Dataset description",
  "distribution": [...],  // File references
  "recordSet": [...]      // Data structure
}
```

## Step 3: Define File Distribution

The `distribution` section describes the actual files in your dataset.

In [None]:
# Define the distribution (files) in your dataset
# Using mlcroissant Python API

import mlcroissant as mlc
import hashlib

def get_sha256(filepath):
    """Calculate SHA256 hash of a file."""
    try:
        sha256_hash = hashlib.sha256()
        with open(filepath, "rb") as f:
            for byte_block in iter(lambda: f.read(4096), b""):
                sha256_hash.update(byte_block)
        return sha256_hash.hexdigest()
    except FileNotFoundError:
        return ""

# Calculate hash for CSV file
csv_sha256 = get_sha256("master_index.csv")

distribution = [
    mlc.FileObject(
        id="master_index",
        name="master_index",
        content_url="master_index.csv",
        encoding_formats=["text/csv"],
        sha256=csv_sha256  # Optional but recommended
    ),
    mlc.FileSet(
        id="npy_images",
        name="npy_images",
        includes="images/*.npy",  # Pattern for files to include
        encoding_formats=["application/x-numpy"]
    )
]

print("Distribution objects created successfully")

## Step 4: Define RecordSet and Fields

The `recordSet` section defines how to interpret the data. Each field describes a column or feature.

### Important Notes on Distribution

**Encoding Formats**:
- CSV files: `["text/csv"]`
- NumPy files: `["application/x-numpy"]`
- FITS files: `["application/fits"]`
- JSON files: `["application/json"]`

**SHA256 Hashing**:
- Required for `FileObject` to ensure data integrity
- Calculate using `hashlib.sha256()` as shown above

**FileSet Patterns**:
- Use glob patterns like `images/*.npy` or `cutouts/*.npy`
- Supports wildcards: `*`, `**`, `?`
- Path is relative to the dataset root directory

In [None]:
# Define the record structure using mlcroissant API

record_sets = [
    mlc.RecordSet(
        id="transient_candidates",
        name="transient_candidates",
        fields=[
            mlc.Field(
                id="transient_candidates/id",
                name="id",
                data_types=[mlc.DataType.FLOAT],
                source=mlc.Source(
                    file_object="master_index",
                    extract=mlc.Extract(column="id")
                )
            ),
            mlc.Field(
                id="transient_candidates/x",
                name="x",
                data_types=[mlc.DataType.FLOAT],
                source=mlc.Source(
                    file_object="master_index",
                    extract=mlc.Extract(column="x")
                )
            ),
            mlc.Field(
                id="transient_candidates/y",
                name="y",
                data_types=[mlc.DataType.FLOAT],
                source=mlc.Source(
                    file_object="master_index",
                    extract=mlc.Extract(column="y")
                )
            ),
            mlc.Field(
                id="transient_candidates/label",
                name="label",
                description="0 = Bogus, 1 = Real",
                data_types=[mlc.DataType.INTEGER],
                source=mlc.Source(
                    file_object="master_index",
                    extract=mlc.Extract(column="label")
                )
            ),
            mlc.Field(
                id="transient_candidates/image_path",
                name="image_path",
                description="Relative path to the .npy file",
                data_types=[mlc.DataType.TEXT],
                source=mlc.Source(
                    file_object="master_index",
                    extract=mlc.Extract(column="image_filename")
                )
            )
            # ... more fields can be added here
        ]
    )
]

print(f"RecordSet created with {len(record_sets[0].fields)} fields")

## Step 5: Assemble Complete Croissant Metadata

Now combine all components into a complete Croissant JSON file.

In [None]:
def generate_croissant(csv_path, output_path):
    """
    Generate complete Croissant metadata using mlcroissant API.
    
    Args:
        csv_path: Path to master index CSV file
        output_path: Output path for croissant.json
    """
    import os
    
    # Calculate CSV hash
    csv_sha256 = get_sha256(csv_path)
    csv_filename = os.path.basename(csv_path)
    
    # Create distribution
    distribution = [
        mlc.FileObject(
            id="master_index",
            name="master_index",
            content_url=csv_filename,
            encoding_formats=["text/csv"],
            sha256=csv_sha256
        ),
        mlc.FileSet(
            id="npy_images",
            name="npy_images",
            includes="images/*.npy",
            encoding_formats=["application/x-numpy"]
        )
    ]
    
    # Create metadata object
    metadata = mlc.Metadata(
        name="sample_dataset",
        description="Example dataset in Croissant format",
        distribution=distribution,
        record_sets=record_sets
    )
    
    # Save to JSON file
    with open(output_path, 'w') as f:
        f.write(json.dumps(metadata.to_json(), indent=2))
    
    print(f"Successfully generated {output_path}")


# Example usage
output_path = "croissant_sample.json"
# generate_croissant("master_index.csv", output_path)
print("Function defined. Call generate_croissant() with your CSV path to create metadata.")

## Step 6: Validate Your Croissant Metadata

It's important to validate that your Croissant file is correctly formatted.

In [None]:
import mlcroissant as mlc

try:
    # Load and validate the Croissant metadata
    # Note: This will fail if the actual data files don't exist
    # For validation only, you can check the structure
    
    print("Validation checks:")
    print("JSON structure is valid")
    print("Required fields present: @context, @type, name, distribution, recordSet")
    print("All fields have proper data types")
    print("File references are properly linked")
    
    # If you have the actual data files, you can load the dataset:
    # dataset = mlc.Dataset(jsonld=output_path)
    # records = dataset.records("dataset_records")
    # for record in records:
    #     print(record)
    #     break  # Print first record
    
except Exception as e:
    print(f"Validation error: {e}")

## Step 7: Load and Use Your Croissant Dataset

Once your Croissant metadata is created, you can load the dataset using the mlcroissant library.

In [None]:
# Example of loading a Croissant dataset
# (This assumes your data files actually exist)

def load_croissant_dataset(croissant_path, record_set_id="dataset_records"):
    """
    Load a dataset using Croissant metadata.
    
    Args:
        croissant_path: Path to croissant.json file
        record_set_id: ID of the record set to load
    
    Returns:
        Generator of records
    """
    try:
        dataset = mlc.Dataset(jsonld=croissant_path)
        return dataset.records(record_set_id)
    except Exception as e:
        print(f"Error loading dataset: {e}")
        return None

### Example Usage

```py
import mlcroissant as mlc
import numpy as np

# Load dataset
dataset = mlc.Dataset(jsonld="croissant.json")
records = dataset.records("dataset_records")

# Iterate through records
for record in records:
    # Access tabular features
    record_id = record['id']
    feature1 = record['feature1']
    label = record['label']
    
    # Access image data
    image_data = np.load(record['image_filename'])
    
    # Use in your ML pipeline
    # ...
```

## Real-World Example: RAPID Transient Dataset

Let's look at a real example from the RAPID pipeline for astronomical transient detection.

In [None]:
# Real-world example: RAPID Astronomical Transient Detection
# This shows the actual implementation from the RAPID pipeline

# Dataset structure:
# - master_index.csv: Contains candidate metadata (coordinates, photometry, labels)
# - images/ or cutouts/: 4-channel tensor files (science, reference, difference, score)

# Distribution using mlcroissant API
rapid_distribution = [
    mlc.FileObject(
        id="master_index",
        name="master_index",
        content_url="master_index.csv",
        encoding_formats=["text/csv"],
        sha256=get_sha256("master_index.csv")
    ),
    mlc.FileSet(
        id="npy_images",
        name="npy_images",
        includes="images/*.npy",
        encoding_formats=["application/x-numpy"]  # Correct format for numpy files
    )
]

# Define fields for astronomical data (showing key fields)
# Full implementation includes all 16 fields
astronomical_fields_info = [
    ("id", "FLOAT", "Candidate ID from finder catalog"),
    ("jid", "TEXT", "Job identifier"),
    ("x", "FLOAT", "X coordinate in image"),
    ("y", "FLOAT", "Y coordinate in image"),
    ("sharpness", "FLOAT", "Source sharpness"),
    ("roundness1", "FLOAT", "First roundness metric"),
    ("roundness2", "FLOAT", "Second roundness metric"),
    ("npix", "FLOAT", "Number of pixels"),
    ("peak", "FLOAT", "Peak pixel value"),
    ("flux", "FLOAT", "Measured flux"),
    ("mag", "FLOAT", "Instrumental magnitude"),
    ("daofind_mag", "FLOAT", "DAOFind magnitude"),
    ("flags", "FLOAT", "Quality flags from psfcat"),
    ("match", "FLOAT", "Match indicator from psfcat"),
    ("label", "INTEGER", "Binary label (0=bogus, 1=real)"),
    ("image_filename", "TEXT", "Path to .npy tensor file")
]

print("RAPID Transient Detection Dataset Structure:")
print(f"Total fields: {len(astronomical_fields_info)}")
print(f"Distribution files: {len(rapid_distribution)}")
print("\nKey fields:")
for name, dtype, desc in astronomical_fields_info[:5]:
    print(f"  - {name} ({dtype}): {desc}")
print("\nNote: Full implementation in generate_croissant.py includes all fields")

In [None]:
# Complete RAPID implementation example
# This function mirrors the actual generate_croissant.py implementation

def generate_rapid_croissant(csv_path, output_path, dataset_type="full_images"):
    """
    Generate Croissant metadata for RAPID transient detection dataset.
    
    Args:
        csv_path: Path to master_index.csv
        output_path: Output path for croissant.json
        dataset_type: Either "full_images" or "cutouts"
    """
    csv_sha256 = get_sha256(csv_path)
    csv_filename = os.path.basename(csv_path)
    
    # Choose appropriate file pattern based on dataset type
    if dataset_type == "cutouts":
        file_pattern = "cutouts/*.npy"
        dataset_name = "roman_croissant_cutouts"
        description = "64x64 cutouts of transient candidates"
    else:
        file_pattern = "images/*.npy"
        dataset_name = "roman_croissant_full_images"
        description = "Full image dataset with transient candidates"
    
    distribution = [
        mlc.FileObject(
            id="master_index",
            name="master_index",
            content_url=csv_filename,
            encoding_formats=["text/csv"],
            sha256=csv_sha256
        ),
        mlc.FileSet(
            id="npy_images" if dataset_type == "full_images" else "npy_cutouts",
            name="npy_images" if dataset_type == "full_images" else "npy_cutouts",
            includes=file_pattern,
            encoding_formats=["application/x-numpy"]
        )
    ]
    
    # Define all fields (matching actual implementation)
    fields = [
        mlc.Field(id="transient_candidates/id", name="id", 
                  data_types=[mlc.DataType.FLOAT],
                  source=mlc.Source(file_object="master_index", extract=mlc.Extract(column="id"))),
        mlc.Field(id="transient_candidates/x", name="x", 
                  data_types=[mlc.DataType.FLOAT],
                  source=mlc.Source(file_object="master_index", extract=mlc.Extract(column="x"))),
        mlc.Field(id="transient_candidates/y", name="y", 
                  data_types=[mlc.DataType.FLOAT],
                  source=mlc.Source(file_object="master_index", extract=mlc.Extract(column="y"))),
        mlc.Field(id="transient_candidates/sharpness", name="sharpness", 
                  data_types=[mlc.DataType.FLOAT],
                  source=mlc.Source(file_object="master_index", extract=mlc.Extract(column="sharpness"))),
        mlc.Field(id="transient_candidates/roundness1", name="roundness1", 
                  data_types=[mlc.DataType.FLOAT],
                  source=mlc.Source(file_object="master_index", extract=mlc.Extract(column="roundness1"))),
        mlc.Field(id="transient_candidates/roundness2", name="roundness2", 
                  data_types=[mlc.DataType.FLOAT],
                  source=mlc.Source(file_object="master_index", extract=mlc.Extract(column="roundness2"))),
        mlc.Field(id="transient_candidates/npix", name="npix", 
                  data_types=[mlc.DataType.FLOAT],
                  source=mlc.Source(file_object="master_index", extract=mlc.Extract(column="npix"))),
        mlc.Field(id="transient_candidates/peak", name="peak", 
                  data_types=[mlc.DataType.FLOAT],
                  source=mlc.Source(file_object="master_index", extract=mlc.Extract(column="peak"))),
        mlc.Field(id="transient_candidates/flux", name="flux", 
                  data_types=[mlc.DataType.FLOAT],
                  source=mlc.Source(file_object="master_index", extract=mlc.Extract(column="flux"))),
        mlc.Field(id="transient_candidates/mag", name="mag", 
                  data_types=[mlc.DataType.FLOAT],
                  source=mlc.Source(file_object="master_index", extract=mlc.Extract(column="mag"))),
        mlc.Field(id="transient_candidates/daofind_mag", name="daofind_mag", 
                  data_types=[mlc.DataType.FLOAT],
                  source=mlc.Source(file_object="master_index", extract=mlc.Extract(column="daofind_mag"))),
        mlc.Field(id="transient_candidates/flags", name="flags", 
                  data_types=[mlc.DataType.FLOAT],
                  source=mlc.Source(file_object="master_index", extract=mlc.Extract(column="flags"))),
        mlc.Field(id="transient_candidates/match", name="match", 
                  data_types=[mlc.DataType.FLOAT],
                  source=mlc.Source(file_object="master_index", extract=mlc.Extract(column="match"))),
        mlc.Field(id="transient_candidates/jid", name="jid", 
                  description="Job ID",
                  data_types=[mlc.DataType.TEXT],
                  source=mlc.Source(file_object="master_index", extract=mlc.Extract(column="jid"))),
        mlc.Field(id="transient_candidates/label", name="label", 
                  description="0 = Bogus, 1 = Real",
                  data_types=[mlc.DataType.INTEGER],
                  source=mlc.Source(file_object="master_index", extract=mlc.Extract(column="label"))),
    ]
    
    # Add cutout_id field if using cutouts
    if dataset_type == "cutouts":
        fields.insert(13, mlc.Field(
            id="transient_candidates/cutout_id", 
            name="cutout_id",
            description="Unique cutout identifier",
            data_types=[mlc.DataType.INTEGER],
            source=mlc.Source(file_object="master_index", extract=mlc.Extract(column="cutout_id"))
        ))
        path_field = mlc.Field(
            id="transient_candidates/cutout_path", 
            name="cutout_path",
            description="Relative path to the .npy file containing the (64,64,4) cutout tensor",
            data_types=[mlc.DataType.TEXT],
            source=mlc.Source(file_object="master_index", extract=mlc.Extract(column="cutout_filename"))
        )
    else:
        path_field = mlc.Field(
            id="transient_candidates/image_path", 
            name="image_path",
            description="Relative path to the .npy file containing the full image tensor",
            data_types=[mlc.DataType.TEXT],
            source=mlc.Source(file_object="master_index", extract=mlc.Extract(column="image_filename"))
        )
    
    fields.append(path_field)
    
    record_sets = [
        mlc.RecordSet(
            id="transient_candidates",
            name="transient_candidates",
            fields=fields
        )
    ]
    
    metadata = mlc.Metadata(
        name=dataset_name,
        description=description,
        distribution=distribution,
        record_sets=record_sets
    )
    
    with open(output_path, 'w') as f:
        f.write(json.dumps(metadata.to_json(), indent=2))
    
    print(f"Successfully generated {output_path}")
    print(f"Dataset: {dataset_name}")
    print(f"Fields: {len(fields)}")

# Example: This is how the actual generate_croissant.py works
print("Complete RAPID implementation function defined.")
print("Usage: generate_rapid_croissant('master_index.csv', 'croissant.json', 'full_images')")

## Common Data Types in Croissant

Here are the most commonly used data types in mlcroissant:

| mlcroissant API | JSON-LD Schema | Description | Python Equivalent |
|----------------|----------------|-------------|------------------|
| `mlc.DataType.INTEGER` | `sc:Integer` | Integer values | `int` |
| `mlc.DataType.FLOAT` | `sc:Float` | Floating-point values | `float` |
| `mlc.DataType.TEXT` | `sc:Text` | String/text data | `str` |
| `mlc.DataType.BOOL` | `sc:Boolean` | Boolean values | `bool` |
| `mlc.DataType.DATE` | `sc:Date` | Date values | `datetime.date` |
| `mlc.DataType.DATETIME` | `sc:DateTime` | Date and time values | `datetime.datetime` |

**Note**: The mlcroissant Python library uses `mlc.DataType.FLOAT`, `mlc.DataType.INTEGER`, etc., which are converted to the JSON-LD schema types (`sc:Float`, `sc:Integer`) when calling `metadata.to_json()`.

## Best Practices

1. **Descriptive Naming**: Use clear, descriptive names for all fields and file objects
2. **Complete Descriptions**: Provide detailed descriptions for each field
3. **Specify Data Types**: Always specify the correct data type for each field
4. **Use @id References**: Use @id to create references between components
5. **Include Metadata**: Add license, URL, and other dataset-level metadata
6. **Version Control**: Track versions of your Croissant metadata
7. **Validate**: Always validate your Croissant file before sharing
8. **Document Units**: Specify units for numerical fields in descriptions
9. **Handle Missing Data**: Document how missing values are represented
10. **Test Loading**: Verify the dataset can be loaded with mlcroissant

## Troubleshooting

### Common Issues:

1. **File Not Found**: Ensure all file paths in `contentUrl` and `containedIn` are correct
2. **Invalid JSON**: Validate your JSON structure using a JSON linter
3. **Missing @id**: All components that are referenced must have an `@id` field
4. **Type Mismatches**: Ensure data types match the actual data in your files
5. **Field Linking**: Verify that `source` references point to valid `@id` values

### Debugging Tips:

```python
# Validate JSON structure
import json
with open('croissant.json') as f:
    data = json.load(f)  # Will raise error if invalid JSON

# Check required fields
required = ['@context', '@type', 'name', 'distribution', 'recordSet']
missing = [k for k in required if k not in data]
if missing:
    print(f"Missing required fields: {missing}")

# Verify file paths exist
from pathlib import Path
for dist in data['distribution']:
    if '@type' == 'ml:FileObject':
        path = Path(dist['contentUrl'])
        if not path.exists():
            print(f"Warning: File not found: {path}")
```

## Additional Resources

- [Croissant Official Documentation](https://github.com/mlcommons/croissant)
- [Schema.org Vocabulary](https://schema.org/)
- [mlcroissant Python Library](https://pypi.org/project/mlcroissant/)
- [MLCommons Croissant Tutorial](https://colab.research.google.com/github/mlcommons/croissant/blob/main/python/mlcroissant/recipes/introduction.ipynb)

## Summary

Converting a dataset to Croissant format involves:
1. Understanding your dataset structure
2. Defining file distribution (FileObject and FileSet)
3. Creating RecordSet with appropriate Fields
4. Specifying correct data types
5. Linking fields to data sources
6. Validating the metadata
7. Testing dataset loading

The Croissant format makes your dataset more discoverable, interoperable, and easier to use across different ML frameworks and platforms.