[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/MLMI2-CSSI/foundry/blob/main/examples/03_advanced_workflows/advanced_workflows.ipynb)

---

# Advanced Foundry Workflows

**Time:** 20 minutes  
**Prerequisites:** Completed previous examples  
**What you'll learn:**
- Publishing datasets to Foundry
- Exporting to HuggingFace Hub
- Using the CLI
- MCP server for AI agent integration
- Structured error handling

---

## 1. Publishing Datasets

Share your data with the materials science community.

In [None]:
from foundry import Foundry

# HTTPS download is now the default
f = Foundry()

### 1.1 Prepare Your Metadata

Foundry uses DataCite metadata standard. Create a JSON file describing your dataset:

In [None]:
metadata = {
    "dc": {
        "titles": [{"title": "My Band Gap Dataset"}],
        "creators": [
            {"creatorName": "Smith, John", "affiliation": "University of Example"}
        ],
        "descriptions": [{
            "description": "Band gap predictions for 1000 materials using DFT calculations.",
            "descriptionType": "Abstract"
        }],
        "publicationYear": 2024,
        "publisher": "Foundry",
        "resourceType": {"resourceType": "Dataset", "resourceTypeGeneral": "Dataset"}
    },
    "foundry": {
        "data_type": "tabular",
        "keys": [
            {
                "key": ["composition"],
                "type": "input",
                "description": "Chemical composition formula"
            },
            {
                "key": ["band_gap"],
                "type": "target",
                "description": "Calculated band gap",
                "units": "eV"
            }
        ],
        "splits": [
            {"label": "train", "type": "train"},
            {"label": "test", "type": "test"}
        ]
    }
}

print("Metadata structure ready!")

### 1.2 Publish (requires Globus authentication)

In [None]:
# To publish:
# 1. Save metadata to foundry.json
# 2. Prepare data files in a folder
# 3. Run:

# result = f.publish(
#     metadata,
#     data_path="./my_data_folder",
#     source_id="my_dataset_v1.0"
# )

print("See foundry.publish() documentation for full publishing workflow")

In [None]:
# Check publication status
# status = f.check_status("my_dataset_v1.0")
# print(status)

## 2. Exporting to HuggingFace Hub

Make your Foundry dataset discoverable on HuggingFace Hub.

In [None]:
# Install HuggingFace dependencies
# !pip install foundry-ml[huggingface]

In [None]:
# Export a Foundry dataset to HuggingFace Hub
from foundry.integrations.huggingface import push_to_hub

# Get a dataset
results = f.search("band gap", limit=1)
dataset = results.iloc[0].FoundryDataset

# Export to HF Hub (requires HF token)
# url = push_to_hub(
#     dataset,
#     repo_id="your-username/dataset-name",
#     token="hf_YOUR_TOKEN",  # or set HF_TOKEN env var
#     private=False
# )
# print(f"Published at: {url}")

print("HuggingFace export ready! Set your HF token to publish.")

### What Gets Created on HuggingFace

The export automatically creates:
- **Data files** in Parquet/Arrow format
- **Dataset Card (README.md)** with:
  - Title, description from DataCite
  - **Authors from original creators** (not the person pushing)
  - DOI link to original source
  - Field descriptions and units
  - BibTeX citation
  - Usage examples for both Foundry and HF

## 3. Using the CLI

Foundry includes a command-line interface for quick operations.

```bash
# Search for datasets
foundry search "band gap"

# Get dataset info
foundry get 10.18126/abc123

# View schema
foundry schema 10.18126/abc123

# List all datasets
foundry catalog --limit 10

# JSON output for scripting
foundry catalog --json | jq '.[] | .name'

# Export to HuggingFace
foundry push-to-hf 10.18126/abc123 --repo your-org/dataset-name

# Check version
foundry version

# Get help
foundry --help
foundry search --help
```

In [None]:
# Run CLI from notebook
!foundry --help

## 4. MCP Server for AI Agents

Foundry includes an MCP (Model Context Protocol) server that allows AI agents like Claude Code to discover and use datasets.

### 4.1 Install for Claude Code

```bash
# Automatically configure Claude Code to use Foundry
foundry mcp install
```

This adds Foundry to your Claude Code configuration, enabling commands like:
- "Find me a materials science dataset for band gap prediction"
- "What fields are in dataset X?"
- "Load the training data from dataset Y"

### 4.2 Available MCP Tools

The MCP server exposes these tools to AI agents:

In [None]:
from foundry.mcp.server import TOOLS, create_server

print("Available MCP Tools:")
print("=" * 50)
for tool in TOOLS:
    print(f"\n{tool['name']}")
    print(f"  {tool['description'][:80]}...")
    print(f"  Parameters: {list(tool['inputSchema']['properties'].keys())}")

In [None]:
# View full server configuration
config = create_server()
print(f"Server: {config['name']} v{config['version']}")
print(f"Tools: {len(config['tools'])}")

### 4.3 Start MCP Server Manually

```bash
# Start the server (for custom agent integrations)
foundry mcp start
```

## 5. Structured Error Handling

Foundry uses structured errors that provide clear context for both humans and AI agents.

In [None]:
from foundry.errors import (
    FoundryError,
    DatasetNotFoundError,
    AuthenticationError,
    DownloadError,
)

# Example: Handle a not-found error
try:
    raise DatasetNotFoundError("fake-doi-12345")
except DatasetNotFoundError as e:
    print(f"Error Code: {e.code}")
    print(f"Message: {e.message}")
    print(f"Details: {e.details}")
    print(f"Recovery Hint: {e.recovery_hint}")

In [None]:
# Errors can be serialized for API responses
import json

error = DatasetNotFoundError("test-query")
error_dict = error.to_dict()
print(json.dumps(error_dict, indent=2))

### Error Types

| Error Class | Code | When It's Raised |
|------------|------|------------------|
| `DatasetNotFoundError` | DATASET_NOT_FOUND | Search/get returns no results |
| `AuthenticationError` | AUTH_FAILED | Globus/service auth fails |
| `DownloadError` | DOWNLOAD_FAILED | File download fails |
| `DataLoadError` | DATA_LOAD_FAILED | Cannot parse data file |
| `ValidationError` | VALIDATION_FAILED | Metadata validation error |
| `PublishError` | PUBLISH_FAILED | Publishing workflow fails |
| `CacheError` | CACHE_ERROR | Local cache issue |
| `ConfigurationError` | CONFIG_ERROR | Invalid config setting |

## 6. Complete Workflow Example

Here's a complete workflow from discovery to model training:

In [None]:
from foundry import Foundry
from foundry.errors import DatasetNotFoundError

def train_band_gap_model():
    """Complete workflow: discover -> understand -> load -> train."""
    
    f = Foundry()
    
    # 1. Discover
    print("1. Searching for datasets...")
    results = f.search("band gap", limit=5, as_json=True)
    
    if not results:
        raise DatasetNotFoundError("band gap")
    
    print(f"   Found {len(results)} datasets")
    
    # 2. Understand
    print("\n2. Getting dataset schema...")
    dataset = f.list(limit=1).iloc[0].FoundryDataset
    schema = dataset.get_schema()
    
    print(f"   Dataset: {schema['name']}")
    print(f"   Fields: {[f['name'] for f in schema['fields']]}")
    print(f"   Splits: {[s['name'] for s in schema['splits']]}")
    
    # 3. Load (with schema for context)
    print("\n3. Loading data...")
    result = dataset.get_as_dict(include_schema=True)
    
    data = result['data']
    print(f"   Loaded splits: {list(data.keys())}")
    
    # 4. Train
    print("\n4. Ready to train!")
    if 'train' in data:
        X_train, y_train = data['train']
        print(f"   Training samples available")
    
    # 5. Cite
    print("\n5. Citation:")
    print(dataset.get_citation())
    
    return dataset

# Run it
try:
    ds = train_band_gap_model()
except Exception as e:
    print(f"Workflow failed: {e}")

## Summary

### Publishing
```python
f.publish(metadata, data_path="./data", source_id="my_dataset_v1")
f.check_status("my_dataset_v1")
```

### HuggingFace Export
```python
from foundry.integrations.huggingface import push_to_hub
push_to_hub(dataset, "org/name", token="hf_xxx")
```

### CLI
```bash
foundry search "query"
foundry schema <doi>
foundry mcp install
```

### Error Handling
```python
from foundry.errors import DatasetNotFoundError
try:
    f.get_dataset(doi)
except DatasetNotFoundError as e:
    print(e.recovery_hint)
```

### Configuration
```python
# Default: HTTPS download (no Globus needed)
f = Foundry()

# For cloud environments (Colab, etc.)
f = Foundry(no_browser=True, no_local_server=True)

# For Globus transfers (large datasets, institutional endpoints)
f = Foundry(use_globus=True)
```

---

**You've completed the Foundry tutorial!**

- Documentation: https://github.com/MLMI2-CSSI/foundry
- Issues: https://github.com/MLMI2-CSSI/foundry/issues