[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/MLMI2-CSSI/foundry/blob/main/examples/02_working_with_data/working_with_data.ipynb)

---

# Working with Foundry Data

**Time:** 15 minutes  
**Prerequisites:** Completed quickstart  
**What you'll learn:**
- Understanding dataset schemas
- Loading specific splits
- Using data with PyTorch and TensorFlow
- Working with different data types
- JSON output for programmatic access

---

In [2]:
!pip install --upgrade "pyarrow>=16.1.0"

Collecting pyarrow>=16.1.0
  Downloading pyarrow-22.0.0-cp312-cp312-macosx_12_0_arm64.whl.metadata (3.2 kB)
Downloading pyarrow-22.0.0-cp312-cp312-macosx_12_0_arm64.whl (34.2 MB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m34.2/34.2 MB[0m [31m17.6 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m0:01[0m:01[0m
[?25hInstalling collected packages: pyarrow
  Attempting uninstall: pyarrow
    Found existing installation: pyarrow 15.0.0
    Uninstalling pyarrow-15.0.0:
      Successfully uninstalled pyarrow-15.0.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
taipy-core 3.1.1 requires pandas<=2.2.0,>=1.3.5, but you have pandas 2.3.1 which is incompatible.
taipy-core 3.1.1 requires pyarrow<=15.0.0,>=14.0.2, but you have pyarrow 22.0.0 which is incompatible.
streamlit 1.36.0 requires packaging<25,>=20, but you have packaging

In [3]:
from foundry import Foundry

# HTTPS download is now the default
f = Foundry()

## 1. Understanding Dataset Schemas

Before loading data, understand what's in it using `get_schema()`.

In [None]:
# Get a dataset
results = f.search("band gap", limit=1)
dataset = results.iloc[0].FoundryDataset

# Get the schema
schema = dataset.get_schema()

print(f"Dataset: {schema['name']}")
print(f"Title: {schema['title']}")
print(f"DOI: {schema['doi']}")
print(f"Data Type: {schema['data_type']}")

In [None]:
# Examine fields (columns)
print("Fields:")
print("-" * 60)
for field in schema['fields']:
    role = field['role']  # 'input' or 'target'
    name = field['name']
    desc = field['description'] or 'No description'
    units = field['units'] or ''
    print(f"  [{role:6}] {name}: {desc} {f'({units})' if units else ''}")

In [None]:
# Examine splits (train/test/validation)
print("Splits:")
print("-" * 60)
for split in schema['splits']:
    print(f"  - {split['name']}: {split.get('type', 'data')}")

## 2. Loading Specific Splits

Load only the data you need.

In [5]:
# Load only training data
train_data = dataset.get_as_dict(split='train')
print(f"Training data keys: {train_data.keys() if isinstance(train_data, dict) else type(train_data)}")

Starting Download of: https://data.materialsdatafacility.org/foundry/foundry_g4mp2_solvation_v1.2/g4mp2_data.json
Downloading... 206.19 MBTraining data keys: dict_keys(['train'])


In [6]:
# Load all splits at once
all_data = dataset.get_as_dict()
print(f"All splits: {list(all_data.keys())}")

All splits: ['train']


## 3. Loading with Schema Information

Use `include_schema=True` to get data AND metadata together. This is especially useful for programmatic/agent workflows.

In [None]:
# Get data with schema attached
result = dataset.get_as_dict(include_schema=True)

print(f"Result keys: {result.keys()}")
print(f"\nSchema name: {result['schema']['name']}")
print(f"Data splits: {list(result['data'].keys())}")

## 4. PyTorch Integration

In [None]:
# Load as a PyTorch Dataset
try:
    torch_dataset = dataset.get_as_torch(split='train')
    
    # Use with DataLoader
    from torch.utils.data import DataLoader
    loader = DataLoader(torch_dataset, batch_size=32, shuffle=True)
    
    # Get a batch
    batch = next(iter(loader))
    print(f"Batch type: {type(batch)}")
    print(f"Batch size: {len(batch[0]) if isinstance(batch, tuple) else batch.shape[0]}")
except ImportError:
    print("PyTorch not installed. Install with: pip install torch")
except Exception as e:
    print(f"Could not load as PyTorch: {e}")

## 5. TensorFlow Integration

In [None]:
# Load as a TensorFlow Dataset
try:
    tf_dataset = dataset.get_as_tensorflow(split='train')
    
    # Batch and prefetch
    tf_dataset = tf_dataset.batch(32).prefetch(1)
    
    # Get a batch
    for batch in tf_dataset.take(1):
        print(f"Batch type: {type(batch)}")
except ImportError:
    print("TensorFlow not installed. Install with: pip install tensorflow")
except Exception as e:
    print(f"Could not load as TensorFlow: {e}")

## 6. JSON Output for Programmatic Access

Use `as_json=True` for agent-friendly output (lists of dicts instead of DataFrames).

In [None]:
# Search with JSON output
# as_json=True returns a list of dicts instead of a DataFrame
results_json = f.search("band gap", limit=3, as_json=True)

print(f"Type: {type(results_json)}")
print(f"Number of results: {len(results_json)}")

for ds in results_json:
    print(f"\n- {ds['name']}")
    print(f"  Title: {ds['title']}")
    print(f"  DOI: {ds['doi']}")
    print(f"  Fields: {ds.get('fields', [])}")

In [None]:
# List all datasets as JSON
import json

all_datasets = f.list(limit=5, as_json=True)
print(json.dumps(all_datasets[0], indent=2))

## 7. Browsing the Catalog

In [None]:
# List all available datasets
catalog = f.list(limit=10)
catalog

In [None]:
# Get a specific dataset by DOI
# Replace with a real DOI from your search results
# dataset = f.get_dataset("10.18126/xyz")

## 8. Working with HDF5 Data

Some datasets use HDF5 format for large arrays.

In [None]:
# Load data keeping HDF5 format (for very large datasets)
# data_hdf5 = dataset.get_as_dict(as_hdf5=True)
# This returns h5py objects that load lazily
print("Use as_hdf5=True for lazy loading of large datasets")

## Summary

| Method | Use Case |
|--------|----------|
| `get_schema()` | Understand dataset structure before loading |
| `get_as_dict()` | General purpose loading |
| `get_as_dict(split='train')` | Load specific split |
| `get_as_dict(include_schema=True)` | Data + metadata together |
| `get_as_torch()` | PyTorch DataLoader compatible |
| `get_as_tensorflow()` | tf.data.Dataset compatible |
| `f.search(as_json=True)` | Programmatic/agent access |

**Next:** See `03_advanced_workflows.ipynb` for publishing, CLI, and agent integration.