[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/MLMI2-CSSI/foundry/blob/main/examples/00_hello_foundry/hello_foundry.ipynb)

---

# Hello Foundry! ðŸš€

Welcome to Foundry-ML! This notebook will show you how to:

1. **Search** for materials science datasets
2. **Load** a dataset into Python
3. **Explore** the data

No domain expertise required - just Python basics!

## Step 1: Install Foundry

If you haven't already, install Foundry-ML:

In [1]:
# Uncomment the line below to install
# !pip install foundry-ml

## Step 2: Import and Connect

First, import Foundry and create a client. If you're running this in Google Colab or a cloud environment, use `no_browser=True`.

In [2]:
from foundry import Foundry

# Create a Foundry client (uses HTTPS download by default)
# For cloud environments (Colab, etc.), add: no_browser=True, no_local_server=True
f = Foundry()

## Step 3: Search for Datasets

Let's search for datasets. You can search by keyword - no need to know the exact name!

In [3]:
# Search for datasets related to "band gap" (a property in materials science)
results = f.search("band gap", limit=5)

# Display the results - Foundry shows a nice table in notebooks!
results

Unnamed: 0,dataset_name,title,year,DOI
0,foundry_g4mp2_solvation_v1.2,DFT Estimates of Solvation Energy in Multiple ...,root=2022,10.18126/jos5-wj65


## Step 4: Get a Dataset

Pick a dataset from the search results. You can access it by name or DOI.

In [4]:
# Get the first dataset from our search results
dataset = results.iloc[0].FoundryDataset

# Display dataset info
dataset

0,1
short_name,g4mp2_solvation
data_type,tabular
task_type,supervised
domain,materials sciencechemistry
n_items,130258.0
splits,typetrainpathg4mp2_data.jsonlabeltrain
keys,"keytypefilterdescriptionunitsclassessmiles_0inputInput SMILES stringsmiles_1inputSMILES string after relaxationinchi_0inputInChi after generating coordinates with CORINAinchi_1inputInChi after relaxationxyzinputInChi after relaxationXYZ coordinates after relaxationatomic_chargesinputAtomic charges on each atom, as predicted from B3LYPAinputRotational constant, AGHzBinputRotational constant, BGHzCinputRotational constant, CGHzinchi_1inputInChi after relaxationn_electronsinputNumber of electronsn_heavy_atomsinputNumber of non-hydrogen atomsn_atominputNumber of atoms in moleculemuinputDipole momentDalphainputIsotropic polarizabilitya_0^3R2inputElectronic spatial extanta_0^2cvinputHeat capacity at 298.15Kcal/mol-Kg4mp2_hf298targetG4MP2 Standard Enthalpy of Formation, 298Kkcal/molbandgapinputB3LYP Band gap energyHahomoinputB3LYP Energy of HOMOHalumoinputB3LYP Energy of LUMOHazpeinputB3LYP Zero point vibrational energyHau0inputB3LYP Internal energy at 0KHauinputB3LYP Internal energy at 298.15KHahinputB3LYP Enthalpy at 298.15KHau0_atominputB3LYP atomization energy at 0KHaginputB3LYP Free energy at 298.15KHag4mp2_0ktargetG4MP2 Internal energy at 0KHag4mp2_energytargetG4MP2 Internal energy at 298.15KHag4mp2_enthalpytargetG4MP2 Enthalpy at 298.15KHag4mp2_freetargetG4MP2 Free eergy at 0KHag4mp2_atomtargetG4MP2 atomization energy at 0KHasol_acetonetargetSolvation energy, acetonekcal/molsol_acntargetSolvation energy, acetonitrilekcal/molsol_dmsotargetSolvation energy, dimethyl sulfoxidekcal/molsol_ethanoltargetSolvation energy, ethanolkcal/molsol_watertargetSolvation energy, waterkcal/mol"

0,1
type,train
path,g4mp2_data.json
label,train

key,type,filter,description,units,classes
smiles_0,input,,Input SMILES string,,
smiles_1,input,,SMILES string after relaxation,,
inchi_0,input,,InChi after generating coordinates with CORINA,,
inchi_1,input,,InChi after relaxation,,
xyz,input,,InChi after relaxation,XYZ coordinates after relaxation,
atomic_charges,input,,"Atomic charges on each atom, as predicted from B3LYP",,
A,input,,"Rotational constant, A",GHz,
B,input,,"Rotational constant, B",GHz,
C,input,,"Rotational constant, C",GHz,
inchi_1,input,,InChi after relaxation,,


## Step 5: Understand the Schema

Before loading data, let's see what fields it contains:

In [5]:
# Get the schema - what columns/fields are in this dataset?
schema = dataset.get_schema()

print(f"Dataset: {schema['name']}")
print(f"Data Type: {schema['data_type']}")
print(f"\nSplits: {[s['name'] for s in schema['splits']]}")
print(f"\nFields:")
for field in schema['fields']:
    print(f"  - {field['name']} ({field['role']}): {field['description'] or 'No description'}")

Dataset: foundry_g4mp2_solvation_v1.2
Data Type: tabular

Splits: ['train']

Fields:
  - smiles_0 (input): Input SMILES string
  - smiles_1 (input): SMILES string after relaxation
  - inchi_0 (input): InChi after generating coordinates with CORINA
  - inchi_1 (input): InChi after relaxation
  - xyz (input): InChi after relaxation
  - atomic_charges (input): Atomic charges on each atom, as predicted from B3LYP
  - A (input): Rotational constant, A
  - B (input): Rotational constant, B
  - C (input): Rotational constant, C
  - inchi_1 (input): InChi after relaxation
  - n_electrons (input): Number of electrons
  - n_heavy_atoms (input): Number of non-hydrogen atoms
  - n_atom (input): Number of atoms in molecule
  - mu (input): Dipole moment
  - alpha (input): Isotropic polarizability
  - R2 (input): Electronic spatial extant
  - cv (input): Heat capacity at 298.15K
  - g4mp2_hf298 (target): G4MP2 Standard Enthalpy of Formation, 298K
  - bandgap (input): B3LYP Band gap energy
  - homo (i

## Step 6: Load the Data

Now let's load the actual data. Foundry handles downloading and caching automatically!

In [6]:
# Load data as a dictionary
data = dataset.get_as_dict()

# See what we got
print(f"Data keys: {data.keys()}")

Data keys: dict_keys(['train'])


In [7]:
# For ML datasets, data is typically split into inputs (X) and targets (y)
# Let's explore the training split
if 'train' in data:
    train_data = data['train']
    print(f"Training data shape: {type(train_data)}")
    
    # If it's a tuple of (inputs, targets)
    if isinstance(train_data, tuple) and len(train_data) == 2:
        X, y = train_data
        print(f"\nInputs (X): {type(X)}")
        print(f"Targets (y): {type(y)}")

Training data shape: <class 'tuple'>

Inputs (X): <class 'pandas.core.frame.DataFrame'>
Targets (y): <class 'pandas.core.frame.DataFrame'>


## Step 7: Use with Your Favorite ML Framework

Foundry datasets work seamlessly with PyTorch and TensorFlow!

In [8]:
# For PyTorch users:
# torch_dataset = dataset.get_as_torch(split='train')
# from torch.utils.data import DataLoader
# loader = DataLoader(torch_dataset, batch_size=32)

# For TensorFlow users:
# tf_dataset = dataset.get_as_tensorflow(split='train')
# model.fit(tf_dataset)

print("Foundry works with PyTorch and TensorFlow out of the box!")

Foundry works with PyTorch and TensorFlow out of the box!


## Step 8: Get the Citation

When you use a dataset in research, cite it properly!

In [9]:
# Get BibTeX citation
citation = dataset.get_citation()
print(citation)

@misc{https://doi.org/10.18126/jos5-wj65
doi = {10.18126/jos5-wj65}
url = {https://doi.org/10.18126/jos5-wj65}
author = {Ward, Logan and Dandu, Naveen and Blaiszik, Ben and Narayanan, Badri and Assary, Rajeev S. and Redfern, Paul C. and Foster, Ian and Curtiss, Larry A.}
title = {DFT Estimates of Solvation Energy in Multiple Solvents}
keywords = {machine learning, foundry}
publisher = {Materials Data Facility}
year = {root=2022}}


## ðŸŽ‰ That's It!

You've just:
- Connected to Foundry
- Searched for datasets
- Loaded data into Python
- Explored the schema
- Got a proper citation

### Next Steps

- Explore more datasets with `f.list()`
- Check out other examples in the `/examples` folder
- Use the CLI: `foundry search "your query"`
- Read the docs: https://github.com/MLMI2-CSSI/foundry

Happy researching! ðŸ”¬