# üöÄ Machine Learning Potential Workshop

## Create Your Own ML Potential from DFT Data

Welcome to this comprehensive hands-on workshop, where you will build machine learning potentials from first principles!

### üéØ What You'll Learn:
- **Process crystallographic files** (CIF ‚Üí DFT inputs).
- **Convert DFT calculations** into ML training datasets.
- **Train MACE potentials** using state-of-the-art methods.
- **Run molecular dynamics** with your custom models.
- **Analyze and validate** your results.

### üèóÔ∏è Workflow Overview:

CIF File ‚Üí DFT Calculations ‚Üí JSON Dataset ‚Üí HDF5 Format ‚Üí MACE Training ‚Üí MD Simulations.

### üìã Complete Workflow:
| Step | Task | Duration | Output |
|------|------|----------|--------|
| 1 | **Setup & Dependencies** | 5 min | Environment |
| 2 | **Download Materials** | 3 min | Workshop files |
| 3 | **Install Requirements** | 5 min | Python packages |
| 4 | **Process Crystal Structure** | 2 min | DFT inputs |
| 5 | **Generate Training Data** | 3 min | JSON dataset |
| 6 | **Data Analysis & Filtering** | 5 min | Clean dataset |
| 7 | **Convert to MACE Format** | 2 min | HDF5 files |
| 8 | **Configure Training** | 3 min | Config file |
| 9 | **Train ML Potential** | 15 min | MACE model |
| 10 | **Run Simulations** | 10 min | MD results |

---

# üîß Phase 0: Installation & Setup

## üì¶ Step 1: Packages for downloads

Let's start by installing the essential packages needed for downloading and accessing our workshop materials.

### üîß What we are installing:
- **gdown**: For Google Drive file downloads.
- **Essential Python libraries**: For file handling and setup.

> üìù **Run the cell**.

In [2]:
# Install essential packages for downloading workshop materials
print("üîß Installing essential packages for setup...")
!pip install -q gdown  # For Google Drive downloads

# Import essential libraries for setup
import os
import sys
from pathlib import Path
import zipfile
import warnings
warnings.filterwarnings('ignore')

# Check if we're in Colab
try:
    from google.colab import files
    IN_COLAB = True
    print("üîç Running in Google Colab")
except ImportError:
    IN_COLAB = False
    print("üîç Running in local Jupyter environment")

print("‚úÖ Setup packages installed successfully!")
print("üéØ Ready to download workshop materials!")

üîß Installing essential packages for setup...
üîç Running in Google Colab
‚úÖ Setup packages installed successfully!
üéØ Ready to download workshop materials!


## üîê Step 2: Access Workshop Materials

Now we will download the complete workshop package from the instructor's repository. This includes:

### üìÅ What's included:
- **AMLP toolkit** (amlpt.py, amlpa.py).
- **Sample crystal structures** (CIF files).
- **Pre-computed DFT data** (for demonstration).
- **Configuration templates**.
- **Requirements and dependencies**.

‚è≥ **Note**: This download may take 2-3 minutes depending on your connection.

> üìù **Run the cell**.

In [2]:
# Download workshop materials from Google Drive
import gdown

folder_url = "https://drive.google.com/drive/folders/1SshMWEXiKEztO2BKOCEYRf_pMIx3WqhG?usp=sharing"

print("üì• Downloading complete workshop package...")
print("‚è≥ This may take a few minutes...")

try:
    # Create workshop_materials directory if it doesn't exist
    os.makedirs("workshop_materials", exist_ok=True)

    gdown.download_folder(
        url=folder_url,
        output="workshop_materials",
        use_cookies=True,
        quiet=False
    )
    print()
    print("‚úÖ Workshop materials downloaded successfully!")
except Exception as e:
    print(f"‚ùå Download failed: {e}")
    print("üîÑ Please check your internet connection and try again.")
    print("üí° Alternative: Upload the workshop materials manually if provided separately.")

üì• Downloading complete workshop package...
‚è≥ This may take a few minutes...


Retrieving folder contents


Processing file 1Ejz51Xdypavwv2S_D4jgqQw0KQSEDnKj AMLP_workshop.zip
Processing file 1g0swq0nguIRP_wl-f6t3N3IYpY_4ILIO MACE-OFF23_small.model


Retrieving folder contents completed
Building directory structure
Building directory structure completed
Downloading...
From (original): https://drive.google.com/uc?id=1Ejz51Xdypavwv2S_D4jgqQw0KQSEDnKj
From (redirected): https://drive.google.com/uc?id=1Ejz51Xdypavwv2S_D4jgqQw0KQSEDnKj&confirm=t&uuid=b18f4ec3-8ea9-4e58-a929-17e8bdf24d53
To: /content/workshop_materials/AMLP_workshop.zip
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1.51G/1.51G [00:17<00:00, 84.1MB/s]
Downloading...
From: https://drive.google.com/uc?id=1g0swq0nguIRP_wl-f6t3N3IYpY_4ILIO
To: /content/workshop_materials/MACE-OFF23_small.model
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 7.35M/7.35M [00:00<00:00, 35.6MB/s]


‚úÖ Workshop materials downloaded successfully!



Download completed


### üìÇ Extract Workshop Archive

Let's extract the downloaded archive and verify its contents.

> üìù **Run the cell**.

In [3]:
import os

# Check download and extract archive
print("üìã Checking downloaded files...")

# Define the path to the zip file
archive_path = "workshop_materials/AMLP_workshop.zip"

# Check if the .zip file exists
if os.path.exists(archive_path):
    # Display file details
    !ls -lh {archive_path}

    print("\nüì¶ Extracting workshop archive...")
    # Ensure destination directory exists
    !mkdir -p workshop_materials
    # Extract the zip archive
    !unzip -o {archive_path} -d workshop_materials/

    print("\n‚úÖ Archive extracted successfully!")
else:
    print(f"‚ùå Archive not found at {archive_path}")
    print("üîç Checking for alternative locations...")
    !find workshop_materials -name "*.tar.gz" -o -name "*.zip" 2>/dev/null || echo "No archives found"

üìã Checking downloaded files...
-rw-r--r-- 1 root root 1.5G Jun 10 22:53 workshop_materials/AMLP_workshop.zip

üì¶ Extracting workshop archive...
Archive:  workshop_materials/AMLP_workshop.zip
   creating: workshop_materials/AMLP_workshop/
  inflating: workshop_materials/__MACOSX/._AMLP_workshop  
   creating: workshop_materials/AMLP_workshop/workshop_cif/
  inflating: workshop_materials/__MACOSX/AMLP_workshop/._workshop_cif  
   creating: workshop_materials/AMLP_workshop/AMLP-v0.2/
  inflating: workshop_materials/__MACOSX/AMLP_workshop/._AMLP-v0.2  
   creating: workshop_materials/AMLP_workshop/tools/
  inflating: workshop_materials/__MACOSX/AMLP_workshop/._tools  
  inflating: workshop_materials/AMLP_workshop/.DS_Store  
  inflating: workshop_materials/__MACOSX/AMLP_workshop/._.DS_Store  
   creating: workshop_materials/AMLP_workshop/workshop_DFT/
  inflating: workshop_materials/__MACOSX/AMLP_workshop/._workshop_DFT  
  inflating: workshop_materials/AMLP_workshop/workshop_cif/PYRZ

## ‚öôÔ∏è Step 3: Install Workshop Requirements

Now we'll install all the specialized packages needed for machine learning potential development.

### üìö Key packages being installed:
- **MACE**: Machine learning potential framework.
- **ASE**: Atomic Simulation Environment.
- **PyTorch**: Deep learning backend.
- **Scientific computing libraries**: NumPy, SciPy, Matplotlib.
- **File format handlers**: HDF5, JSON processors.

‚è≥ **Note**: This installation may take 5-10 minutes and will install many dependencies.

> üìù **Run the cell**.

In [4]:
# Install workshop requirements
from pathlib import Path
import sys

# Find requirements file
possible_req_files = [
    Path('workshop_materials/AMLP_workshop/AMLP-v0.2/requirements.txt'),
    Path('workshop_materials/requirements.txt'),
    Path('requirements.txt')
]

req_file = None
for file_path in possible_req_files:
    if file_path.exists():
        req_file = file_path
        break

if req_file:
    print(f"üìç Using requirements file: {req_file}")

    # Show what we're about to install
    print("\nüìã Preview of requirements:")
    with open(req_file, 'r') as f:
        lines = f.readlines()[:10]  # Show first 10 lines
        for line in lines:
            print(f"  {line.strip()}")
        if len(f.readlines()) > 10:
            print("  ... (and more)")

    # Install requirements
    print("\nüì¶ Installing requirements...")
    !pip install -r "{req_file}"

else:
    print("‚ùå No requirements file found. Installing essential packages manually...")
    essential_packages = [
        "torch",
        "ase",
        "numpy",
        "scipy",
        "matplotlib",
        "h5py",
        "mace-torch"
    ]
    for package in essential_packages:
        print(f"Installing {package}...")
        !pip install {package}

# Verify key installations
print("\nüîç Verifying key installations...")
try:
    import torch
    print(f"‚úÖ PyTorch {torch.__version__} installed")
except ImportError:
    print("‚ùå PyTorch installation failed")

try:
    import ase
    print(f"‚úÖ ASE {ase.__version__} installed")
except ImportError:
    print("‚ùå ASE installation failed")

try:
    import mace
    print("‚úÖ MACE installed")
except ImportError:
    print("‚ùå MACE installation failed")

print("\nüéØ Environment setup complete!")
print("üöÄ Ready to start the ML potential workflow!")

üìç Using requirements file: workshop_materials/AMLP_workshop/AMLP-v0.2/requirements.txt

üìã Preview of requirements:
  # Core dependencies
  numpy==2.0.0
  openai>=1.0.0
  tenacity>=8.0.0
  pyyaml>=6.0
  ase>=3.22.0  # Atomic Simulation Environment
  torch
  scipy
  mace
  mace-torch

üì¶ Installing requirements...
Collecting numpy==2.0.0 (from -r workshop_materials/AMLP_workshop/AMLP-v0.2/requirements.txt (line 2))
  Downloading numpy-2.0.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m60.9/60.9 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
Collecting ase>=3.22.0 (from -r workshop_materials/AMLP_workshop/AMLP-v0.2/requirements.txt (line 6))
  Downloading ase-3.25.0-py3-none-any.whl.metadata (4.2 kB)
Collecting mace (from -r workshop_materials/AMLP_workshop/AMLP-v0.2/requirements.txt (line 9))
  Downloading MA


üîç Verifying key installations...
‚úÖ PyTorch 2.6.0+cu124 installed
‚úÖ ASE 3.25.0 installed
‚úÖ MACE installed

üéØ Environment setup complete!
üöÄ Ready to start the ML potential workflow!


---

# üß™ Phase 1: Data Preparation

Now that our environment is ready, let's begin the actual development of the ML potential. We will start by processing our crystal structure and generating the necessary input files.

## üèóÔ∏è Step 4: Process Crystal Structure

We will use crystal structures as our example system. This step will:
- Load the CIF (Crystallographic Information File).
- Generate DFT input files.
- Prepare the structure for calculations.

### üî¨ About these crystals:
They are organic crystals that serve as an excellent example for ML potential development due to its:
- **Moderate complexity**: Not too simple, not too complex.
- **Interesting dynamics**: Molecular motions and interactions.
- **Computational feasibility**: Reasonable size for workshop timing.


> üìù **Change the name of the crystal file to the one of your computational sheet and run the cell.**

> üìù **Choose the input generation mode of the machine learning potential research system and specify the DFT parameters from your computational sheet.**

In [3]:
import os
# Process the crystal structure using AMLP toolkit
print("üß™ Processing your crystal crystal structure...")

name_of_your_crystal_file = "BENZAC24"  # MAKE SURE TO CHANGE THIS NAME.

# Find and copy CIF file
cif_sources = [
    f"workshop_materials/AMLP_workshop/workshop_cif/{name_of_your_crystal_file}.cif",
    f"workshop_materials/workshop_cif/{name_of_your_crystal_file}.cif",
    f"{name_of_your_crystal_file}.cif"
]

cif_found = False
for cif_path in cif_sources:
    if os.path.exists(cif_path):
        print(f"üìÅ Found CIF file at: {cif_path}")
        !cp "{cif_path}" .
        cif_found = True
        break

if not cif_found:
    print("‚ùå CIF file not found. Searching for available CIF files...")
    !find . -name "*.cif" 2>/dev/null || echo "No CIF files found"
    print("üí° Please ensure the workshop materials were downloaded correctly.")
else:
    print(f"‚úÖ {name_of_your_crystal_file}.cif copied successfully!")

# Find and run AMLP processor
amlpt_paths = [
    "workshop_materials/AMLP_workshop/AMLP-v0.2/amlpt.py",
    "workshop_materials/amlpt.py",
    "amlpt.py"
]

amlpt_found = False
for amlpt_path in amlpt_paths:
    if os.path.exists(amlpt_path):
        print(f"\n‚öôÔ∏è Running AMLP structure processor: {amlpt_path}")
        print("   This will generate DFT input files and setup directories.")
        !python3 "{amlpt_path}"
        amlpt_found = True
        break

if not amlpt_found:
    print("‚ùå AMLP processor not found. Searching...")
    !find . -name "amlpt.py" 2>/dev/null || echo "amlpt.py not found"
    print("üí° Continuing with manual structure processing...")

print("\nüéØ Crystal structure processing step completed!")

üß™ Processing your crystal crystal structure...
üìÅ Found CIF file at: workshop_materials/AMLP_workshop/workshop_cif/BENZAC24.cif
‚úÖ BENZAC24.cif copied successfully!

‚öôÔ∏è Running AMLP structure processor: workshop_materials/AMLP_workshop/AMLP-v0.2/amlpt.py
   This will generate DFT input files and setup directories.
2025-06-27 17:09:42,475 - __main__ - INFO - Initializing Multi-Agent DFT Research System


Please select an operation mode:
1. AI-agent feedback (summaries & reports)
2. Input generation (CP2K/VASP/Gaussian)
3. Output processing (extract forces, energies, coordinates)
4. ML potential dataset creation (JSON to MACE HDF5)
5. AIMD processing (JSON to CP2K/VASP AIMD inputs)
6. VASP MD input generation

Enter your choice (1/2/3/4/5/6): 2

==== Supercell Configuration ====
Create a supercell? (y/n) [n]: y

Enter supercell dimensions as multipliers for each axis:
Multiplier for x-axis [1]: 1
Multiplier for y-axis [1]: 1
Multiplier for z-axis [1]: 1

Supercell dimensions: 1

## üìä Step 5: Load Pre-computed DFT Data

For this workshop, we'll use pre-computed DFT (Density Functional Theory) calculations to save time.

### üî¨ What DFT calculations provide:
- **Accurate energies** for different configurations.
- **Atomic forces** for training the ML model.
- **Electronic structure information**.
- **Reference data** for validation.

### ‚è±Ô∏è Why we use pre-computed data:
- DFT calculations can take **hours to days** per structure.
- We need **hundreds of configurations** for good ML models.
- Workshop time constraints require pre-computed datasets.

‚è≥ **In practice**: You would run these calculations on HPC clusters over several days or weeks.

> üìù **Run the cell to copy over the pre-computed DFT data.**

In [4]:
# Copy pre-computed DFT data
print("üìä Loading pre-computed DFT data...")
print(f"üîÑ Copying DFT calculations for {name_of_your_crystal_file}...")

# Create output directory if it doesn't exist
os.makedirs("output", exist_ok=True)

# Find DFT data
dft_sources = [
    f"workshop_materials/AMLP_workshop/workshop_DFT/{name_of_your_crystal_file}",
    f"workshop_materials/workshop_DFT/{name_of_your_crystal_file}",
    f"workshop_DFT/{name_of_your_crystal_file}"
]

dft_found = False
for dft_path in dft_sources:
    if os.path.exists(dft_path):
        print(f"üìÅ Found DFT data at: {dft_path}")
        !cp -rf "{dft_path}" output/
        dft_found = True
        break

if not dft_found:
    print("‚ùå DFT data not found. Searching for available DFT directories...")
    !find . -name "*DFT*" -type d 2>/dev/null || echo "No DFT directories found"
    print("üí° Please ensure the workshop materials were downloaded correctly.")
else:
    print("‚úÖ DFT data loaded successfully!")

    print("\nüìÅ Checking loaded data structure:")
    output_path = f"output/{name_of_your_crystal_file}/"
    if os.path.exists(output_path):
        !ls -la "{output_path}"
    else:
        print(f"‚ö†Ô∏è Output directory not found: {output_path}")

print("\nüéØ Ready to process DFT outputs into training data!")

üìä Loading pre-computed DFT data...
üîÑ Copying DFT calculations for BENZAC24...
üìÅ Found DFT data at: workshop_materials/AMLP_workshop/workshop_DFT/BENZAC24
‚úÖ DFT data loaded successfully!

üìÅ Checking loaded data structure:
total 44
drwxr-xr-x 8 root root 4096 Jun 27 17:11 .
drwxr-xr-x 3 root root 4096 Jun 27 17:11 ..
-rw-r--r-- 1 root root  227 Jun 27 17:11 INCAR
-rw-r--r-- 1 root root   48 Jun 27 17:11 KPOINTS
drwxr-xr-x 2 root root 4096 Jun 27 17:11 opt
-rw-r--r-- 1 root root 3368 Jun 27 17:11 POSCAR
drwxr-xr-x 2 root root 4096 Jun 27 17:11 T300
drwxr-xr-x 2 root root 4096 Jun 27 17:11 T400
drwxr-xr-x 2 root root 4096 Jun 27 17:11 T500
drwxr-xr-x 2 root root 4096 Jun 27 17:11 T600
drwxr-xr-x 2 root root 4096 Jun 27 17:11 xyz_checks

üéØ Ready to process DFT outputs into training data!


## üîÑ Step 6: Convert DFT Output to JSON Dataset

Now we will process the DFT calculation outputs and convert them into a structured JSON format suitable for machine learning.

### üìã Data conversion process:
1. **Parse DFT output files** (VASP format).
2. **Extract energies and forces** for each configuration.
3. **Structure data** in machine-readable JSON format.
4. **Organize by temperature** for different sampling conditions.

### üå°Ô∏è Temperature sampling:
Our DFT data includes calculations at different temperatures to capture:
- **Low T**: Ground state and near-equilibrium configurations.
- **Medium T**: Thermal fluctuations and normal dynamics.
- **High T**: Rare events and high-energy configuration.

> üìù **Run the cell.**

> üìù **Choose the output processing mode of the machine learning potential research system to process the various VASP files in the output directory in batch mode.**

In [5]:
# Convert DFT outputs to JSON training data
print("üîÑ Converting DFT outputs to JSON training dataset...")
print("üìä Processing energies, forces, and structural data...")

# Try to run AMLP processor again for JSON conversion
for amlpt_path in amlpt_paths:
    if os.path.exists(amlpt_path):
        print(f"\n‚öôÔ∏è Running AMLP data converter: {amlpt_path}")
        !python3 "{amlpt_path}"
        break
else:
    print("‚ö†Ô∏è AMLP processor not found. Attempting manual JSON conversion...")

    # Manual JSON conversion if AMLP tools are not available
    print("üîß Creating manual JSON conversion script...")

    with open("manual_json_converter.py", "w") as f:
        f.write(json_converter)

    !python manual_json_converter.py

print("\n‚úÖ JSON dataset generation completed!")

print("\nüìÅ Moving all your JSON:")

!mkdir output/{name_of_your_crystal_file}/JSON
!mv output/{name_of_your_crystal_file}/*/*json output/{name_of_your_crystal_file}/JSON

print("\nüéØ Ready for data analysis!")


üîÑ Converting DFT outputs to JSON training dataset...
üìä Processing energies, forces, and structural data...

‚öôÔ∏è Running AMLP data converter: workshop_materials/AMLP_workshop/AMLP-v0.2/amlpt.py
2025-06-27 17:12:00,081 - __main__ - INFO - Initializing Multi-Agent DFT Research System


Please select an operation mode:
1. AI-agent feedback (summaries & reports)
2. Input generation (CP2K/VASP/Gaussian)
3. Output processing (extract forces, energies, coordinates)
4. ML potential dataset creation (JSON to MACE HDF5)
5. AIMD processing (JSON to CP2K/VASP AIMD inputs)
6. VASP MD input generation

Enter your choice (1/2/3/4/5/6): 3

==== DFT Output Processing ====

Available DFT codes:
1. CP2K
2. VASP (Enhanced - processes all optimization steps)
3. Gaussian

Select DFT code (1/2/3): 2

==== Enhanced VASP Output Processing ====

Processing modes:
1. Single directory (process one VASP calculation)
2. Batch mode (process all VASP calculations in a parent directory)

Select mode (1/2) [2]:

---

# üìà Phase 2: Data Analysis & Preparation

Before training our ML model, we need to understand and prepare our dataset.

### üéØ Key questions to address:
1. What information do we have (energies, forces, structures)?
2. How is the data distributed (energy/force ranges)?
3. Are there outliers or problematic data points?
4. How should we combine different temperature datasets?

### üìä Data analysis workflow:
- **Collect** all JSON files from different temperatures.
- **Merge** data into a single comprehensive dataset.
- **Shuffle** for better training statistics.
- **Visualize** energy and force distributions.
- **Identify** potential issues or outliers.

## üîç Step 7: Dataset Analysis & Merging

Let's start by merging and shuffling the JSON datasets from all temperatures.

> üìù **Run the cell.**

In [6]:
# Merge and shuffle all JSON datasets
print("üìä Merging JSON datasets from all temperatures...")

import json
import random
from pathlib import Path
import numpy as np
import matplotlib.pyplot as plt

# Create JSON collection directory
json_dir = Path(f"output/{name_of_your_crystal_file}/JSON")
json_dir.mkdir(parents=True, exist_ok=True)

# Collect all JSONs in one directory
print("üîç Searching for JSON files...")
json_sources = [
    f"output/{name_of_your_crystal_file}/*/",
    f"output/{name_of_your_crystal_file}/",
    "output/*/"
]

json_files_found = []
for source_pattern in json_sources:
    json_files_found.extend(list(Path().glob(f"{source_pattern}*.json")))

if json_files_found:
    print(f"üìÅ Found {len(json_files_found)} JSON files")
    # Copy to JSON directory
    for json_file in json_files_found:
        !cp "{json_file}" "{json_dir}/"
else:
    print("‚ö†Ô∏è No JSON files found. Using dummy data from previous step.")

print("\nüìÅ JSON files in collection directory:")
!ls -la "{json_dir}/"

# Merge & shuffle using Python
print("\nüîÑ Merging and shuffling datasets...")

output_file = Path(f"output/{name_of_your_crystal_file}/merged_shuffled.json")

# Load all JSON files
json_files = sorted(json_dir.glob("*.json"))
if not json_files:
    print("‚ùå No JSON files found for merging.")
    print("üí° Creating minimal dataset for demonstration...")

    # Create minimal demo dataset
    demo_data = [{
        "energy": -360.0,
        "atoms": [
            {"element": "C", "x": 0.0, "y": 0.0, "z": 0.0},
            {"element": "H", "x": 1.0, "y": 0.0, "z": 0.0}
        ],
        "forces": [
            {"x": 0.1, "y": 0.0, "z": 0.0},
            {"x": -0.1, "y": 0.0, "z": 0.0}
        ]
    }]

    with output_file.open("w") as f:
        json.dump(demo_data, f, indent=2)

    print(f"‚úÖ Demo dataset created: {output_file}")

else:
    print(f"üìã Found {len(json_files)} JSON files to merge")

    merged = []
    for jf in json_files:
        print(f"   Processing: {jf.name}")
        try:
            with jf.open() as f:
                data = json.load(f)

            # Handle both list and single object formats
            if isinstance(data, list):
                merged.extend(data)
            else:
                merged.append(data)
        except json.JSONDecodeError as e:
            print(f"   ‚ö†Ô∏è Error reading {jf.name}: {e}")
            continue

    if not merged:
        print("‚ùå No valid data found in JSON files.")
    else:
        # Shuffle for better training statistics (with reproducible seed)
        random.seed(42)
        random.shuffle(merged)

        # Ensure output directory exists
        output_file.parent.mkdir(parents=True, exist_ok=True)

        # Write merged & shuffled dataset
        with output_file.open("w") as f:
            json.dump(merged, f, indent=2)

        print(f"\n‚úÖ Successfully merged {len(json_files)} files into {len(merged)} configurations")
        print(f"üìÅ Merged dataset saved to: {output_file}")
        print(f"üíæ Dataset size: {len(merged)} structures")

print("\nüéØ Ready for data visualization and analysis!")

üìä Merging JSON datasets from all temperatures...
üîç Searching for JSON files...
üìÅ Found 5 JSON files
cp: 'output/BENZAC24/JSON/opt_vasp_output.json' and 'output/BENZAC24/JSON/opt_vasp_output.json' are the same file
cp: 'output/BENZAC24/JSON/T300_vasp_output.json' and 'output/BENZAC24/JSON/T300_vasp_output.json' are the same file
cp: 'output/BENZAC24/JSON/T400_vasp_output.json' and 'output/BENZAC24/JSON/T400_vasp_output.json' are the same file
cp: 'output/BENZAC24/JSON/T500_vasp_output.json' and 'output/BENZAC24/JSON/T500_vasp_output.json' are the same file
cp: 'output/BENZAC24/JSON/T600_vasp_output.json' and 'output/BENZAC24/JSON/T600_vasp_output.json' are the same file

üìÅ JSON files in collection directory:
total 13320
drwxr-xr-x 2 root root    4096 Jun 27 17:12 .
drwxr-xr-x 9 root root    4096 Jun 27 17:12 ..
-rw-r--r-- 1 root root 1397625 Jun 27 17:12 opt_vasp_output.json
-rw-r--r-- 1 root root 3366823 Jun 27 17:12 T300_vasp_output.json
-rw-r--r-- 1 root root 2724276 Jun 

## üìä Step 8: Data Visualization & Quality Assessment

Let's visualize our dataset to understand its characteristics and identify any potential issues.

### üéØ What we're looking for:
- **Energy distribution**: Are energies in reasonable ranges?
- **Force magnitudes**: Do we have a good spread of force values?
- **Outliers**: Any suspiciously high/low values?
- **Data balance**: Sufficient sampling across different regimes?

### üö® Red flags to watch for:
- Extremely high energies (SCF convergence failures).
- Unreasonably large forces.
- Too narrow energy/force ranges (poor sampling).
- Isolated outliers far from main distribution.

> üìù **Modify the cell to plot the histogram of the `energies` and the `all_forces` arrays.**

> üìù **Find a reasonable cutoff for the forces to exclude outliers in the histogram.**

In [8]:
# Analyze and visualize the merged dataset
import json
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path

print("üìä Loading merged dataset for analysis...")

# Load the merged dataset
merged_file = Path(f"output/{name_of_your_crystal_file}/merged_shuffled.json")
if not merged_file.exists():
    print(f"‚ùå {merged_file} not found. Please run the merging step first.")
    # Create a simple dataset for demonstration
    demo_data = [{
        "energy": -360.0 + np.random.normal(0, 0.1),
        "forces": [{"x": np.random.normal(0, 0.5), "y": np.random.normal(0, 0.5), "z": np.random.normal(0, 0.5)} for _ in range(10)]
    } for _ in range(50)]

    with merged_file.open("w") as f:
        json.dump(demo_data, f)
    print(f"üìù Created demo dataset: {merged_file}")

with merged_file.open() as f:
    data = json.load(f)

print(f"üìã Dataset loaded: {len(data)} configurations")

# Extract energies and forces
energies = []
all_forces = []

print("üîç Extracting energies and forces...")

for i, entry in enumerate(data):
    # Extract energy (try different possible keys)
    energy = entry.get("energy", entry.get("total_energy", entry.get("E")))
    if energy is not None:
        energies.append(energy)

    # Extract forces
    forces = entry.get("forces", [])
    if isinstance(forces, list) and forces:
        try:
            # Convert to numpy array and calculate magnitudes
            force_array = np.array([[f["x"], f["y"], f["z"]] for f in forces], dtype=float)
            force_magnitudes = np.linalg.norm(force_array, axis=1)
            all_forces.extend(force_magnitudes)
        except (KeyError, ValueError, TypeError) as e:
            print(f"   ‚ö†Ô∏è Error processing forces for entry {i}: {e}")
            continue

# Convert to numpy arrays
energies = np.array(energies)
all_forces = np.array(all_forces)

# TODO: ADD CODE HERE FOR PLOTTING HISTOGRAMS OF ENERGIES AND FORCES AND FIND REASONABLE FORCE CUTOFF.


üìä Loading merged dataset for analysis...
üìã Dataset loaded: 975 configurations
üîç Extracting energies and forces...


---

# ‚öôÔ∏è Phase 3: ML Model Preparation

## üîÑ Step 9: Convert to MACE Training Format

Now we'll convert our filtered JSON dataset into the HDF5 format required by MACE and prepare it for training. Here, the machine learning potential research system will also apply the force cutoff that you determined in the previous step.

### üìÅ Format conversion process:
- **JSON ‚Üí HDF5**: Convert to efficient binary format.
- **Train/Validation split**: Separate data for training and validation.
- **Data standardization**: Ensure consistent formatting.
- **Metadata preparation**: Add necessary headers and information.

### üéØ MACE requirements:
- **HDF5 format**: Efficient storage and fast data loading.
- **Specific data keys**: Energy, forces, atomic numbers, positions.
- **Proper data types**: Float64 precision for numerical stability.
- **Train/validation split**: Typically 90/10 or 80/20 ratio.

> üìù **Run the cell.**

> üìù **Choose the ML potential dataset creation mode of the machine learning potential research system to process the merged and shuffled JSON file in the output directory with your force cutoff (keep the default options for all other questions).**

In [None]:
# Convert JSON to HDF5 format for MACE training
print("üì¶ Converting JSON to HDF5 format for MACE training...")

# Try to use AMLP data processor
amlpt_paths = [
    "workshop_materials/AMLP_workshop/AMLP-v0.2/amlpt.py",
    "workshop_materials/amlpt.py",
    "amlpt.py"
]

amlpt_found = False
for amlpt_path in amlpt_paths:
    if os.path.exists(amlpt_path):
        print(f"‚öôÔ∏è Running AMLP data processor: {amlpt_path}")
        !python3 "{amlpt_path}"
        amlpt_found = True
        break

print("\n‚úÖ HDF5 conversion complete!")

print("\nüéØ MACE training format preparation completed!")

In [None]:
import h5py
import numpy as np
from ase.data import chemical_symbols

# Replace with your actual filename
filename = "output/merged_shuffled_train.h5"

unique_atomic_numbers = set()

with h5py.File(filename, "r") as f:
    for batch in f.keys():
        if not batch.startswith("config_batch"):
            continue
        for config in f[batch]:
            path = f"{batch}/{config}/atomic_numbers"
            if path in f:
                z = f[path][:]
                unique_atomic_numbers.update(z)

# Convert to element symbols using ASE
unique_symbols = sorted({chemical_symbols[z] for z in unique_atomic_numbers})
print("Unique atom types in the dataset:", unique_symbols)

## ‚öôÔ∏è Step 10: Configure MACE Training

Let's create the configuration file for MACE training with optimized parameters for our workshop.

### üèóÔ∏è Model architecture:
- **MACE framework**: State-of-the-art message passing.
- **Interaction layers**: 2 layers for efficiency.
- **Radial basis**: 8 functions for distance encoding.
- **Angular features**: Up to l=3 spherical harmonics.

### üìä Training parameters:
- **Batch size**: 5 (workshop-optimized for quick training).
- **Learning rate**: 0.01 with adaptive scheduling.
- **Early stopping**: Patience of 50 epochs.
- **Loss function**: Weighted energy + force loss.

> üìù **Adjust the `atomic_numbers` list and `E0s` dictionary to only include the elements appearing in your crystal based on the following table.**

element   E0


---


H         -0.27737697

---
C          -0.77091629

---
N          -0.83778712

---
O          -0.77578785

---
> üìù **Pick a Random number and replace the seed with your number** (do not use the 45 seed number)

> üìù **Run the cell.**

In [None]:
# Create MACE training configuration
print("üìù Creating MACE training configuration...")

mace_config = """# Model and training configuration
model: "MACE"
default_dtype: "float64"
multiheads_finetuning: False #True  # Adjust based on your needs
name: "name_of_the_model"
model_dir: "MACE_models"
log_dir: "MACE_models"
checkpoints_dir: "MACE_models"
results_dir: "MACE_models"

num_channels: 64
max_L: 1
num_interactions: 2
correlation: 3

# Data paths pointing to your dataset HDF5 files
train_file: "output/merged_shuffled_train.h5"  # Add the actual path to the training file
valid_file: "output/merged_shuffled_valid.h5"  # Add the actual path to the validation file
valid_fraction: 0.10
atomic_numbers: "[1]"  # Adjust based on your dataset
E0s: "{1: -0.27737697}"  # Adjusted to match the original script

# Data keys
energy_key: "energy"
forces_key: "forces"
# Training parameters
device: cuda
batch_size: 5  # Adjusted batch size
max_num_epochs: 10
swa: True
start_swa: 5
swa_lr: 0.0001
swa_forces_weight: 1
swa_energy_weight: 2
ema: True
ema_decay: 0.99
amsgrad: True
restart_latest: True
seed: 45

# Scaling and loss
scaling: "rms_forces_scaling"
error_table: "PerAtomMAE"
loss: "ef"

# Additional parameters
r_max: 6.0
"""

with open("config_train.yaml", "w") as f:
    f.write(mace_config)

print("‚úÖ MACE training configuration created: config_train.yaml")
print("\nüìã Training setup:")
print("   üèóÔ∏è Architecture: MACE with 2 interaction layers")
print("   üìä Batch size: 5 (workshop-optimized)")
print("   üéØ Max epochs: 10 with early stopping")
print("   üíª Device: GPU acceleration")
print("\nüéØ Ready to train the ML potential!")

## üöÄ Step 11a: Train MACE Model

Now for the exciting part, training our machine learning potential!

### ‚è±Ô∏è What to expect:
- **Training time**: 10-15 minutes (depending on GPU).
- **Progress monitoring**: Watch loss decrease over epochs.
- **Early stopping**: Training will stop when validation loss plateaus.
- **Model checkpoints**: Best model will be saved automatically.

### üìä Training metrics:
- **Energy MAE**: Mean Average error for energies.
- **Force MAE**: Mean Average error for forces.
- **Validation loss**: Performance on unseen data.
- **Training stability**: Smooth convergence curves.

‚òï **Perfect time for a coffee break!** The training will run automatically.

> üìù **Run the cell.**

In [None]:
import warnings
warnings.filterwarnings("ignore")
import sys
import logging
import os
from mace.cli.run_train import main as mace_run_train_main

def train_mace(config_file_path):
    logging.getLogger().handlers.clear()
    sys.argv = ["program", "--config", config_file_path]
    mace_run_train_main()

# Train the MACE model
print("üéì Starting MACE model training...")
print("‚è±Ô∏è Estimated training time: 10-15 minutes")
print("‚òï Perfect time for another coffee break!")

# Check if MACE is available
print("\nüîç Checking MACE installation...")
try:
    from mace.cli.run_train import main as mace_run_train_main
    mace_available = True
    print("‚úÖ MACE is available")
except ImportError:
    mace_available = False
    print("‚ùå MACE import failed")

# Start MACE training
print("\nüî• Launching MACE training...")
try:
    train_mace("config_train.yaml")
    print("\n‚úÖ MACE training completed successfully!")
except Exception as e:
    print(f"‚ùå Training failed: {e}")
    ("üí° This may be due to insufficient data or configuration issues.")


print("\nüìä Check the logs directory for training progress and metrics")
print("üéØ Your custom ML potential is ready!")


üìä Review Your MACE Training Results

Once your training process has completed, you can explore the output in the MACE_models directory. Each training session should generate a dedicated subfolder, typically named after your system or training configuration. Within each subfolder, you will find two .png image files:


1.   Energy RMSE plot ‚Äì illustrating how the model‚Äôs prediction error
2.   Force RMSE plot ‚Äì showing the root-mean-square error in predicted forces as a function of epochs.


These plots allow you to visually assess the learning behavior and convergence of your model. They are especially useful for identifying potential issues such as underfitting, overfitting, or unstable training.

In [None]:
import os
from IPython.display import Image, display

# üîÅ Replace this with the name of your model
model_name = "name_of_the_model"

# Construct full paths to stage one and stage two plots
stage_one_path = f"MACE_model/{model_name}_run-42_train_Default_stage_one.png"
stage_two_path = f"MACE_model/{model_name}_run-42_train_Default_stage_two.png"

# Display the images if they exist
if os.path.exists(stage_one_path):
    print("üìä Stage One Training Plot")
    display(Image(filename=stage_one_path))
else:
    print(f"‚ùå Stage one plot not found at {stage_one_path}")

if os.path.exists(stage_two_path):
    print("\nüìä Stage Two Training Plot")
    display(Image(filename=stage_two_path))
else:
    print(f"‚ùå Stage two plot not found at {stage_two_path}")

## üöÄ Step 11b: Train MACE Model with Foundation Model
Instead of training from scratch, let's use a **foundation model** that has already been pre-trained on millions of DFT calculations! This approach is called **transfer learning** or **fine-tuning**

 üåü **Why use foundation models?**
- **Better accuracy**: Start from a model that already understands chemistry
- **Faster convergence**: Requires fewer epochs to reach good performance
- **Less data needed**: Works well even with limited training data
- **Robust performance**: Pre-trained on diverse chemical environments

### üèóÔ∏è **MACE Foundation Models:**
- **MACE-MP-0**: Trained on ~150,000 Materials Project structures
- **MACE-OFF23**: Trained on organic molecules (good for molecular crystals)
- **Different sizes**: Small, Medium, Large (we'll use Small for memory efficiency)

### ‚ö° **Transfer Learning Benefits:**
- **10-100x faster training** compared to from-scratch
- **Better generalization** to new chemical environments
- **Lower computational requirements** due to faster convergence


> üìù **Run the cells.**

In [None]:
# Create MACE training configuration
print("üìù Creating MACE training configuration...")

mace_config = """# Model and training configuration
model: "MACE"
default_dtype: "float64"
foundation_model: "workshop_materials/MACE-OFF23_small.model"
multiheads_finetuning: False #True  # Adjust based on your needs
name: "foundation_models"
model_dir: "MACE_foundation_models"
log_dir: "MACE_foundation_models"
checkpoints_dir: "MACE_foundation_models"
results_dir: "MACE_foundation_models"

num_channels: 60
max_L: 2
num_interactions: 2
correlation: 2

# Data paths pointing to your dataset HDF5 files
train_file: "output/merged_shuffled_train.h5"  # Add the actual path to the training file
valid_file: "output/merged_shuffled_valid.h5"  # Add the actual path to the validation file
valid_fraction: 0.15
atomic_numbers: "[1]"  # Adjust based on your dataset
E0s: "{1: -0.27737421}"  # Adjusted to match the original script

# Data keys
energy_key: "energy"
forces_key: "forces"
# Training parameters
device: cuda
batch_size: 5  # Adjusted batch size
max_num_epochs: 10
swa: True
start_swa: 5
swa_lr: 0.0001
swa_forces_weight: 2
swa_energy_weight: 1
ema: True
ema_decay: 0.99
amsgrad: True
restart_latest: True
seed: 45

# Scaling and loss
scaling: "rms_forces_scaling"
error_table: "PerAtomMAE"
loss: "ef"

# Additional parameters
r_max: 6.0
"""

with open("config_foundation_train.yaml", "w") as f:
    f.write(mace_config)

print("‚úÖ MACE training configuration created: config_foundation_train.yaml")
print("\nüìã Training setup:")
print("   üèóÔ∏è Architecture: MACE with 2 interaction layers")
print("   üìä Batch size: 5 (workshop-optimized)")
print("   üèóÔ∏è Using foundation model")
print("   üéØ Max epochs: 10")
print("   üíª Device: GPU acceleration")
print("\nüéØ Ready to train the ML potential!")

üìù Creating MACE training configuration...
‚úÖ MACE training configuration created: config_foundation_train.yaml

üìã Training setup:
   üèóÔ∏è Architecture: MACE with 2 interaction layers
   üìä Batch size: 5 (workshop-optimized)
   üèóÔ∏è Using foundation model
   üéØ Max epochs: 10
   üíª Device: GPU acceleration

üéØ Ready to train the ML potential!


In [None]:
import warnings
warnings.filterwarnings("ignore")
from mace.cli.run_train import main as mace_run_train_main
import sys
import logging
import os

def train_mace(config_file_path):
    logging.getLogger().handlers.clear()
    sys.argv = ["program", "--config", config_file_path]
    mace_run_train_main()

# Train the MACE model
print("üéì Starting MACE model training...")
print("‚è±Ô∏è Estimated training time: 10-15 minutes")
print("‚òï Perfect time for another coffee break!")

# Check if MACE is available
print("\nüîç Checking MACE installation...")
try:
    from mace.cli.run_train import main as mace_run_train_main
    mace_available = True
    print("‚úÖ MACE is available")
except ImportError:
    mace_available = False
    print("‚ùå MACE import failed")

# Start MACE training
print("\nüî• Launching MACE training...")
try:
    train_mace("config_foundation_train.yaml")
    print("\n‚úÖ MACE training completed successfully!")
except Exception as e:
    print(f"‚ùå Training failed: {e}")
    ("üí° This may be due to insufficient data or configuration issues.")


print("\nüìä Check the logs directory for training progress and metrics")
print("üéØ Your custom ML potential is ready!")


üìä Review your Foundation MACE model and compare it to your scratch model.


In [None]:
import os
import base64
from IPython.display import display, HTML

# Replace with your model name
your_model_name = "name_of_the_model"

# Define image paths
your_stage1 = f"MACE_models/{your_model_name}_run-42_train_Default_stage_one.png"
your_stage2 = f"MACE_models/{your_model_name}_run-42_train_Default_stage_two.png"
foundation_stage1 = "MACE_foundation_models/foundation_models_run-42_train_Default_stage_one.png"
foundation_stage2 = "MACE_foundation_models/foundation_models_run-42_train_Default_stage_two.png"

# Helper to encode PNG to base64
def img_to_base64(img_path):
    if not os.path.exists(img_path):
        return None
    with open(img_path, "rb") as f:
        data = f.read()
    return base64.b64encode(data).decode("utf-8")

# Function to display two images side-by-side with labels
def display_comparison(img1_path, img2_path, label1, label2, title):
    img1_b64 = img_to_base64(img1_path)
    img2_b64 = img_to_base64(img2_path)

    if img1_b64 and img2_b64:
        html = f"""
        <h4>{title}</h4>
        <div style="display: flex; gap: 40px;">
            <div style="text-align: center;">
                <img src="data:image/png;base64,{img1_b64}" width="800"><br>{label1}
            </div>
            <div style="text-align: center;">
                <img src="data:image/png;base64,{img2_b64}" width="800"><br>{label2}
            </div>
        </div>
        """
        display(HTML(html))
    else:
        print("‚ùå One or both images not found:")
        if not img1_b64:
            print(f" - Missing: {img1_path}")
        if not img2_b64:
            print(f" - Missing: {img2_path}")

# Compare stage one
display_comparison(
    your_stage1, foundation_stage1,
    "Your Model ‚Äì Stage 1", "Foundation Model ‚Äì Stage 1",
    "üìä Training Comparison ‚Äì Stage One"
)

# Compare stage two
display_comparison(
    your_stage2, foundation_stage2,
    "Your Model ‚Äì Stage 2", "Foundation Model ‚Äì Stage 2",
    "üìä Training Comparison ‚Äì Stage Two"
)

---

# üß™ Phase 4: Testing & Validation

## ‚öôÔ∏è Step 12a: MD Simulation Configuration

Now let's test our trained ML potential with molecular dynamics simulations at different temperatures.

### üéØ Simulation goals:
- **Validate model performance** with realistic dynamics.
- **Test temperature response** across multiple conditions.
- **Generate thermal properties** and structural analysis.
- **Comparison between both models** (e.g. foundation model).


### üå°Ô∏è Temperature protocol:
- **200K**: Low temperature, near-equilibrium dynamics.
- **400K**: Room temperature range, normal fluctuations.
- **600K**: Elevated temperature, enhanced sampling.

> üìù **Make sure to choose the correct `model_path` (file ending with .model in the `MACE_models` directory) and `cell_params` (from your computational sheet) in the following cell and run it.**

> üìù **Run your dynamics at different temperatures with both models.**

In [None]:
from ase import units
from ase.md.langevin import Langevin
from ase.io import read, write
import numpy as np
import time

from mace.calculators import MACECalculator

# Initialize calculator
calculator = MACECalculator(model_path='MACE_models/name_of_the_model_run-42_stagetwo.model', device='cuda')

# Read initial configuration
init_conf = read('output/BENZAC24/xyz_checks/BENZAC24.xyz', '0')

# User-defined cell parameters
a, b, c, alpha, beta, gamma = 20.0, 20.0, 20.0, 90.0, 90.0, 90.0  # Angstroms and degrees


pbc = [True, True, True]  # PBC in x, y, z directions

# Set up the cell from lattice parameters
from ase.geometry import cellpar_to_cell
cell_matrix = cellpar_to_cell([a, b, c, alpha, beta, gamma])
init_conf.set_cell(cell_matrix)

# Set periodic boundary conditions
init_conf.set_pbc(pbc)

# Set the calculator
init_conf.set_calculator(calculator)

# Print cell information
print(f"Cell vectors:\n{init_conf.get_cell()}")
print(f"Cell volume: {init_conf.get_volume():.2f} √Ö¬≥")
print(f"PBC: {init_conf.get_pbc()}")

# Set up Langevin dynamics
dyn = Langevin(init_conf, 0.5*units.fs, temperature_K=310, friction=5e-3)

# Write function that preserves cell information
def write_frame():
    # Write with cell information preserved
    dyn.atoms.write('traj_base.xyz', append=True)

    # Alternative: Write in a format that better preserves cell info
    # dyn.atoms.write('traj_base.traj', append=True)  # ASE trajectory format

dyn.attach(write_frame, interval=50)

# Optional: Print initial system info
print(f"Number of atoms: {len(init_conf)}")
print(f"Initial temperature: {init_conf.get_temperature():.2f} K")

# Run the dynamics
dyn.run(100000)
print("MD finished!")

# Optional: Save final configuration with cell
init_conf.write('final_config.xyz')
print("Final configuration saved!")

## ‚öôÔ∏è Step 12b: Visualize your dynamics

Now visualize both trajectories with the code that you want (e.g. [vmd](https://www.ks.uiuc.edu/Development/Download/download.cgi?PackageName=VMD) or [Ovito](https://www.ovito.org/))

## üéì Workshop Complete!

---