Inverse Coulomb Explosion Imaging via Deep Learning

Recovering initial molecular geometry from fragment velocities measured in Coulomb Explosion Imaging (CEI) experiments using deep neural networks.

Overview

Coulomb Explosion Imaging (CEI) is an ultrafast technique that probes molecular structure by ionizing all atoms nearly instantaneously and recording the resulting fragment momenta. The inverse problem — reconstructing the initial 3D atomic positions from the measured velocity vectors — is highly non-trivial due to the many-body Coulomb dynamics. This repository provides two complementary deep learning approaches:

Model	Architecture	Description
MLP	Multi-Layer Perceptron	Deterministic feedforward regression with batch normalization
VAE	Variational Autoencoder	Encoder–decoder with a learned latent space and reparameterization trick

Both models are trained on an oriented molecular dataset (CHBrClF) and predict 3D initial positions (12 coordinates) from 12 measured velocity components.

Molecular System

The target molecule is bromochlorofluoromethane (CHBrClF), containing five atoms bonded to a central carbon:

     Br
      \
  Cl — C — F
      |
      H

Input features (12): Final velocity components of each fragment
Br_vxf, Br_vyf, Cl_vxf, Cl_vyf, Cl_vzf, F_vxf, F_vyf, F_vzf, C_vxf, H_vxf, H_vyf, H_vzf

Output targets (12): Initial atomic positions
Br_x0, Br_y0, Br_z0, Cl_x0, Cl_y0, Cl_z0, F_x0, F_y0, F_z0, H_x0, H_y0, H_z0

Repository Structure

CEI/
├── README.md
├── MLP/
│   ├── MLP_FineTune.py      # Hyperparameter optimization with Optuna
│   └── MLP_Train.py         # Multi-seed training with best hyperparameters
├── VAE/
│   ├── VAE_FineTune.py      # Hyperparameter optimization with Optuna
│   └── VAE_Train.py         # Multi-seed training with best hyperparameters
└── EVAL/
    ├── Metrics.ipynb         # Per-atom and overall metric computation
    └── Plots.ipynb           # 3D visualization of true vs. predicted positions

Pipeline

The workflow follows a three-stage pipeline for each model:

┌─────────────────────┐     ┌─────────────────────┐     ┌─────────────────────┐
│   1. Fine-Tune      │     │   2. Train           │     │   3. Evaluate        │
│   (FineTune.py)     │────▶│   (Train.py)         │────▶│   (EVAL/)            │
│                     │     │                      │     │                      │
│ • Optuna HPO        │     │ • Load best params   │     │ • Metrics.ipynb      │
│ • Minimize val MSE  │     │ • 10 random seeds    │     │ • Plots.ipynb        │
│ • Save best params  │     │ • Avg ± std metrics  │     │ • Per-atom analysis  │
│   to JSON           │     │ • Save predictions   │     │ • 3D visualization   │
└─────────────────────┘     └─────────────────────┘     └─────────────────────┘

Stage 1 — Hyperparameter Optimization (`*_FineTune.py`)

Uses Optuna to search over model architecture and training hyperparameters. The objective minimizes validation MSE on a held-out 15% validation split.

MLP search space:

Hyperparameter	Range
Number of layers	2 – 8
Hidden size	256 – 2048 (log)
Dropout	0.0 – 0.5
Learning rate	1e-4 – 1e-2 (log)
Weight decay	1e-6 – 1e-3 (log)
Batch size	{16384, 32768, 65536, 131072}
Gradient clipping	0.5 – 3.0
Activation	{ReLU, GELU, SiLU}

VAE search space:

Hyperparameter	Range
Number of layers	2 – 12
Hidden size	128 – 4096 (log)
Latent dimension	8 – 512 (log)
Dropout	0.0 – 0.6
Learning rate	5e-6 – 5e-3 (log)
Weight decay	1e-7 – 1e-1 (log)
Batch size	{512, 1024, ..., 16384}
β (KL weight)	1e-6 – 2.0 (log)
Gradient clipping	0.0 – 10.0
Scaler	{MinMax, Standard, Robust}
Activation	{LeakyReLU, GELU, SiLU}

Key features:

TPE / Hyperband pruning for efficient search
GPU-resident data caching (data loaded once, reused across all trials)
Mixed-precision training (FP16/BF16) for speed
Graceful shutdown with progress saving
SQLite-backed Optuna storage for resumable studies
Trial-level CSV logging for post-hoc analysis

Outputs:

best_hyperparameters.json / mlp_best_hyperparams.json — best hyperparameters
optuna_trials/*.csv — full trial history
models/ — trial checkpoints
results/*.db — Optuna study database

Stage 2 — Multi-Seed Training (`*_Train.py`)

Loads the best hyperparameters from Stage 1 and trains the model across 10 independent random seeds to produce robust performance estimates (mean ± std).

# MLP
python MLP_Train.py \
    --data_path data/4au_data_oriented.parquet \
    --best_params_path results/mlp_best_hyperparams.json \
    --output_dir ./results \
    --n_seeds 10

# VAE
python VAE_Train.py \
    --data_path data/data_oriented.parquet \
    --best_params_path results/best_hyperparameters.json \
    --output_dir ./results \
    --n_seeds 10

Key features:

Same data split (70/15/15) across all seeds for fair comparison
Early stopping with patience monitoring
Per-seed and averaged metric reporting (MSE, R², ADE, MAE, NRMSE)
Predictions saved in evaluation-ready format

Outputs:

results/mlp_test_full_data.csv / results/vae_test_full_data.csv — combined true vs. predicted values
results/multi_seed_summary.json — averaged metrics across seeds
models/best_model.pt — final model checkpoint

Stage 3 — Evaluation (`EVAL/`)

Jupyter notebooks that consume the CSV outputs from Stage 2.

Metrics.ipynb computes per-atom and overall metrics:

Metric	Description
MSE	Mean Squared Error
R²	Coefficient of Determination
ADE	Average Displacement Error (Euclidean distance in 3D)
NRMSE	Normalized Root Mean Squared Error (by data range)

Plots.ipynb generates interactive 3D visualizations using Plotly:

True atom positions (scatter)
Predicted atom positions (scatter)
True vs. Predicted overlay
Average molecular geometry with bonds (true vs. predicted)

Data Format

The input data is a Parquet file where each row represents one CEI event. Columns include 12 velocity features and 12 position targets. Data is split as:

Split	Ratio	Purpose
Train	70%	Model fitting
Validation	15%	Hyperparameter selection & early stopping
Test	15%	Final performance evaluation

Normalization is computed on the training set only and applied to all splits to prevent data leakage.

Requirements

torch>=2.0
optuna>=3.0
scikit-learn
pandas
numpy
joblib
plotly

Install with:

pip install torch optuna scikit-learn pandas numpy joblib plotly

Quick Start

1. Hyperparameter search

# MLP (50 Optuna trials)
cd MLP/
python MLP_FineTune.py \
    --data_path ../data/4au_data_oriented.parquet \
    --output_dir ./results \
    --n_trials 50

# VAE (50 Optuna trials)
cd VAE/
python VAE_FineTune.py \
    --data_path ../data/data_oriented.parquet \
    --output_dir ./results \
    --n_trials 50

2. Train with best hyperparameters (10 seeds)

# MLP
cd MLP/
python MLP_Train.py \
    --data_path ../data/4au_data_oriented.parquet \
    --best_params_path ./results/results/mlp_best_hyperparams.json \
    --output_dir ./results \
    --n_seeds 10

# VAE
cd VAE/
python VAE_Train.py \
    --data_path ../data/data_oriented.parquet \
    --best_params_path ./results/results/best_hyperparameters.json \
    --output_dir ./results \
    --n_seeds 10

3. Evaluate

Open the notebooks in EVAL/ and set the PATH / csv_file variable to point to the output CSV from Step 2:

# In Metrics.ipynb
csv_file = "../MLP/results/results/mlp_test_full_data.csv"

# In Plots.ipynb
PATH = "../MLP/results/results/mlp_test_full_data.csv"

4. Manual Split

These setup instructions assume that the entire dataset is stored in a single file.
The train/validation/test split is performed randomly within the script.

If your data is structured differently (e.g., multiple files, pre-separated splits, or grouped data), only the data loading logic needs to be modified. The rest of the pipeline remains unchanged.

This setup was specifically used for the "isomers" experiment, where one isomer was excluded from the dataset during training. In this case, the split was applied after filtering out the selected isomer.

Key Assumptions

All data is contained in a single file.
Random splitting is sufficient for the experiment.
No stratification is applied unless explicitly added.
Filtering (e.g., excluding one isomer) occurs before splitting.

When to Modify

You need to adjust the data loading section if:

The dataset is distributed across multiple files.
The split is predefined.
You require stratified or grouped splitting.
The exclusion logic (e.g., isomer-based filtering) changes.

Model Details

MLP

A standard feedforward network with configurable depth and width:

Input (12) → [Linear → BatchNorm → Activation → Dropout] × N → Linear → Output (12)

Trained with AdamW optimizer and ReduceLROnPlateau scheduler
Mixed-precision (FP16) via torch.amp
GPU-resident data with custom GPUDataLoader for zero CPU overhead

VAE

An encoder–decoder architecture with a stochastic latent bottleneck:

         ┌──────────── μ ─────────────┐
Input → Encoder → ─┤                  ├─ z → Decoder → Output
         └──── log(σ²) ── reparam ───┘

Loss: MSE (reconstruction) + β · KL divergence
At inference, the mean μ is used directly (no sampling noise)
Supports MinMax, Standard, and Robust scaling
Multi-GPU support with thread-safe GPU slot allocation

Hardware

Scripts are optimized for NVIDIA A40 GPUs but will run on any CUDA-capable device. CPU fallback is supported. Key optimizations:

TF32 matrix multiplication on Ampere+ GPUs
torch.compile with reduce-overhead mode (when available)
Fused AdamW optimizer kernels
Persistent GPU data caching across Optuna trials

Citation

If you use this work, please cite:

@article{ghanaatian_molecular_cei,
  author  = {Ghanaatian, A.},
  title   = {Neural Network Based Molecular Structure Retrieval from Coulomb Explosion Imaging Data},
  journal = {<Journal Name>},
  year    = {2026},
  volume  = {<Volume>},
  pages   = {<Pages>},
  doi     = {<DOI>}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Inverse Coulomb Explosion Imaging via Deep Learning

Overview

Molecular System

Repository Structure

Pipeline

Stage 1 — Hyperparameter Optimization (`*_FineTune.py`)

Stage 2 — Multi-Seed Training (`*_Train.py`)

Stage 3 — Evaluation (`EVAL/`)

Data Format

Requirements

Quick Start

1. Hyperparameter search

2. Train with best hyperparameters (10 seeds)

3. Evaluate

4. Manual Split

Key Assumptions

When to Modify

Model Details

MLP

VAE

Hardware

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
EVAL		EVAL
MLP		MLP
VAE		VAE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Inverse Coulomb Explosion Imaging via Deep Learning

Overview

Molecular System

Repository Structure

Pipeline

Stage 1 — Hyperparameter Optimization (*_FineTune.py)

Stage 2 — Multi-Seed Training (*_Train.py)

Stage 3 — Evaluation (EVAL/)

Data Format

Requirements

Quick Start

1. Hyperparameter search

2. Train with best hyperparameters (10 seeds)

3. Evaluate

4. Manual Split

Key Assumptions

When to Modify

Model Details

MLP

VAE

Hardware

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Stage 1 — Hyperparameter Optimization (`*_FineTune.py`)

Stage 2 — Multi-Seed Training (`*_Train.py`)

Stage 3 — Evaluation (`EVAL/`)

Packages