Skip to content

View chemicals as graphs and perform operations on graphs for predictive chemistry

License

Notifications You must be signed in to change notification settings

CodeHalwell/chemical-graph-series

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

9 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ§ͺ Chemical Graph Series

A progressive educational journey from basic cheminformatics to state-of-the-art Graph Neural Networks (GNNs) and Molecular Transformers. This series covers everything from representing molecules as graphs to predicting chemical properties using advanced deep learning architectures.

Molecular Graph Representation


🎯 Who Is This For?

This series is designed for:

  • Computational chemists looking to apply deep learning to molecular data
  • ML engineers interested in graph neural networks with a chemistry application
  • Drug discovery researchers wanting to build property prediction models
  • Students with basic Python and chemistry knowledge

Prerequisites: Basic Python (loops, functions, data structures) and fundamental chemistry (molecular structure, bonds, functional groups). No prior experience with RDKit, graph theory, or deep learning requiredβ€”we teach everything from scratch.


πŸš€ Curriculum Overview

The course is structured into 7 sequential notebooks, progressively building from foundations to production-ready models.

Lesson Title Key Concepts Time
01 Building Graphs SMILES parsing, RDKit, Mol-to-Graph, Feature extraction 45-60 min
02 Positional Encoding Laplacian Eigenvectors, RWPE, Spectral Analysis 60-75 min
03 GAT Model Graph Attention Networks, Message Passing, Multi-head Attention 75-90 min
04 Sparse Attention Efficiency in Graph Transformers, Virtual Edges, Locality 60-75 min
05 Full Graph Transformer Global Self-Attention, Edge Features, Deep Architectures 90-105 min
06 Advanced Graph Models GraphGPS, E(3)-GNNs, Equivariance, Hybrid Architectures 90-105 min
07 Modelling & Predictions Property Prediction (ESOL, FreeSolv), Training Pipelines 120-150 min

Total Estimated Time: ~9-11 hours


πŸ“š Learning Path

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        FOUNDATIONS (Lessons 01-02)                      β”‚
β”‚  β€’ Molecular representations    β€’ Feature extraction                   β”‚
β”‚  β€’ Graph structures             β€’ Positional encodings                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                    ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      ATTENTION MECHANISMS (Lessons 03-04)               β”‚
β”‚  β€’ Local attention (GAT)        β€’ Sparse patterns                      β”‚
β”‚  β€’ Message passing              β€’ Scalability                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                    ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    ADVANCED ARCHITECTURES (Lessons 05-06)               β”‚
β”‚  β€’ Graph Transformers           β€’ GraphGPS                             β”‚
β”‚  β€’ Global context               β€’ Equivariant networks                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                    ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         APPLICATION (Lesson 07)                         β”‚
β”‚  β€’ Real datasets (ESOL, FreeSolv)    β€’ Model comparison                β”‚
β”‚  β€’ Training pipelines                β€’ Deployment                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ› οΈ Setup & Installation

This project uses pyproject.toml for dependency management. It is recommended to use uv for fast, reliable package management.

Using uv (Recommended)

# Clone the repository
git clone https://github.com/yourusername/ChemicalGraphSeries.git
cd ChemicalGraphSeries

# Sync environment and install all dependencies
uv sync

# Launch Jupyter
uv run jupyter notebook

Using pip

# Create a virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
pip install rdkit torch torch-geometric networkx matplotlib pandas jupyter py3dmol scipy

# Launch Jupyter
jupyter notebook

Verify Installation

# Run this in a notebook cell to verify everything works
from rdkit import Chem
import torch
import torch_geometric
import networkx as nx

print(f"RDKit: {Chem.rdBase.rdkitVersion}")
print(f"PyTorch: {torch.__version__}")
print(f"PyTorch Geometric: {torch_geometric.__version__}")
print("βœ… All dependencies installed successfully!")

πŸ“‚ Project Structure

ChemicalGraphSeries/
β”œβ”€β”€ notebooks/
β”‚   β”œβ”€β”€ 01_Building_Graphs.ipynb      # Foundations: SMILES, RDKit, graphs
β”‚   β”œβ”€β”€ 02_Positional_Encoding.ipynb  # Spectral graph theory & RWPE
β”‚   β”œβ”€β”€ 03_GAT_Model.ipynb            # Graph Attention Networks
β”‚   β”œβ”€β”€ 04_Sparse Attention.ipynb     # Efficient attention patterns
β”‚   β”œβ”€β”€ 05_Full_Graph_Transformer.ipynb  # Complete transformer architecture
β”‚   β”œβ”€β”€ 06_Advanced_Graph_Models.ipynb   # GraphGPS, E(3)-GNNs
β”‚   └── 07_Modelling_and_Predictions.ipynb  # Real-world applications
β”œβ”€β”€ molGraph.png                      # Visual for documentation
β”œβ”€β”€ pyproject.toml                    # Project dependencies
β”œβ”€β”€ uv.lock                           # Locked dependency versions
β”œβ”€β”€ main.py                           # Utility scripts
└── README.md                         # This file

πŸ§ͺ Requirements

Requirement Version
Python β‰₯ 3.13
RDKit latest
PyTorch latest
PyTorch Geometric latest
NetworkX latest
matplotlib latest
pandas latest
py3Dmol β‰₯ 2.5.3
scipy β‰₯ 1.16.3

πŸŽ“ What You'll Build

By the end of this series, you will have:

  1. Molecular featurization pipelines β€” Convert any SMILES string into ML-ready graph representations
  2. Custom GNN architectures β€” GATs, Graph Transformers, and hybrid models
  3. Property prediction models β€” Trained on ESOL (solubility) and FreeSolv (solvation energy) benchmarks
  4. Interpretable AI β€” Visualize attention weights to understand what your model "sees"
  5. Production-ready code β€” Deployable models for real-world molecular property prediction

πŸ“– Key Topics Covered

Cheminformatics

  • SMILES and SMARTS notation
  • Molecular visualization (2D, 3D, conformer ensembles)
  • Substructure matching and pharmacophore identification

Graph Theory

  • Molecules as graphs (atoms = nodes, bonds = edges)
  • Adjacency and Laplacian matrices
  • Spectral graph theory and eigenvector decomposition

Deep Learning

  • Message passing neural networks
  • Attention mechanisms (single-head, multi-head, sparse)
  • Transformer architectures adapted for graphs
  • Equivariant neural networks (E(3)-GNNs)

Practical ML

  • Feature engineering for molecular properties
  • Train/validation/test splitting with scaffold awareness
  • Hyperparameter tuning and cross-validation
  • Model interpretation and error analysis

πŸ”— Resources & Further Reading

RDKit Documentation: https://www.rdkit.org/docs/
PyTorch Geometric: https://pytorch-geometric.readthedocs.io/
DeepChem: https://deepchem.io/
OGB Molecular Benchmarks: https://ogb.stanford.edu/

Key Papers:

  • VeličkoviΔ‡ et al. (2018) β€” Graph Attention Networks
  • RampΓ‘Ε‘ek et al. (2022) β€” GraphGPS
  • Dwivedi et al. (2021) β€” Benchmarking GNNs

πŸ“ License

This project is for educational purposes. Feel free to use, modify, and share with attribution.


Ready to start? Open Lesson 01: Building Graphs and begin your journey!

About

View chemicals as graphs and perform operations on graphs for predictive chemistry

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published