GSGE is a Python package for functional group aware molecular fragment tokenization and fragment-level graph embeddings. It combines group-SELFIES-based fragmentation with graph autoencoders so molecules can be represented as graphs whose nodes are chemically meaningful fragments rather than individual atoms.
GSGE supports:
- Building fragment vocabularies for a specific chemical space
- Tokenizing molecules into Group-SELFIES-like fragment sequences
- Creating compound graphs with fragment nodes
- Training fragment graph autoencoders and reusing the learned embeddings
- Combining learned embeddings with fragment descriptors for downstream models
Figure 1. Example compound graph built from fragment nodes.
pip install GSGEgit clone https://github.com/CDDLeiden/GSGE
cd GSGE
pip install .Optional extras from a source checkout:
pip install ".[viz]"
pip install ".[notebooks]"
pip install ".[viz,notebooks]"pip install GSGE or pip install . will install PyTorch from the default index. If you want a CUDA build, install PyTorch first and then install GSGE:
pip install torch --index-url https://download.pytorch.org/whl/cu128
pip install GSGESee docs/getting-started/installation.md for more installation options and troubleshooting.
git clone https://github.com/CDDLeiden/GSGE
cd GSGE
bash install.shThis creates a gsge-dev conda environment and installs GSGE in editable mode with the development extras.
python -c "import GSGE; from GSGE import GS_Vocab, GSGE_Corpus; print('Installation successful')"
GSGE_CLI run_test --helpNotes:
- Python
>=3.10is supported numpy<=2.3.0is required by the current package metadata- The
group-selfiesdependency is installed fromhttps://github.com/JasperDurinck/group-selfies, sogitmust be available during install
Figure 2. Typical GSGE workflow.
Use GS_Vocab for the merged fragment vocabulary used in representation learning, and GSGE_Corpus for the non-merged fragment set used to train the graph autoencoder.
from GSGE import GS_Vocab, GSGE_Corpus, GSGE, CUSTOM_fragment_mol
smiles_list = [
"CCO",
"CC(=O)NC",
"c1ccccc1O",
]
corpus = GSGE_Corpus()
corpus.build_corpus(
smiles_list,
min_size=1,
max_size=15,
fragment_mol_fn=CUSTOM_fragment_mol,
convert=True,
fragmented=False,
)
corpus.save_GSGE_corpus(vocab_name="GSGE_corpus_example")
vocab = GS_Vocab()
vocab.build_vocab(
m_set=smiles_list,
convert=True,
n_limit=1,
target=200,
MIN_SIZE=1,
MAX_SIZE=15,
fragment_mol_fn=CUSTOM_fragment_mol,
)
vocab.add_GS_fragment("O=C(*1)(*1)")
vocab.add_GS_fragment("N=C(*1)(*1)")
vocab.save_GS_vocab(vocab_name="GS_vocab_example")
gsge = GSGE(GS_vocab=vocab, GSGE_corpus=corpus)
gsge.add_all_single_elements()
gsge.add_GS_vocab_to_GSGE_corpus()Why both objects matter:
GS_Vocabstores merged, generalized fragments for representing moleculesGSGE_Corpuskeeps non-merged fragments, which is useful for fragment GAE training and data augmentation
from GSGE import GSGE
gsge = GSGE(GS_vocab="GS_vocab_example")
gsge.add_all_single_elements()
tokens = gsge.preprocess_from_SMILES("CCO")
compound_graphs = gsge.make_compound_graphs(["CCO", "CC(=O)NC"], pyg_data=False)
cg = gsge.get_CG_from_smiles("CCO", return_CG_object=True)
cg.plot_graph_rd_c_style()The easiest route is the high-level GSGE wrapper:
from GSGE import GSGE
gsge = GSGE(GS_vocab="GS_vocab_example", GSGE_corpus="GSGE_corpus_example")
gsge.train_GSGE_Auto_Encoder(
batch_size=64,
num_epochs=300,
checkpoint_interval=5,
checkpoint_dir="model_checkpoints",
)If you want lower-level control, the core training API lives in GSGE.graphs.fragment_graph.GAE and GSGE.core_gsge.CoreGSGE.
from GSGE import GSGE, GSGE_Embedding
gsge = GSGE(GS_vocab="GS_vocab_example")
gsge.set_encoder()
gsge.load_GAE_weights("model_checkpoints/checkpoint_epoch_300.pth", map_location="cpu")
gsge.make_GS_fragment_embedding_dict()
lookup_table = gsge.get_fragment_embeddings()
token_vocab = gsge.get_GSGE_vocab()
embedding_layer = GSGE_Embedding(
sparse_vocab_size=len(token_vocab),
dense_size=lookup_table.shape[1],
embedding_dim=128,
GSGE_combined_embeddings=lookup_table,
only_token2vec=True,
no_grad=True,
)You can also combine learned fragment embeddings with RDKit fragment descriptors:
gsge.calc_fragment_descriptors(
descriptor_keys=["MolWt", "TPSA", "NumHDonors", "NumHAcceptors"]
)
combined = gsge.get_fragment_descriptors_and_embeddings()Jupyter notebook tutorials live in use_examples/.
| Topic | Path |
|---|---|
| Vocabulary and corpus building | use_examples/00_making_vocabs/vocabulary_and_corpus_tutorial.ipynb |
| Compound graphs | use_examples/01_make_compound_graphs/compound_graphs_tutorial.ipynb |
| Tokenization | use_examples/02_tokenization_example/tokenization_tutorial.ipynb |
| GAE training and embedding visualization | use_examples/03_GAE/ |
| Using embeddings | use_examples/04_use_embeddings/embeddings_tutorial.ipynb |
| Fragment descriptors | use_examples/05_mol_frag_features/fragment_descriptors.ipynb |
| End-to-end property prediction | use_examples/06_end_to_end/property_prediction_tutorial.ipynb |
Recommended learning path:
- Start with
use_examples/00_making_vocabs/ - Continue with
use_examples/01_make_compound_graphs/ - Add
use_examples/03_GAE/if you want learned fragment embeddings - Finish with the tutorial that matches your downstream use case
The repository includes example .pkl files in tests/ and use_examples/ that are useful when working from a source checkout.
from GSGE import GSGE, get_tests_dir
tests_dir = get_tests_dir()
if tests_dir is not None:
gsge = GSGE(GSGE_load_path=tests_dir / "test_gsge_save_with_descriptors.pkl")Note: get_tests_dir() returns None in a standard pip install because tests/ is not part of the installed package.
The figure below shows an example fragment embedding projection from one of the included experiments.
Figure 3. Example 2D view of learned fragment embeddings.
From a source checkout, you can run tests either through pytest or through the CLI helper.
pytest
pytest tests/test_make_gsge_vocab.py
GSGE_CLI run_test
GSGE_CLI run_test --file test_make_cg.pyNote: GSGE_CLI run_test relies on the repository tests/ directory, so it is mainly intended for editable installs or source checkouts.
Core runtime requirements are declared in pyproject.toml and currently include:
- Python
>=3.10 rdkit==2024.9.6numpy<=2.3.0torch>=2.0.0torch_geometric>=2.3.0pandas,scipy,scikit-learn,joblib,pyarrow,selfiesgroup-selfiesfrom the maintained GitHub fork
- Docs: https://CDDLeiden.github.io/gsge/
- Installation guide:
docs/getting-started/installation.md - Contributing guide:
CONTRIBUTING.md - Tutorials:
use_examples/