Group-SELFIES Graph Embeddings (GSGE) 🚀

GSGE is a Python package for functional group aware molecular fragment tokenization and fragment-level graph embeddings. It combines group-SELFIES-based fragmentation with graph autoencoders so molecules can be represented as graphs whose nodes are chemically meaningful fragments rather than individual atoms.

GSGE supports:

Building fragment vocabularies for a specific chemical space
Tokenizing molecules into Group-SELFIES-like fragment sequences
Creating compound graphs with fragment nodes
Training fragment graph autoencoders and reusing the learned embeddings
Combining learned embeddings with fragment descriptors for downstream models

Figure 1. Example compound graph built from fragment nodes.

Installation

Install from PyPI

pip install GSGE

Install from source

git clone https://github.com/CDDLeiden/GSGE
cd GSGE
pip install .

Optional extras from a source checkout:

pip install ".[viz]"
pip install ".[notebooks]"
pip install ".[viz,notebooks]"

GPU support

pip install GSGE or pip install . will install PyTorch from the default index. If you want a CUDA build, install PyTorch first and then install GSGE:

pip install torch --index-url https://download.pytorch.org/whl/cu128
pip install GSGE

See docs/getting-started/installation.md for more installation options and troubleshooting.

Developer setup

git clone https://github.com/CDDLeiden/GSGE
cd GSGE
bash install.sh

This creates a gsge-dev conda environment and installs GSGE in editable mode with the development extras.

Verify installation

python -c "import GSGE; from GSGE import GS_Vocab, GSGE_Corpus; print('Installation successful')"
GSGE_CLI run_test --help

Notes:

Python >=3.10 is supported
numpy<=2.3.0 is required by the current package metadata
The group-selfies dependency is installed from https://github.com/JasperDurinck/group-selfies, so git must be available during install

Quick start

Figure 2. Typical GSGE workflow.

1. Build a vocabulary and corpus

Use GS_Vocab for the merged fragment vocabulary used in representation learning, and GSGE_Corpus for the non-merged fragment set used to train the graph autoencoder.

from GSGE import GS_Vocab, GSGE_Corpus, GSGE, CUSTOM_fragment_mol

smiles_list = [
    "CCO",
    "CC(=O)NC",
    "c1ccccc1O",
]

corpus = GSGE_Corpus()
corpus.build_corpus(
    smiles_list,
    min_size=1,
    max_size=15,
    fragment_mol_fn=CUSTOM_fragment_mol,
    convert=True,
    fragmented=False,
)
corpus.save_GSGE_corpus(vocab_name="GSGE_corpus_example")

vocab = GS_Vocab()
vocab.build_vocab(
    m_set=smiles_list,
    convert=True,
    n_limit=1,
    target=200,
    MIN_SIZE=1,
    MAX_SIZE=15,
    fragment_mol_fn=CUSTOM_fragment_mol,
)
vocab.add_GS_fragment("O=C(*1)(*1)")
vocab.add_GS_fragment("N=C(*1)(*1)")
vocab.save_GS_vocab(vocab_name="GS_vocab_example")

gsge = GSGE(GS_vocab=vocab, GSGE_corpus=corpus)
gsge.add_all_single_elements()
gsge.add_GS_vocab_to_GSGE_corpus()

Why both objects matter:

GS_Vocab stores merged, generalized fragments for representing molecules
GSGE_Corpus keeps non-merged fragments, which is useful for fragment GAE training and data augmentation

2. Tokenize molecules and build compound graphs

from GSGE import GSGE

gsge = GSGE(GS_vocab="GS_vocab_example")
gsge.add_all_single_elements()

tokens = gsge.preprocess_from_SMILES("CCO")
compound_graphs = gsge.make_compound_graphs(["CCO", "CC(=O)NC"], pyg_data=False)

cg = gsge.get_CG_from_smiles("CCO", return_CG_object=True)
cg.plot_graph_rd_c_style()

3. Train the fragment graph autoencoder

The easiest route is the high-level GSGE wrapper:

from GSGE import GSGE

gsge = GSGE(GS_vocab="GS_vocab_example", GSGE_corpus="GSGE_corpus_example")
gsge.train_GSGE_Auto_Encoder(
    batch_size=64,
    num_epochs=300,
    checkpoint_interval=5,
    checkpoint_dir="model_checkpoints",
)

If you want lower-level control, the core training API lives in GSGE.graphs.fragment_graph.GAE and GSGE.core_gsge.CoreGSGE.

4. Generate and use fragment embeddings

from GSGE import GSGE, GSGE_Embedding

gsge = GSGE(GS_vocab="GS_vocab_example")
gsge.set_encoder()
gsge.load_GAE_weights("model_checkpoints/checkpoint_epoch_300.pth", map_location="cpu")
gsge.make_GS_fragment_embedding_dict()

lookup_table = gsge.get_fragment_embeddings()
token_vocab = gsge.get_GSGE_vocab()

embedding_layer = GSGE_Embedding(
    sparse_vocab_size=len(token_vocab),
    dense_size=lookup_table.shape[1],
    embedding_dim=128,
    GSGE_combined_embeddings=lookup_table,
    only_token2vec=True,
    no_grad=True,
)

You can also combine learned fragment embeddings with RDKit fragment descriptors:

gsge.calc_fragment_descriptors(
    descriptor_keys=["MolWt", "TPSA", "NumHDonors", "NumHAcceptors"]
)
combined = gsge.get_fragment_descriptors_and_embeddings()

Tutorials and examples

Jupyter notebook tutorials live in use_examples/.

Topic	Path
Vocabulary and corpus building	`use_examples/00_making_vocabs/vocabulary_and_corpus_tutorial.ipynb`
Compound graphs	`use_examples/01_make_compound_graphs/compound_graphs_tutorial.ipynb`
Tokenization	`use_examples/02_tokenization_example/tokenization_tutorial.ipynb`
GAE training and embedding visualization	`use_examples/03_GAE/`
Using embeddings	`use_examples/04_use_embeddings/embeddings_tutorial.ipynb`
Fragment descriptors	`use_examples/05_mol_frag_features/fragment_descriptors.ipynb`
End-to-end property prediction	`use_examples/06_end_to_end/property_prediction_tutorial.ipynb`

Recommended learning path:

Start with use_examples/00_making_vocabs/
Continue with use_examples/01_make_compound_graphs/
Add use_examples/03_GAE/ if you want learned fragment embeddings
Finish with the tutorial that matches your downstream use case

Pretrained example assets

The repository includes example .pkl files in tests/ and use_examples/ that are useful when working from a source checkout.

from GSGE import GSGE, get_tests_dir

tests_dir = get_tests_dir()
if tests_dir is not None:
    gsge = GSGE(GSGE_load_path=tests_dir / "test_gsge_save_with_descriptors.pkl")

Note: get_tests_dir() returns None in a standard pip install because tests/ is not part of the installed package.

Visual example

The figure below shows an example fragment embedding projection from one of the included experiments.

Figure 3. Example 2D view of learned fragment embeddings.

Testing

From a source checkout, you can run tests either through pytest or through the CLI helper.

pytest
pytest tests/test_make_gsge_vocab.py

GSGE_CLI run_test
GSGE_CLI run_test --file test_make_cg.py

Note: GSGE_CLI run_test relies on the repository tests/ directory, so it is mainly intended for editable installs or source checkouts.

Requirements

Core runtime requirements are declared in pyproject.toml and currently include:

Python >=3.10
rdkit==2024.9.6
numpy<=2.3.0
torch>=2.0.0
torch_geometric>=2.3.0
pandas, scipy, scikit-learn, joblib, pyarrow, selfies
group-selfies from the maintained GitHub fork

Documentation and contributing

Docs: https://CDDLeiden.github.io/gsge/
Installation guide: docs/getting-started/installation.md
Contributing guide: CONTRIBUTING.md
Tutorials: use_examples/

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
GSGE		GSGE
docs		docs
tests		tests
use_examples		use_examples
.coveragerc		.coveragerc
.gitattributes		.gitattributes
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
install.sh		install.sh
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Group-SELFIES Graph Embeddings (GSGE) 🚀

Installation

Install from PyPI

Install from source

GPU support

Developer setup

Verify installation

Quick start

1. Build a vocabulary and corpus

2. Tokenize molecules and build compound graphs

3. Train the fragment graph autoencoder

4. Generate and use fragment embeddings

Tutorials and examples

Pretrained example assets

Visual example

Testing

Requirements

Documentation and contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Group-SELFIES Graph Embeddings (GSGE) 🚀

Installation

Install from PyPI

Install from source

GPU support

Developer setup

Verify installation

Quick start

1. Build a vocabulary and corpus

2. Tokenize molecules and build compound graphs

3. Train the fragment graph autoencoder

4. Generate and use fragment embeddings

Tutorials and examples

Pretrained example assets

Visual example

Testing

Requirements

Documentation and contributing

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages