Skip to content
John Mayfield edited this page Mar 3, 2024 · 16 revisions

Contents

Introduction

CDK v2.1 brings in a large functional change to it's core molecule representation. The primary interface for handling connection tables is still IAtomContainer however a new implementation has been added that provides useful methods for writing efficient algorithms. These methods are also used within IAtomContainer to speed up current operations.

New API points

atom.getIndex() and bond.getIndex()

Atoms and bonds now know their index in the 'parent' atom container and can be access in O(1) time. If an atom/bond is not accessed through a container then the index will be -1.

IAtomContainer mol = ...;
int aidx = mol.getAtom(0).getIndex();
int bidx = mol.getBond(0).getIndex();

atom.bonds() and atom.getBondCount()

Each atom knows and tracks it's adjacency information, providing again O(1) access to the bond count and an iterator to the connected bonds. In previous versions access the connected bonds via mol.getConnectedBondList(atom) did an O(N) scan over the entire bond list. Attempting to call these methods on an atom not accessed through an AtomContainer will throw an UnsupportedOperationException.

IAtomContainer mol = ...;
IAtom atom = mol.getAtom(0);
int   deg = atom.getBondCount();
for (IBond bond : atom.bonds()) {
  IAtom nbr = bond.getOther(atom): 
}

atom.getContainer() and bond.getContainer()

Atoms and bonds also contain a reference to the container they belong. If an an atom/bond has no parent the returned value will be null.

IAtomContainer mol = ...;
IAtomContainer parent = mol.getAtom(0).getContainer();

Reusing atoms/bonds in multiple containers

A challenge in the CDK API is that atoms and bonds can appear in multiple containers. This is still the case with the new implementation but introduces a major caveat that the atom put into the container is not the same reference that comes out. However they can affect each others properties. Essentially each atom is boxed up with in a indirect wrapper, study the following piece of code carefully:

IAtom atom = ...;
IAtomContainer mol1 = ...;
IAtomContainer mol2 = ...;

mol1.addAtom(atom);
mol2.addAtom(atom);

atom.getContainer(); // still null
mol1.getAtom(0).getContainer(); // mol1
mol2.getAtom(0).getContainer(); // mol2

mol1.getAtom(0) == atom; // false
mol2.getAtom(0) == atom; // false
mol1.getAtom(0).equals(atom); // true
mol2.getAtom(0).equals(atom); // true

mol1.getAtom(0).setFormalCharge(-1); // changes in all cases
atom.getFormalCharge() == -1; // true!
mol2.getAtom(0).getFormalCharge() == -1; // true!

Further discussion on safe usage is provided later in the Gotchas section.

Activation

To begin with v2.1 will still default to AtomContainer, even with AtomContainer2 activated you should not notice any differences (please open an issue if you do). To active the AtomContainer2 implementation the environment or system property CdkUseLegacyAtomContainer boolean should be set to false.

$ CdkUseLegacyAtomContainer=0 java -jar myapp.jar
$ java -jar myapp.jar -DCdkUseLegacyAtomContainer=false

The ChemObjectBuilder instance will then create AtomContainer2s.

IAtomContainer mol = SilentChemObjectBuilder.getInstance().newAtomContainer();

Planned default values for release schedule:

  • 2.1 CdkUseLegacyAtomContainer=t (AtomContainer is default)
  • 2.2 CdkUseLegacyAtomContainer=f (AtomContainer2 is default) - pushed down if issues identified by comunity
  • 2.x ...
  • 3.0 (original AtomContainer removed)

Benchmark

Throughput in mol per min (measured on ChEMBL 23). The AtomContainer2 has a slight trade off in write (i.e. creation from SMILES) performance for an improvement in read performance (i.e. accessing connection info of the structure). The modest gains for existing operations are because many have been previously optimised with foreknowledge of where the slow parts are. This often involved transforming the IAtomContainer into a int[][] graph which is also benchmarked but no longer needed (indicated by '*').

The largest gains are seen for ConnectedComponents, MorganNumbers, and SpanningTree these are the true use-case where the algorithms use the existing IAtomContainer API without special handling/work arounds (more explanation here). Through only changing the flag CdkUseLegacyAtomContainer to 0 a 10x-22x speed up can be achieved. To put this in perspective running the MorganNumbers over ChEMBL 23 SMILES goes from taking ~20 min to taking just over 1 min. The use in the SpanningTree class sees an improvement of ~2x - note the SpanningTree algorithm is actually deprecated and replaced with RingFinder but the demonstrate still stands.

Measure Improvement AtomContainer AtomContainer2
SmilesParse 0.90 9,416,331 8,504,450
GraphUtil 1.24 10,772,008 13,345,360*
AssignAtomTypes 1.2 19,387,600 23,313,098
Gen2DLayout 1.08 119,084 128,272
PathFingerprint 1.04 18,260,215 19,031,537
RingFinder 1.17 5,136,392 6,024,809
SpanningTree 2.08 1,337,291 2,783,419
ConnectedComponents 10.64 5,958,983 63,380,257
MorganNumbers 22.14 88,345 1,956,144
  • GraphUtil - *note AtomContainer2 makes much of this usage obsolete
int[][] g = GraphUitl.toAdjList(mol, new EdgeToBondMap());
  • AssignAtomTypes
AtomContainerManipulator.percieveAtomTypesAndConfigureAtoms(mol)
  • Gen2DLayout - modest improvement as most of the algorithms are pre-converted to GraphUtil adjacency list
new StructureDiagramGenerator().generateDiagram(mol);
  • PathFingerprint - modest improvement as most of the algorithms are pre-converted to GraphUtil adjacency list
new Fingerprint().getBitFingerprint(mol);
  • SpanningTree
new SpanningTree(mol).getCyclicFragmentsContainer();
  • RingFinder
Cycles.markRingAtomsAndBonds(mol);
  • ConnectedComponents
private static void TraversePart(int[] parts, int part, IAtom atom,
                                 IAtomContainer mol) {
    parts[mol.indexOf(atom)] = part;
    for (IBond bond : mol.getConnectedBondsList(atom)) {
        IAtom other = bond.getOther(atom);
        if (parts[mol.indexOf(other)] == 0)
            TraversePart(parts, part, other, mol);
    }
}

private static int ConnectedComponents(int[] parts, IAtomContainer mol) {
    int numParts = 0;
    for (IAtom atom : mol.atoms()) {
        if (parts[mol.indexOf(atom)] == 0)
            TraversePart(parts, ++numParts, atom, mol);
    }
    return numParts;
}
  • MorganNumbers
private static int[] MorganNumbers(IAtomContainer mol) {
    int[] prev = new int[mol.getAtomCount()];
    int[] next = new int[mol.getAtomCount()];
    for (int i = 0; i < mol.getAtomCount(); i++)
        prev[i] = mol.getAtom(i).getAtomicNumber();
    for (int i = 0; i < mol.getAtomCount(); i++) {
        for (int j = 0; j < mol.getAtomCount(); j++) {
            IAtom atom = mol.getAtom(j);
            for (IBond bond : mol.getConnectedBondsList(atom))
                next[j] += prev[mol.indexOf(bond.getOther(atom))];
        }
        System.arraycopy(next, 0, prev, 0, prev.length);
        Arrays.fill(next, 0);
    }
    return prev;
}

Gotchas

  • Use interfaces rather than concrete types
IAtomContainer mol = builder.newAtomContainer(); // good!
AtomContainer2 mol = (AtomContainer2) builder.newAtomContainer(); // bad! will be renamed in future
  • Use object and not reference equality for atoms and bonds - if (atom.equals(other)) not if (atom == other). See blog post on how to analyse your code for this.
  • Deref custom atom/bond implementations using AtomRef.deref(atom). You can still use and add these to the AtomContainers but trying to cast the AtomRef back to your custom implementation will wail
IAtomContainer mol = ...;
mol.add(new MyCustomAtom()); // MyCustomAtom extends Atom
MyCustomAtom myatom = (MyCustomAtom) mol.get(0); // unchecked cast, will fail!
MyCustomAtom myatom = (MyCustomAtom) AtomRef.deref(mol.get(0)); // unchecked cast, will work!
  • Avoid cloning - cloning works but is much more complicated, in general it's bad practice to don't do it!
  • Avoid mixing implementations - using both AtomContainer and AtomContainer2
  • Add atoms before bonds
IAtomContainer mol = ...;
IAtom a = ..., b = ...;
mol.addBond(new Bond(a, b, Single)); // Error! a and b not in container
  • Modifying original bonds after adding to a container
IAtomContainer mol = ...;
IAtom a1 = ..., a2 = ...;
IBond b = ...;
mol.addAtom(a1);
mol.addAtom(a2);
mol.addBond(b);
b.setAtoms(a1, a2); // wrong, IAtomContainer doesn't know about the adjacency on these atoms!

To avoid this issue you need to get the bond that the container knows about (the BondRef).

mol.addBond(b);
b = mol.getBond(mol.getBondCount()-1); // get the ref to the last added bond

This is a bit tedious and so the newAtom and newBond methods are preferred when creating atoms/bonds.

IAtomContainer mol = ...;
IAtom a1 = ..., a2 = ...;
IBond b = ...;
a1 = mol.newAtom(a1);
a2 = mol.newAtom(a2);
b = mol.newBond(a1, a2);
b.setAtoms(a2, a1); // now okay as we're using the 'boxed' IBond instance but still O(2N) insertion (see below)

Performance tips

IChemObject builder = SilentChemObjectBuilder.getInstance();
IAtomContainer mol = builder.newAtomContainer();
IAtom          a1  = builder.newAtom();
IAtom          a2  = builder.newAtom();
IBond          b   = builder.newBond();
b.setAtoms(new IAtom[]{a1, a2});
mol.addAtom(a1);
mol.addAtom(a2);
mol.addBond(b); // the container needs to scan the atoms and find the appropriate reference (O(N))
IChemObject builder = SilentChemObjectBuilder.getInstance();
IAtomContainer mol = builder.newAtomContainer();
IAtom a1 = mol.newAtom();
IAtom a2 = mol.newAtom();
mol.newBond(a1, a2); // O(1) insertion

The longer version of the above is as follow:

IChemObject builder = SilentChemObjectBuilder.getInstance();
IAtomContainer mol = builder.newAtomContainer();
IAtom          a1  = builder.newAtom();
IAtom          a2  = builder.newAtom();
IBond          b   = builder.newBond();
mol.addAtom(a1);
mol.addAtom(a2);
b.setAtoms(new IAtom[]{mol.getAtom(0), mol.getAtom(1)});
mol.addBond(b); // O(1) insertion