AtomContainer2
CDK v2.1 brings in a large functional change to it's core molecule representation. The primary interface for handling connection tables is still IAtomContainer
however a new implementation has been added that provides useful methods for
writing efficient algorithms. These methods are also used within IAtomContainer
to speed up current operations.
Atoms and bonds now know their index in the 'parent' atom container and can be access in O(1) time. If an atom/bond is not accessed through a container then the index will be -1.
IAtomContainer mol = ...;
int aidx = mol.getAtom(0).getIndex();
int bidx = mol.getBond(0).getIndex();
Each atom knows and tracks it's adjacency information, providing again O(1) access to the bond count and an iterator to the connected bonds. In previous versions access the connected bonds via mol.getConnectedBondList(atom)
did an O(N) scan over the entire bond list. Attempting to call these methods on an atom not accessed through an AtomContainer will throw an UnsupportedOperationException
.
IAtomContainer mol = ...;
IAtom atom = mol.getAtom(0);
int deg = atom.getBondCount();
for (IBond bond : atom.bonds()) {
IAtom nbr = bond.getOther(atom):
}
Atoms and bonds also contain a reference to the container they belong. If an an atom/bond has no parent the returned value will be null.
IAtomContainer mol = ...;
IAtomContainer parent = mol.getAtom(0).getContainer();
A challenge in the CDK API is that atoms and bonds can appear in multiple containers. This is still the case with the new implementation but introduces a major caveat that the atom put into the container is not the same reference that comes out. However they can affect each others properties. Essentially each atom is boxed up with in a indirect wrapper, study the following piece of code carefully:
IAtom atom = ...;
IAtomContainer mol1 = ...;
IAtomContainer mol2 = ...;
mol1.addAtom(atom);
mol2.addAtom(atom);
atom.getContainer(); // still null
mol1.getAtom(0).getContainer(); // mol1
mol2.getAtom(0).getContainer(); // mol2
mol1.getAtom(0) == atom; // false
mol2.getAtom(0) == atom; // false
mol1.getAtom(0).equals(atom); // true
mol2.getAtom(0).equals(atom); // true
mol1.getAtom(0).setFormalCharge(-1); // changes in all cases
atom.getFormalCharge() == -1; // true!
mol2.getAtom(0).getFormalCharge() == -1; // true!
Further discussion on safe usage is provided later in the Gotchas section.
To begin with v2.1 will still default to AtomContainer
, even with AtomContainer2
activated you should not notice any differences (please open an issue if you do). To active the AtomContainer2
implementation the environment or system property CdkUseLegacyAtomContainer
boolean should be set to false.
$ CdkUseLegacyAtomContainer=0 java -jar myapp.jar
$ java -jar myapp.jar -DCdkUseLegacyAtomContainer=false
The ChemObjectBuilder
instance will then create AtomContainer2
s.
IAtomContainer mol = SilentChemObjectBuilder.getInstance().newAtomContainer();
Planned default values for release schedule:
- 2.1
CdkUseLegacyAtomContainer=t
(AtomContainer is default) - 2.2
CdkUseLegacyAtomContainer=f
(AtomContainer2 is default) - pushed down if issues identified by comunity - 2.x ...
- 3.0 (original AtomContainer removed)
Throughput in mol per min (measured on ChEMBL 23). The AtomContainer2 has a slight trade off in write (i.e. creation from SMILES) performance for an improvement in read performance (i.e. accessing connection info of the structure). The modest gains for existing operations are because many have been previously optimised with foreknowledge of where the slow parts are. This often involved transforming the IAtomContainer
into a int[][]
graph which is also benchmarked but no longer needed (indicated by '*').
The largest gains are seen for ConnectedComponents
, MorganNumbers
, and SpanningTree
these are the true use-case where the algorithms use the existing IAtomContainer API without special handling/work arounds (more explanation here). Through only changing the flag CdkUseLegacyAtomContainer
to 0
a 10x-22x speed up can be achieved. To put this in perspective running the MorganNumbers
over ChEMBL 23 SMILES goes from taking ~20 min to taking just over 1 min. The use in the SpanningTree
class sees an improvement of ~2x - note the SpanningTree
algorithm is actually deprecated and replaced with RingFinder
but the demonstrate still stands.
Measure | Improvement | AtomContainer | AtomContainer2 |
---|---|---|---|
SmilesParse | 0.90 | 9,416,331 | 8,504,450 |
GraphUtil | 1.24 | 10,772,008 | 13,345,360* |
AssignAtomTypes | 1.2 | 19,387,600 | 23,313,098 |
Gen2DLayout | 1.08 | 119,084 | 128,272 |
PathFingerprint | 1.04 | 18,260,215 | 19,031,537 |
RingFinder | 1.17 | 5,136,392 | 6,024,809 |
SpanningTree | 2.08 | 1,337,291 | 2,783,419 |
ConnectedComponents | 10.64 | 5,958,983 | 63,380,257 |
MorganNumbers | 22.14 | 88,345 | 1,956,144 |
- GraphUtil - *note
AtomContainer2
makes much of this usage obsolete
int[][] g = GraphUitl.toAdjList(mol, new EdgeToBondMap());
- AssignAtomTypes
AtomContainerManipulator.percieveAtomTypesAndConfigureAtoms(mol)
- Gen2DLayout - modest improvement as most of the algorithms are pre-converted to GraphUtil adjacency list
new StructureDiagramGenerator().generateDiagram(mol);
- PathFingerprint - modest improvement as most of the algorithms are pre-converted to GraphUtil adjacency list
new Fingerprint().getBitFingerprint(mol);
- SpanningTree
new SpanningTree(mol).getCyclicFragmentsContainer();
- RingFinder
Cycles.markRingAtomsAndBonds(mol);
- ConnectedComponents
private static void TraversePart(int[] parts, int part, IAtom atom,
IAtomContainer mol) {
parts[mol.indexOf(atom)] = part;
for (IBond bond : mol.getConnectedBondsList(atom)) {
IAtom other = bond.getOther(atom);
if (parts[mol.indexOf(other)] == 0)
TraversePart(parts, part, other, mol);
}
}
private static int ConnectedComponents(int[] parts, IAtomContainer mol) {
int numParts = 0;
for (IAtom atom : mol.atoms()) {
if (parts[mol.indexOf(atom)] == 0)
TraversePart(parts, ++numParts, atom, mol);
}
return numParts;
}
- MorganNumbers
private static int[] MorganNumbers(IAtomContainer mol) {
int[] prev = new int[mol.getAtomCount()];
int[] next = new int[mol.getAtomCount()];
for (int i = 0; i < mol.getAtomCount(); i++)
prev[i] = mol.getAtom(i).getAtomicNumber();
for (int i = 0; i < mol.getAtomCount(); i++) {
for (int j = 0; j < mol.getAtomCount(); j++) {
IAtom atom = mol.getAtom(j);
for (IBond bond : mol.getConnectedBondsList(atom))
next[j] += prev[mol.indexOf(bond.getOther(atom))];
}
System.arraycopy(next, 0, prev, 0, prev.length);
Arrays.fill(next, 0);
}
return prev;
}
- Use interfaces rather than concrete types
IAtomContainer mol = builder.newAtomContainer(); // good!
AtomContainer2 mol = (AtomContainer2) builder.newAtomContainer(); // bad! will be renamed in future
- Use object and not reference equality for atoms and bonds -
if (atom.equals(other))
notif (atom == other)
. See blog post on how to analyse your code for this. - Deref custom atom/bond implementations using
AtomRef.deref(atom)
. You can still use and add these to theAtomContainer
s but trying to cast theAtomRef
back to your custom implementation will wail
IAtomContainer mol = ...;
mol.add(new MyCustomAtom()); // MyCustomAtom extends Atom
MyCustomAtom myatom = (MyCustomAtom) mol.get(0); // unchecked cast, will fail!
MyCustomAtom myatom = (MyCustomAtom) AtomRef.deref(mol.get(0)); // unchecked cast, will work!
- Avoid cloning - cloning works but is much more complicated, in general it's bad practice to don't do it!
- Avoid mixing implementations - using both AtomContainer and AtomContainer2
- Add atoms before bonds
IAtomContainer mol = ...;
IAtom a = ..., b = ...;
mol.addBond(new Bond(a, b, Single)); // Error! a and b not in container
- Modifying original bonds after adding to a container
IAtomContainer mol = ...;
IAtom a1 = ..., a2 = ...;
IBond b = ...;
mol.addAtom(a1);
mol.addAtom(a2);
mol.addBond(b);
b.setAtoms(a1, a2); // wrong, IAtomContainer doesn't know about the adjacency on these atoms!
To avoid this issue you need to get the bond that the container knows about (the BondRef).
mol.addBond(b);
b = mol.getBond(mol.getBondCount()-1); // get the ref to the last added bond
This is a bit tedious and so the newAtom
and newBond
methods are preferred when creating atoms/bonds.
IAtomContainer mol = ...;
IAtom a1 = ..., a2 = ...;
IBond b = ...;
a1 = mol.newAtom(a1);
a2 = mol.newAtom(a2);
b = mol.newBond(a1, a2);
b.setAtoms(a2, a1); // now okay as we're using the 'boxed' IBond instance but still O(2N) insertion (see below)
IChemObject builder = SilentChemObjectBuilder.getInstance();
IAtomContainer mol = builder.newAtomContainer();
IAtom a1 = builder.newAtom();
IAtom a2 = builder.newAtom();
IBond b = builder.newBond();
b.setAtoms(new IAtom[]{a1, a2});
mol.addAtom(a1);
mol.addAtom(a2);
mol.addBond(b); // the container needs to scan the atoms and find the appropriate reference (O(N))
IChemObject builder = SilentChemObjectBuilder.getInstance();
IAtomContainer mol = builder.newAtomContainer();
IAtom a1 = mol.newAtom();
IAtom a2 = mol.newAtom();
mol.newBond(a1, a2); // O(1) insertion
The longer version of the above is as follow:
IChemObject builder = SilentChemObjectBuilder.getInstance();
IAtomContainer mol = builder.newAtomContainer();
IAtom a1 = builder.newAtom();
IAtom a2 = builder.newAtom();
IBond b = builder.newBond();
mol.addAtom(a1);
mol.addAtom(a2);
b.setAtoms(new IAtom[]{mol.getAtom(0), mol.getAtom(1)});
mol.addBond(b); // O(1) insertion