Skip to content

2. Chemical Example Usage

Jonas Schaub edited this page Jun 2, 2023 · 1 revision

Chemical Example Usage

The intended use case of the fragment fingerprinter functionality is to encode the presence and absence of substructures in a given molecule that result from a molecular fragmentation study, i.e. the algorithmic extraction of specific substructures from input molecules. These substructures are automatically extracted and can be represented by different string-based molecular structure encodings, like SMILES or InChI. Other key-based substructure fingerprint functionalities require SMARTS strings as inputs and are therefore not as ubiquitously applicable as the fragment fingerprint for this purpose.

In the following, a molecular structure data set is imported that contains 100 natural products with a naphthalene substructure taken from the COCONUT natural products database. These are fragmented using the CDK ExhaustiveFragmenter functionality that breaks single non-ring bonds in input molecules to generate fragments. The resulting fragments are collected together with their fraquencies as unique SMILES representations. Fragments that occur more than two times are then used to initialise the fragment fingerprinter. At the end, the "naphthalene-derivatives exhaustive fragmenter fingerprint" is generated for 3-hydroxy-2-naphthoic acid.

InputStream tmpInputStream = new FileInputStream("\path\to\coconut_naphthalene_substructure_search_result.sdf"); IteratingSDFReader tmpSDFReader = new IteratingSDFReader(tmpInputStream, SilentChemObjectBuilder.getInstance()); //This fragmentation scheme simply breaks single non-ring bonds. ExhaustiveFragmenter tmpFragmenter = new ExhaustiveFragmenter(); //Default would be 6 which is too high for the short side chains in the input molecules tmpFragmenter.setMinimumFragmentSize(1); //ExhaustiveFragmenter has a convenience method .getFragments() that returns the generated fragments already as // unique SMILES strings, but to be explicit here, the fragments are retrieved as atom containers and unique // SMILES strings created in a second step. Also note that any other string-based molecular structure representation // like InChI could be used instead, but it should be canonical. SmilesGenerator tmpSmiGen = new SmilesGenerator(SmiFlavor.Unique); HashMap<String, Integer> tmpFrequenciesMap = new HashMap<>(50, 0.75f); while (tmpSDFReader.hasNext()) {     IAtomContainer tmpMolecule = tmpSDFReader.next();     tmpFragmenter.generateFragments(tmpMolecule);     IAtomContainer[] tmpFragments = tmpFragmenter.getFragmentsAsContainers();     for (IAtomContainer tmpFragment : tmpFragments) {         String tmpSmilesCode = tmpSmiGen.create(tmpFragment);         if (tmpFrequenciesMap.containsKey(tmpSmilesCode)) {             tmpFrequenciesMap.put(tmpSmilesCode, tmpFrequenciesMap.get(tmpSmilesCode) + 1);         } else {             tmpFrequenciesMap.put(tmpSmilesCode, 1);         }     } } //Printing size of fragment set and all the fragment SMILES with their frequencies System.out.println(tmpFrequenciesMap.keySet().size()); for (String tmpFragmentSmilesCode : tmpFrequenciesMap.keySet()) {     System.out.println(tmpFragmentSmilesCode + ": " + tmpFrequenciesMap.get(tmpFragmentSmilesCode)); }

Output:
28
BrC1=CC=CC=2C=CC=CC12: 4
BrC=1C=CC2=CC(O)=CC=C2C1: 1
OC1=C[CH](OC)=CC=2C=CC=CC12: 1
BrC1=CC=CC2=[C]C=CC=C12: 1
BrC1=CC=CC=2C=[C]C=CC12: 1
O=CC: 1
O[NH](O)[CH]1=CC=CC=2C=CC=CC21: 1
O=CCl: 1
OC=1C=CC=2C=CC=CC2C1: 6
BrC1=CC=CC=2C=C(C=CC12)C: 1
ON=[CH3]: 2
ONO: 3
OC1=CC=CC=2C=CC=CC12: 4
NC1=CC=CC=2C=CC=CC12: 1
O=C[CH]1=CC=CC=2C=CC=CC21: 2
O=CO: 8
ON=C: 1
O=[S](=O)O: 5
C=1C=CC=2C=CC=CC2C1: 20
C=1C=CC=2C=C(C=CC2C1)C: 2
OC1=CC=CC=2C1=CC=CC2C: 1
O=N[CH]1=CC=C(O)C=2C=CC=CC21: 1
OC=1C=2C=CC=CC2C=CC1C: 1
[CH2][CH]=1C=CC=2C=CC=CC2C1: 1
BrC1=CC=CC=2C1=CC=CC2C: 1
OC1=CC=C(O)C=2C=CC=CC12: 2
C=1C=CC2=C(C1)C=CC=C2C: 1
O=COC: 1

//Collecting fragments that appear at least 2 times List tmpFragmentsList = new ArrayList<>(28); for (String tmpFragment : tmpFrequenciesMap.keySet()) {     if (tmpFrequenciesMap.get(tmpFragment) > 2) {         tmpFragmentsList.add(tmpFragment);     } } //Initialising fingerprinter FragmentFingerprinter tmpNaphthaleneFingerprinter = new FragmentFingerprinter(tmpFragmentsList); System.out.println(tmpNaphthaleneFingerprinter.getSize());

Output: 7

Only 7 out of the 28 fragments appear more than 2 times and are included in the fingerprint (see above).

//Parsing 3-hydroxy-2-naphthoic acid, fragmenting it, and creating its fingerprint String tmpCNP0437667SmilesString = "O=C(O)C1=CC=2C=CC=CC2C=C1O"; //3-hydroxy-2-naphthoic acid SmilesParser tmpSmiPar = new SmilesParser(SilentChemObjectBuilder.getInstance()); tmpFragmenter.generateFragments(tmpSmiPar.parseSmiles(tmpCNP0437667SmilesString)); IAtomContainer[] tmpFragments = tmpFragmenter.getFragmentsAsContainers(); List tmpCNP0437667Fragments = new ArrayList(10); for (IAtomContainer tmpFragment : tmpFragments) {     tmpCNP0437667Fragments.add(tmpSmiGen.create(tmpFragment)); } IBitFingerprint tmpCNP0437667BitFP = tmpNaphthaleneFingerprinter.getBitFingerprint(tmpCNP0437667Fragments); for (int i = 0; i < tmpNaphthaleneFingerprinter.getSize(); i++) {     System.out.println(tmpNaphthaleneFingerprinter.getBitDefinition(i) + ": " + tmpCNP0437667BitFP.get(i)); }

Output:
BrC1=CC=CC=2C=CC=CC12: false
OC=1C=CC=2C=CC=CC2C1: true
ONO: false
OC1=CC=CC=2C=CC=CC12: false
O=CO: true
O=SO: false
C=1C=CC=2C=CC=CC2C1: false

3-hydroxy-2-naphthoic acid contains the formic acid and the naphthol fragments. It does not produce a naphthalene fragment because the hydroxy fragment is too small to be considered on its own, according to the CDK ExhaustiveFragmenter.

Clone this wiki locally