# Parsing and searching the COCONUT database for Natural Products

[COCONUT](https://coconut.naturalproducts.net) is a database of natural products. It contains information about the natural products, their structures, and chemical properties. In this notebook, we will parse the COCONUT database (a MongoDB database) and search for natural products by their names.

COCONUT contains the following fields:

1. **_id**: MongoDB's unique object identifier for each document.
2. **coconut_id**: A unique identifier specific to the COCONUT database.
3. **contains_sugar**: A boolean flag indicating the presence of sugar in the compound.
4. **heavy_atom_number**: The number of heavy atoms (non-hydrogen) in the molecule.
5. **inchi**: The International Chemical Identifier (InChI) string.
6. **inchikey**: A hash of the InChI string, used for quick database searches.
7. **smiles**: The Simplified Molecular Input Line Entry System string describing the molecule's structure.
8. **unique_smiles**: A canonical SMILES string that uniquely identifies the molecule.
9. **clean_smiles**: A cleaned-up version of the SMILES string.
10. **sugar_free_smiles**: SMILES string of the molecule without any sugar moieties.
11. **deep_smiles**: An alternative SMILES notation potentially capturing deeper structure.
12. **name**: Common name of the chemical compound.
13. **nameTrustLevel**: A numerical trust level associated with the provided name.
14. **annotationLevel**: A numerical level indicating the depth of annotation data available.
15. **synonyms**: A list of other names by which the compound is known.
16. **cas**: The Chemical Abstracts Service registry number.
17. **iupac_name**: The systematic name according to the International Union of Pure and Applied Chemistry.
18. **contains_ring_sugars**: A boolean indicating if ring sugars are present.
19. **contains_linear_sugars**: A boolean indicating if linear sugars are present.
20. **collection**: A list indicating collections or categories the compound belongs to.
21. **molecular_formula**: The molecular formula of the compound.
22. **molecular_weight**: The calculated molecular weight of the compound.
23. **geoLocation**: A list of geographic locations associated with the compound.
24. **npl_noh_score**, **npl_score**, **npl_sugar_score**: Various scores related to natural product likeness.
25. **number_of_carbons**, **number_of_nitrogens**, **number_of_oxygens**: Counts of specific atom types.
26. **max_number_of_rings**, **min_number_of_rings**: The range of ring structures in the compound.
27. **sugar_free_heavy_atom_number**: Heavy atom count excluding sugars.
28. **sugar_free_total_atom_number**, **total_atom_number**: Counts of total atoms, with and without sugars.
29. **bond_count**: The total number of chemical bonds in the molecule.
30. **found_in_databases**: A list of databases where this compound is registered.
31. **xrefs**: External references with IDs and links to databases.
32. **fragments**: A dictionary describing molecular fragments present.
33. **fragmentsWithSugar**: Descriptions of molecular fragments including sugar components.
34. **murko_framework**: The Murcko scaffold which represents the core structure.
35. **ertlFunctionalFragments**, **ertlFunctionalFragmentsPseudoSmiles**: Functional fragment descriptors.
36. **pubchemFingerprint**, **pfCounts**: A compact binary fingerprint and count information for PubChem.
37. **circularFingerprint**: Circular fingerprinting typically used for structural similarity.
38. **extendedFingerprint**: Extended fingerprint data for more detailed structural information.
39. **alogp**, **alogp2**, **amralogp**, **apol**, **bcutDescriptor**, **bpol**: Various computed chemical properties and descriptors.
40. **eccentricConnectivityIndexDescriptor**, **fmfDescriptor**, **fsp3**: More molecular descriptors related to shape, connectivity, and saturation.
41. **fragmentComplexityDescriptor**, **gravitationalIndexHeavyAtoms**: Complexity and gravitational index descriptors.
42. **hBondAcceptorCount**, **hBondDonorCount**: Count of hydrogen bond acceptors and donors.
43. **hybridizationRatioDescriptor**, **kappaShapeIndex1**, **kappaShapeIndex2**, **kappaShapeIndex3**: Hybridization and kappa shape indices.
44. **manholdlogp**, **petitjeanNumber**, **petitjeanShapeTopo**, **petitjeanShapeGeom**: LogP values, Petitjean number, and shape descriptors.
45. **lipinskiRuleOf5Failures**: Number of violations of Lipinski's Rule of 5.
46. **numberSpiroAtoms**, **vabcDescriptor**, **vertexAdjMagnitude**: Count of spiro atoms and vertex adjacency magnitude.
47. **weinerPathNumber**, **weinerPolarityNumber**: Weiner path and polarity numbers.
48. **xlogp**, **zagrebIndex**, **topoPSA**:

 LogP value, Zagreb index, and topological polar surface area.
49. **tpsaEfficiency**: Efficiency of the topological polar surface area.
50. **_class**: Classification path in the COCONUT database.


The NPL score ranges between −5 (if the compound is more similar to a synthetic compound) and 5 (if the compound is more similar to a natural product (https://www.sciencedirect.com/science/article/pii/S2667318523000107)

## Read COCONUT MongoDB

In [15]:
import bson
import pandas as pd

def find_compounds_by_name(bson_path, compound_names):
    # Convert all search names to lower case for case-insensitive comparison
    search_names = set(name.lower() for name in compound_names)
    results = []

    with open(bson_path, 'rb') as file:
        while True:
            # Read a document's size from the first 4 bytes
            doc_size = file.read(4)
            if not doc_size:
                break  # End of file
            
            doc_size = int.from_bytes(doc_size, byteorder='little')
            document_data = doc_size.to_bytes(4, byteorder='little') + file.read(doc_size - 4)
            document = bson.BSON(document_data).decode()

            # Retrieve and normalize the primary name
            compound_name = document.get('name', '').lower()
            # Check against both the 'name' and 'synonyms' fields
            if compound_name in search_names or any(syn.lower() in search_names for syn in document.get('synonyms', [])):
                results.append(document)
    
    return results

def build_compound_dataframe(compounds, columns=None):
    # If no specific columns are requested, use a default list
    if columns is None:
        columns = ['coconut_id', 'name', 'synonyms', 'unique_smiles', 
                   'iupac_name', 'molecular_weight', 'npl_score', 
                   'taxid', 'textTaxa', 'chemicalSuperClass',
                   'chemicalClass', 'chemicalSubClass', 'directParentClassification'
                   ]

    # Prepare data for DataFrame
    data = []
    for compound in compounds:
        # Extract data for each specified column, handling missing keys
        row = {col: compound.get(col, pd.NA) for col in columns}
        data.append(row)
    
    # Create and return the DataFrame
    df = pd.DataFrame(data, columns=columns)
    return df

{'_id': ObjectId('61a5787bc52bda1e67b86a1a'), 'coconut_id': 'CNP0355008', 'contains_sugar': 0, 'heavy_atom_number': 42, 'inchi': 'InChI=1S/C40H52O2/c1-29(17-13-19-31(3)21-23-35-33(5)37(41)25-27-39(35,7)8)15-11-12-16-30(2)18-14-20-32(4)22-24-36-34(6)38(42)26-28-40(36,9)10/h11-24H,25-28H2,1-10H3', 'inchikey': 'FDSDTBUPSURDBL-UHFFFAOYSA-N', 'smiles': '[H]C(C1=C(C(=O)C([H])([H])C([H])([H])C1(C([H])([H])[H])C([H])([H])[H])C([H])([H])[H])=C([H])C(=C([H])C([H])=C([H])C(=C([H])C([H])=C([H])C([H])=C(C([H])=C([H])C([H])=C(C([H])=C([H])C2=C(C(=O)C([H])([H])C([H])([H])C2(C([H])([H])[H])C([H])([H])[H])C([H])([H])[H])C([H])([H])[H])C([H])([H])[H])C([H])([H])[H])C([H])([H])[H]', 'unique_smiles': 'O=C1C(=C(C=CC(=CC=CC(=CC=CC=C(C=CC=C(C=CC2=C(C(=O)CCC2(C)C)C)C)C)C)C)C(C)(C)CC1)C', 'clean_smiles': 'O=C1C(=C(C=CC(=CC=CC(=CC=CC=C(C=CC=C(C=CC2=C(C(=O)CCC2(C)C)C)C)C)C)C)C(C)(C)CC1)C', 'sugar_free_smiles': 'O=C1C(=C(C=CC(=CC=CC(=CC=CC=C(C=CC=C(C=CC2=C(C(=O)CCC2(C)C)C)C)C)C)C)C(C)(C)CC1)C', 'deep_smiles': '',

In [None]:
bson_path = '/home/robaina/Databases/COCONUT_2021_11/uniqueNaturalProduct.bson'
compound_names = [
    "Beta-Carotene",
    "Astaxanthin",
    "Lutein",
    "Zeaxanthin",
    "Canthaxanthin",
    "Fucoxanthin",
]

found_compounds = find_compounds_by_name(bson_path, compound_names)

In [17]:
df_compounds = build_compound_dataframe(found_compounds)
df_compounds


Unnamed: 0,coconut_id,name,synonyms,unique_smiles,iupac_name,molecular_weight,npl_score,taxid,textTaxa,chemicalSuperClass,chemicalClass,chemicalSubClass,directParentClassification
0,CNP0355008,Orobronze,"[Food Orange 8, 4,4'-Dioxo-Beta-Carotene, Caro...",O=C1C(=C(C=CC(=CC=CC(=CC=CC=C(C=CC=C(C=CC2=C(C...,"2,4,4-trimethyl-3-[3,7,12,16-tetramethyl-18-(2...",564.841165,0.714854,[],"[Cantharellus cinnabarinus, Anabaena flos-aqua...",Lipids and lipid-like molecules,Prenol lipids,Tetraterpenoids,Xanthophylls
1,CNP0283320,Zeaxanthin,"[(3R,3'S)-Zeaxanthin, 4-[18-(4-Hydroxy-2,6,6-T...",OC1CC(=C(C=CC(=CC=CC(=CC=CC=C(C=CC=C(C=CC2=C(C...,"4-[18-(4-hydroxy-2,6,6-trimethylcyclohex-1-en-...",568.872928,0.998701,"[48386, 502529, 36622, 4369, 134427, 4121, 183...","[Prosopis glandulosa, Agnorhiza bolanderi, Tax...",Lipids and lipid-like molecules,Prenol lipids,Tetraterpenoids,Xanthophylls
2,CNP0216202,Bo-Xan,"[(All-E)-Lutein, Lutein B, Xanthophyll, Lutein...",OC1C=C(C)C(C=CC(=CC=CC(=CC=CC=C(C=CC=C(C=CC2=C...,"4-[18-(4-hydroxy-2,6,6-trimethylcyclohex-1-en-...",568.872928,1.119328,"[48386, 207754, 16719, 1672008, 29780, 340432,...","[Begonia nantoensis, Citrus sinensis, Pelteoba...",Lipids and lipid-like molecules,Prenol lipids,Tetraterpenoids,Xanthophylls
3,CNP0190579,Ovoester,"[Ovoester, E 161J, All-Trans-(3S,3'S)-Astaxant...",O=C1C(=C(C=CC(=CC=CC(=CC=CC=C(C=CC=C(C=CC2=C(C...,"6-hydroxy-3-[18-(4-hydroxy-2,6,6-trimethyl-3-o...",596.839975,0.957261,[],"[Pfaffia rhodozyma, Pelteobagrus nudiceps, Ado...",Lipids and lipid-like molecules,Prenol lipids,Tetraterpenoids,Xanthophylls
4,CNP0233748,Beta,"[Beta-Carotene, B-Carotene, (9Z)-Beta,Beta-Car...",C(=CC=C(C=CC=C(C=CC1=C(C)CCCC1(C)C)C)C)C=C(C=C...,"1,3,3-trimethyl-2-[3,7,12,16-tetramethyl-18-(2...",536.874118,0.746975,"[48386, 714511, 502529, 73737, 1209881, 49675,...","[Prosopis glandulosa, Agnorhiza bolanderi, Tax...",Lipids and lipid-like molecules,Prenol lipids,Tetraterpenoids,Carotenes
5,CNP0233524,Fucoxanthin,"[3-Hydroxy-4-(18-{4-Hydroxy-2,2,6-Trimethyl-7-...",O=C(OC1CC(O)(C(=C=CC(=CC=CC(=CC=CC=C(C=CC=C(C(...,"3-hydroxy-4-(18-{4-hydroxy-2,2,6-trimethyl-7-o...",658.907901,1.291161,"[4081, 2850, 88149, 127572, 45367, 510735, 743...","[Corbicula sandai, Sargassum fusiforme, Cratae...",Lipids and lipid-like molecules,Prenol lipids,Tetraterpenoids,Xanthophylls
