# Non-Natural Product Formula Analysis

This notebook analyzes molecular formulas from the PTFI database that do not match with natural products databases (SP3/Supernatural, LOTUS, and COCONUT). We'll identify compounds that are likely synthetic or have formulas not found in nature, which can help distinguish between natural and synthetic metabolites in the PTFI dataset.

In [51]:
import pandas as pd
import numpy as np


### Load PTFI Database

Load the PTFI metabolite dictionary and extract molecular formulas for analysis.

In [52]:
# Load PTFI database
ptfi_dic = pd.read_csv("../databases/ptfi_dic.csv")

# Filter to unknown compounds
ptfi_unknown = ptfi_dic[ptfi_dic["definition"].str.contains("unknown")].copy()
ptfi_unkown_no_names = ptfi_unknown[~ptfi_unknown["element_name"].str.contains(r"\|")].copy()

ptfi_unkown_no_names[['formula', 'rt_index']] = (
    ptfi_unkown_no_names['element_name']
        .str.split('-', n=1, expand=True)   # n=1 → split only on the first “-”
)


### Supernatural

We will parse the smiles in the supernatural database to extract the molecular formulas

In [21]:
supernatural3 = pd.read_csv("../databases/full_data_download.csv", sep=";")
from rdkit import Chem
from rdkit.Chem import rdMolDescriptors as Descr

formulas    = []    # list aligned with the input order
bad_smiles  = []    # store SMILES we can’t parse

for i, smi in enumerate(supernatural3["smiles"].astype(str)):
    if pd.isna(smi) or smi == "":
        formulas.append(None)
        bad_smiles.append(smi)
        continue

    try:
        # build Mol without full sanitisation
        mol = Chem.MolFromSmiles(smi, sanitize=False)
        if mol is None:
            raise ValueError("MolFromSmiles failed")

        # lightweight cache fill so CalcMolFormula won’t crash
        mol.UpdatePropertyCache(strict=False)
        form = Descr.CalcMolFormula(mol)

        formulas.append(form)
    except Exception:
        # catch any RDKit warning / valence error
        formulas.append(None)
        bad_smiles.append(smi)

    # progress ping every 50 000 rows
    if (i + 1) % 50_000 == 0:
        print(f"Processed {i+1:,} SMILES…")

# attach to the DataFrame
supernatural3["formula"] = formulas

# write out the rejects
pd.Series(bad_smiles, name="bad_smiles").to_csv(
    "supernatural2_bad_smiles.csv", index=False
)

print(f"✓ Parsed formulas: {supernatural3['formula'].notna().sum():,}")
print(f"✗ Failed SMILES  : {len(bad_smiles):,}  (stored in supernatural2_bad_smiles.csv)")


[17:05:43] SMILES Parse Error: syntax error while parsing: nan
[17:05:43] SMILES Parse Error: check for mistakes around position 2:
[17:05:43] nan
[17:05:43] ~^
[17:05:43] SMILES Parse Error: Failed parsing SMILES 'nan' for input: 'nan'
[17:05:43] SMILES Parse Error: syntax error while parsing: nan
[17:05:43] SMILES Parse Error: check for mistakes around position 2:
[17:05:43] nan
[17:05:43] ~^
[17:05:43] SMILES Parse Error: Failed parsing SMILES 'nan' for input: 'nan'
[17:05:43] SMILES Parse Error: syntax error while parsing: nan
[17:05:43] SMILES Parse Error: check for mistakes around position 2:
[17:05:43] nan
[17:05:43] ~^
[17:05:43] SMILES Parse Error: Failed parsing SMILES 'nan' for input: 'nan'
[17:05:43] SMILES Parse Error: syntax error while parsing: nan
[17:05:43] SMILES Parse Error: check for mistakes around position 2:
[17:05:43] nan
[17:05:43] ~^
[17:05:43] SMILES Parse Error: Failed parsing SMILES 'nan' for input: 'nan'
[17:05:43] SMILES Parse Error: syntax error while pa

Processed 50,000 SMILES…


[17:05:46] SMILES Parse Error: syntax error while parsing: nan
[17:05:46] SMILES Parse Error: check for mistakes around position 2:
[17:05:46] nan
[17:05:46] ~^
[17:05:46] SMILES Parse Error: Failed parsing SMILES 'nan' for input: 'nan'
[17:05:46] SMILES Parse Error: syntax error while parsing: nan
[17:05:46] SMILES Parse Error: check for mistakes around position 2:
[17:05:46] nan
[17:05:46] ~^
[17:05:46] SMILES Parse Error: Failed parsing SMILES 'nan' for input: 'nan'
[17:05:46] SMILES Parse Error: syntax error while parsing: nan
[17:05:46] SMILES Parse Error: check for mistakes around position 2:
[17:05:46] nan
[17:05:46] ~^
[17:05:46] SMILES Parse Error: Failed parsing SMILES 'nan' for input: 'nan'
[17:05:46] SMILES Parse Error: syntax error while parsing: nan
[17:05:46] SMILES Parse Error: check for mistakes around position 2:
[17:05:46] nan
[17:05:46] ~^
[17:05:46] SMILES Parse Error: Failed parsing SMILES 'nan' for input: 'nan'
[17:05:46] SMILES Parse Error: syntax error while pa

Processed 100,000 SMILES…


[17:05:49] SMILES Parse Error: syntax error while parsing: nan
[17:05:49] SMILES Parse Error: check for mistakes around position 2:
[17:05:49] nan
[17:05:49] ~^
[17:05:49] SMILES Parse Error: Failed parsing SMILES 'nan' for input: 'nan'
[17:05:49] SMILES Parse Error: syntax error while parsing: nan
[17:05:49] SMILES Parse Error: check for mistakes around position 2:
[17:05:49] nan
[17:05:49] ~^
[17:05:49] SMILES Parse Error: Failed parsing SMILES 'nan' for input: 'nan'
[17:05:49] SMILES Parse Error: syntax error while parsing: nan
[17:05:49] SMILES Parse Error: check for mistakes around position 2:
[17:05:49] nan
[17:05:49] ~^
[17:05:49] SMILES Parse Error: Failed parsing SMILES 'nan' for input: 'nan'
[17:05:49] SMILES Parse Error: syntax error while parsing: nan
[17:05:49] SMILES Parse Error: check for mistakes around position 2:
[17:05:49] nan
[17:05:49] ~^
[17:05:49] SMILES Parse Error: Failed parsing SMILES 'nan' for input: 'nan'
[17:05:49] SMILES Parse Error: syntax error while pa

Processed 150,000 SMILES…


[17:05:53] SMILES Parse Error: syntax error while parsing: nan
[17:05:53] SMILES Parse Error: check for mistakes around position 2:
[17:05:53] nan
[17:05:53] ~^
[17:05:53] SMILES Parse Error: Failed parsing SMILES 'nan' for input: 'nan'
[17:05:54] SMILES Parse Error: syntax error while parsing: nan
[17:05:54] SMILES Parse Error: check for mistakes around position 2:
[17:05:54] nan
[17:05:54] ~^
[17:05:54] SMILES Parse Error: Failed parsing SMILES 'nan' for input: 'nan'


Processed 200,000 SMILES…


[17:05:57] SMILES Parse Error: syntax error while parsing: [H]OC(=O)C([H])([H])C([H])([H])C([H])([H])C(=O)N([H])C(=C(\[H])C([H])([H])[H])\C(=O)N1C([H])([H])C([H])([H])C([H])([H])[C@@]1([H])C(=O)N([H])[C@]([H])(C(=O)N([H])[C@@]([H])(C(=O)N([H])[C@@]([H])(C(=O)N([H])[C@@]([H])(C(=O)N([H])[C@]([H])(C(=O)N([H])[C@@]([H])(C(=O)N([H])[C@@]([H])(C(=O)N([H])[C@@]([H])(C(=O)N([H])[C@]([H])(C(=O)N([H])[C@@]([H])(C(=O)N([H])C(=C(\[H])C([H])([H])[H])\C(=O)N([H])[C@]1([H])C(=O)N([H])[C@]([H])(C(=O)N([H])[C@]([H])(C(=O)N([H])[C@@]([H])(C(=O)N([H])[C@]([H])(C(=O)O[C@]1([H])C([H])([H])[H])C([H])([H])C([H])([H])C([H])([H])C([H])([H])N([H])[H])C([H])([H])C([H])([H])N([H])[H])C([H])([H])C([H])([H])O[H])[C@]([H])(C([H])([H])[H])C([H])([H])C([H])([H])[H])C([H])(C([H])([H])[H])C([H])([H])[H])C([H])([H])C([H])(C([H])([H])[H])C([H])([H])[H])C([H])([H])C([H])([H])C(=O)N([H])[H])C([H])(C([H])([H])[H])C([H])([H])[H])C([H])(C([H])([H])[H])C([H])([H])[H])C([H])([H])C([H])(C([H])([H])[H])C([H])([H])[H])C([H])([H])O

Processed 250,000 SMILES…


[17:06:02] SMILES Parse Error: syntax error while parsing: [H]OC(=O)C([H])([H])C([H])([H])[C@]([H])(N([H])C(=O)[C@@]([H])(N([H])C(=O)[C@]([H])(N([H])C(=O)[C@]([H])(N([H])C(=O)[C@@]([H])(N([H])C(=O)[C@]([H])(N([H])C(=O)[C@@]([H])(N([H])[H])C([H])([H])C([H])([H])SC([H])([H])[H])C([H])([H])C(=O)N([H])[H])C([H])([H])C(=O)N([H])[H])C([H])([H])C([H])(C([H])([H])[H])C([H])([H])[H])C([H])([H])C1=C([H])C([H])=C(O[H])C([H])=C1[H])C([H])([H])C([H])([H])C([H])([H])N([H])C(=N[H])N([H])[H])C(=O)N([H])[C@@]([H])(C(=O)N([H])[C@@]([H])(C(=O)N1C([H])([H])C([H])([H])C([H])([H])[C@]1([H])C(=O)N([H])[C@@]([H])(C(=O)N1C([H])([H])C([H])([H])C([H])([H])[C@]1([H])C(=O)N([H])C([H])([H])C(=O)N1C([H])([H])C([H])([H])C([H])([H])[C@@]1([H])C(=O)N([H])[C@]([H])(C(=O)N([H])[C@@]([H])(C(=O)N([H])[C@@]([H])(C(=O)N([H])[C@@]([H])(C(=O)N([H])[C@]([H])(C(=O)O[H])[C@@]([H])(C([H])([H])[H])C([H])([H])C([H])([H])[H])C([H])([H])C([H])([H])C(=O)O[H])C([H])([H])[H])C([H])([H])[H])C([H])([H])[H])[C@@]([H])(C([H])([H])[H])C([H])(

Processed 300,000 SMILES…
Processed 350,000 SMILES…


[17:06:06] SMILES Parse Error: syntax error while parsing: nan
[17:06:06] SMILES Parse Error: check for mistakes around position 2:
[17:06:06] nan
[17:06:06] ~^
[17:06:06] SMILES Parse Error: Failed parsing SMILES 'nan' for input: 'nan'


Processed 400,000 SMILES…


[17:06:09] SMILES Parse Error: syntax error while parsing: nan
[17:06:09] SMILES Parse Error: check for mistakes around position 2:
[17:06:09] nan
[17:06:09] ~^
[17:06:09] SMILES Parse Error: Failed parsing SMILES 'nan' for input: 'nan'


Processed 450,000 SMILES…


[17:06:12] SMILES Parse Error: syntax error while parsing: nan
[17:06:12] SMILES Parse Error: check for mistakes around position 2:
[17:06:12] nan
[17:06:12] ~^
[17:06:12] SMILES Parse Error: Failed parsing SMILES 'nan' for input: 'nan'
[17:06:12] SMILES Parse Error: syntax error while parsing: nan
[17:06:12] SMILES Parse Error: check for mistakes around position 2:
[17:06:12] nan
[17:06:12] ~^
[17:06:12] SMILES Parse Error: Failed parsing SMILES 'nan' for input: 'nan'
[17:06:12] SMILES Parse Error: syntax error while parsing: nan
[17:06:12] SMILES Parse Error: check for mistakes around position 2:
[17:06:12] nan
[17:06:12] ~^
[17:06:12] SMILES Parse Error: Failed parsing SMILES 'nan' for input: 'nan'


Processed 500,000 SMILES…


[17:06:13] SMILES Parse Error: syntax error while parsing: nan
[17:06:13] SMILES Parse Error: check for mistakes around position 2:
[17:06:13] nan
[17:06:13] ~^
[17:06:13] SMILES Parse Error: Failed parsing SMILES 'nan' for input: 'nan'
[17:06:13] SMILES Parse Error: syntax error while parsing: nan
[17:06:13] SMILES Parse Error: check for mistakes around position 2:
[17:06:13] nan
[17:06:13] ~^
[17:06:13] SMILES Parse Error: Failed parsing SMILES 'nan' for input: 'nan'
[17:06:13] SMILES Parse Error: syntax error while parsing: nan
[17:06:13] SMILES Parse Error: check for mistakes around position 2:
[17:06:13] nan
[17:06:13] ~^
[17:06:13] SMILES Parse Error: Failed parsing SMILES 'nan' for input: 'nan'
[17:06:13] SMILES Parse Error: syntax error while parsing: nan
[17:06:13] SMILES Parse Error: check for mistakes around position 2:
[17:06:13] nan
[17:06:13] ~^
[17:06:13] SMILES Parse Error: Failed parsing SMILES 'nan' for input: 'nan'
[17:06:13] SMILES Parse Error: syntax error while pa

Processed 550,000 SMILES…


[17:06:17] SMILES Parse Error: extra open parentheses while parsing: [H]OC([H])([H])[C@@]1([H])O[C@@]([H])(O[C@@]2([H])[C@]([H])(O[C@@]3([H])C([H])([H])C([H])([H])[C@@]4(C([H])([H])[H])[C@]([H])(C([H])([H])C([H])([H])[C@]5(C([H])([H])[H])[C@@]4([H])C([H])([H])C([H])=C4[C@]6([H])C([H])([H])C(C([H])([H])[H])(C([H])([H])[H])[C@@]([H])(OC(=O)C(=C(/[H])C([H])([H])C([H])([H])[C@](O[C@]7([H])O[C@]([H])(C([H])([H])[H])[C@@]([H])(OC(=O)C(=C(/[H])C([H])([H])C([H])([H])[C@](O[C@]8([H])O[C@]([H])(C([H])([H])[H])[C@@]([H])(OC(=O)C(=C(/[H])C([H])([H])C([H])([H])[C@](O[C@]9([H])O[C@]([H])(C([H])([H])[H])[C@@]([H])(O[H])[C@]([H])(O[H])[C@@]9([H])O[H])(C([H])=C([H])[H])C([H])([H])[H])\C([H])([H])[H])[C@]([H])(O[H])[C@@]8([H])O[H])(C([H])=C([H])[H])C([H])([H])[H])\C([H])([H])[H])[C@]([H])(O[H])[C@@]7([H])O[H])(C([H])=C([H])[H])C([H])([H])[H])\C([H])([H])[H])C([H])([H])[C@]6(C(=O)O[C@]6([H])O[C@]([H])(C([H])([H])O[H])[C@@]([H])(O[H])[C@]([H])(O[H])[C@@]6([H])O[C@]6([H])O[C@@]([H])(C([H])([H])[H])[C@]([H]

Processed 600,000 SMILES…


[17:06:22] SMILES Parse Error: syntax error while parsing: [H]OC1=C([H])C([H])=C(C([H])=C1[H])C([H])([H])[C@@]1([H])N([H])C(=O)[C@]([H])(N([H])C(=O)[C@@]([H])(N([H])C(=O)[C@]([H])(N([H])C(=O)\C(N([H])C(=O)[C@]([H])(N([H])C(=O)[C@]([H])(N([H])C(=O)[C@@]([H])(N([H])C(=O)C(\N([H])C(=O)[C@]([H])(N([H])C(=O)[C@@]([H])(N([H])C(=O)[C@@]([H])(N([H])C(=O)[C@@]([H])(N([H])C(=O)[C@]([H])(N([H])C(=O)[C@]([H])(N([H])C(=O)[C@@]([H])(N([H])C(=O)[C@@]([H])(N([H])C(=O)[C@]([H])(N([H])C(=O)[C@@]([H])(N([H])C(=O)[C@]([H])(N([H])C(=O)[C@]2([H])N(C(=O)C(\N([H])C(=O)C([H])([H])[C@]([H])(O[H])C([H])([H])C([H])([H])C([H])([H])C([H])([H])C([H])([H])C([H])([H])C([H])([H])C([H])([H])C([H])([H])C([H])([H])C([H])([H])[H])=C(/[H])C([H])([H])[H])C([H])([H])C([H])([H])C2([H])[H])C([H])(C([H])([H])[H])C([H])([H])[H])C([H])([H])C([H])(C([H])([H])[H])C([H])([H])[H])C([H])([H])[H])C([H])([H])[H])C([H])([H])C([H])(C([H])([H])[H])C([H])([H])[H])C([H])(C([H])([H])[H])C([H])([H])[H])C([H])([H])[H])C([H])([H])[H])C([H])(C([H]

Processed 650,000 SMILES…


[17:06:23] SMILES Parse Error: syntax error while parsing: nan
[17:06:23] SMILES Parse Error: check for mistakes around position 2:
[17:06:23] nan
[17:06:23] ~^
[17:06:23] SMILES Parse Error: Failed parsing SMILES 'nan' for input: 'nan'


Processed 700,000 SMILES…


[17:06:27] SMILES Parse Error: syntax error while parsing: nan
[17:06:27] SMILES Parse Error: check for mistakes around position 2:
[17:06:27] nan
[17:06:27] ~^
[17:06:27] SMILES Parse Error: Failed parsing SMILES 'nan' for input: 'nan'


Processed 750,000 SMILES…


[17:06:30] SMILES Parse Error: syntax error while parsing: nan
[17:06:30] SMILES Parse Error: check for mistakes around position 2:
[17:06:30] nan
[17:06:30] ~^
[17:06:30] SMILES Parse Error: Failed parsing SMILES 'nan' for input: 'nan'
[17:06:30] SMILES Parse Error: syntax error while parsing: nan
[17:06:30] SMILES Parse Error: check for mistakes around position 2:
[17:06:30] nan
[17:06:30] ~^
[17:06:30] SMILES Parse Error: Failed parsing SMILES 'nan' for input: 'nan'
[17:06:30] SMILES Parse Error: syntax error while parsing: nan
[17:06:30] SMILES Parse Error: check for mistakes around position 2:
[17:06:30] nan
[17:06:30] ~^
[17:06:30] SMILES Parse Error: Failed parsing SMILES 'nan' for input: 'nan'
[17:06:30] SMILES Parse Error: syntax error while parsing: nan
[17:06:30] SMILES Parse Error: check for mistakes around position 2:
[17:06:30] nan
[17:06:30] ~^
[17:06:30] SMILES Parse Error: Failed parsing SMILES 'nan' for input: 'nan'
[17:06:30] SMILES Parse Error: syntax error while pa

Processed 800,000 SMILES…


[17:06:35] SMILES Parse Error: syntax error while parsing: nan
[17:06:35] SMILES Parse Error: check for mistakes around position 2:
[17:06:35] nan
[17:06:35] ~^
[17:06:35] SMILES Parse Error: Failed parsing SMILES 'nan' for input: 'nan'


Processed 850,000 SMILES…


[17:06:35] SMILES Parse Error: syntax error while parsing: [H]OC([H])([H])C1=C([H])[C@]([H])(N([H])[C@]2([H])[C@]([H])(O[C@]([H])(O[C@@]3([H])[C@]([H])(O[H])[C@@]([H])(O[H])[C@@]([H])(O[C@]4([H])C(=C([H])[C@]([H])(N([H])[C@]5([H])[C@]([H])(O[C@]([H])(O[C@@]6([H])[C@]([H])(O[H])[C@@]([H])(O[H])[C@@]([H])(O[C@@]7([H])[C@]([H])(O[H])[C@@]([H])(O[H])[C@@]([H])(O[C@@]8([H])[C@]([H])(O[H])[C@@]([H])(O[H])[C@@]([H])(O[C@@]9([H])[C@]([H])(O[H])[C@@]([H])(O[H])[C@]([H])(O[H])O[C@]9([H])C([H])([H])O[H])O[C@]8([H])C([H])([H])O[H])O[C@]7([H])C([H])([H])O[H])O[C@]6([H])C([H])([H])O[H])[C@]([H])(O[H])[C@@]5([H])O[H])C([H])([H])[H])[C@]([H])(O[H])[C@@]4([H])O[H])C([H])([H])O[H])O[C@]3([H])C([H])([H])O[H])[C@]([H])(O[H])[C@@]2([H])O[H])C([H])([H])[H])[C@]([H])(O[H])[C@@]([H])(O[H])[C@]1([H])O[C@@]1([H])O[C@]([H])(C([H])([H])O[H])[C@@]([H])(O[C@@]2([H])O[C@]([H])(C([H])([H])[H])[C@@]([H])(N([H])[C@@]3([H])C([H])=C(C([H])([H])O[H])[C@@]([H])(O[C@@]4([H])O[C@]([H])(C([H])([H])O[H])[C@@]([H])(O[C@@]5([H])

Processed 900,000 SMILES…


[17:06:39] SMILES Parse Error: syntax error while parsing: [H]OC(=O)[C@]([H])(N([H])C(=O)[C@@]([H])(N([H])C(=O)[C@@]([H])(N([H])C(=O)[C@@]([H])(N([H])C(=O)[C@]([H])(N([H])C(=O)C(\N([H])C(=O)[C@@]([H])(N([H])C(=O)[C@@]([H])(N([H])C(=O)[C@]([H])(N([H])C(=O)[C@]([H])(N([H])C(=O)[C@@]([H])(N([H])C(=O)[C@@]([H])(N([H])C(=O)[C@@]([H])(N([H])C(=O)[C@@]([H])(N([H])C(=O)[C@]([H])(N([H])C(=O)[C@@]([H])(N([H])C(=O)[C@]1([H])N(C(=O)C(\N([H])C(=O)C([H])([H])[C@]([H])(O[H])C([H])([H])C([H])([H])C([H])([H])C([H])([H])C([H])([H])[H])=C(/[H])C([H])([H])[H])C([H])([H])C([H])([H])C1([H])[H])C([H])([H])O[H])C([H])([H])C([H])(C([H])([H])[H])C([H])([H])[H])C([H])(C([H])([H])[H])C([H])([H])[H])C([H])([H])O[H])C([H])([H])C([H])(C([H])([H])[H])C([H])([H])[H])C([H])(C([H])([H])[H])C([H])([H])[H])C([H])(C([H])([H])[H])C([H])([H])[H])C([H])([H])C([H])([H])C(=O)N([H])[H])C([H])([H])C([H])(C([H])([H])[H])C([H])([H])[H])C([H])(C([H])([H])[H])C([H])([H])[H])=C(/[H])C([H])([H])[H])[C@@]([H])(O[H])C([H])([H])[H])[C@]([

Processed 950,000 SMILES…
Processed 1,000,000 SMILES…
Processed 1,050,000 SMILES…


[17:06:48] SMILES Parse Error: syntax error while parsing: nan
[17:06:48] SMILES Parse Error: check for mistakes around position 2:
[17:06:48] nan
[17:06:48] ~^
[17:06:48] SMILES Parse Error: Failed parsing SMILES 'nan' for input: 'nan'
[17:06:51] SMILES Parse Error: syntax error while parsing: [H]OC(=O)C([H])([H])C([H])([H])[C@]1([H])N([H])C(=O)[C@@]([H])(N([H])C(=O)[C@@]([H])(N([H])C(=O)[C@@]2([H])N([H])C(=O)[C@@]([H])(N([H])C(=O)[C@@]([H])(N([H])C(=O)[C@]([H])(N([H])C(=O)[C@@]([H])(N([H])C(=O)C(\N([H])C(=O)[C@@]3([H])N(C(=O)[C@@]([H])(N([H])[H])C([H])([H])C4=C([H])N=C([H])N4[H])C([H])([H])C([H])([H])C3([H])[H])=C(/[H])C([H])([H])[H])C([H])([H])C3=C([H])N=C([H])N3[H])[C@@]([H])(SC([H])([H])[C@]([H])(N([H])C(=O)[C@@]([H])(N([H])C(=O)[C@@]([H])(N([H])C(=O)C([H])([H])N([H])C1=O)C([H])([H])C1=C([H])N=C([H])N1[H])C([H])([H])C([H])(C([H])([H])[H])C([H])([H])[H])C(=O)N([H])\C(=C(\[H])C([H])([H])[H])C(=O)N([H])\C(=C(\[H])C([H])([H])[H])C(=O)N([H])[C@]([H])(C(=O)N([H])[C@]([H])(C(=O)N([H])[C@

Processed 1,100,000 SMILES…


[17:06:53] SMILES Parse Error: syntax error while parsing: nan
[17:06:53] SMILES Parse Error: check for mistakes around position 2:
[17:06:53] nan
[17:06:53] ~^
[17:06:53] SMILES Parse Error: Failed parsing SMILES 'nan' for input: 'nan'


Processed 1,150,000 SMILES…


[17:06:56] SMILES Parse Error: syntax error while parsing: nan
[17:06:56] SMILES Parse Error: check for mistakes around position 2:
[17:06:56] nan
[17:06:56] ~^
[17:06:56] SMILES Parse Error: Failed parsing SMILES 'nan' for input: 'nan'
[17:06:56] SMILES Parse Error: syntax error while parsing: nan
[17:06:56] SMILES Parse Error: check for mistakes around position 2:
[17:06:56] nan
[17:06:56] ~^
[17:06:56] SMILES Parse Error: Failed parsing SMILES 'nan' for input: 'nan'
[17:06:56] SMILES Parse Error: syntax error while parsing: nan
[17:06:56] SMILES Parse Error: check for mistakes around position 2:
[17:06:56] nan
[17:06:56] ~^
[17:06:56] SMILES Parse Error: Failed parsing SMILES 'nan' for input: 'nan'
[17:06:56] SMILES Parse Error: syntax error while parsing: nan
[17:06:56] SMILES Parse Error: check for mistakes around position 2:
[17:06:56] nan
[17:06:56] ~^
[17:06:56] SMILES Parse Error: Failed parsing SMILES 'nan' for input: 'nan'
[17:06:56] SMILES Parse Error: syntax error while pa

Processed 1,200,000 SMILES…
✓ Parsed formulas: 1,203,563
✗ Failed SMILES  : 1,636  (stored in supernatural2_bad_smiles.csv)


In [23]:
supernatural3_formulas = set(supernatural3["formula"].dropna().unique())

## Coconut

Coconut does not need additional parsing

In [27]:
coconut = pd.read_csv("../databases/coconut_csv-07-2025.csv", sep=",", low_memory=False)


In [28]:
coconut_formulas = set(coconut["molecular_formula"].dropna().unique())

## LOTUS

It's also necessary to parse LOTUS SMILES into  formulas

In [36]:
from rdkit import Chem
from rdkit.Chem import rdMolDescriptors as Descr

def smiles2formula_safe(smi: str) -> str | None:
    """Return Hill-sorted formula from a SMILES, or None if RDKit chokes."""
    try:
        mol = Chem.MolFromSmiles(smi, sanitize=False)
        if mol is None:
            return None
        mol.UpdatePropertyCache(strict=False)          # tiny, fast
        return Descr.CalcMolFormula(mol)
    except Exception:
        return None


lotus_formulas = set()
bad_smiles     = []

with open("../databases/LOTUS_DB.smi", "r", encoding="utf-8") as fh:
    for i, line in enumerate(fh, 1):
        smi = line.split("\t", 1)[0].strip()          # grab the SMILES column
        if not smi:
            continue
        f = smiles2formula_safe(smi)
        if f:
            lotus_formulas.add(f)
        else:
            bad_smiles.append(smi)

print(f"LOTUS formulas parsed: {len(lotus_formulas):,}")
print(f"SMILES that failed   : {len(bad_smiles):,}")

LOTUS formulas parsed: 30,090
SMILES that failed   : 0


## PTFI non natural formulas

In this section, we finalize the analysis by determining whether a compound is a natural product or not. The logic is based on the presence of the compound in any of the natural product databases (Coconut, LOTUS, or Supernatural).

Additionally, a fluorine filter is applied. Fluorine is an extremely rare occurrence in natural products, so its presence in a formula is used as a stricter criterion. If a compound contains fluorine, it must be present in all three databases to be considered a natural product. Otherwise, it is classified as non-natural.

In [56]:
ptfi_formulas = ptfi_unkown_no_names[["formula", "rt_index", "element_name", "element_id"]].copy()
ptfi_formulas["in_coconut"] = ptfi_formulas["formula"].isin(coconut_formulas)
ptfi_formulas["in_lotus"] = ptfi_formulas["formula"].isin(lotus_formulas)
ptfi_formulas["in_supernatural3"] = ptfi_formulas["formula"].isin(supernatural3_formulas)

In [59]:
# Ensure database flags are boolean
ptfi_formulas[['in_coconut', 'in_lotus', 'in_supernatural3']] = (
    ptfi_formulas[['in_coconut', 'in_lotus', 'in_supernatural3']].fillna(False).astype(bool)
    )

# Rule 1: Natural product if present in ANY database
ptfi_formulas['np_any'] = ptfi_formulas[['in_coconut', 'in_lotus', 'in_supernatural3']].any(axis=1)

# Detect elemental Fluorine 'F' (not 'Fe') in the formula
ptfi_formulas['has_fluorine'] = (
    ptfi_formulas['formula']
        .fillna('')
        .astype(str)
        .str.contains(r'F(?![a-z])', regex=True)
    )

# Rule 2: If formula has F, require it to be present in ALL DBs; otherwise use ANY
ptfi_formulas['np_all'] = ptfi_formulas[['in_coconut', 'in_lotus', 'in_supernatural3']].all(axis=1)
ptfi_formulas['np_final'] = np.where(
    ptfi_formulas['has_fluorine'],
    ptfi_formulas['np_all'],
    ptfi_formulas['np_any']
    )

# Final boolean columns along with all original data
ptfi_formulas = ptfi_formulas.assign(
    in_coconut=ptfi_formulas['in_coconut'],
    in_lotus=ptfi_formulas['in_lotus'],
    in_supernatural3=ptfi_formulas['in_supernatural3'],
    has_fluorine=ptfi_formulas['has_fluorine'],
    np_final=ptfi_formulas['np_final']
    )

# Diagnostics
summary = {
    'total': len(ptfi_formulas),
    'np_any_true': int(ptfi_formulas['np_final'].sum()),
    'has_fluorine': int(ptfi_formulas['has_fluorine'].sum()),
    'non_np': len(ptfi_formulas) - int(ptfi_formulas['np_final'].sum()),
}
print('Summary:', summary)

ptfi_formulas.to_csv("../export_ptfi/ptfi_non_natural_formulas.csv", index=False)


Summary: {'total': 23714, 'np_any_true': 22612, 'has_fluorine': 126, 'non_np': 1102}
