# Processing SMILES and Errors

This notebook runs a **conversion pipeline** from raw SMILES text files to analysis-ready CSV files:

1. **TXT → R_*.csv** — Collect unique SMILES from each `.txt` file and record ID and occurrence count.
2. **R_*.csv → E_*.csv** — Check parsability with RDKit, compute canonical SMILES where possible, and attach parse error/warning messages.
3. **E_*.csv → errors.csv** — Classify unparsable SMILES by error type (syntax vs chemical) and summarize counts per file.

Inputs: `.txt` files in `./txt/` (one SMILES per line).  
Outputs: `./par/R_*.csv`, `./par/E_*.csv`, and `./data/errors.csv`.

In [1]:
import os
from collections import defaultdict

import pandas as pd
from rdkit.Chem.rdmolfiles import MolToSmiles

from libTMC.parse_mol_and_errors import get_parsability, append_error_messages

## Pipeline 1: Unique SMILES (TXT → R_*.csv)

For each `.txt` file in `txt/`, we read SMILES (one per line), assign a unique ID and count occurrences, then write `par/R_{basename}.csv` with columns: **ID**, **LLM_SMI**, **Occur**. Files already present in `par/` are skipped.

In [2]:
repodir = '.'
txtdir = 'txt'
outdir = 'par'

os.makedirs(os.path.join(repodir, outdir), exist_ok=True)

existing = [
    n.replace('.csv', '').split('_')[-1]
    for n in os.listdir(os.path.join(repodir, outdir))
    if 'R_' in n and n.endswith('.csv')
]
processed = 0
for file in os.listdir(os.path.join(repodir, txtdir)):
    if not file.endswith('.txt'):
        continue
    pre_check = file.replace('.txt', '')
    if pre_check in existing:
        continue
    processed += 1
    with open(os.path.join(repodir, txtdir, file), 'r') as f:
        info = f.readlines()

    identical = {}
    begin = 0
    for line in info:
        smi = line.split('\n')[0].strip()
        if not smi:
            continue
        if smi not in identical:
            begin += 1
            identical[smi] = {'ID': begin, 'Occur': 1}
        else:
            identical[smi]['Occur'] += 1

    output_data = defaultdict(list)
    for key, values in identical.items():
        output_data['ID'].append(values['ID'])
        output_data['LLM_SMI'].append(key)
        output_data['Occur'].append(values['Occur'])
    outputdf = pd.DataFrame(output_data)
    param = file.replace('.txt', '')
    outputdf.to_csv(os.path.join(repodir, outdir, f'R_{param}.csv'))
print('Files processed:', processed)

Files processed: 1


## Pipeline 2: Parsability and canonical SMILES (R_*.csv → E_*.csv)

For each `R_*.csv` in `par/`, we run RDKit parsability checks. Parsable SMILES are canonicalized and merged by canonical SMILES (occurrences and redundant IDs/SMILES aggregated). Unparsable rows keep the original SMILES and get parse error/warning/problem columns. Output: `par/E_{basename}.csv` with **ID**, **Canon_SMI**, **LLM_SMI**, **Occur**, **Parsable**, **RSC_IDs**, **RSC_SMIs**, **Parse_errors**, **Parse_warns**, **Parse_probs**.

In [3]:
pardir = 'par'
existing_e = [
    n.replace('.csv', '').replace('E', '')
    for n in os.listdir(pardir)
    if 'E_' in n and n.endswith('.csv')
]
processed = 0
for file in os.listdir(pardir):
    if not file.endswith('.csv') or 'R' not in file:
        continue
    filename = file.replace('.csv', '').replace('R', '')
    if filename in existing_e:
        continue
    processed += 1
    df = pd.read_csv(os.path.join(pardir, file)).drop(columns=['Unnamed: 0'], errors='ignore')

    identical = {}
    for idx, smi, occur in zip(df['ID'], df['LLM_SMI'], df['Occur']):
        mol, parsable, messages = get_parsability(smi)
        if parsable:
            c_smi = MolToSmiles(mol, canonical=True)
            if c_smi not in identical:
                identical[c_smi] = {
                    'ID': idx, 'Canon_SMI': c_smi, 'LLM_SMI': smi, 'Occur': occur,
                    'Parsable': parsable, 'RSC_IDs': '', 'RSC_SMIs': '', 'Messages': messages
                }
            else:
                identical[c_smi]['Occur'] += occur
                identical[c_smi]['RSC_IDs'] += (',' if identical[c_smi]['RSC_IDs'] else '') + str(idx)
                identical[c_smi]['RSC_SMIs'] += (',' if identical[c_smi]['RSC_SMIs'] else '') + smi
        else:
            identical[smi] = {
                'ID': idx, 'Canon_SMI': smi, 'LLM_SMI': smi, 'Occur': occur,
                'Parsable': parsable, 'RSC_IDs': '', 'RSC_SMIs': '', 'Messages': messages
            }

    outputdict = defaultdict(list)
    for k, v in identical.items():
        for key, value in v.items():
            if key != 'Messages':
                outputdict[key].append(value)
            else:
                append_error_messages(outputdict, value, subtitle='Parse')
    outputdf = pd.DataFrame(outputdict)
    outputdf.to_csv(os.path.join(pardir, f'E{filename}.csv'))
print('Files processed:', processed)

Files processed: 1


## Pipeline 3: Error classification (E_*.csv → errors.csv)

We classify **unparsable** SMILES by the content of their parse messages:
- **Chemical**: Improper valences, Kekulization issues, Aromatic labels for non-ring atoms, or multiple chemical issues.
- **Syntax**: Unclosed rings, Parentheses, Duplicate bonds on ring closure, Other syntax errors.

Each E_*.csv is summarized into one row of counts; results are written to `./data/errors.csv`.

In [4]:
psftdir = 'par'
files = [f for f in os.listdir(psftdir) if f.endswith('.csv') and 'E_' in f]

chemdict = {
    'Improper valences': 'Explicit valence',
    'Kekulization issues': 'kekulize',
    'Aromatic labels for non-ring atoms': 'non-ring'
}
syndict = {
    'Unclosed rings': 'unclosed ring',
    'Parentheses': 'parentheses',
    'Duplicate bonds on ring closure': 'ring closure',
    'Other syntax errors': 'syntax error while parsing'
}
errdict = {}
for key, value in chemdict.items():
    errdict[value] = {'Label': key, 'Occur': 0, 'Class': 'Chemical'}
for key, value in syndict.items():
    errdict[value] = {'Label': key, 'Occur': 0, 'Class': 'Syntax'}
errdict['multiple'] = {'Label': 'Multiple chemical issues', 'Occur': 0, 'Class': 'Chemical'}

chemphs = list(chemdict.values())
allphs = list(errdict.keys())

In [5]:
outdict = defaultdict(list)
for file in files:
    df = pd.read_csv(os.path.join(psftdir, file)).drop(columns=['Unnamed: 0'], errors='ignore')
    fname = file.replace('.csv', '')
    info = fname

    errdict_local = {}
    for key, value in chemdict.items():
        errdict_local[value] = {'Label': key, 'Occur': 0, 'Class': 'Chemical'}
    for key, value in syndict.items():
        errdict_local[value] = {'Label': key, 'Occur': 0, 'Class': 'Syntax'}
    errdict_local['multiple'] = {'Label': 'Multiple chemical issues', 'Occur': 0, 'Class': 'Chemical'}

    for idx, par, message in zip(df['ID'], df['Parsable'], df['Parse_warns']):
        if par:
            continue
        if isinstance(message, float):
            continue
        check = sum(1 for ck in chemphs if ck in message)
        all_check = sum(1 for ck in allphs if ck in message)
        if check <= 1:
            for tk in allphs:
                if tk in message:
                    errdict_local[tk]['Occur'] += 1
                    break
        else:
            errdict_local['multiple']['Occur'] += 1
        if all_check < 1:
            raise ValueError(fname, idx, 'missing errors!', message)

    outdict['Info'].append(info)
    for k in errdict_local:
        outdict['|'.join([errdict_local[k]['Label'], errdict_local[k]['Class']])].append(errdict_local[k]['Occur'])

outdf = pd.DataFrame(outdict)
outdf.head()

Unnamed: 0,Info,Improper valences|Chemical,Kekulization issues|Chemical,Aromatic labels for non-ring atoms|Chemical,Unclosed rings|Syntax,Parentheses|Syntax,Duplicate bonds on ring closure|Syntax,Other syntax errors|Syntax,Multiple chemical issues|Chemical
0,E_example,148,166,8,549,35,19,7,5


In [6]:
os.makedirs('data', exist_ok=True)
outdf.to_csv('./data/errors.csv')

    Copyright ©2025  The Regents of the University of California (Regents). All Rights Reserved. Permission to use, copy, modify, and distribute this software and its documentation for educational, research, and not-for-profit purposes, without fee and without a signed licensing agreement, is hereby granted, provided that the above copyright notice, this paragraph and the following paragraphs appear in all copies, modifications, and distributions. Contact The Office of Technology Licensing, UC Berkeley, 2150 Shattuck Avenue, Suite 408, Berkeley, CA 94704-1362, otl@berkeley.edu.
    
    Created by John Smith and Mary Doe, Department of Statistics, University of California, Berkeley.

    IN NO EVENT SHALL REGENTS BE LIABLE TO ANY PARTY FOR DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS DOCUMENTATION, EVEN IF REGENTS HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

    REGENTS SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE SOFTWARE AND ACCOMPANYING DOCUMENTATION, IF ANY, PROVIDED HEREUNDER IS PROVIDED "AS IS". REGENTS HAS NO OBLIGATION TO PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS