# PyClassyFire Tutorial: Classifying Chemical Compounds Using the ClassyFire API


## Introduction

Welcome to the **PyClassyFire** tutorial! This guide will walk you through the process of classifying a large set of chemical compounds using the [ClassyFire](http://classyfire.wishartlab.com/) API. We'll utilize the `PyClassyFire` package, which provides a command-line interface (CLI) and programmatic access to the ClassyFire service, enabling efficient and scalable classification of chemical structures.

By the end of this tutorial, you'll be able to:

1. **Preprocess your SMILES data**: Prepare your unique SMILES strings for classification.
2. **Submit classification jobs**: Use the `PyClassyFire` package to send your data to the ClassyFire API.
3. **Retrieve and process results**: Collect the classification results and merge them with your original data.
4. **Save the annotated data**: Store the enriched dataset for further analysis.

Let's get started!

## Prerequisites

Before diving into the tutorial, ensure you have the following:

- **Conda Environment**: A Conda environment named `classyfire_env` with all necessary dependencies installed.
- **PyClassyFire Package**: Installed and accessible within your Conda environment.
- **Unique SMILES Data**: A TSV file containing approximately 16,000 unique SMILES strings located at `/Users/macbook/CODE/PyClassyFire/data/unique_valid_smiles_no_header.tsv`.

**Note:** This tutorial assumes that the Conda environment and `PyClassyFire` package are already set up. If not, please refer to the [repository's README](https://github.com/Jozefov/PyClassyFire) for setup instructions.




In [1]:
import os
import pandas as pd
import json

from fontTools.subset import intersect

from classyfire_cli.src.utils import MoleCule, load_existing_results, save_intermediate_results, merge_intermediate_files, check_all_smiles_present
from classyfire_cli.src.batch import process_batches_with_saving_and_retry

In [2]:
# Define paths
smiles_file_path = '../data/unique_valid_smiles_no_header.tsv'
output_dir = '../data/intermediate_results/'
final_output_path = '../data/final_classification_results.json'

In [3]:
# Load SMILES data
smiles_df = pd.read_csv(smiles_file_path, sep='\t', header=None, names=['SMILES']).dropna()

# Canonicalize SMILES
smiles_df['Canonical_SMILES'] = smiles_df['SMILES'].apply(
    lambda x: MoleCule.from_smiles(x).canonical_smiles if x else None
).dropna()

In [4]:
# Remove invalid entries
invalid_smiles = smiles_df['Canonical_SMILES'].isnull().sum()
print(f"Number of invalid SMILES after canonicalization: {invalid_smiles}")

if invalid_smiles > 0:
    smiles_df = smiles_df.dropna(subset=['Canonical_SMILES'])
    print(f"Removed {invalid_smiles} invalid SMILES entries.")

Number of invalid SMILES after canonicalization: 0


In [5]:
# Reset index after cleaning
smiles_df.reset_index(drop=True, inplace=True)

# Extract the list of canonical SMILES
canonical_smiles_list = smiles_df['Canonical_SMILES'].tolist()
canonical_smiles_list = list(set(canonical_smiles_list))

In [6]:
# Define parameters
batch_size = 10          # Number of SMILES per job
save_interval = 20        # Save intermediate results every 20 batches
output_dir = '/Users/macbook/CODE/PyClassyFire/data/intermediate_results/'
max_retries = 3           # Maximum number of retries for failed batches
retry_delay = 10         # Delay between retries in seconds 

In [12]:
# Process the batches with resumption and retry logic
intermediate_files = process_batches_with_saving_and_retry(
    smiles_list=canonical_smiles_list,
    batch_size=batch_size,
    output_dir=output_dir,
    max_retries=max_retries,
    retry_delay=retry_delay
)

2025-01-02 15:35:15,633 - INFO - All SMILES: 13984
2025-01-02 15:35:15,637 - INFO - Loaded results from /Users/macbook/CODE/PyClassyFire/data/intermediate_results/intermediate_6.json
2025-01-02 15:35:15,642 - INFO - Loaded results from /Users/macbook/CODE/PyClassyFire/data/intermediate_results/intermediate_54.json
2025-01-02 15:35:15,647 - INFO - Loaded results from /Users/macbook/CODE/PyClassyFire/data/intermediate_results/intermediate_42.json
2025-01-02 15:35:15,651 - INFO - Loaded results from /Users/macbook/CODE/PyClassyFire/data/intermediate_results/intermediate_15.json
2025-01-02 15:35:15,655 - INFO - Loaded results from /Users/macbook/CODE/PyClassyFire/data/intermediate_results/intermediate_81.json
2025-01-02 15:35:15,659 - INFO - Loaded results from /Users/macbook/CODE/PyClassyFire/data/intermediate_results/intermediate_39.json
2025-01-02 15:35:15,662 - INFO - Loaded results from /Users/macbook/CODE/PyClassyFire/data/intermediate_results/intermediate_127.json
2025-01-02 15:35:1

All  smiles: 13984


2025-01-02 15:35:15,835 - INFO - Loaded results from /Users/macbook/CODE/PyClassyFire/data/intermediate_results/intermediate_13.json
2025-01-02 15:35:15,839 - INFO - Loaded results from /Users/macbook/CODE/PyClassyFire/data/intermediate_results/intermediate_68.json
2025-01-02 15:35:15,843 - INFO - Loaded results from /Users/macbook/CODE/PyClassyFire/data/intermediate_results/intermediate_87.json
2025-01-02 15:35:15,848 - INFO - Loaded results from /Users/macbook/CODE/PyClassyFire/data/intermediate_results/intermediate_121.json
2025-01-02 15:35:15,852 - INFO - Loaded results from /Users/macbook/CODE/PyClassyFire/data/intermediate_results/intermediate_29.json
2025-01-02 15:35:15,856 - INFO - Loaded results from /Users/macbook/CODE/PyClassyFire/data/intermediate_results/intermediate_137.json
2025-01-02 15:35:15,860 - INFO - Loaded results from /Users/macbook/CODE/PyClassyFire/data/intermediate_results/intermediate_91.json
2025-01-02 15:35:15,863 - INFO - Loaded results from /Users/macbook

Already processed SMILES: 13877


2025-01-02 15:35:18,199 - INFO - Remaining SMILES to process: 655
2025-01-02 15:35:18,200 - INFO - Remaining unique SMILES to process after removing duplicates: 655
2025-01-02 15:35:18,200 - INFO - Total remaining batches to process: 66


Remaining SMILES to process: 655
Remaining unique SMILES to process after removing duplicates: 655
Total remaining batches to process: 66


Processing Batches:   0%|          | 0/66 [00:00<?, ?it/s]2025-01-02 15:35:27,037 - INFO - Submitted Batch 166 with Query ID 12022013


Submitted Batch 166 with Query ID 12022013


2025-01-02 15:36:29,852 - ERROR - Batch 166: Error encountered: 500 Server Error: Internal Server Error for url: http://classyfire.wishartlab.com/queries/12022013.json. Retrying (1/3) after 10 seconds...


Batch 166: Error encountered: 500 Server Error: Internal Server Error for url: http://classyfire.wishartlab.com/queries/12022013.json. Retrying (1/3) after 10 seconds...


KeyboardInterrupt: 

In [13]:
# Merge the intermediate files into the final JSON
merge_intermediate_files(output_dir, final_output_path)

Processing Batches:   0%|          | 0/7 [01:42<?, ?it/s]


Successfully merged 165 files into ../data/final_classification_results.json.


In [7]:
# Check if all SMILES are present in the final output
check_all_smiles_present(final_output_path, canonical_smiles_list)

Missing 655 SMILES in the final output:
Can be caused by server ERROR, try to lower batch and rerun
It keeps already processed molecules and will try to retrieve only missing smiles.


In [15]:
with open(final_output_path, 'r') as f:
    molecules = json.load(f)

In [17]:
len(molecules)

15115

In [22]:
output_smiles = set()
for molecule in molecules:
    smiles = molecule.get('smiles')
    if smiles:
        output_smiles.add(smiles.strip())

In [23]:
len(output_smiles - set(canonical_smiles_list)) 

13352

In [24]:
canonical_smiles_list

['Cc1cc(N2CCN(c3ncnc(C4CC4)c3F)CC2)n2nccc2n1',
 'O=C(O)[C@H]1Cc2ccccc2CN1',
 'CC[C@@H](C)[C@H](NC(=O)[C@H](C)n1nnc2ccccc2c1=O)C(=O)O',
 'CCc1ccc(-c2nn(C)c(=O)c3c2CCCC3)cc1S(=O)(=O)N1CCC(C(N)=O)CC1',
 'CCOc1cc(N2CCC(O)CC2)ccc1Nc1ncc2c(n1)N(C1CCCC1)CCC(=O)N2C',
 'Cc1ccc(C(=O)Oc2ccc(C(CN(C)C)C3(O)CCCCC3)cc2)cc1',
 'COc1ccc(NC(=O)C(Cc2ccccc2)NC(=O)c2ccc(C)cc2)cc1',
 'C/C(=C(\\CCOC(=O)c1ccccc1)SC(=O)c1ccccc1)N(C=O)Cc1cnc(C)nc1N',
 'O=C(CSc1n[nH]c(-c2ccccc2Cl)n1)c1ccc(Br)cc1',
 'CCC(C)c1ccccc1OCC(O)CSc1ccccn1',
 'COCCOC(=O)c1c(N)n(CCc2ccc(OC)c(OC)c2)c2nc3ccccc3nc12',
 'O=C(Nc1cccc(NCc2ccncc2)c1)c1ccccc1Cl',
 'CCn1cc(-c2ccc(Cl)cc2)cc1C(=O)NCCOC',
 'CNc1nc2c(ncn2[C@@H]2O[C@H](CO)[C@@H](O)[C@H]2O)c(=O)[nH]1',
 'c1ccc(CNc2ncnc3nc[nH]c23)cc1',
 'Cc1cn(-c2ccc(NC(=O)Nc3ccccc3C)cc2)nn1',
 'O=C(CCc1ccccc1)NCc1nc(C2CCC(F)(F)CC2)no1',
 'COc1cccc(-c2cc(NCc3ccc4c(c3)OCO4)nc(N)n2)c1',
 'CCOC(=O)c1cnn2ccc(NC(=O)C3CC=CCC3)cc12',
 'CCOc1ccc(NC(=S)NNC(=S)NC2C=CCCC2)cc1',
 'CC#Cc1cncc(-c2ccc3c(c2)[C@]2(N=C(C)C

In [21]:
output_smiles

['OC1=CC=C(C=C1)C1(C(=O)NC2=CC=CC=C12)C1=CC=C(O)C=C1',
 'CC(C)CN1C2=C(NC=N2)C(=O)N(C)C1=O',
 'OC1=CC2=C(C=C1)C1(OC(=O)C3=CC=CC4=C3C1=CC=C4)C1=C(O2)C=C(O)C=C1',
 'COC(=O)C1(C)CCC(=O)C2(C)C1C(OC(C)=O)C(O)C1=CC(=CC(O)=C21)C(C)C',
 'ClC1=CC=CC(NNS(=O)(=O)C2=CC=C(Br)C=C2)=N1',
 'ClC1=C(NC2=NCCN2)C2=NSN=C2C=C1',
 'NC1=NC(N)=C(Cl)N=C1C(O)=NC(=N)NCC1=CC=CC=C1',
 'CCOCC1(COCC)CCC(CC1)C1=C(CN(C)CCNC)C=NN1',
 'COC1=C(OCC2=C(Cl)C=C(Cl)C=C2)C=CC(\\C=N/NC(=O)C2=C(C)N=C(N)S2)=C1',
 'OC(=O)C1=CC(NC2=NC=CC(NC3=CC(C(O)=O)=C(O)C=C3)=N2)=CC=C1',
 'FC1=CC=CC=C1N1C=C(C=N1)C(=O)NC1=CC(NC(=O)C=C)=C(F)C=C1',
 'NC(N)=NS(=O)(=O)C1=CC=C(N)C=C1',
 'CC(C)(O)CCC1=C(O)C=C(O)C2=C1OC(=C(O)C2=O)C1=CC=C(O)C=C1',
 'CCCOC1=C(C=C(C=C1)S(=O)(=O)NCCC1CCCN1C)C1=NC(=O)C2=C(N1)C(CCC)=NN2C',
 'C[C@H]1[C@H]2[C@H](CC3[C@@H]4CC[C@@H]5C[C@H](CC[C@]5(C)C4CC[C@]23C)O[C@@H]2O[C@H](CO)[C@@H](O[C@@H]3O[C@H](CO[C@@H]4OC[C@@H](O)[C@H](O)[C@H]4O)[C@@H](O[C@@H]4O[C@H](CO)[C@H](O)[C@H](O)[C@H]4O)[C@H](O)[C@H]3O)[C@H](O)[C@H]2O)O[C@]11CC[C@H](C

In [25]:
canonical_output = []
for smi in output_smiles:
    canonical_smi = MoleCule.from_smiles(smi).canonical_smiles
    canonical_output.append(canonical_smi)
canonical_output_set = set(canonical_output)

In [26]:
len(canonical_output_set)

13877

In [27]:
len(canonical_output_set - set(canonical_smiles_list)) 

548

In [None]:
def merge_intermediate_results(intermediate_files):
    """
    Merges multiple intermediate JSON result files into a single dictionary.
    """
    merged_results = {}
    for file in intermediate_files:
        try:
            with open(file, 'r') as f:
                data = json.load(f)
                merged_results.update(data)
            print(f"Successfully merged results from {file}")
        except Exception as e:
            print(f"Error merging results from {file}: {e}")
    return merged_results

In [None]:
# Merge all intermediate results
merged_results = merge_intermediate_results(intermediate_files)

In [None]:
# Display the number of classified SMILES
classified_count = len(merged_results)
print(f"Total number of classified SMILES: {classified_count}")

# Convert the merged results dictionary to a DataFrame
results_df = pd.DataFrame.from_dict(merged_results, orient='index')
results_df.reset_index(inplace=True)
results_df.rename(columns={'index': 'Canonical_SMILES'}, inplace=True)

# Merge the classification results with the original SMILES DataFrame
annotated_df = pd.merge(smiles_df, results_df, on='Canonical_SMILES', how='left')

In [None]:

# Handle unclassified SMILES
unclassified = annotated_df['superclass'].isnull().sum()
print(f"Number of SMILES without classification: {unclassified}")

# Fill NaN values with 'Unknown'
annotated_df[['superclass', 'class', 'subclass']] = annotated_df[['superclass', 'class', 'subclass']].fillna('Unknown')

# Save the annotated DataFrame to a TSV file
annotated_df.to_csv(final_output_path, sep='\t', index=False)

print(f"Annotated data has been saved to {final_output_path}")

In [8]:
def analyze_structure(data, level=0):
    """Recursively analyze and print the structure of JSON data."""
    if isinstance(data, dict):
        print(" " * level + f"Object with keys: {list(data.keys())}")
        for key, value in data.items():
            analyze_structure(value, level + 2)
    elif isinstance(data, list):
        print(" " * level + f"List of {len(data)} items")
        if len(data) > 0:
            analyze_structure(data[0], level + 2)  # Analyze the first item as representative
    else:
        print(" " * level + f"Value type: {type(data).__name__}")



In [9]:
# Load the JSON file
file_path = "/Users/macbook/CODE/PyClassyFire/data/intermediate_results/intermediate_1.json"  # Replace with your file's path
with open(file_path, "r") as file:
    json_data = json.load(file)

# Analyze the JSON structure
analyze_structure(json_data)

Object with keys: ['12021409']
  List of 100 items
    Object with keys: ['identifier', 'smiles', 'inchikey', 'kingdom', 'superclass', 'class', 'subclass', 'intermediate_nodes', 'direct_parent', 'alternative_parents', 'molecular_framework', 'substituents', 'description', 'external_descriptors', 'ancestors', 'predicted_chebi_terms', 'predicted_lipidmaps_terms', 'classification_version']
      Value type: str
      Value type: str
      Value type: str
      Object with keys: ['name', 'description', 'chemont_id', 'url']
        Value type: str
        Value type: str
        Value type: str
        Value type: str
      Object with keys: ['name', 'description', 'chemont_id', 'url']
        Value type: str
        Value type: str
        Value type: str
        Value type: str
      Object with keys: ['name', 'description', 'chemont_id', 'url']
        Value type: str
        Value type: str
        Value type: str
        Value type: str
      Object with keys: ['name', 'description', 'c

In [9]:
existing_smiles = load_existing_results(output_dir)
print(f"Already processed SMILES: {len(existing_smiles[0])}")

Already processed SMILES: 300


In [9]:
existing_smiles

({'C(C=CC1=CC=CC=C1)N1CCN(CC1)C(C1=CC=CC=C1)C1=CC=CC=C1',
  'C(CN(CC1=CC=CC=N1)CC1=CC=CC=N1)N(CC1=CC=CC=N1)CC1=CC=CC=N1',
  'C(N1C=CN=C1)C1=CC(CN2C=CN=C2)=CC(CN2C=CN=C2)=C1',
  'C1=CC=C(C=C1)C1=C2C=CC3=C(C=CN=C3C2=NC=C1)C1=CC=CC=C1',
  'C1CN(CCN1)C1=CC=C(C=C1)C1=CN2N=CC(=C2N=C1)C1=CC=NC2=CC=CC=C12',
  'CC#CC1(O)CCC2C3CCC4=CC(=O)CCC4=C3C(CC12C)C1=CC=C(C=C1)N(C)C',
  'CC(=O)NCC1CN(C(=O)O1)C1=CC(F)=C(C=C1)N1CCN(CC1)C(=O)CO',
  'CC(C)(C)NC(=O)COC1=CC=C(CNC2=CC3=C(NC(=O)N3)C=C2)C=C1',
  'CC(C)(C)SC1=C(CC(C)(C)C(O)=O)N(CC2=CC=C(Cl)C=C2)C2=C1C=C(OCC1=NC3=CC=CC=C3C=C1)C=C2',
  'CC(C)(CC1CC2=CC=CC=C2C1)NCC(O)COC1=C(C=CC(CCC(O)=O)=C1)C#N',
  'CC(C)(OCc1nn(Cc2ccccc2)c2ccccc12)C(O)=O',
  'CC(C)C(=O)OCC1(CO1)C1=C(OC(=O)C(C)C)C=C(C)C=C1',
  'CC(C)C1=NOC(=N1)C1CCN(CC1)C1=C(C(NC2=C(F)C=C(C=C2)S(C)(=O)=O)=NC=N1)[N+]([O-])=O',
  'CC(C)CC(N1CC2=CC=CC=C2C1=O)C(=O)NC1=CC=CC2=C1C=CN2',
  'CC(C)OC1=CC=C(C=C1)C1=CN2N=CC(=C2N=C1)C1=CC=NC2=CC=CC=C12',
  'CC(C1CCC2C3CC=C4CC(CC(O)C4(C)C3CCC12C)OC1OC(COC2OC(CO)C(O

In [10]:
a = set(canonical_smiles_list) - set(existing_smiles[0])

In [11]:
len(a)

13971

In [12]:
b = set()
for smi in existing_smiles[0]:
    tmp = MoleCule.from_smiles(smi).canonical_smiles
    b.add(tmp)

In [13]:
c = set(canonical_smiles_list) - b

In [14]:
len(c)

13691

In [15]:
len(set(canonical_smiles_list))

13984

In [16]:
len(b)

300

In [22]:
b & c

set()

In [23]:
weird = c - set(canonical_smiles_list) 

In [24]:
weird

set()