# PyClassyFire Tutorial: Classifying Chemical Compounds Using the ClassyFire API


## Introduction

Welcome to the **PyClassyFire** tutorial! This guide will walk you through the process of classifying a large set of chemical compounds using the [ClassyFire](http://classyfire.wishartlab.com/) API. We'll utilize the `PyClassyFire` package, which provides a command-line interface (CLI) and programmatic access to the ClassyFire service, enabling efficient and scalable classification of chemical structures.

By the end of this tutorial, you'll be able to:

1. **Preprocess your SMILES data**: Prepare your unique SMILES strings for classification.
2. **Submit classification jobs**: Use the `PyClassyFire` package to send your data to the ClassyFire API.
3. **Retrieve and process results**: Collect the classification results and merge them with your original data.
4. **Save the annotated data**: Store the enriched dataset for further analysis.

Let's get started!

## Prerequisites

Before diving into the tutorial, ensure you have the following:

- **Conda Environment**: A Conda environment named `classyfire_env` with all necessary dependencies installed.
- **PyClassyFire Package**: Installed and accessible within your Conda environment.
- **Unique SMILES Data**: A TSV file containing approximately 16,000 unique SMILES strings located at `/Users/macbook/CODE/PyClassyFire/data/unique_valid_smiles_no_header.tsv`.

> **Note:** This tutorial assumes that the Conda environment and `PyClassyFire` package are already set up. If not, please refer to the [repository's README](https://github.com/yourusername/PyClassyFire) for setup instructions.

## Table of Contents

1. [Importing Necessary Libraries](#importing-libraries)
2. [Loading and Exploring the Data](#loading-data)
3. [Preparing the SMILES Data for Classification](#preparing-data)
4. [Submitting Classification Jobs to ClassyFire API](#submitting-jobs)
5. [Monitoring Job Progress](#monitoring-progress)
6. [Retrieving and Processing Results](#retrieving-results)
7. [Saving the Annotated Data](#saving-data)
8. [Conclusion](#conclusion)


In [1]:
import os
import pandas as pd
import json

from fontTools.subset import intersect

from classyfire_cli.src.utils import MoleCule, load_existing_results, save_intermediate_results, merge_intermediate_files, check_all_smiles_present
from classyfire_cli.src.batch import process_batches_with_saving_and_retry

In [2]:
# Define paths
smiles_file_path = '../data/unique_valid_smiles_no_header.tsv'
output_dir = '../data/intermediate_results/'
final_output_path = '../data/final_classification_results.json'

In [3]:
# Load SMILES data
smiles_df = pd.read_csv(smiles_file_path, sep='\t', header=None, names=['SMILES']).dropna()

# Canonicalize SMILES
smiles_df['Canonical_SMILES'] = smiles_df['SMILES'].apply(
    lambda x: MoleCule.from_smiles(x).canonical_smiles if x else None
).dropna()

In [4]:
# Remove invalid entries
invalid_smiles = smiles_df['Canonical_SMILES'].isnull().sum()
print(f"Number of invalid SMILES after canonicalization: {invalid_smiles}")

if invalid_smiles > 0:
    smiles_df = smiles_df.dropna(subset=['Canonical_SMILES'])
    print(f"Removed {invalid_smiles} invalid SMILES entries.")

Number of invalid SMILES after canonicalization: 0


In [5]:
# Reset index after cleaning
smiles_df.reset_index(drop=True, inplace=True)

# Extract the list of canonical SMILES
canonical_smiles_list = smiles_df['Canonical_SMILES'].tolist()
canonical_smiles_list = list(set(canonical_smiles_list))

In [6]:
# Define parameters
batch_size = 100          # Number of SMILES per job
save_interval = 20        # Save intermediate results every 20 batches
output_dir = '/Users/macbook/CODE/PyClassyFire/data/intermediate_results/'
max_retries = 3           # Maximum number of retries for failed batches
retry_delay = 10         # Delay between retries in seconds (5 minutes)

In [None]:
# Process the batches with resumption and retry logic
intermediate_files = process_batches_with_saving_and_retry(
    smiles_list=canonical_smiles_list,
    batch_size=batch_size,
    output_dir=output_dir,
    max_retries=max_retries,
    retry_delay=retry_delay
)

All  smiles: 13984
Already processed SMILES: 5960
Remaining SMILES to process: 8228
Remaining unique SMILES to process after removing duplicates: 8228
Total remaining batches to process: 83



Processing Batches:   0%|          | 0/83 [00:00<?, ?it/s][A

Submitted Batch 61 with Query ID 12021720



Processing Batches:   1%|          | 1/83 [01:27<1:59:01, 87.09s/it][A

Batch 61 completed with 100 molecules.
Saved intermediate results to /Users/macbook/CODE/PyClassyFire/data/intermediate_results/intermediate_61.json
Submitted Batch 62 with Query ID 12021721



Processing Batches:   2%|▏         | 2/83 [03:01<2:03:29, 91.47s/it][A

Batch 62 completed with 100 molecules.
Saved intermediate results to /Users/macbook/CODE/PyClassyFire/data/intermediate_results/intermediate_62.json
Submitted Batch 63 with Query ID 12021723



Processing Batches:   4%|▎         | 3/83 [04:38<2:05:20, 94.01s/it][A

Batch 63 completed with 100 molecules.
Saved intermediate results to /Users/macbook/CODE/PyClassyFire/data/intermediate_results/intermediate_63.json
Submitted Batch 64 with Query ID 12021724


ERROR:root:Batch 64: Error encountered: 500 Server Error: Internal Server Error for url: http://classyfire.wishartlab.com/queries/12021724.json. Retrying (1/3) after 10 seconds...


Batch 64: Error encountered: 500 Server Error: Internal Server Error for url: http://classyfire.wishartlab.com/queries/12021724.json. Retrying (1/3) after 10 seconds...
Submitted Batch 64 with Query ID 12021726


ERROR:root:Batch 64: Error encountered: 500 Server Error: Internal Server Error for url: http://classyfire.wishartlab.com/queries/12021726.json. Retrying (2/3) after 10 seconds...


Batch 64: Error encountered: 500 Server Error: Internal Server Error for url: http://classyfire.wishartlab.com/queries/12021726.json. Retrying (2/3) after 10 seconds...
Submitted Batch 64 with Query ID 12021728


ERROR:root:Batch 64: Error encountered: 500 Server Error: Internal Server Error for url: http://classyfire.wishartlab.com/queries/12021728.json. Retrying (3/3) after 10 seconds...


Batch 64: Error encountered: 500 Server Error: Internal Server Error for url: http://classyfire.wishartlab.com/queries/12021728.json. Retrying (3/3) after 10 seconds...
Submitted Batch 64 with Query ID 12021730


ERROR:root:Batch 64: Maximum retries reached. Error: 500 Server Error: Internal Server Error for url: http://classyfire.wishartlab.com/queries/12021730.json

Processing Batches:   5%|▍         | 4/83 [10:28<4:16:44, 195.00s/it][A

Batch 64: Maximum retries reached. Skipping batch.
Saved intermediate results to /Users/macbook/CODE/PyClassyFire/data/intermediate_results/intermediate_64.json
Submitted Batch 65 with Query ID 12021732


ERROR:root:Batch 65: Error encountered: 500 Server Error: Internal Server Error for url: http://classyfire.wishartlab.com/queries/12021732.json. Retrying (1/3) after 10 seconds...


Batch 65: Error encountered: 500 Server Error: Internal Server Error for url: http://classyfire.wishartlab.com/queries/12021732.json. Retrying (1/3) after 10 seconds...
Submitted Batch 65 with Query ID 12021734


ERROR:root:Batch 65: Error encountered: 500 Server Error: Internal Server Error for url: http://classyfire.wishartlab.com/queries/12021734.json. Retrying (2/3) after 10 seconds...


Batch 65: Error encountered: 500 Server Error: Internal Server Error for url: http://classyfire.wishartlab.com/queries/12021734.json. Retrying (2/3) after 10 seconds...
Submitted Batch 65 with Query ID 12021735


ERROR:root:Batch 65: Error encountered: 500 Server Error: Internal Server Error for url: http://classyfire.wishartlab.com/queries/12021735.json. Retrying (3/3) after 10 seconds...


Batch 65: Error encountered: 500 Server Error: Internal Server Error for url: http://classyfire.wishartlab.com/queries/12021735.json. Retrying (3/3) after 10 seconds...
Submitted Batch 65 with Query ID 12021736


ERROR:root:Batch 65: Maximum retries reached. Error: 500 Server Error: Internal Server Error for url: http://classyfire.wishartlab.com/queries/12021736.json

Processing Batches:   6%|▌         | 5/83 [15:30<5:03:42, 233.63s/it][A

Batch 65: Maximum retries reached. Skipping batch.
Saved intermediate results to /Users/macbook/CODE/PyClassyFire/data/intermediate_results/intermediate_65.json
Submitted Batch 66 with Query ID 12021737



Processing Batches:   7%|▋         | 6/83 [17:10<4:01:16, 188.00s/it][A

Batch 66 completed with 100 molecules.
Saved intermediate results to /Users/macbook/CODE/PyClassyFire/data/intermediate_results/intermediate_66.json
Submitted Batch 67 with Query ID 12021738



Processing Batches:   8%|▊         | 7/83 [19:05<3:27:59, 164.20s/it][A

Batch 67 completed with 100 molecules.
Saved intermediate results to /Users/macbook/CODE/PyClassyFire/data/intermediate_results/intermediate_67.json
Submitted Batch 68 with Query ID 12021740



Processing Batches:  10%|▉         | 8/83 [20:41<2:58:08, 142.51s/it][A

Batch 68 completed with 100 molecules.
Saved intermediate results to /Users/macbook/CODE/PyClassyFire/data/intermediate_results/intermediate_68.json
Submitted Batch 69 with Query ID 12021741



Processing Batches:  11%|█         | 9/83 [22:17<2:37:44, 127.90s/it][A

Batch 69 completed with 100 molecules.
Saved intermediate results to /Users/macbook/CODE/PyClassyFire/data/intermediate_results/intermediate_69.json
Submitted Batch 70 with Query ID 12021742



Processing Batches:  12%|█▏        | 10/83 [23:52<2:23:19, 117.80s/it][A

Batch 70 completed with 100 molecules.
Saved intermediate results to /Users/macbook/CODE/PyClassyFire/data/intermediate_results/intermediate_70.json
Submitted Batch 71 with Query ID 12021743



Processing Batches:  13%|█▎        | 11/83 [25:40<2:17:59, 114.99s/it][A

Batch 71 completed with 100 molecules.
Saved intermediate results to /Users/macbook/CODE/PyClassyFire/data/intermediate_results/intermediate_71.json
Submitted Batch 72 with Query ID 12021745



Processing Batches:  14%|█▍        | 12/83 [27:18<2:09:53, 109.77s/it][A

Batch 72 completed with 100 molecules.
Saved intermediate results to /Users/macbook/CODE/PyClassyFire/data/intermediate_results/intermediate_72.json
Submitted Batch 73 with Query ID 12021746


ERROR:root:Batch 73: Error encountered: 500 Server Error: Internal Server Error for url: http://classyfire.wishartlab.com/queries/12021746.json. Retrying (1/3) after 10 seconds...


Batch 73: Error encountered: 500 Server Error: Internal Server Error for url: http://classyfire.wishartlab.com/queries/12021746.json. Retrying (1/3) after 10 seconds...
Submitted Batch 73 with Query ID 12021747


ERROR:root:Batch 73: Error encountered: 500 Server Error: Internal Server Error for url: http://classyfire.wishartlab.com/queries/12021747.json. Retrying (2/3) after 10 seconds...


Batch 73: Error encountered: 500 Server Error: Internal Server Error for url: http://classyfire.wishartlab.com/queries/12021747.json. Retrying (2/3) after 10 seconds...
Submitted Batch 73 with Query ID 12021748


ERROR:root:Batch 73: Error encountered: 500 Server Error: Internal Server Error for url: http://classyfire.wishartlab.com/queries/12021748.json. Retrying (3/3) after 10 seconds...


Batch 73: Error encountered: 500 Server Error: Internal Server Error for url: http://classyfire.wishartlab.com/queries/12021748.json. Retrying (3/3) after 10 seconds...
Submitted Batch 73 with Query ID 12021750


ERROR:root:Batch 73: Maximum retries reached. Error: 500 Server Error: Internal Server Error for url: http://classyfire.wishartlab.com/queries/12021750.json

Processing Batches:  16%|█▌        | 13/83 [32:45<3:24:44, 175.50s/it][A

Batch 73: Maximum retries reached. Skipping batch.
Saved intermediate results to /Users/macbook/CODE/PyClassyFire/data/intermediate_results/intermediate_73.json
Submitted Batch 74 with Query ID 12021751



Processing Batches:  17%|█▋        | 14/83 [34:25<2:55:27, 152.58s/it][A

Batch 74 completed with 100 molecules.
Saved intermediate results to /Users/macbook/CODE/PyClassyFire/data/intermediate_results/intermediate_74.json
Submitted Batch 75 with Query ID 12021752



Processing Batches:  18%|█▊        | 15/83 [36:03<2:34:25, 136.26s/it][A

Batch 75 completed with 100 molecules.
Saved intermediate results to /Users/macbook/CODE/PyClassyFire/data/intermediate_results/intermediate_75.json
Submitted Batch 76 with Query ID 12021753



Processing Batches:  19%|█▉        | 16/83 [37:45<2:20:28, 125.81s/it][A

Batch 76 completed with 100 molecules.
Saved intermediate results to /Users/macbook/CODE/PyClassyFire/data/intermediate_results/intermediate_76.json
Submitted Batch 77 with Query ID 12021755



Processing Batches:  20%|██        | 17/83 [39:22<2:09:00, 117.28s/it][A

Batch 77 completed with 100 molecules.
Saved intermediate results to /Users/macbook/CODE/PyClassyFire/data/intermediate_results/intermediate_77.json
Submitted Batch 78 with Query ID 12021756



Processing Batches:  22%|██▏       | 18/83 [40:56<1:59:37, 110.43s/it][A

Batch 78 completed with 100 molecules.
Saved intermediate results to /Users/macbook/CODE/PyClassyFire/data/intermediate_results/intermediate_78.json
Submitted Batch 79 with Query ID 12021757



Processing Batches:  23%|██▎       | 19/83 [42:34<1:53:30, 106.41s/it][A

Batch 79 completed with 100 molecules.
Saved intermediate results to /Users/macbook/CODE/PyClassyFire/data/intermediate_results/intermediate_79.json
Submitted Batch 80 with Query ID 12021759



Processing Batches:  24%|██▍       | 20/83 [44:11<1:48:56, 103.75s/it][A

Batch 80 completed with 100 molecules.
Saved intermediate results to /Users/macbook/CODE/PyClassyFire/data/intermediate_results/intermediate_80.json
Submitted Batch 81 with Query ID 12021761



Processing Batches:  25%|██▌       | 21/83 [45:49<1:45:16, 101.89s/it][A

Batch 81 completed with 100 molecules.
Saved intermediate results to /Users/macbook/CODE/PyClassyFire/data/intermediate_results/intermediate_81.json
Submitted Batch 82 with Query ID 12021764



Processing Batches:  27%|██▋       | 22/83 [47:27<1:42:33, 100.89s/it][A

Batch 82 completed with 100 molecules.
Saved intermediate results to /Users/macbook/CODE/PyClassyFire/data/intermediate_results/intermediate_82.json
Submitted Batch 83 with Query ID 12021767



Processing Batches:  28%|██▊       | 23/83 [49:14<1:42:33, 102.56s/it][A

Batch 83 completed with 100 molecules.
Saved intermediate results to /Users/macbook/CODE/PyClassyFire/data/intermediate_results/intermediate_83.json
Submitted Batch 84 with Query ID 12021771



Processing Batches:  29%|██▉       | 24/83 [50:52<1:39:29, 101.18s/it][A

Batch 84 completed with 100 molecules.
Saved intermediate results to /Users/macbook/CODE/PyClassyFire/data/intermediate_results/intermediate_84.json
Submitted Batch 85 with Query ID 12021774



Processing Batches:  30%|███       | 25/83 [52:26<1:35:57, 99.26s/it] [A

Batch 85 completed with 100 molecules.
Saved intermediate results to /Users/macbook/CODE/PyClassyFire/data/intermediate_results/intermediate_85.json
Submitted Batch 86 with Query ID 12021776



Processing Batches:  31%|███▏      | 26/83 [54:02<1:33:19, 98.23s/it][A

Batch 86 completed with 100 molecules.
Saved intermediate results to /Users/macbook/CODE/PyClassyFire/data/intermediate_results/intermediate_86.json
Submitted Batch 87 with Query ID 12021779



Processing Batches:  33%|███▎      | 27/83 [55:43<1:32:24, 99.01s/it][A

Batch 87 completed with 100 molecules.
Saved intermediate results to /Users/macbook/CODE/PyClassyFire/data/intermediate_results/intermediate_87.json
Submitted Batch 88 with Query ID 12021783



Processing Batches:  34%|███▎      | 28/83 [57:15<1:28:50, 96.92s/it][A

Batch 88 completed with 100 molecules.
Saved intermediate results to /Users/macbook/CODE/PyClassyFire/data/intermediate_results/intermediate_88.json
Submitted Batch 89 with Query ID 12021785



Processing Batches:  35%|███▍      | 29/83 [58:52<1:27:11, 96.88s/it][A

Batch 89 completed with 100 molecules.
Saved intermediate results to /Users/macbook/CODE/PyClassyFire/data/intermediate_results/intermediate_89.json
Submitted Batch 90 with Query ID 12021788


In [15]:
# Merge the intermediate files into the final JSON
merge_intermediate_files(output_dir, final_output_path)

13685

In [None]:
# Check if all SMILES are present in the final output
check_all_smiles_present(final_output_path, canonical_smiles_list)

In [None]:
def merge_intermediate_results(intermediate_files):
    """
    Merges multiple intermediate JSON result files into a single dictionary.
    """
    merged_results = {}
    for file in intermediate_files:
        try:
            with open(file, 'r') as f:
                data = json.load(f)
                merged_results.update(data)
            print(f"Successfully merged results from {file}")
        except Exception as e:
            print(f"Error merging results from {file}: {e}")
    return merged_results

In [None]:
# Merge all intermediate results
merged_results = merge_intermediate_results(intermediate_files)

In [None]:
# Display the number of classified SMILES
classified_count = len(merged_results)
print(f"Total number of classified SMILES: {classified_count}")

# Convert the merged results dictionary to a DataFrame
results_df = pd.DataFrame.from_dict(merged_results, orient='index')
results_df.reset_index(inplace=True)
results_df.rename(columns={'index': 'Canonical_SMILES'}, inplace=True)

# Merge the classification results with the original SMILES DataFrame
annotated_df = pd.merge(smiles_df, results_df, on='Canonical_SMILES', how='left')

In [None]:

# Handle unclassified SMILES
unclassified = annotated_df['superclass'].isnull().sum()
print(f"Number of SMILES without classification: {unclassified}")

# Fill NaN values with 'Unknown'
annotated_df[['superclass', 'class', 'subclass']] = annotated_df[['superclass', 'class', 'subclass']].fillna('Unknown')

# Save the annotated DataFrame to a TSV file
annotated_df.to_csv(final_output_path, sep='\t', index=False)

print(f"Annotated data has been saved to {final_output_path}")

In [8]:
def analyze_structure(data, level=0):
    """Recursively analyze and print the structure of JSON data."""
    if isinstance(data, dict):
        print(" " * level + f"Object with keys: {list(data.keys())}")
        for key, value in data.items():
            analyze_structure(value, level + 2)
    elif isinstance(data, list):
        print(" " * level + f"List of {len(data)} items")
        if len(data) > 0:
            analyze_structure(data[0], level + 2)  # Analyze the first item as representative
    else:
        print(" " * level + f"Value type: {type(data).__name__}")



In [9]:
# Load the JSON file
file_path = "/Users/macbook/CODE/PyClassyFire/data/intermediate_results/intermediate_1.json"  # Replace with your file's path
with open(file_path, "r") as file:
    json_data = json.load(file)

# Analyze the JSON structure
analyze_structure(json_data)

Object with keys: ['12021409']
  List of 100 items
    Object with keys: ['identifier', 'smiles', 'inchikey', 'kingdom', 'superclass', 'class', 'subclass', 'intermediate_nodes', 'direct_parent', 'alternative_parents', 'molecular_framework', 'substituents', 'description', 'external_descriptors', 'ancestors', 'predicted_chebi_terms', 'predicted_lipidmaps_terms', 'classification_version']
      Value type: str
      Value type: str
      Value type: str
      Object with keys: ['name', 'description', 'chemont_id', 'url']
        Value type: str
        Value type: str
        Value type: str
        Value type: str
      Object with keys: ['name', 'description', 'chemont_id', 'url']
        Value type: str
        Value type: str
        Value type: str
        Value type: str
      Object with keys: ['name', 'description', 'chemont_id', 'url']
        Value type: str
        Value type: str
        Value type: str
        Value type: str
      Object with keys: ['name', 'description', 'c

In [9]:
existing_smiles = load_existing_results(output_dir)
print(f"Already processed SMILES: {len(existing_smiles[0])}")

Already processed SMILES: 300


In [9]:
existing_smiles

({'C(C=CC1=CC=CC=C1)N1CCN(CC1)C(C1=CC=CC=C1)C1=CC=CC=C1',
  'C(CN(CC1=CC=CC=N1)CC1=CC=CC=N1)N(CC1=CC=CC=N1)CC1=CC=CC=N1',
  'C(N1C=CN=C1)C1=CC(CN2C=CN=C2)=CC(CN2C=CN=C2)=C1',
  'C1=CC=C(C=C1)C1=C2C=CC3=C(C=CN=C3C2=NC=C1)C1=CC=CC=C1',
  'C1CN(CCN1)C1=CC=C(C=C1)C1=CN2N=CC(=C2N=C1)C1=CC=NC2=CC=CC=C12',
  'CC#CC1(O)CCC2C3CCC4=CC(=O)CCC4=C3C(CC12C)C1=CC=C(C=C1)N(C)C',
  'CC(=O)NCC1CN(C(=O)O1)C1=CC(F)=C(C=C1)N1CCN(CC1)C(=O)CO',
  'CC(C)(C)NC(=O)COC1=CC=C(CNC2=CC3=C(NC(=O)N3)C=C2)C=C1',
  'CC(C)(C)SC1=C(CC(C)(C)C(O)=O)N(CC2=CC=C(Cl)C=C2)C2=C1C=C(OCC1=NC3=CC=CC=C3C=C1)C=C2',
  'CC(C)(CC1CC2=CC=CC=C2C1)NCC(O)COC1=C(C=CC(CCC(O)=O)=C1)C#N',
  'CC(C)(OCc1nn(Cc2ccccc2)c2ccccc12)C(O)=O',
  'CC(C)C(=O)OCC1(CO1)C1=C(OC(=O)C(C)C)C=C(C)C=C1',
  'CC(C)C1=NOC(=N1)C1CCN(CC1)C1=C(C(NC2=C(F)C=C(C=C2)S(C)(=O)=O)=NC=N1)[N+]([O-])=O',
  'CC(C)CC(N1CC2=CC=CC=C2C1=O)C(=O)NC1=CC=CC2=C1C=CN2',
  'CC(C)OC1=CC=C(C=C1)C1=CN2N=CC(=C2N=C1)C1=CC=NC2=CC=CC=C12',
  'CC(C1CCC2C3CC=C4CC(CC(O)C4(C)C3CCC12C)OC1OC(COC2OC(CO)C(O

In [10]:
a = set(canonical_smiles_list) - set(existing_smiles[0])

In [11]:
len(a)

13971

In [12]:
b = set()
for smi in existing_smiles[0]:
    tmp = MoleCule.from_smiles(smi).canonical_smiles
    b.add(tmp)

In [13]:
c = set(canonical_smiles_list) - b

In [14]:
len(c)

13691

In [15]:
len(set(canonical_smiles_list))

13984

In [16]:
len(b)

300

In [22]:
b & c

set()

In [23]:
weird = c - set(canonical_smiles_list) 

In [24]:
weird

set()

In [None]:
def load_existing_results(output_dir):
    """
    Loads existing intermediate JSON files and returns a set of already processed SMILES.
    """
    merged_results = {}
    if not os.path.exists(output_dir):
        return merged_results
    for file in os.listdir(output_dir):
        if file.startswith('intermediate_') and file.endswith('.json'):
            with open(os.path.join(output_dir, file), 'r') as f:
                data = json.load(f)
                merged_results.update(data)
    return merged_results

In [None]:
def merge_intermediate_results(intermediate_files):
    """
    Merges multiple intermediate JSON result files into a single dictionary.

    Parameters:
    - intermediate_files (list): List of file paths to intermediate JSON files.

    Returns:
    - merged_results (dict): Merged classification results.
    """
    merged_results = {}
    for file in intermediate_files:
        try:
            with open(file, 'r') as f:
                data = json.load(f)
                merged_results.update(data)
            logging.info(f"Successfully merged results from {file}")
            print(f"Successfully merged results from {file}")
        except Exception as e:
            logging.error(f"Error merging results from {file}: {e}")
            print(f"Error merging results from {file}: {e}")
    return merged_results

In [None]:
# Merge all intermediate results
merged_results = merge_intermediate_results(intermediate_files)

# Display the number of classified SMILES
classified_count = len(merged_results)
print(f"Total number of classified SMILES: {classified_count}")

In [None]:
# Convert the merged results dictionary to a DataFrame
# The dictionary keys are canonical SMILES, and values are classification details
results_df = pd.DataFrame.from_dict(merged_results, orient='index')
results_df.reset_index(inplace=True)
results_df.rename(columns={'index': 'Canonical_SMILES'}, inplace=True)

# Display the first few entries of the results
results_df.head()

In [None]:
# Merge the classification results with the original SMILES DataFrame
annotated_df = pd.merge(smiles_df, results_df, on='Canonical_SMILES', how='left')

# Display the merged DataFrame
annotated_df.head()

In [None]:
# Check for any SMILES that did not receive a classification
unclassified = annotated_df['superclass'].isnull().sum()
print(f"Number of SMILES without classification: {unclassified}")

# Optionally, handle unclassified SMILES (e.g., mark as 'Unknown')
annotated_df['superclass'].fillna('Unknown', inplace=True)
annotated_df['class'].fillna('Unknown', inplace=True)
annotated_df['subclass'].fillna('Unknown', inplace=True)

In [None]:
# Define the output path for the annotated data
final_output_path = '/Users/macbook/CODE/PyClassyFire/data/classified_smiles.tsv'

# Save the annotated DataFrame to a TSV file
annotated_df.to_csv(final_output_path, sep='\t', index=False)

print(f"Annotated data has been saved to {final_output_path}")