# PyClassyFire Tutorial: Classifying Chemical Compounds Using the ClassyFire API


## Introduction

Welcome to the **PyClassyFire** tutorial! This guide will walk you through the process of classifying a large set of chemical compounds using the [ClassyFire](http://classyfire.wishartlab.com/) API. We'll utilize the `PyClassyFire` package, which provides a command-line interface (CLI) and programmatic access to the ClassyFire service, enabling efficient and scalable classification of chemical structures.

By the end of this tutorial, you'll be able to:

1. **Preprocess your SMILES data**: Prepare your unique SMILES strings for classification.
2. **Submit classification jobs**: Use the `PyClassyFire` package to send your data to the ClassyFire API.
3. **Retrieve and process results**: Collect the classification results and merge them with your original data.
4. **Save the annotated data**: Store the enriched dataset for further analysis.

Let's get started!

## Prerequisites

Before diving into the tutorial, ensure you have the following:

- **Conda Environment**: A Conda environment named `classyfire_env` with all necessary dependencies installed.
- **PyClassyFire Package**: Installed and accessible within your Conda environment.
- **Unique SMILES Data**: A TSV file containing approximately 16,000 unique SMILES strings located at `/Users/macbook/CODE/PyClassyFire/data/unique_valid_smiles_no_header.tsv`.

> **Note:** This tutorial assumes that the Conda environment and `PyClassyFire` package are already set up. If not, please refer to the [repository's README](https://github.com/yourusername/PyClassyFire) for setup instructions.

## Table of Contents

1. [Importing Necessary Libraries](#importing-libraries)
2. [Loading and Exploring the Data](#loading-data)
3. [Preparing the SMILES Data for Classification](#preparing-data)
4. [Submitting Classification Jobs to ClassyFire API](#submitting-jobs)
5. [Monitoring Job Progress](#monitoring-progress)
6. [Retrieving and Processing Results](#retrieving-results)
7. [Saving the Annotated Data](#saving-data)
8. [Conclusion](#conclusion)


In [1]:
import os
import pandas as pd
import json

from fontTools.subset import intersect

from classyfire_cli.src.utils import MoleCule, load_existing_results, save_intermediate_results
from classyfire_cli.src.batch import process_batches_with_saving_and_retry

In [2]:
# Define paths
smiles_file_path = '../data/unique_valid_smiles_no_header.tsv'
output_dir = '../data/intermediate_results/'
final_output_path = '/Users/macbook/CODE/PyClassyFire/data/classified_smiles.tsv'

In [3]:
# Load SMILES data
smiles_df = pd.read_csv(smiles_file_path, sep='\t', header=None, names=['SMILES']).dropna()

# Canonicalize SMILES
smiles_df['Canonical_SMILES'] = smiles_df['SMILES'].apply(
    lambda x: MoleCule.from_smiles(x).canonical_smiles if x else None
).dropna()

In [4]:
# Remove invalid entries
invalid_smiles = smiles_df['Canonical_SMILES'].isnull().sum()
print(f"Number of invalid SMILES after canonicalization: {invalid_smiles}")

if invalid_smiles > 0:
    smiles_df = smiles_df.dropna(subset=['Canonical_SMILES'])
    print(f"Removed {invalid_smiles} invalid SMILES entries.")

Number of invalid SMILES after canonicalization: 0


In [5]:
# Reset index after cleaning
smiles_df.reset_index(drop=True, inplace=True)

# Extract the list of canonical SMILES
canonical_smiles_list = smiles_df['Canonical_SMILES'].tolist()
canonical_smiles_list = list(set(canonical_smiles_list))

In [6]:
# Define parameters
batch_size = 100          # Number of SMILES per job
save_interval = 20        # Save intermediate results every 20 batches
output_dir = '/Users/macbook/CODE/PyClassyFire/data/intermediate_results/'
max_retries = 3           # Maximum number of retries for failed batches
retry_delay = 10         # Delay between retries in seconds (5 minutes)

In [None]:
# Process the batches with resumption and retry logic
intermediate_files = process_batches_with_saving_and_retry(
    smiles_list=canonical_smiles_list,
    batch_size=batch_size,
    output_dir=output_dir,
    max_retries=max_retries,
    retry_delay=retry_delay
)

All  smiles: 13984
Already processed SMILES: 493
Remaining SMILES to process: 13503
Remaining unique SMILES to process after removing duplicates: 13503
Total remaining batches to process: 136




Processing Batches:   0%|          | 0/136 [00:00<?, ?it/s][A[A

Submitted Batch 6 with Query ID 12021447




Processing Batches:   1%|          | 1/136 [01:54<4:17:42, 114.54s/it][A[A

Batch 6 completed with 100 molecules.
Saved intermediate results to /Users/macbook/CODE/PyClassyFire/data/intermediate_results/intermediate_6.json
Submitted Batch 7 with Query ID 12021449


In [15]:
len(set(canonical_smiles_list))

13685

In [None]:
def merge_intermediate_results(intermediate_files):
    """
    Merges multiple intermediate JSON result files into a single dictionary.
    """
    merged_results = {}
    for file in intermediate_files:
        try:
            with open(file, 'r') as f:
                data = json.load(f)
                merged_results.update(data)
            print(f"Successfully merged results from {file}")
        except Exception as e:
            print(f"Error merging results from {file}: {e}")
    return merged_results

In [None]:
# Merge all intermediate results
merged_results = merge_intermediate_results(intermediate_files)

In [None]:
# Display the number of classified SMILES
classified_count = len(merged_results)
print(f"Total number of classified SMILES: {classified_count}")

# Convert the merged results dictionary to a DataFrame
results_df = pd.DataFrame.from_dict(merged_results, orient='index')
results_df.reset_index(inplace=True)
results_df.rename(columns={'index': 'Canonical_SMILES'}, inplace=True)

# Merge the classification results with the original SMILES DataFrame
annotated_df = pd.merge(smiles_df, results_df, on='Canonical_SMILES', how='left')

In [None]:

# Handle unclassified SMILES
unclassified = annotated_df['superclass'].isnull().sum()
print(f"Number of SMILES without classification: {unclassified}")

# Fill NaN values with 'Unknown'
annotated_df[['superclass', 'class', 'subclass']] = annotated_df[['superclass', 'class', 'subclass']].fillna('Unknown')

# Save the annotated DataFrame to a TSV file
annotated_df.to_csv(final_output_path, sep='\t', index=False)

print(f"Annotated data has been saved to {final_output_path}")

In [8]:
def analyze_structure(data, level=0):
    """Recursively analyze and print the structure of JSON data."""
    if isinstance(data, dict):
        print(" " * level + f"Object with keys: {list(data.keys())}")
        for key, value in data.items():
            analyze_structure(value, level + 2)
    elif isinstance(data, list):
        print(" " * level + f"List of {len(data)} items")
        if len(data) > 0:
            analyze_structure(data[0], level + 2)  # Analyze the first item as representative
    else:
        print(" " * level + f"Value type: {type(data).__name__}")



In [9]:
# Load the JSON file
file_path = "/Users/macbook/CODE/PyClassyFire/data/intermediate_results/intermediate_1.json"  # Replace with your file's path
with open(file_path, "r") as file:
    json_data = json.load(file)

# Analyze the JSON structure
analyze_structure(json_data)

Object with keys: ['12021409']
  List of 100 items
    Object with keys: ['identifier', 'smiles', 'inchikey', 'kingdom', 'superclass', 'class', 'subclass', 'intermediate_nodes', 'direct_parent', 'alternative_parents', 'molecular_framework', 'substituents', 'description', 'external_descriptors', 'ancestors', 'predicted_chebi_terms', 'predicted_lipidmaps_terms', 'classification_version']
      Value type: str
      Value type: str
      Value type: str
      Object with keys: ['name', 'description', 'chemont_id', 'url']
        Value type: str
        Value type: str
        Value type: str
        Value type: str
      Object with keys: ['name', 'description', 'chemont_id', 'url']
        Value type: str
        Value type: str
        Value type: str
        Value type: str
      Object with keys: ['name', 'description', 'chemont_id', 'url']
        Value type: str
        Value type: str
        Value type: str
        Value type: str
      Object with keys: ['name', 'description', 'c

In [9]:
existing_smiles = load_existing_results(output_dir)
print(f"Already processed SMILES: {len(existing_smiles[0])}")

Already processed SMILES: 300


In [9]:
existing_smiles

({'C(C=CC1=CC=CC=C1)N1CCN(CC1)C(C1=CC=CC=C1)C1=CC=CC=C1',
  'C(CN(CC1=CC=CC=N1)CC1=CC=CC=N1)N(CC1=CC=CC=N1)CC1=CC=CC=N1',
  'C(N1C=CN=C1)C1=CC(CN2C=CN=C2)=CC(CN2C=CN=C2)=C1',
  'C1=CC=C(C=C1)C1=C2C=CC3=C(C=CN=C3C2=NC=C1)C1=CC=CC=C1',
  'C1CN(CCN1)C1=CC=C(C=C1)C1=CN2N=CC(=C2N=C1)C1=CC=NC2=CC=CC=C12',
  'CC#CC1(O)CCC2C3CCC4=CC(=O)CCC4=C3C(CC12C)C1=CC=C(C=C1)N(C)C',
  'CC(=O)NCC1CN(C(=O)O1)C1=CC(F)=C(C=C1)N1CCN(CC1)C(=O)CO',
  'CC(C)(C)NC(=O)COC1=CC=C(CNC2=CC3=C(NC(=O)N3)C=C2)C=C1',
  'CC(C)(C)SC1=C(CC(C)(C)C(O)=O)N(CC2=CC=C(Cl)C=C2)C2=C1C=C(OCC1=NC3=CC=CC=C3C=C1)C=C2',
  'CC(C)(CC1CC2=CC=CC=C2C1)NCC(O)COC1=C(C=CC(CCC(O)=O)=C1)C#N',
  'CC(C)(OCc1nn(Cc2ccccc2)c2ccccc12)C(O)=O',
  'CC(C)C(=O)OCC1(CO1)C1=C(OC(=O)C(C)C)C=C(C)C=C1',
  'CC(C)C1=NOC(=N1)C1CCN(CC1)C1=C(C(NC2=C(F)C=C(C=C2)S(C)(=O)=O)=NC=N1)[N+]([O-])=O',
  'CC(C)CC(N1CC2=CC=CC=C2C1=O)C(=O)NC1=CC=CC2=C1C=CN2',
  'CC(C)OC1=CC=C(C=C1)C1=CN2N=CC(=C2N=C1)C1=CC=NC2=CC=CC=C12',
  'CC(C1CCC2C3CC=C4CC(CC(O)C4(C)C3CCC12C)OC1OC(COC2OC(CO)C(O

In [10]:
a = set(canonical_smiles_list) - set(existing_smiles[0])

In [11]:
len(a)

13971

In [12]:
b = set()
for smi in existing_smiles[0]:
    tmp = MoleCule.from_smiles(smi).canonical_smiles
    b.add(tmp)

In [13]:
c = set(canonical_smiles_list) - b

In [14]:
len(c)

13691

In [15]:
len(set(canonical_smiles_list))

13984

In [16]:
len(b)

300

In [22]:
b & c

set()

In [23]:
weird = c - set(canonical_smiles_list) 

In [24]:
weird

set()

In [8]:
def process_batches_with_saving_and_retry(
    smiles_list,
    batch_size=100,
    save_interval=20,
    output_dir='../data/intermediate_results/',
    max_retries=3,
    retry_delay=300  # in seconds (5 minutes)
):
    """
    Processes SMILES in batches, submits them to the ClassyFire API, saves intermediate results,
    and implements retry logic for failed batches.
    
    Parameters:
    - smiles_list (list): List of canonical SMILES strings to classify.
    - batch_size (int): Number of SMILES per batch/job.
    - save_interval (int): Save intermediate results every 'save_interval' batches.
    - output_dir (str): Directory to save intermediate result files.
    - max_retries (int): Maximum number of retries for failed batches.
    - retry_delay (int): Delay between retries in seconds.
    
    Returns:
    - saved_files (list): List of paths to the saved intermediate JSON files.
    """
    # Ensure the output directory exists
    os.makedirs(output_dir, exist_ok=True)
    
    # Split the SMILES list into batches
    batches = chunk_tasks(smiles_list, batch_size)
    total_batches = len(batches)
    print(f"Total number of batches to process: {total_batches}")
    logging.info(f"Total number of batches to process: {total_batches}")
    
    # Create Job instances for each batch
    jobs = [Job(batch) for batch in batches]
    
    # Initialize Scheduler
    scheduler = Scheduler(jobs)
    
    # Initialize counters
    processed_batches = 0
    saved_files = []
    current_batch = 0
    
    # Initialize progress bar
    pbar = tqdm(total=total_batches, desc="Processing Batches")
    
    while scheduler.jobs:
        try:
            job = scheduler.jobs[0]
            if job.query_id is None:
                # Submit the job
                job.submit()
                logging.info(f"Submitted Job {current_batch + 1}/{total_batches} with Query ID {job.query_id}")
                print(f"Submitted Job {current_batch + 1}/{total_batches} with Query ID {job.query_id}")
                time.sleep(60)  # Wait before checking status
            elif job.is_done:
                # Parse and store the results
                scheduler.results[job.query_id] = job.parse_results()
                logging.info(f"Job {current_batch + 1}/{total_batches} completed.")
                print(f"Job {current_batch + 1}/{total_batches} completed.")
                scheduler.jobs.pop(0)
                processed_batches += 1
                current_batch += 1
                pbar.update(1)
                time.sleep(10)  # Short wait before processing next job
                
                # Save intermediate results every 'save_interval' batches
                if processed_batches % save_interval == 0:
                    intermediate_file = os.path.join(output_dir, f'intermediate_{processed_batches}.json')
                    with open(intermediate_file, 'w') as f:
                        json.dump(scheduler.results, f, indent=4)
                    saved_files.append(intermediate_file)
                    logging.info(f"Saved intermediate results to {intermediate_file}")
                    print(f"Saved intermediate results to {intermediate_file}")
            elif job.is_stale:
                # Handle stale jobs
                logging.warning(f"Job {current_batch + 1}/{total_batches} is stale. Skipping.")
                print(f"Job {current_batch + 1}/{total_batches} is stale. Skipping.")
                scheduler.results[job.query_id] = []
                scheduler.jobs.pop(0)
                processed_batches += 1
                current_batch += 1
                pbar.update(1)
            else:
                # Job is still running
                time.sleep(60)  # Wait before rechecking
        except urllib.error.HTTPError as e:
            logging.error(f"HTTPError encountered: {e}. Retrying Job {current_batch + 1}/{total_batches}.")
            print(f"HTTPError encountered: {e}. Retrying Job {current_batch + 1}/{total_batches}.")
            if scheduler.retry < max_retries:
                scheduler.retry += 1
                logging.info(f"Retrying Job {current_batch + 1}/{total_batches} after {retry_delay} seconds.")
                print(f"Retrying Job {current_batch + 1}/{total_batches} after {retry_delay} seconds.")
                time.sleep(retry_delay)  # Wait before retrying
            else:
                logging.error(f"Maximum retries reached for Job {current_batch + 1}/{total_batches}. Skipping.")
                print(f"Maximum retries reached for Job {current_batch + 1}/{total_batches}. Skipping.")
                scheduler.jobs.pop(0)
                scheduler.results[job.query_id] = []
                processed_batches += 1
                current_batch += 1
                pbar.update(1)
                scheduler.retry = 0  # Reset retry counter
        except Exception as e:
            logging.error(f"Unexpected error: {e}. Skipping Job {current_batch + 1}/{total_batches}.")
            print(f"Unexpected error: {e}. Skipping Job {current_batch + 1}/{total_batches}.")
            scheduler.jobs.pop(0)
            scheduler.results[job.query_id] = []
            processed_batches += 1
            current_batch += 1
            pbar.update(1)
    
    pbar.close()
    
    # Save any remaining results after processing all jobs
    if processed_batches % save_interval != 0:
        intermediate_file = os.path.join(output_dir, f'intermediate_{processed_batches}.json')
        with open(intermediate_file, 'w') as f:
            json.dump(scheduler.results, f, indent=4)
        saved_files.append(intermediate_file)
        logging.info(f"Saved final intermediate results to {intermediate_file}")
        print(f"Saved final intermediate results to {intermediate_file}")
    
    return saved_files

In [None]:
def load_existing_results(output_dir):
    """
    Loads existing intermediate JSON files and returns a set of already processed SMILES.
    """
    merged_results = {}
    if not os.path.exists(output_dir):
        return merged_results
    for file in os.listdir(output_dir):
        if file.startswith('intermediate_') and file.endswith('.json'):
            with open(os.path.join(output_dir, file), 'r') as f:
                data = json.load(f)
                merged_results.update(data)
    return merged_results

In [9]:
# Extract the list of canonical SMILES
canonical_smiles_list = smiles_df['Canonical_SMILES'].tolist()

# Define parameters
batch_size = 100          # Number of SMILES per job
save_interval = 20        # Save intermediate results every 20 batches
output_dir = '/Users/macbook/CODE/PyClassyFire/data/intermediate_results/'
max_retries = 3           # Maximum number of retries for failed batches
retry_delay = 300         # Delay between retries in seconds (5 minutes)

# Process the batches and save intermediate results with retry logic
intermediate_files = process_batches_with_saving_and_retry(
    smiles_list=canonical_smiles_list,
    batch_size=batch_size,
    save_interval=save_interval,
    output_dir=output_dir,
    max_retries=max_retries,
    retry_delay=retry_delay
)

Total number of batches to process: 140


Processing Batches:   0%|          | 0/140 [00:00<?, ?it/s]

Submitted Job 1/140 with Query ID 12021290


Processing Batches:   1%|          | 1/140 [01:49<4:13:04, 109.24s/it]

Job 1/140 completed.
Submitted Job 2/140 with Query ID 12021291


Processing Batches:   1%|▏         | 2/140 [03:52<4:29:52, 117.34s/it]

Job 2/140 completed.
Submitted Job 3/140 with Query ID 12021292


Processing Batches:   2%|▏         | 3/140 [05:49<4:28:16, 117.49s/it]

Job 3/140 completed.
Submitted Job 4/140 with Query ID 12021293


Processing Batches:   3%|▎         | 4/140 [07:49<4:27:57, 118.22s/it]

Job 4/140 completed.
Submitted Job 5/140 with Query ID 12021294


Processing Batches:   4%|▎         | 5/140 [09:46<4:25:21, 117.94s/it]

Job 5/140 completed.
Submitted Job 6/140 with Query ID 12021295


Processing Batches:   4%|▍         | 6/140 [11:44<4:23:34, 118.02s/it]

Job 6/140 completed.
Submitted Job 7/140 with Query ID 12021296


Processing Batches:   5%|▌         | 7/140 [13:53<4:29:34, 121.61s/it]

Job 7/140 completed.
Submitted Job 8/140 with Query ID 12021297


Processing Batches:   6%|▌         | 8/140 [16:06<4:35:05, 125.04s/it]

Job 8/140 completed.
Submitted Job 9/140 with Query ID 12021298


Processing Batches:   6%|▋         | 9/140 [18:16<4:36:43, 126.74s/it]

Job 9/140 completed.
Submitted Job 10/140 with Query ID 12021299


Processing Batches:   7%|▋         | 10/140 [20:30<4:39:13, 128.88s/it]

Job 10/140 completed.
Submitted Job 11/140 with Query ID 12021300


Processing Batches:   8%|▊         | 11/140 [22:41<4:38:38, 129.60s/it]

Job 11/140 completed.
Submitted Job 12/140 with Query ID 12021301


Processing Batches:   9%|▊         | 12/140 [24:56<4:39:42, 131.11s/it]

Job 12/140 completed.
Submitted Job 13/140 with Query ID 12021302


Processing Batches:   9%|▉         | 13/140 [27:08<4:38:26, 131.55s/it]

Job 13/140 completed.
Submitted Job 14/140 with Query ID 12021303


Processing Batches:  10%|█         | 14/140 [29:18<4:34:52, 130.89s/it]

Job 14/140 completed.
Submitted Job 15/140 with Query ID 12021304


Processing Batches:  11%|█         | 15/140 [31:25<4:30:15, 129.72s/it]

Job 15/140 completed.
Submitted Job 16/140 with Query ID 12021305


Processing Batches:  11%|█▏        | 16/140 [33:32<4:26:20, 128.87s/it]

Job 16/140 completed.
Submitted Job 17/140 with Query ID 12021306


Processing Batches:  12%|█▏        | 17/140 [35:42<4:25:14, 129.38s/it]

Job 17/140 completed.
Submitted Job 18/140 with Query ID 12021308


Processing Batches:  13%|█▎        | 18/140 [38:26<4:43:51, 139.60s/it]

Job 18/140 completed.
Submitted Job 19/140 with Query ID 12021309


Processing Batches:  14%|█▎        | 19/140 [40:38<4:37:23, 137.55s/it]

Unexpected error: 504 Server Error: Gateway Time-out for url: http://classyfire.wishartlab.com/queries/12021309.json. Skipping Job 19/140.
Submitted Job 20/140 with Query ID 12021311


Processing Batches:  14%|█▍        | 20/140 [42:55<4:34:30, 137.26s/it]

Job 20/140 completed.
Saved intermediate results to /Users/macbook/CODE/PyClassyFire/data/intermediate_results/intermediate_20.json
Submitted Job 21/140 with Query ID 12021312


Processing Batches:  15%|█▌        | 21/140 [45:08<4:30:01, 136.15s/it]

Unexpected error: 504 Server Error: Gateway Time-out for url: http://classyfire.wishartlab.com/queries/12021312.json. Skipping Job 21/140.
Submitted Job 22/140 with Query ID 12021313


Processing Batches:  16%|█▌        | 22/140 [47:13<4:20:56, 132.69s/it]

Unexpected error: 504 Server Error: Gateway Time-out for url: http://classyfire.wishartlab.com/queries/12021313.json. Skipping Job 22/140.
Submitted Job 23/140 with Query ID 12021314


Processing Batches:  16%|█▋        | 23/140 [49:34<4:23:38, 135.20s/it]

Job 23/140 completed.
Submitted Job 24/140 with Query ID 12021316


Processing Batches:  17%|█▋        | 24/140 [52:05<4:30:34, 139.96s/it]

Job 24/140 completed.
Submitted Job 25/140 with Query ID 12021319


Processing Batches:  18%|█▊        | 25/140 [54:28<4:30:04, 140.91s/it]

Job 25/140 completed.
Submitted Job 26/140 with Query ID 12021322


Processing Batches:  19%|█▊        | 26/140 [57:04<4:35:57, 145.24s/it]

Job 26/140 completed.
Submitted Job 27/140 with Query ID 12021325


Processing Batches:  19%|█▉        | 27/140 [59:44<4:41:59, 149.73s/it]

Job 27/140 completed.
Submitted Job 28/140 with Query ID 12021328


Processing Batches:  20%|██        | 28/140 [1:01:56<4:29:33, 144.41s/it]

Unexpected error: 504 Server Error: Gateway Time-out for url: http://classyfire.wishartlab.com/queries/12021328.json. Skipping Job 28/140.
Submitted Job 29/140 with Query ID 12021331


Processing Batches:  21%|██        | 29/140 [1:04:01<4:16:44, 138.78s/it]

Unexpected error: 504 Server Error: Gateway Time-out for url: http://classyfire.wishartlab.com/queries/12021331.json. Skipping Job 29/140.
Submitted Job 30/140 with Query ID 12021333


Processing Batches:  21%|██▏       | 30/140 [1:06:25<4:16:51, 140.11s/it]

Job 30/140 completed.
Submitted Job 31/140 with Query ID 12021337


Processing Batches:  22%|██▏       | 31/140 [1:08:39<4:11:21, 138.36s/it]

Unexpected error: 504 Server Error: Gateway Time-out for url: http://classyfire.wishartlab.com/queries/12021337.json. Skipping Job 31/140.
Submitted Job 32/140 with Query ID 12021338


Processing Batches:  23%|██▎       | 32/140 [1:11:04<4:12:23, 140.22s/it]

Job 32/140 completed.
Submitted Job 33/140 with Query ID 12021341


Processing Batches:  24%|██▎       | 33/140 [1:13:16<4:06:00, 137.95s/it]

Unexpected error: 504 Server Error: Gateway Time-out for url: http://classyfire.wishartlab.com/queries/12021341.json. Skipping Job 33/140.
Submitted Job 34/140 with Query ID 12021343


Processing Batches:  24%|██▍       | 34/140 [1:15:20<3:55:59, 133.58s/it]

Unexpected error: 504 Server Error: Gateway Time-out for url: http://classyfire.wishartlab.com/queries/12021343.json. Skipping Job 34/140.
Submitted Job 35/140 with Query ID 12021345


Processing Batches:  25%|██▌       | 35/140 [1:17:21<3:47:33, 130.03s/it]

Unexpected error: 504 Server Error: Gateway Time-out for url: http://classyfire.wishartlab.com/queries/12021345.json. Skipping Job 35/140.
Submitted Job 36/140 with Query ID 12021346


Processing Batches:  26%|██▌       | 36/140 [1:19:24<3:41:33, 127.82s/it]

Unexpected error: 504 Server Error: Gateway Time-out for url: http://classyfire.wishartlab.com/queries/12021346.json. Skipping Job 36/140.
Submitted Job 37/140 with Query ID 12021349


Processing Batches:  26%|██▋       | 37/140 [1:21:29<3:38:09, 127.08s/it]

Unexpected error: 504 Server Error: Gateway Time-out for url: http://classyfire.wishartlab.com/queries/12021349.json. Skipping Job 37/140.
Submitted Job 38/140 with Query ID 12021353


Processing Batches:  27%|██▋       | 38/140 [1:23:58<3:47:11, 133.65s/it]

Job 38/140 completed.
Submitted Job 39/140 with Query ID 12021356


Processing Batches:  28%|██▊       | 39/140 [1:26:31<3:54:48, 139.49s/it]

Job 39/140 completed.
Submitted Job 40/140 with Query ID 12021359


Processing Batches:  29%|██▊       | 40/140 [1:28:44<3:49:08, 137.48s/it]

Unexpected error: 504 Server Error: Gateway Time-out for url: http://classyfire.wishartlab.com/queries/12021359.json. Skipping Job 40/140.
Submitted Job 41/140 with Query ID 12021362


Processing Batches:  29%|██▉       | 41/140 [1:30:47<3:39:21, 132.95s/it]

Unexpected error: 504 Server Error: Gateway Time-out for url: http://classyfire.wishartlab.com/queries/12021362.json. Skipping Job 41/140.
Submitted Job 42/140 with Query ID 12021365


Processing Batches:  30%|███       | 42/140 [1:33:15<3:44:51, 137.67s/it]

Job 42/140 completed.
Submitted Job 43/140 with Query ID 12021371


Processing Batches:  31%|███       | 43/140 [1:35:32<3:41:56, 137.29s/it]

Job 43/140 completed.
Submitted Job 44/140 with Query ID 12021376


Processing Batches:  31%|███▏      | 44/140 [1:38:04<3:46:52, 141.79s/it]

Job 44/140 completed.
Submitted Job 45/140 with Query ID 12021377


Processing Batches:  32%|███▏      | 45/140 [1:40:40<3:51:14, 146.05s/it]

Job 45/140 completed.
Submitted Job 46/140 with Query ID 12021378


Processing Batches:  33%|███▎      | 46/140 [1:42:58<3:45:08, 143.70s/it]

Job 46/140 completed.
Submitted Job 47/140 with Query ID 12021379


Processing Batches:  34%|███▎      | 47/140 [1:45:26<3:44:27, 144.81s/it]

Job 47/140 completed.
Submitted Job 48/140 with Query ID 12021380


Processing Batches:  34%|███▍      | 48/140 [1:48:00<3:46:18, 147.60s/it]

Job 48/140 completed.
Submitted Job 49/140 with Query ID 12021381


Processing Batches:  35%|███▌      | 49/140 [1:49:36<3:20:36, 132.27s/it]

Unexpected error: 500 Server Error: Internal Server Error for url: http://classyfire.wishartlab.com/queries/12021381.json. Skipping Job 49/140.
Submitted Job 50/140 with Query ID 12021382


Processing Batches:  36%|███▌      | 50/140 [1:51:57<3:22:20, 134.90s/it]

Job 50/140 completed.
Submitted Job 51/140 with Query ID 12021383


Processing Batches:  36%|███▋      | 51/140 [1:53:50<3:10:25, 128.38s/it]

Unexpected error: 500 Server Error: Internal Server Error for url: http://classyfire.wishartlab.com/queries/12021383.json. Skipping Job 51/140.
Submitted Job 52/140 with Query ID 12021384


Processing Batches:  37%|███▋      | 52/140 [1:55:50<3:04:18, 125.66s/it]

Job 52/140 completed.
Submitted Job 53/140 with Query ID 12021386


Processing Batches:  38%|███▊      | 53/140 [1:58:36<3:19:45, 137.76s/it]

Job 53/140 completed.
Submitted Job 54/140 with Query ID 12021388


Processing Batches:  39%|███▊      | 54/140 [2:00:53<3:17:04, 137.49s/it]

Unexpected error: 504 Server Error: Gateway Time-out for url: http://classyfire.wishartlab.com/queries/12021388.json. Skipping Job 54/140.
Submitted Job 55/140 with Query ID 12021389


Processing Batches:  39%|███▉      | 55/140 [2:03:27<3:21:50, 142.48s/it]

Job 55/140 completed.
Submitted Job 56/140 with Query ID 12021390


Processing Batches:  40%|████      | 56/140 [2:05:40<3:15:45, 139.82s/it]

Unexpected error: 504 Server Error: Gateway Time-out for url: http://classyfire.wishartlab.com/queries/12021390.json. Skipping Job 56/140.
Submitted Job 57/140 with Query ID 12021391


Processing Batches:  41%|████      | 57/140 [2:07:15<2:54:51, 126.40s/it]

Unexpected error: 500 Server Error: Internal Server Error for url: http://classyfire.wishartlab.com/queries/12021391.json. Skipping Job 57/140.
Submitted Job 58/140 with Query ID 12021392


Processing Batches:  41%|████▏     | 58/140 [2:09:18<2:51:11, 125.26s/it]

Unexpected error: 504 Server Error: Gateway Time-out for url: http://classyfire.wishartlab.com/queries/12021392.json. Skipping Job 58/140.
Submitted Job 59/140 with Query ID 12021393


Processing Batches:  42%|████▏     | 59/140 [2:11:20<2:47:43, 124.24s/it]

Unexpected error: 504 Server Error: Gateway Time-out for url: http://classyfire.wishartlab.com/queries/12021393.json. Skipping Job 59/140.
Submitted Job 60/140 with Query ID 12021394


Processing Batches:  43%|████▎     | 60/140 [2:13:23<2:45:18, 123.98s/it]

Unexpected error: 504 Server Error: Gateway Time-out for url: http://classyfire.wishartlab.com/queries/12021394.json. Skipping Job 60/140.
Submitted Job 61/140 with Query ID 12021395


Processing Batches:  44%|████▎     | 61/140 [2:15:28<2:43:37, 124.27s/it]

Unexpected error: 504 Server Error: Gateway Time-out for url: http://classyfire.wishartlab.com/queries/12021395.json. Skipping Job 61/140.
Submitted Job 62/140 with Query ID 12021396


Processing Batches:  44%|████▍     | 62/140 [2:17:33<2:41:50, 124.49s/it]

Unexpected error: 504 Server Error: Gateway Time-out for url: http://classyfire.wishartlab.com/queries/12021396.json. Skipping Job 62/140.
Submitted Job 63/140 with Query ID 12021397


Processing Batches:  45%|████▌     | 63/140 [2:19:35<2:38:52, 123.80s/it]

Unexpected error: 504 Server Error: Gateway Time-out for url: http://classyfire.wishartlab.com/queries/12021397.json. Skipping Job 63/140.
Submitted Job 64/140 with Query ID 12021398


Processing Batches:  46%|████▌     | 64/140 [2:22:09<2:48:20, 132.90s/it]

Job 64/140 completed.
Submitted Job 65/140 with Query ID 12021399


Processing Batches:  46%|████▋     | 65/140 [2:24:23<2:46:18, 133.05s/it]

Unexpected error: 504 Server Error: Gateway Time-out for url: http://classyfire.wishartlab.com/queries/12021399.json. Skipping Job 65/140.
Submitted Job 66/140 with Query ID 12021400


Processing Batches:  47%|████▋     | 66/140 [2:26:53<2:50:21, 138.12s/it]

Job 66/140 completed.
Submitted Job 67/140 with Query ID 12021401


Processing Batches:  48%|████▊     | 67/140 [2:29:31<2:55:14, 144.04s/it]

Job 67/140 completed.
Submitted Job 68/140 with Query ID 12021402


Processing Batches:  49%|████▊     | 68/140 [2:32:12<2:59:06, 149.25s/it]

Job 68/140 completed.
Submitted Job 69/140 with Query ID 12021403


Processing Batches:  49%|████▉     | 69/140 [2:34:25<2:50:56, 144.46s/it]

Unexpected error: 504 Server Error: Gateway Time-out for url: http://classyfire.wishartlab.com/queries/12021403.json. Skipping Job 69/140.
Submitted Job 70/140 with Query ID 12021404


Processing Batches:  50%|█████     | 70/140 [2:36:33<2:42:44, 139.49s/it]

Unexpected error: 504 Server Error: Gateway Time-out for url: http://classyfire.wishartlab.com/queries/12021404.json. Skipping Job 70/140.
Submitted Job 71/140 with Query ID 12021405


Processing Batches:  51%|█████     | 71/140 [2:38:38<2:35:20, 135.08s/it]

Unexpected error: 504 Server Error: Gateway Time-out for url: http://classyfire.wishartlab.com/queries/12021405.json. Skipping Job 71/140.


Processing Batches:  51%|█████▏    | 72/140 [2:39:51<2:12:04, 116.54s/it]

Unexpected error: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')). Skipping Job 72/140.
Submitted Job 73/140 with Query ID 12021406


KeyboardInterrupt: 

In [None]:
def merge_intermediate_results(intermediate_files):
    """
    Merges multiple intermediate JSON result files into a single dictionary.

    Parameters:
    - intermediate_files (list): List of file paths to intermediate JSON files.

    Returns:
    - merged_results (dict): Merged classification results.
    """
    merged_results = {}
    for file in intermediate_files:
        try:
            with open(file, 'r') as f:
                data = json.load(f)
                merged_results.update(data)
            logging.info(f"Successfully merged results from {file}")
            print(f"Successfully merged results from {file}")
        except Exception as e:
            logging.error(f"Error merging results from {file}: {e}")
            print(f"Error merging results from {file}: {e}")
    return merged_results

In [None]:
# Merge all intermediate results
merged_results = merge_intermediate_results(intermediate_files)

# Display the number of classified SMILES
classified_count = len(merged_results)
print(f"Total number of classified SMILES: {classified_count}")

In [None]:
# Convert the merged results dictionary to a DataFrame
# The dictionary keys are canonical SMILES, and values are classification details
results_df = pd.DataFrame.from_dict(merged_results, orient='index')
results_df.reset_index(inplace=True)
results_df.rename(columns={'index': 'Canonical_SMILES'}, inplace=True)

# Display the first few entries of the results
results_df.head()

In [None]:
# Merge the classification results with the original SMILES DataFrame
annotated_df = pd.merge(smiles_df, results_df, on='Canonical_SMILES', how='left')

# Display the merged DataFrame
annotated_df.head()

In [None]:
# Check for any SMILES that did not receive a classification
unclassified = annotated_df['superclass'].isnull().sum()
print(f"Number of SMILES without classification: {unclassified}")

# Optionally, handle unclassified SMILES (e.g., mark as 'Unknown')
annotated_df['superclass'].fillna('Unknown', inplace=True)
annotated_df['class'].fillna('Unknown', inplace=True)
annotated_df['subclass'].fillna('Unknown', inplace=True)

In [None]:
# Define the output path for the annotated data
final_output_path = '/Users/macbook/CODE/PyClassyFire/data/classified_smiles.tsv'

# Save the annotated DataFrame to a TSV file
annotated_df.to_csv(final_output_path, sep='\t', index=False)

print(f"Annotated data has been saved to {final_output_path}")

In [7]:
# Extract the list of canonical SMILES
canonical_smiles_list = smiles_df['Canonical_SMILES'].tolist()

# Define the batch size (number of SMILES per job)
batch_size = 100  # Adjust based on API limitations and performance

# Create batches using the chunk_tasks utility function
batches = chunk_tasks(canonical_smiles_list, batch_size)

# Create Job instances for each batch
jobs = [Job(batch) for batch in batches]

print(f"Total number of jobs created: {len(jobs)}")

Total number of jobs created: 140


In [8]:
# Initialize the Scheduler with the list of jobs
scheduler = Scheduler(jobs)

In [9]:
# Start the classification process
print("Submitting classification jobs to the ClassyFire API...")
scheduler.run()
print("All jobs have been processed.")

Submitting classification jobs to the ClassyFire API...


  4%|▍         | 6/140 [14:58<5:26:44, 146.30s/it]

KeyboardInterrupt: 

In [None]:
# Export the results from the Scheduler
classification_results = scheduler.export()

# Convert the results dictionary to a DataFrame
# The dictionary keys are canonical SMILES, and values are classification details
results_df = pd.DataFrame.from_dict(classification_results, orient='index')
results_df.reset_index(inplace=True)
results_df.rename(columns={'index': 'Canonical_SMILES'}, inplace=True)

# Display the first few entries of the results
results_df.head()

In [None]:
# Merge the classification results with the original SMILES DataFrame
annotated_df = pd.merge(smiles_df, results_df, on='Canonical_SMILES', how='left')

# Display the merged DataFrame
annotated_df.head()

In [None]:
# Check for any SMILES that did not receive a classification
unclassified = annotated_df['superclass'].isnull().sum()
print(f"Number of SMILES without classification: {unclassified}")

# Optionally, handle unclassified SMILES (e.g., mark as 'Unknown')
annotated_df['superclass'].fillna('Unknown', inplace=True)
annotated_df['class'].fillna('Unknown', inplace=True)
annotated_df['subclass'].fillna('Unknown', inplace=True)

In [None]:
# Define the output path for the annotated data
output_file_path = '../data/classified_smiles.tsv'

# Save the annotated DataFrame to a TSV file
annotated_df.to_csv(output_file_path, sep='\t', index=False)

print(f"Annotated data has been saved to {output_file_path}")