# PyClassyFire Tutorial: Classifying Chemical Compounds Using the ClassyFire API


## Introduction

Welcome to the **PyClassyFire** tutorial! This guide will walk you through the process of classifying a large set of chemical compounds using the [ClassyFire](http://classyfire.wishartlab.com/) API. We'll utilize the `PyClassyFire` package, which provides a command-line interface (CLI) and programmatic access to the ClassyFire service, enabling efficient and scalable classification of chemical structures.

By the end of this tutorial, you'll be able to:

1. **Preprocess your SMILES data**: Prepare your unique SMILES strings for classification.
2. **Submit classification jobs**: Use the `PyClassyFire` package to send your data to the ClassyFire API.
3. **Retrieve and process results**: Collect the classification results and merge them with your original data.
4. **Save the annotated data**: Store the enriched dataset for further analysis.

Let's get started!

## Prerequisites

Before diving into the tutorial, ensure you have the following:

- **Conda Environment**: A Conda environment named `classyfire_env` with all necessary dependencies installed.
- **PyClassyFire Package**: Installed and accessible within your Conda environment.
- **Unique SMILES Data**: A TSV file containing approximately 16,000 unique SMILES strings located at `/Users/macbook/CODE/PyClassyFire/data/unique_valid_smiles_no_header.tsv`.

> **Note:** This tutorial assumes that the Conda environment and `PyClassyFire` package are already set up. If not, please refer to the [repository's README](https://github.com/yourusername/PyClassyFire) for setup instructions.

## Table of Contents

1. [Importing Necessary Libraries](#importing-libraries)
2. [Loading and Exploring the Data](#loading-data)
3. [Preparing the SMILES Data for Classification](#preparing-data)
4. [Submitting Classification Jobs to ClassyFire API](#submitting-jobs)
5. [Monitoring Job Progress](#monitoring-progress)
6. [Retrieving and Processing Results](#retrieving-results)
7. [Saving the Annotated Data](#saving-data)
8. [Conclusion](#conclusion)


In [1]:
import os
import pandas as pd
import json
import time

# Import classes and functions from PyClassyFire
from classyfire_cli.src.batch import Job, Scheduler
from classyfire_cli.src.utils import MoleCule, chunk_tasks

In [2]:
# Define the path to the unique SMILES TSV file
smiles_file_path = '../data/unique_valid_smiles_no_header.tsv'

# Load the SMILES data into a pandas DataFrame
# Since the TSV is headless, we'll assign a column name 'SMILES'
smiles_df = pd.read_csv(smiles_file_path, sep='\t', header=None, names=['SMILES'])

# Display the first few entries
smiles_df.head()

Unnamed: 0,SMILES
0,COC1=C(C=CC(=C1)C(=O)O)O[C@H]2[C@@H]([C@H]([C@...
1,CCCCC(=O)O[C@H](CC(=O)O)C[N+](C)(C)C
2,CCN1C=C(C(=N1)C(=O)N)NC(=S)NC2=C(C=CC(=C2)Cl)OC
3,COC1=C(C=CC(=C1)/C=N\NC2=CC=CC=C2C(=O)O)OCC3=C...
4,CCOC(=O)CSC1=NN=C(N1C)C2=CN(N=C2OC)C


In [3]:
# Check the number of unique SMILES
unique_smiles_count = smiles_df['SMILES'].nunique()
print(f"Total unique SMILES in the dataset: {unique_smiles_count}")

Total unique SMILES in the dataset: 13984


In [None]:
# Function to convert SMILES to canonical SMILES using RDKit
def canonicalize_smiles(smiles):
    try:
        molecule = MoleCule.from_smiles(smiles)
        return molecule.canonical_smiles
    except:
        return None

In [5]:
smiles_df['Canonical_SMILES'] = smiles_df['SMILES'].apply(canonicalize_smiles)

# Remove any entries that failed canonicalization
invalid_smiles = smiles_df['Canonical_SMILES'].isnull().sum()
print(f"Number of invalid SMILES after canonicalization: {invalid_smiles}")

if invalid_smiles > 0:
    smiles_df = smiles_df.dropna(subset=['Canonical_SMILES'])
    print(f"Removed {invalid_smiles} invalid SMILES entries.")

Number of invalid SMILES after canonicalization: 0


In [6]:
# Reset index after cleaning
smiles_df.reset_index(drop=True, inplace=True)

In [7]:
# Extract the list of canonical SMILES
canonical_smiles_list = smiles_df['Canonical_SMILES'].tolist()

# Define the batch size (number of SMILES per job)
batch_size = 100  # Adjust based on API limitations and performance

# Create batches using the chunk_tasks utility function
batches = chunk_tasks(canonical_smiles_list, batch_size)

# Create Job instances for each batch
jobs = [Job(batch) for batch in batches]

print(f"Total number of jobs created: {len(jobs)}")

Total number of jobs created: 140


In [8]:
# Initialize the Scheduler with the list of jobs
scheduler = Scheduler(jobs)

In [None]:
# Start the classification process
print("Submitting classification jobs to the ClassyFire API...")
scheduler.run()
print("All jobs have been processed.")

Submitting classification jobs to the ClassyFire API...


  1%|▏         | 2/140 [05:10<6:01:23, 157.13s/it]

In [None]:
# Export the results from the Scheduler
classification_results = scheduler.export()

# Convert the results dictionary to a DataFrame
# The dictionary keys are canonical SMILES, and values are classification details
results_df = pd.DataFrame.from_dict(classification_results, orient='index')
results_df.reset_index(inplace=True)
results_df.rename(columns={'index': 'Canonical_SMILES'}, inplace=True)

# Display the first few entries of the results
results_df.head()

In [None]:
# Merge the classification results with the original SMILES DataFrame
annotated_df = pd.merge(smiles_df, results_df, on='Canonical_SMILES', how='left')

# Display the merged DataFrame
annotated_df.head()

In [None]:
# Check for any SMILES that did not receive a classification
unclassified = annotated_df['superclass'].isnull().sum()
print(f"Number of SMILES without classification: {unclassified}")

# Optionally, handle unclassified SMILES (e.g., mark as 'Unknown')
annotated_df['superclass'].fillna('Unknown', inplace=True)
annotated_df['class'].fillna('Unknown', inplace=True)
annotated_df['subclass'].fillna('Unknown', inplace=True)

In [None]:
# Define the output path for the annotated data
output_file_path = '../data/classified_smiles.tsv'

# Save the annotated DataFrame to a TSV file
annotated_df.to_csv(output_file_path, sep='\t', index=False)

print(f"Annotated data has been saved to {output_file_path}")