# PyClassyFire Tutorial: Classifying Chemical Compounds Using the ClassyFire API


## Introduction

Welcome to the **PyClassyFire** tutorial! This guide will walk you through the process of classifying a large set of chemical compounds using the [ClassyFire](http://classyfire.wishartlab.com/) API.

By the end of this tutorial, you'll be able to:

1. **Preprocess your SMILES data**: Prepare your unique SMILES strings for classification.
2. **Submit classification jobs**: Use the `PyClassyFire` package to send your data to the ClassyFire API.
3. **Retrieve and process results**: Collect the classification results and merge them with your original data.
4. **Save the annotated data**: Store the enriched dataset.

## Prerequisites

Before diving into the tutorial, ensure you have the following:

- **Conda Environment**: A Conda environment named `pyclassyfire_env` with all necessary dependencies installed.
- **PyClassyFire Package**: Installed and accessible within your Conda environment.
- **Unique SMILES Data**: A file containing SMILES strings, exampled can be taken from sample_data/sample_smiles.tsv.

**Note:** This tutorial assumes that the Conda environment are already set up. If not, please refer to the [repository's README](https://github.com/Jozefov/PyClassyFire) for setup instructions.

In [1]:
import os
import pandas as pd
import json


from pyclassyfire.src.utils import MoleCule, load_existing_results

Before classification, you need to prepare your SMILES data. This involves loading the data from a TSV file and ensuring it’s clean and ready for processing.

In [2]:
# Define paths
smiles_file_path = '../sample_data/sample_smiles.tsv'
output_dir = '../sample_data/output'
intermediate_files_path = '../sample_data/output/intermediate_files'
final_output_path = '../sample_data/output/output.json'

# Load SMILES data
# The TSV file may or may not have a header. We load it without assuming a header.
smiles_df = pd.read_csv(smiles_file_path, sep='\t', header=None, names=['SMILES']).dropna()

# Check if the first row is a header (i.e., 'SMILES') and skip it if so
if smiles_df.iloc[0]['SMILES'].strip().upper() == 'SMILES':
    print("Header detected. Skipping the first row.")
    smiles_df = smiles_df.iloc[1:].reset_index(drop=True)
else:
    print("No header detected. Proceeding with all SMILES.")
    
# Display the first few entries
smiles_df.head()

Header detected. Skipping the first row.


Unnamed: 0,SMILES
0,COC1=C(C=C(C=C1)Br)CNC23CC4CC(C2)CC(C4)C3
1,CCOC1=CC=C(C=C1)S(=O)(=O)[C@H]2CS(=O)(=O)C[C@@...
2,CC1=C(C(N2C(=O)CCSC2=N1)C3=C(C=C(C=C3)Cl)Cl)C(...
3,C1=CC(=C(C=C1COC(=O)C2=CC(=C(C=C2)O)O)O)O[C@H]...
4,C[C@@H]1CCC[C@@H](N1CCCC(C2=CC=CC=C2)(C3=CC=CC...


	
•	**smiles_file_path:** Path to your input TSV file containing SMILES strings.

•	**output_dir:** Directory where final results will be stored.

•	**intermediate_files:** Directory where intermediate results will be stored.

•	**final_output_path:** Path to the final JSON file containing classification results.

•	**Loading Data:** Reads the TSV file into a pandas DataFrame and drops any rows with missing values.

## **4. Canonicalizing SMILES**
Canonicalization ensures that each SMILES string has a unique representation.

In [3]:
# Canonicalize SMILES
smiles_df['Canonical_SMILES'] = smiles_df['SMILES'].apply(
    lambda x: MoleCule.from_smiles(x).canonical_smiles if x else None
)

# Count invalid SMILES after canonicalization
invalid_smiles = smiles_df['Canonical_SMILES'].isnull().sum()
print(f"Number of invalid SMILES after canonicalization: {invalid_smiles}")

# Remove invalid entries
if invalid_smiles > 0:
    smiles_df = smiles_df.dropna(subset=['Canonical_SMILES'])
    print(f"Removed {invalid_smiles} invalid SMILES entries.")

# Reset index after cleaning
smiles_df.reset_index(drop=True, inplace=True)

Number of invalid SMILES after canonicalization: 0


•	**Canonicalization:** Uses the MoleCule class to convert each SMILES string to its canonical form.

•	**Invalid Entries:** Counts and removes any SMILES strings that couldn’t be canonized.

•	**Cleaning:** Ensures that only valid and unique SMILES strings are retained for further processing.

## **5. Generating SMILES Mapping**

Creating a mapping from original to canonical SMILES helps in tracking and verifying the classification results.

In [4]:
from pyclassyfire.src.utils import save_smiles_mapping

# Extract the list of canonical SMILES
canonical_smiles_list = smiles_df['Canonical_SMILES'].tolist()
canonical_smiles_list = list(set(canonical_smiles_list))

# Extract the list of original SMILES
original_smiles_list = smiles_df['SMILES'].tolist()
original_smiles_list = list(set(original_smiles_list))

# Save the mapping from original to canonical SMILES
save_smiles_mapping(original_smiles_list, output_dir)

Successfully saved SMILES mapping to ../sample_data/output/mapping.json.



•	**Deduplication:** Removes duplicate SMILES strings to optimize processing.

•	**Mapping:** Uses the save_smiles_mapping function to create and save a JSON file that maps each original SMILES to its canonical form.

## **6. Processing Batches for Classification**

Classification is performed in batches to efficiently handle large datasets and manage API interactions.

In [5]:
from pyclassyfire.src.batch import process_batches_with_saving_and_retry

# Define parameters
batch_size = 100          # Number of SMILES per job
max_retries = 1           # Maximum number of retries for failed batches
retry_delay = 10         # Delay between retries in seconds

# Process the batches with resumption and retry logic
intermediate_files = process_batches_with_saving_and_retry(
    smiles_list=canonical_smiles_list,
    batch_size=batch_size,
    output_dir=intermediate_files_path,
    max_retries=max_retries,
    retry_delay=retry_delay
)

2025-01-11 07:11:17,634 - INFO - All SMILES: 10
2025-01-11 07:11:17,635 - INFO - Loaded results from ../sample_data/output/intermediate_files/intermediate_1.json
2025-01-11 07:11:17,636 - INFO - Already processed SMILES: 10
2025-01-11 07:11:17,636 - INFO - Remaining SMILES to process: 0
2025-01-11 07:11:17,636 - INFO - Remaining unique SMILES to process after removing duplicates: 0
2025-01-11 07:11:17,637 - INFO - Total remaining batches to process: 0


All SMILES: 10
Already processed SMILES: 10
Remaining SMILES to process: 0
Remaining unique SMILES to process after removing duplicates: 0
Total remaining batches to process: 0
All batches have already been processed.



•	**Batch Size:** Determines how many SMILES strings are processed in each API call.

•	**Retry Logic:** If a batch fails, the function will retry processing it up to max_retries times with a delay of retry_delay seconds between attempts.

•	**Function:** process_batches_with_saving_and_retry handles the classification process, saves intermediate results, and manages resuming process from last batch.

**Hint:**

•	**Multiple Runs**: If some batches fail despite retries, consider running the process_batches_with_saving_and_retry function multiple times to ensure all SMILES strings are classified.

## **7. Merging Intermediate Results**

After processing all batches, the intermediate JSON files need to be merged into a single result file.

In [6]:
from pyclassyfire.src.utils import merge_intermediate_files

# Merge the intermediate files into the final JSON
merge_intermediate_files(intermediate_files_path, final_output_path)

Successfully merged 1 files into ../sample_data/output/output.json.


•	**Function:** merge_intermediate_files consolidates all intermediate JSON files into a single output.json 

## **8. Evaluating Classification Results**

Once classification is complete, it’s essential to evaluate the results.

In [7]:
from pyclassyfire.src.utils import check_all_smiles_present

# Check if all SMILES are present in the final output
missing_smiles_mapping = check_all_smiles_present(final_output_path, canonical_smiles_list)

# Display summary of missing SMILES
if missing_smiles_mapping:
    print(f"Number of missing SMILES: {len(missing_smiles_mapping)}")
    print("Check 'missing_smiles.json' for details.")
else:
    print("All SMILES have been successfully processed.")

All SMILES are present in the final output.
All SMILES have been successfully processed.


•	**Function:** check_all_smiles_present verifies that every canonical SMILES string has a corresponding entry in the final classification results.

•	**Missing SMILES:** If any SMILES strings are missing, they are detailed in a missing_smiles.json file for further investigation.

**Hint:**

•	Investigate Missing SMILES: If missing_smiles.json is generated, review it to understand which SMILES strings weren’t classified and why. This can help in troubleshooting and ensuring complete data processing.