# Bioactivity Data Curation

## Overview

This notebook performs systematic bioactivity data curation for subsequent QSAR modelling. Raw activity data are filtered, standardised, and validated to ensure consistency and suitability for quantitative analysis.

The curation process includes removal of incomplete records, harmonisation of activity units, selection of relevant assay measurements, and enforcement of structural–activity pairing integrity. Each transformation step is explicitly reported to ensure transparency and reproducibility.

This stage ensures that the final dataset is chemically valid, numerically consistent, and methodologically appropriate for descriptor generation and predictive modelling.

**Step 1: Import Required Libraries**

This section imports the core libraries required for data handling and retrieval of bioactivity records from the ChEMBL database.  
All dependencies must be defined in the repository’s `requirements.txt` file to ensure reproducible execution.

In [None]:
# Core data handling libraries
import pandas as pd
import numpy as np

# ChEMBL API client (requires 'chembl_webresource_client' in requirements.txt)
from chembl_webresource_client.new_client import new_client

**Step 2: Query ChEMBL for Target Information**

In [None]:
# Define target keyword (modifiable for reuse)
TARGET_KEYWORD = "KEAP1"

# Query ChEMBL for matching targets
target_client = new_client.target
target_query = target_client.search(TARGET_KEYWORD)

targets_df = pd.DataFrame.from_dict(target_query)

# Validate results
if targets_df.empty:
    raise ValueError(f"No targets found for keyword: {TARGET_KEYWORD}")

print(f"Number of targets retrieved for '{TARGET_KEYWORD}': {len(targets_df)}")

targets_df.head()

**Step 3: Select the Biologically Relevant Target**

The target is selected after filtering for biologically relevant criteria (e.g., organism and target type).  
Exactly one valid entry is expected. If multiple entries are found, manual inspection is required to avoid ambiguous target selection.

In [None]:
# Filter for Homo sapiens single-protein targets
filtered_targets = targets_df[
    (targets_df["organism"] == "Homo sapiens") &
    (targets_df["target_type"] == "SINGLE PROTEIN")
]

if filtered_targets.empty:
    raise ValueError("No matching Homo sapiens single-protein targets found.")

if len(filtered_targets) > 1:
    raise ValueError(
        "Multiple biologically relevant targets found. "
        "Please inspect filtered_targets manually to select the correct entry."
    )

# Exactly one expected entry
selected_target_id = filtered_targets.iloc[0]["target_chembl_id"]

print(f"Selected Target ChEMBL ID: {selected_target_id}")

**Step 4: Retrieve IC50 Bioactivity Records**

Bioactivity data corresponding to the selected target are retrieved from ChEMBL.  
Only records with a defined standard type (IC50) are queried. The retrieved dataset is subsequently inspected and curated to ensure consistency and suitability for modelling.

In [None]:
# Retrieve IC50 bioactivity records for selected target
activity_client = new_client.activity

activity_query = activity_client.filter(
    target_chembl_id=selected_target_id,
    standard_type="IC50"
)

df_ic50 = pd.DataFrame.from_dict(activity_query)

# Validate retrieval
if df_ic50.empty:
    raise ValueError("No IC50 bioactivity records retrieved for the selected target.")

print(f"Total IC50 records retrieved: {len(df_ic50)}")

df_ic50.head()

**Step 5: Filter Records with Valid IC50 Values and Structural Information**

Records lacking quantitative activity values or canonical SMILES representations are removed.  
This ensures that each retained entry contains both a valid numerical bioactivity measurement and a defined molecular structure.

In [None]:
# Initial record count
initial_count = len(df_ic50)

# Retain records with defined IC50 values and canonical SMILES
df_filtered = df_ic50[
    df_ic50["standard_value"].notna() &
    df_ic50["canonical_smiles"].notna()
].copy()

final_count = len(df_filtered)
removed_count = initial_count - final_count

print(f"Initial IC50 records: {initial_count}")
print(f"Records after removing missing values: {final_count}")
print(f"Records removed: {removed_count}")

df_filtered.head()

**Step 6: Retain Exact Quantitative Measurements**

Only records with an exact activity relationship (standard_relation = "=") are retained.  
Entries containing inequality operators (e.g., ">" or "<") are excluded to ensure quantitative consistency for regression modelling.

In [None]:
# Record count before filtering
initial_count = len(df_filtered)

# Retain only exact IC50 measurements
df_exact = df_filtered[df_filtered["standard_relation"] == "="].copy()

final_count = len(df_exact)
removed_count = initial_count - final_count

print(f"Records before relation filtering: {initial_count}")
print(f"Records after retaining exact values: {final_count}")
print(f"Records removed due to inequality relations: {removed_count}")

df_exact.head()

**Step 7: Remove Duplicate Compounds Based on Canonical SMILES**

To ensure that each compound appears only once in the curated dataset, duplicate entries are removed based on canonical SMILES.  
When multiple records exist for the same structure, the first occurrence is retained.

In [None]:
# Record count before duplicate removal
initial_count = len(df_exact)

# Remove duplicate compounds based on canonical SMILES
df_unique = df_exact.drop_duplicates(subset="canonical_smiles", keep="first").copy()

final_count = len(df_unique)
removed_count = initial_count - final_count

print(f"Records before duplicate removal: {initial_count}")
print(f"Unique compounds retained: {final_count}")
print(f"Duplicate entries removed: {removed_count}")

df_unique.head()

**Step 8: Construct Modelling Dataset with Essential Fields**

Only the essential fields required for QSAR modelling are retained: compound identifier, canonical SMILES, and quantitative IC50 values.  
This creates a clean and focused dataset for subsequent standardisation and transformation steps.

In [None]:
# Define required columns
required_columns = ["molecule_chembl_id", "canonical_smiles", "standard_value"]

# Validate column availability
missing_columns = [col for col in required_columns if col not in df_unique.columns]
if missing_columns:
    raise ValueError(f"Missing required columns: {missing_columns}")

# Select relevant columns
dataset = df_unique[required_columns].copy()

# Rename columns for clarity
dataset.rename(
    columns={
        "molecule_chembl_id": "ChEMBL_ID",
        "canonical_smiles": "SMILES",
        "standard_value": "IC50_nM"
    },
    inplace=True
)

print(f"Final curated dataset size: {dataset.shape[0]} compounds")
print(f"Number of columns retained: {dataset.shape[1]}")

dataset.head()

**Step 9: Standardise IC50 Values and Compute pIC50**

IC50 values are converted to numeric format and filtered to retain only valid positive measurements.  
Assuming activity values are expressed in nanomolar (nM), pIC50 values are calculated using:

pIC50 = 9 − log10(IC50 in nM)

This logarithmic transformation improves numerical stability and modelling performance.

In [None]:
# Record count before numeric conversion
initial_count = len(dataset)

# Ensure IC50 values are numeric
dataset["IC50_nM"] = pd.to_numeric(dataset["IC50_nM"], errors="coerce")

# Remove non-numeric and non-positive values
dataset = dataset[dataset["IC50_nM"] > 0].copy()

final_count = len(dataset)
removed_count = initial_count - final_count

print(f"Records before IC50 cleaning: {initial_count}")
print(f"Records after removing invalid IC50 values: {final_count}")
print(f"Records removed: {removed_count}")

# Compute pIC50 (assuming IC50 in nM)
dataset["pIC50"] = 9 - np.log10(dataset["IC50_nM"])

dataset.head()

**Step 10: Finalise Column Naming for QSAR Modelling**

Column names are standardised to ensure clarity and consistency throughout the modelling pipeline.  
Identifiers, molecular representations, and response variables are clearly defined.

In [None]:
# Expected final column structure
expected_columns = ["ChEMBL_ID", "SMILES", "IC50_nM", "pIC50"]

missing_columns = [col for col in expected_columns if col not in dataset.columns]

if missing_columns:
    raise ValueError(f"Missing expected columns: {missing_columns}")

print("Column structure validated successfully.")
print("Final dataset columns:")
print(dataset.columns.tolist())

dataset.head()

**Step 11: Export Curated Bioactivity Dataset**

The curated and standardised bioactivity dataset is saved to the structured project directory.  
This ensures reproducibility and consistent integration with downstream descriptor generation and modelling workflows.

In [None]:
from pathlib import Path

# Define structured output directory
OUTPUT_DIR = Path("../data/processed")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

# Define output file (target-agnostic)
output_file = OUTPUT_DIR / "bioactivity_dataset_curated.csv"

# Save dataset
dataset.to_csv(output_file, index=False)

# Final summary
print("Curated bioactivity dataset successfully saved.")
print(f"Output path: {output_file}")
print(f"Total compounds in final dataset: {len(dataset)}")