# Automated Data Retrieval Pipeline for Kinase Inhibitors (ChEMBL)

## Project Overview
This notebook implements an automated pipeline to retrieve, clean, and aggregate bioactive molecule data from the [ChEMBL database](https://www.ebi.ac.uk/chembl/). 
The goal is to curate a dataset of inhibitors for the PI3K/AKT/mTOR signaling pathway—specifically targeting **PI3K (Alpha, Beta, Gamma, Delta)**, **AKT**, and **mTOR**.

The pipeline performs the following steps:
1.  **API Integration:** Connects to ChEMBL using the `chembl_webresource_client`.
2.  **Target Search:** Identifies specific human targets using keyword matching and ChEMBL IDs.
3.  **Data Extraction:** Retrieves bioactivity data and enriches it with physicochemical properties (e.g., SMILES, LogP, Molecular Weight).
4.  **Aggregation:** Merges data across multiple targets and removes duplicates.
5.  **Export:** Saves the final structured dataset for downstream Machine Learning tasks.

## 1. Environment Setup
First, we verify and install the necessary dependencies, specifically the `chembl_webresource_client`, to allow Python to interface with the ChEMBL API.

In [1]:
pip install chembl_webresource_client

Collecting chembl_webresource_client
  Downloading chembl_webresource_client-0.10.9-py3-none-any.whl.metadata (1.4 kB)
Collecting requests-cache~=1.2 (from chembl_webresource_client)
  Downloading requests_cache-1.2.1-py3-none-any.whl.metadata (9.9 kB)
Collecting cattrs>=22.2 (from requests-cache~=1.2->chembl_webresource_client)
  Downloading cattrs-25.3.0-py3-none-any.whl.metadata (8.4 kB)
Collecting url-normalize>=1.4 (from requests-cache~=1.2->chembl_webresource_client)
  Downloading url_normalize-2.2.1-py3-none-any.whl.metadata (5.6 kB)
Downloading chembl_webresource_client-0.10.9-py3-none-any.whl (55 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.2/55.2 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading requests_cache-1.2.1-py3-none-any.whl (61 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.4/61.4 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading cattrs-25.3.0-py3-none-any.whl (70 kB)
[2K   [90m━━━━━━━━━━━━━━

## 2. Core Retrieval Logic
To ensure modularity and error handling, we define a reusable function: `get_target_inhibitors_data`.

**Function Capabilities:**
* **Target Identification:** Accepts broad search terms to locate the correct human target or uses a specific ChEMBL ID if provided.
* **Activity Retrieval:** Queries the database for all recorded activities (inhibitors) associated with the target.
* **Feature Enrichment:** Extracts key molecular features essential for drug discovery models, including:
    * **Canonical SMILES** (for molecular representation)
    * **LogP** (lipophilicity)
    * **Molecular Weight & Formula**
    * **Hydrogen Bond Acceptors/Donors**
    * **Lipinski's Rule of 5 violations** (implied by properties)
* **Robustness:** Gracefully handles missing values and API timeouts.

In [2]:
import pandas as pd
from chembl_webresource_client.new_client import new_client as client

def get_target_inhibitors_data(target_search_terms, target_name, specific_chembl_id=None):
    """
    Queries the ChEMBL API for compounds targeting a specified target (or isoforms),
    extracts their ChEMBL IDs, preferred names, logP values, SMILES strings,
    and additional molecular properties, handling potential missing values gracefully.

    Args:
        target_search_terms (list): A list of strings to search for the target.
        target_name (str): The name to assign to the 'Target' column.
        specific_chembl_id (str, optional): A specific ChEMBL ID to use for the target, bypassing search.

    Returns:
        pd.DataFrame: A DataFrame containing 'ChEMBL ID', 'Preferred Name', 'logP',
                      'SMILES', and new molecular properties for the target's inhibitors.
    """
    print(f"Searching for {target_name} targets broadly using terms: {target_search_terms}...")
    target_chembl_id = specific_chembl_id

    if not target_chembl_id:
        # Iterate through each search term to find the target
        for search_term in target_search_terms:
            targets = client.target.search(search_term)
            for target in targets:
                # Check for human organism and if the preferred name (case-insensitive) contains any of the search terms
                # We use `any(s_term.lower() in target['pref_name'].lower() for s_term in target_search_terms)`
                # to ensure flexibility and match any of the provided terms in the target's pref_name.
                if 'Homo sapiens' in target['organism'] and \
                   any(s_term.lower() in target['pref_name'].lower() for s_term in target_search_terms):
                    target_chembl_id = target['target_chembl_id']
                    print(f"Found human {target_name} target: {target['pref_name']} ({target_chembl_id})")
                    break
            if target_chembl_id: # If found, no need to try other search terms
                break

    if target_chembl_id and specific_chembl_id:
        print(f"Using specific ChEMBL ID for {target_name}: {target_chembl_id}")
    elif target_chembl_id:
        print(f"Found target for {target_name}: {target_chembl_id}")

    # If no suitable target is found, return an empty DataFrame with predefined columns
    if not target_chembl_id:
        print(f"Could not find a suitable human {target_name} target with current filtering. Exiting.")
        return pd.DataFrame(columns=['ChEMBL ID', 'Preferred Name', 'logP', 'SMILES',
                                     'aromatic_rings', 'full_molformula', 'full_mwt',
                                     'hba', 'hbd', 'heavy_atoms', 'np_likeness_score'])

    # Retrieve activities associated with this target
    print(f"Retrieving activities for target {target_chembl_id} with broader criteria...")
    activities = client.activity.filter(target_chembl_id=target_chembl_id)

    # Initialize an empty list to store the extracted data
    inhibitor_data = []

    # Iterate through the retrieved activities
    print(f"Processing {len(activities)} activities...")
    for activity in activities:
        chembl_id = activity.get('molecule_chembl_id')
        preferred_name = None
        logp_value = None
        smiles_value = None
        aromatic_rings = None
        full_molformula = None
        full_mwt = None
        hba = None
        hbd = None
        heavy_atoms = None
        np_likeness_score = None

        # a. Extract preferred name
        if activity.get('molecule_pref_name'):
            preferred_name = activity['molecule_pref_name']
        elif 'molecule_synonyms' in activity and activity['molecule_synonyms']:
            for synonym_entry in activity['molecule_synonyms']:
                if synonym_entry.get('molecule_synonym'):
                    preferred_name = synonym_entry['molecule_synonym']
                    break
        if preferred_name is None:
            preferred_name = chembl_id # Fallback to CHEMBL ID

        # b. Extract logP, SMILES, and new molecular properties from activity object first
        if 'molecule_properties' in activity and activity['molecule_properties']:
            mol_props = activity['molecule_properties']
            logp_value = mol_props.get('alogp')
            smiles_value = mol_props.get('standard_smiles')
            aromatic_rings = mol_props.get('aromatic_rings')
            full_molformula = mol_props.get('full_molformula')
            full_mwt = mol_props.get('full_mwt')
            hba = mol_props.get('hba')
            hbd = mol_props.get('hbd')
            heavy_atoms = mol_props.get('heavy_atoms')
            np_likeness_score = mol_props.get('np_likeness_score')

        # If any essential data is still None, try fetching full molecule details
        if (logp_value is None or smiles_value is None or preferred_name == chembl_id or \
            aromatic_rings is None or full_molformula is None or full_mwt is None or \
            hba is None or hbd is None or heavy_atoms is None or np_likeness_score is None) and chembl_id:
            try:
                molecule = client.molecule.get(chembl_id)
                if molecule:
                    # Try to get a better preferred name from the molecule object itself
                    if preferred_name == chembl_id and molecule.get('pref_name'):
                        preferred_name = molecule['pref_name']
                    elif preferred_name == chembl_id and molecule.get('molecule_synonyms') and not preferred_name:
                        for synonym_entry in molecule['molecule_synonyms']:
                            if synonym_entry.get('molecule_synonym'):
                                preferred_name = synonym_entry['molecule_synonym']
                                break

                    if 'molecule_properties' in molecule and molecule['molecule_properties']:
                        mol_props = molecule['molecule_properties']
                        if logp_value is None: logp_value = mol_props.get('alogp')
                        if smiles_value is None: smiles_value = mol_props.get('standard_smiles')
                        if aromatic_rings is None: aromatic_rings = mol_props.get('aromatic_rings')
                        if full_molformula is None: full_molformula = mol_props.get('full_molformula')
                        if full_mwt is None: full_mwt = mol_props.get('full_mwt')
                        if hba is None: hba = mol_props.get('hba')
                        if hbd is None: hbd = mol_props.get('hbd')
                        if heavy_atoms is None: heavy_atoms = mol_props.get('heavy_atoms')
                        if np_likeness_score is None: np_likeness_score = mol_props.get('np_likeness_score')
                    # Also check directly for canonical_smiles if standard_smiles is missing
                    if smiles_value is None:
                        smiles_value = molecule.get('canonical_smiles')
            except Exception as e:
                # print(f"Error fetching molecule details for {chembl_id}: {e}")
                pass # Suppress individual molecule fetch errors for brevity in output

        # c. Append the extracted data as a dictionary to your list
        inhibitor_data.append({
            'ChEMBL ID': chembl_id,
            'Preferred Name': preferred_name,
            'logP': logp_value,
            'SMILES': smiles_value,
            'aromatic_rings': aromatic_rings,
            'full_molformula': full_molformula,
            'full_mwt': full_mwt,
            'hba': hba,
            'hbd': hbd,
            'heavy_atoms': heavy_atoms,
            'np_likeness_score': np_likeness_score
        })

    # 7. Convert the list of dictionaries into a pandas DataFrame
    df = pd.DataFrame(inhibitor_data)
    df['Target'] = target_name
    df['Action'] = 'inhibitor'
    return df

print("The `get_target_inhibitors_data` function has been redefined to accept `specific_chembl_id`.")

The `get_target_inhibitors_data` function has been redefined to accept `specific_chembl_id`.


## 3. Target Configuration
We define the scope of our search by specifying the biological targets within the PI3K/AKT/mTOR pathway.

We create a configuration list where each entry contains:
* **Search Terms:** A list of synonyms to ensure the correct protein is found via text search.
* **Target Name:** A standardized label for the final dataset.
* **Specific ChEMBL ID:** (Optional) Hardcoded IDs for precision, ensuring we query the exact human isoform (e.g., *PIK3CA* for PI3K Alpha).

In [3]:
target_parameters = []

# PI3K Alpha
target_parameters.append({
    'target_search_terms': ['PI3K', 'PI3K alpha', 'PI3Kalpha', 'p110 alpha', 'pik3ca'],
    'target_name': 'PI3K alpha',
    'specific_chembl_id': 'CHEMBL1075102' # User-provided specific ChEMBL ID
})

# PI3K Beta
target_parameters.append({
    'target_search_terms': ['PI3K', 'PI3K beta', 'PI3Kbeta', 'p110 beta', 'pik3cb'],
    'target_name': 'PI3K beta',
    'specific_chembl_id': 'CHEMBL5554' # User-provided specific ChEMBL ID
})

# PI3K Gamma
target_parameters.append({
    'target_search_terms': ['PI3K', 'PI3K gamma', 'PI3Kgamma', 'p110 gamma', 'pik3cg'],
    'target_name': 'PI3K gamma',
    'specific_chembl_id': 'CHEMBL3267' # User-provided specific ChEMBL ID
})

# PI3K Delta
target_parameters.append({
    'target_search_terms': ['PI3K', 'PI3K delta', 'PI3Kdelta', 'p110 delta', 'pik3cd'],
    'target_name': 'PI3K delta',
    'specific_chembl_id': 'CHEMBL3130' # User-provided specific ChEMBL ID
})

# AKT
target_parameters.append({
    'target_search_terms': ['AKT', 'Protein kinase B', 'PKB'],
    'target_name': 'AKT'
})

# mTOR
target_parameters.append({
    'target_search_terms': ['mTOR', 'mechanistic target of rapamycin', 'FKBP12-rapamycin complex associated protein 1'],
    'target_name': 'mTOR'
})

print("The 'target_parameters' list has been updated with specific ChEMBL IDs.")

The 'target_parameters' list has been updated with specific ChEMBL IDs.


In [4]:
all_inhibitor_dfs = []
print("An empty list named `all_inhibitor_dfs` has been initialized.")

An empty list named `all_inhibitor_dfs` has been initialized.


## 4. Pipeline Execution
We iterate through the configuration list, triggering the data retrieval pipeline for each target. The script provides real-time feedback on the number of bioactivities fetched for each kinase. Valid datasets are collected into a list for subsequent aggregation.

In [5]:
for params in target_parameters:
    search_terms = params['target_search_terms']
    name = params['target_name']
    specific_chembl_id = params.get('specific_chembl_id') # Get specific ID if available
    print(f"\nFetching data for target: {name}")
    df_inhibitors = get_target_inhibitors_data(search_terms, name, specific_chembl_id=specific_chembl_id)
    if not df_inhibitors.empty:
        all_inhibitor_dfs.append(df_inhibitors)
        print(f"Successfully fetched data for {name}. Total records: {len(df_inhibitors)}")
    else:
        print(f"No inhibitors found for {name} or an error occurred.")

print(f"\nFinished processing all targets. Total DataFrames collected: {len(all_inhibitor_dfs)}")


Fetching data for target: PI3K alpha
Searching for PI3K alpha targets broadly using terms: ['PI3K', 'PI3K alpha', 'PI3Kalpha', 'p110 alpha', 'pik3ca']...
Using specific ChEMBL ID for PI3K alpha: CHEMBL1075102
Retrieving activities for target CHEMBL1075102 with broader criteria...
Processing 232 activities...
Successfully fetched data for PI3K alpha. Total records: 232

Fetching data for target: PI3K beta
Searching for PI3K beta targets broadly using terms: ['PI3K', 'PI3K beta', 'PI3Kbeta', 'p110 beta', 'pik3cb']...
Using specific ChEMBL ID for PI3K beta: CHEMBL5554
Retrieving activities for target CHEMBL5554 with broader criteria...
Processing 1486 activities...
Successfully fetched data for PI3K beta. Total records: 1486

Fetching data for target: PI3K gamma
Searching for PI3K gamma targets broadly using terms: ['PI3K', 'PI3K gamma', 'PI3Kgamma', 'p110 gamma', 'pik3cg']...
Using specific ChEMBL ID for PI3K gamma: CHEMBL3267
Retrieving activities for target CHEMBL3267 with broader cri

## 5. Data Aggregation and Cleaning
Once retrieval is complete, we unify the individual datasets into a single Master DataFrame.

**Post-Processing Steps:**
1.  **Concatenation:** Merge all target-specific DataFrames.
2.  **Deduplication:** Remove duplicate entries based on `ChEMBL ID` and `Target` to ensure that each unique molecule-target interaction is recorded only once.

In [6]:
# Concatenate all DataFrames
aggregated_df = pd.concat(all_inhibitor_dfs, ignore_index=True)
print(f"Total records after concatenation: {len(aggregated_df)}")

print("Aggregated DataFrame head:")
print(aggregated_df.head())
print(f"Aggregated DataFrame shape: {aggregated_df.shape}")

Total records after concatenation: 78036
Aggregated DataFrame head:
       ChEMBL ID Preferred Name   logP SMILES  aromatic_rings full_molformula  \
0  CHEMBL1085178  CHEMBL1085178   2.10   None             3.0     C18H22N6O2S   
1  CHEMBL1084926  CHEMBL1084926   1.79   None             3.0     C17H20N6O2S   
2  CHEMBL1256459         TORIN1   6.83   None             6.0    C35H28F3N5O2   
3  CHEMBL1684984      IZORLISIB  -0.33   None             2.0     C15H19N7O3S   
4  CHEMBL1765602  CHEMBL1765602   5.20   None             5.0     C24H15F3N4O   

  full_mwt  hba  hbd  heavy_atoms np_likeness_score      Target     Action  
0   386.48  9.0  2.0         27.0             -1.15  PI3K alpha  inhibitor  
1   372.45  9.0  2.0         26.0             -1.21  PI3K alpha  inhibitor  
2   607.64  6.0  0.0         45.0             -1.20  PI3K alpha  inhibitor  
3   377.43  9.0  1.0         26.0             -1.49  PI3K alpha  inhibitor  
4   432.41  5.0  1.0         32.0             -0.85  PI3K al

## 6. Data Export
Finally, the processed dataset is saved to a CSV file (`all_inhibitor_data_with_smiles.csv`). This file serves as the foundational dataset for subsequent exploratory data analysis (EDA) and machine learning modeling (e.g., QSAR prediction or generative modeling).

In [7]:
# Define the output file name
output_filename = 'all_inhibitor_data_with_smiles.csv'

# Save the aggregated DataFrame to a CSV file
aggregated_df.to_csv(output_filename, index=False)

print(f"Aggregated data successfully saved to '{output_filename}'")

Aggregated data successfully saved to 'all_inhibitor_data_with_smiles.csv'
