# ChemBL Dataset Extraction

This notebook is a supplementary to the main notebook and is just used to extract drugs from ChemBL and save it as a `.csv` file

The ChemBL dataset can be found [here](https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/latest/) 

I am using the `"chembl_35.db"` version of the database.


## Libraries

In [1]:
import sqlite3
import pandas as pd
import re

## Reading the dataset

In [2]:
conn = sqlite3.connect("chembl_35.db")
cursor = conn.cursor()

In [3]:
query =  """
SELECT DISTINCT
  md.molregno,
  md.chembl_id      AS molecule_chembl_id,
  md.pref_name      AS molecule_name,
  cs.canonical_smiles,
  cs.standard_inchi_key,
  act.activity_id,
  act.activity_comment,
  act.standard_type,
  act.standard_value,
  ass.description   AS assay_description,
  ass.assay_type,
  td.pref_name      AS target_pref_name,
  td.target_type,
  td.organism       AS target_organism
FROM molecule_dictionary md
LEFT JOIN compound_structures cs ON cs.molregno = md.molregno
LEFT JOIN activities act ON act.molregno = md.molregno
LEFT JOIN assays ass ON ass.assay_id = act.assay_id
LEFT JOIN target_dictionary td ON td.tid = ass.tid;
"""

In [4]:
df = pd.read_sql(query, conn)

## Extracting all substrates of any of the Cytochrome P450's isoforms

First, I am checking if the target enzyme is Cytochrome P450 (any isoform).  
We return `False` for any other target enzyme.

Then, I am following two criteria to check if a drug is a substrate or not:

### 1. Direct indication from activity comment:
- If the activity comment says "substrate", we can immediately return `True` because it is outright given.

### 2. Evidence tracker evaluation:
- If no direct mention of "substrate" is found, we use an evidence tracker to build up the evidence to decide whether the given drug is a substrate or not.
  - We consider the case of negation as well, i.e., "not a substrate" (rare, but better to be safe than sorry), we return `False` if such is the case
  - We **add points** if there is an indication of metabolism.
  - We **deduct points** if there is an indication of inhibition (since the target enzyme is an isoform of CYP450, it can be either an inhibitor or a substrate).
  - If the total score is above a threshold (currently set to `0.7`), we return `True`; otherwise, `False`.


In [5]:
cyp_pattern = re.compile(r"\b(?:cyp|cytochrome(?:\s*p450)?)\s*\.?\s*([0-9]{1,2}[A-Za-z]?)?\b", flags=re.I)
substrate_word = re.compile(r"\bsubstrate\b", flags=re.I)
negation_pattern = re.compile(r"\b(?:not a substrate|non-?substrate|no evidence of(?: metabolism)?|does not metabol|not metaboliz|not metabolized|not metabolised)\b", flags=re.I)
qualifier_pattern = re.compile(r"\b(?:possible|potential|probable|likely|may be|might be|suggests|suggested|suspected|\?)\b", flags=re.I)
metabolism_indicators = re.compile(
    r"\b(km|vmax|intrinsic clearance|clint|microsom|hepatocyt|metabolit|metabolic stability|substrate depletion|turnover|t1/2|half-?life|metabolised|metabolized)\b",
    flags=re.I
)
inhibitor_indicators = re.compile(r"\b(inhibitor|inhibition|inhibits|ic50|ki|ki\(|ki$)\b", flags=re.I)

In [6]:
def normalize_text(x):
    if pd.isna(x):
        return ""
    return str(x).strip()

In [7]:
def is_cyp_target_row(row):

    texts = " ".join([
        normalize_text(row.get("target_pref_name", "")),
        normalize_text(row.get("target_chembl_id", "")),
        normalize_text(row.get("target_organism", "")),
    ]).lower()
    texts += " " + normalize_text(row.get("assay_description","")).lower()
    return bool(cyp_pattern.search(texts))

In [8]:
def is_cyp_substrate_row(row, accept_threshold=0.7):

    # normalize fields
    activity_comment = normalize_text(row.get("activity_comment", ""))
    assay_description = normalize_text(row.get("assay_description", ""))
    assay_type = normalize_text(row.get("assay_type", ""))
    target_pref = normalize_text(row.get("target_pref_name", ""))
    target_organism = normalize_text(row.get("target_organism", ""))

    # Quick: require target is CYP for any positive decision
    if not is_cyp_target_row(row):
        return False

    # EXACT safe accept: activity_comment exactly == "substrate" (no surrounding text)
    if re.fullmatch(r"\s*substrate\s*", activity_comment, flags=re.I) and not negation_pattern.search(activity_comment):
        return True

    # otherwise build evidence
    score = 0.0

    # CYP target presence gives base weight
    score += 0.4

    combined_text = " ".join([activity_comment, assay_description, assay_type]).lower()

    if substrate_word.search(combined_text):
        
        if negation_pattern.search(combined_text):
            return False
        
        if qualifier_pattern.search(combined_text):
            score += 0.15
        else:
            score += 0.3

    if metabolism_indicators.search(combined_text):
        score += 0.25

    if inhibitor_indicators.search(combined_text):
        score -= 0.35

    if target_organism and target_organism.lower() in ("homo sapiens", "human"):
        score += 0.1

    # clamp score
    score = max(0.0, min(1.0, score))

    return score >= accept_threshold

In [None]:
df["is_cyp450_substrate"] = df.apply(is_cyp_substrate_row, axis=1)

In [10]:
print("\nTarget distribution:")
print(df["is_cyp450_substrate"].value_counts())


Target distribution:
is_cyp450_substrate
False    21222731
True         2590
Name: count, dtype: int64


In [11]:
positives = df[df["is_cyp450_substrate"] == 1]

In [12]:
positives

Unnamed: 0,molregno,molecule_chembl_id,molecule_name,canonical_smiles,standard_inchi_key,activity_id,activity_comment,standard_type,standard_value,assay_description,assay_type,target_pref_name,target_type,target_organism,is_cyp450_substrate
2720,115,CHEMBL3,NICOTINE,CN1CCC[C@H]1c1cccnc1,SNICXCGAKADSCV-JTQLQIEISA-N,6074386.0,Substrate,Activity,,Clinically relevant substrates of human liver ...,A,Cytochrome P450 2A6,SINGLE PROTEIN,Homo sapiens,True
53033,605,CHEMBL11,IMIPRAMINE,CN(C)CCCN1c2ccccc2CCc2ccccc21,BCGWQEUPMDMJNV-UHFFFAOYSA-N,1477930.0,,Log 1/Km,-2.0400,log (1/Km) value for human liver microsome cyt...,A,Cytochrome P450 3A4,SINGLE PROTEIN,Homo sapiens,True
53211,605,CHEMBL11,IMIPRAMINE,CN(C)CCCN1c2ccccc2CCc2ccccc21,BCGWQEUPMDMJNV-UHFFFAOYSA-N,6074432.0,Substrate,Activity,,Clinically relevant substrates of human liver ...,A,Cytochrome P450 2D6,SINGLE PROTEIN,Homo sapiens,True
53479,605,CHEMBL11,IMIPRAMINE,CN(C)CCCN1c2ccccc2CCc2ccccc21,BCGWQEUPMDMJNV-UHFFFAOYSA-N,15470675.0,,CL,0.0380,Drug metabolism in assessed as human CYP2C19-m...,A,Cytochrome P450 2C19,SINGLE PROTEIN,Homo sapiens,True
53482,605,CHEMBL11,IMIPRAMINE,CN(C)CCCN1c2ccccc2CCc2ccccc21,BCGWQEUPMDMJNV-UHFFFAOYSA-N,15470681.0,,CL,0.0380,Drug metabolism in assessed as human CYP1A2-me...,A,Cytochrome P450 1A2,SINGLE PROTEIN,Homo sapiens,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21199947,2881563,CHEMBL5441417,,CC(C)[C@H](Nc1ncnc2[nH]ccc12)c1ccc2c(c1)S(=O)(...,WLJITGAGZLIWOY-KRWDZBQOSA-N,25737787.0,,IC50,18200.0000,"The compounds are incubated at 0, 0.15, 0.5, 1...",A,Cytochrome P450 2C9,SINGLE PROTEIN,Homo sapiens,True
21199948,2881563,CHEMBL5441417,,CC(C)[C@H](Nc1ncnc2[nH]ccc12)c1ccc2c(c1)S(=O)(...,WLJITGAGZLIWOY-KRWDZBQOSA-N,25737788.0,,IC50,50000.0000,"The compounds are incubated at 0, 0.15, 0.5, 1...",A,Cytochrome P450 2D6,SINGLE PROTEIN,Homo sapiens,True
21199949,2881563,CHEMBL5441417,,CC(C)[C@H](Nc1ncnc2[nH]ccc12)c1ccc2c(c1)S(=O)(...,WLJITGAGZLIWOY-KRWDZBQOSA-N,25737789.0,,IC50,3200.0000,"The compounds are incubated at 0, 0.15, 0.5, 1...",A,Cytochrome P450 3A4,SINGLE PROTEIN,Homo sapiens,True
21206260,2882632,CHEMBL5483011,,CC(C)(C)NC(=O)[C@@H]1CN(Cc2cccnc2)CCN1C[C@@H](...,XTYSXGHMTNTKFH-BDEHJDMKSA-N,1477232.0,,Log 1/Km,-0.1139,log (1/Km) value for human liver microsome cyt...,A,Cytochrome P450 3A4,SINGLE PROTEIN,Homo sapiens,True


## Excluding drugs that are already present in ChemBL and removing duplicates

In [13]:
drugbank_dataset = pd.read_csv("dataset.csv")

In [14]:
drugbank_chembl_ids = set(drugbank_dataset["chembl_id"].dropna().unique())
drugbank_smiles = set(drugbank_dataset["smiles"].dropna().unique())

In [15]:
df.rename(columns={"molecule_chembl_id": "chembl_id"}, inplace=True)

In [16]:
df = df[~df["chembl_id"].isin(drugbank_chembl_ids)]
df = df[~df["canonical_smiles"].isin(drugbank_smiles)]

In [17]:
print("\nTarget distribution:")
print(df["is_cyp450_substrate"].value_counts())


Target distribution:
is_cyp450_substrate
False    19258983
True         2181
Name: count, dtype: int64


In [18]:
df = df.drop_duplicates(subset=["canonical_smiles"])

In [19]:
print("\nTarget distribution:")
print(df["is_cyp450_substrate"].value_counts())


Target distribution:
is_cyp450_substrate
False    2465626
True         274
Name: count, dtype: int64


In [20]:
positives = df[df["is_cyp450_substrate"] == 1]
negatives = df[df["is_cyp450_substrate"] == 0]

In [21]:
pos_sample = positives.sample(n=274, random_state=42)
neg_sample = negatives.sample(n=274, random_state=42)
final_val_set = pd.concat([pos_sample, neg_sample]).reset_index(drop=True)

In [22]:
final_val_set

Unnamed: 0,molregno,chembl_id,molecule_name,canonical_smiles,standard_inchi_key,activity_id,activity_comment,standard_type,standard_value,assay_description,assay_type,target_pref_name,target_type,target_organism,is_cyp450_substrate
0,1122112,CHEMBL1743264,,CC[C@@]1(c2ccccc2)NC(=O)N(C)C1=O,GMHKMTDVRCWUDX-LBPRGKRZSA-N,6074390.0,Substrate,Activity,,Clinically relevant substrates of human liver ...,A,Cytochrome P450 2B6,SINGLE PROTEIN,Homo sapiens,True
1,2158270,CHEMBL3948899,,O=S(=O)(NC1CCc2c(-c3ccc(C(F)(F)F)cc3)cncc21)C1CC1,WXLZPMOUEMUHHE-UHFFFAOYSA-N,17720898.0,343607,EC50,2328.40000,Cellular Enzyme Assay: G-402 cells expressing ...,B,Cytochrome P450 11B1,SINGLE PROTEIN,Homo sapiens,True
2,2186745,CHEMBL3977374,,CCC(=O)NC1CCc2c(-c3cc4c(cc3F)N(C)C(=O)CC4)cncc21,RXJOFZNMHGDABT-UHFFFAOYSA-N,17773552.0,407740,EC50,10.20000,Cellular Enzyme Assay: The expression plasmids...,B,Cytochrome P450 11B2,SINGLE PROTEIN,Homo sapiens,True
3,2117343,CHEMBL3907972,,CN1C(=O)CCc2cc(-c3cncc(COc4ncccc4C(F)(F)F)c3)c...,JEOUUVHDJALODO-UHFFFAOYSA-N,17794779.0,434818,EC50,2.30000,G402-Based Assay: Herein we identified the use...,B,Cytochrome P450 11B2,SINGLE PROTEIN,Homo sapiens,True
4,2501239,CHEMBL4753902,,Cc1sc2c(c1C)C(c1ccc(Cl)cc1)=N[C@@H](CC(=O)NCCO...,ADVKPHNJNIYOGA-MBMZGMDYSA-N,22413651.0,,T1/2,0.01833,Metabolic stability assessed as recombinant CY...,A,Cytochrome P450 3A4,SINGLE PROTEIN,Homo sapiens,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
543,1925324,CHEMBL3542197,,C[C@H](CO)[C@]1(C)SC(N[C@H]2C[C@H]3C[C@@H]2CC3...,PNVPXPSCFLTPOV-GOTIFHFNSA-N,15445357.0,,Drug recovery,0.10000,Total drug recovery in female Sprague-Dawley r...,A,Rattus norvegicus,ORGANISM,Rattus norvegicus,False
544,398059,CHEMBL231715,,O=C1Nc2cccc(O)c2/C1=C/c1ccc[nH]1,ITGNBFTVFUSOGU-CLFYSBASSA-N,1969507.0,,IC50,1000.00000,Inhibition of PDK1 mediated cAKT2 phosphorylat...,B,3-phosphoinositide dependent protein kinase-1,SINGLE PROTEIN,Homo sapiens,False
545,2749368,CHEMBL5178137,,C[Se]C1=C([C@@H]2O[C@H](COP(=O)(O)OP(=O)(O)OP(...,JZPFYMOTUXQAKE-QMTIVRBISA-N,24816501.0,Active,Inhibition,,Inhibition of Escherichia coli RNA polymerase-...,B,Unchecked,UNCHECKED,,False
546,30431,CHEMBL22757,,CCCCc1nnc(C(=O)[C@@H](NC(=O)Cn2c(-c3ccc(F)cc3)...,OZNABUFLGPLWMM-IBGZPJMESA-N,1179061.0,,Ki,6.76000,Binding constant derived from inhibition of el...,B,Leukocyte elastase,SINGLE PROTEIN,Homo sapiens,False


In [None]:
## Sanity Check to see if my thresholding was good 
final_val_set.iloc[1]["assay_description"]

"Cellular Enzyme Assay: G-402 cells expressing CYP11 constructs were established as described above and maintained in McCoy's 5a Medium Modified, ATCC Catalog No. 30-2007 containing 10% FCS and 400 ug/ml G418 (Geneticin) at 37° C. under an atmosphere of 5% CO2/95% air. Cellular enzyme assays were performed in DMEM/F12 medium containing 2.5% charcoal treated FCS and appropriate concentration of substrate (0.3-10 uM 11-Deoxycorticosterone, 11-Deoxycortisol or Corticosterone). For assaying enzymatic activity, cells were plated onto 96 well plates and incubated for 16 h. An aliquot of the supernatant is then transferred and analyzed for the concentration of the expected product (Aldosterone for CYP11B2; Cortisol for CYP11B1). The concentrations of these steroids can be determined using HTRF assays from CisBio analyzing either Aldosterone or Cortisol."

In [None]:
## Saving the dataset
final_val_set.to_csv('Chembl_ext_val_dataset.csv', index=False)  

In [None]:
## A bit of feature engineering that should go in the main notebook
final_val_set['is_cyp450_substrate'] = final_val_set['is_cyp450_substrate'].apply(lambda x: 1 if x else 0)

In [None]:
final_val_set

Unnamed: 0,molregno,chembl_id,molecule_name,canonical_smiles,standard_inchi_key,activity_id,activity_comment,standard_type,standard_value,assay_description,assay_type,target_pref_name,target_type,target_organism,is_cyp450_substrate
0,1122112,CHEMBL1743264,,CC[C@@]1(c2ccccc2)NC(=O)N(C)C1=O,GMHKMTDVRCWUDX-LBPRGKRZSA-N,6074390.0,Substrate,Activity,,Clinically relevant substrates of human liver ...,A,Cytochrome P450 2B6,SINGLE PROTEIN,Homo sapiens,1
1,2158270,CHEMBL3948899,,O=S(=O)(NC1CCc2c(-c3ccc(C(F)(F)F)cc3)cncc21)C1CC1,WXLZPMOUEMUHHE-UHFFFAOYSA-N,17720898.0,343607,EC50,2328.40000,Cellular Enzyme Assay: G-402 cells expressing ...,B,Cytochrome P450 11B1,SINGLE PROTEIN,Homo sapiens,1
2,2186745,CHEMBL3977374,,CCC(=O)NC1CCc2c(-c3cc4c(cc3F)N(C)C(=O)CC4)cncc21,RXJOFZNMHGDABT-UHFFFAOYSA-N,17773552.0,407740,EC50,10.20000,Cellular Enzyme Assay: The expression plasmids...,B,Cytochrome P450 11B2,SINGLE PROTEIN,Homo sapiens,1
3,2117343,CHEMBL3907972,,CN1C(=O)CCc2cc(-c3cncc(COc4ncccc4C(F)(F)F)c3)c...,JEOUUVHDJALODO-UHFFFAOYSA-N,17794779.0,434818,EC50,2.30000,G402-Based Assay: Herein we identified the use...,B,Cytochrome P450 11B2,SINGLE PROTEIN,Homo sapiens,1
4,2501239,CHEMBL4753902,,Cc1sc2c(c1C)C(c1ccc(Cl)cc1)=N[C@@H](CC(=O)NCCO...,ADVKPHNJNIYOGA-MBMZGMDYSA-N,22413651.0,,T1/2,0.01833,Metabolic stability assessed as recombinant CY...,A,Cytochrome P450 3A4,SINGLE PROTEIN,Homo sapiens,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
543,1925324,CHEMBL3542197,,C[C@H](CO)[C@]1(C)SC(N[C@H]2C[C@H]3C[C@@H]2CC3...,PNVPXPSCFLTPOV-GOTIFHFNSA-N,15445357.0,,Drug recovery,0.10000,Total drug recovery in female Sprague-Dawley r...,A,Rattus norvegicus,ORGANISM,Rattus norvegicus,0
544,398059,CHEMBL231715,,O=C1Nc2cccc(O)c2/C1=C/c1ccc[nH]1,ITGNBFTVFUSOGU-CLFYSBASSA-N,1969507.0,,IC50,1000.00000,Inhibition of PDK1 mediated cAKT2 phosphorylat...,B,3-phosphoinositide dependent protein kinase-1,SINGLE PROTEIN,Homo sapiens,0
545,2749368,CHEMBL5178137,,C[Se]C1=C([C@@H]2O[C@H](COP(=O)(O)OP(=O)(O)OP(...,JZPFYMOTUXQAKE-QMTIVRBISA-N,24816501.0,Active,Inhibition,,Inhibition of Escherichia coli RNA polymerase-...,B,Unchecked,UNCHECKED,,0
546,30431,CHEMBL22757,,CCCCc1nnc(C(=O)[C@@H](NC(=O)Cn2c(-c3ccc(F)cc3)...,OZNABUFLGPLWMM-IBGZPJMESA-N,1179061.0,,Ki,6.76000,Binding constant derived from inhibition of el...,B,Leukocyte elastase,SINGLE PROTEIN,Homo sapiens,0


: 