# Chembl webresource client
[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Stef0916/chemoinformatics-bioinformatics/blob/main/cheminformatics-workflow/notebooks/3-chembl_webresource.ipynb)

## Content

1. [Import Libraries](#1)
2. [Listing all tables](#2)
3. [Search for Target Protein](#3)
    - 3.1 [Select Homo Sapiens Cyclooxygenase-2](#4)
    - 3.2 [Retrieve Bioactivity Data by IC50](#5)
    - 3.3 [Remove empty entries](#6)
    - 3.4 [Remove duplicates](#7)
    - 3.5 [Manage Std Values Units](#8)
    - 3.6 [Label: Active, Inactive, Intermediate](#9)
    - 3.7 [Convert IC50 to pIC50](#10)
4. [Molecular Visualization](#11)
5. [Sabe DataSet](#12)

## ChEMBL Database<a id = 4></a>

ChEMBL is a database of bioactive molecules with drug-like properties. It brings together chemical, bioactivity, and genomic data to aid the translation of genomic information into effective new drugs. The ChEMBL database contains a vast amount of bioactivity data sourced from scientific literature, specifically focusing on the properties and activities of drug-like molecules. It's a valuable resource for drug discovery and chemical biology research.<sup>[1](https://doi.org/10.1093/nar/gkr777)</sup>

## 1. Import Libraries<a name = 1></a>

In [1]:
!pip install chembl_webresource_client

Collecting attrs<22.0,>=21.2 (from requests-cache~=0.7.0->chembl_webresource_client)
  Using cached attrs-21.4.0-py2.py3-none-any.whl (60 kB)
Installing collected packages: attrs
  Attempting uninstall: attrs
    Found existing installation: attrs 23.1.0
    Uninstalling attrs-23.1.0:
      Successfully uninstalled attrs-23.1.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
lida 0.0.10 requires fastapi, which is not installed.
lida 0.0.10 requires kaleido, which is not installed.
lida 0.0.10 requires python-multipart, which is not installed.
lida 0.0.10 requires uvicorn, which is not installed.
jsonschema 4.19.1 requires attrs>=22.2.0, but you have attrs 21.4.0 which is incompatible.
referencing 0.30.2 requires attrs>=22.2.0, but you have attrs 21.4.0 which is incompatible.[0m[31m
[0mSuccessfully installed attrs-21.4.0


In [2]:
!pip install rdkit



In [3]:
!pip install mols2grid

Collecting attrs>=22.2.0 (from jsonschema>=2.6->nbformat->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets<8,>=7->mols2grid)
  Using cached attrs-23.1.0-py3-none-any.whl (61 kB)
Installing collected packages: attrs
  Attempting uninstall: attrs
    Found existing installation: attrs 21.4.0
    Uninstalling attrs-21.4.0:
      Successfully uninstalled attrs-21.4.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
lida 0.0.10 requires fastapi, which is not installed.
lida 0.0.10 requires kaleido, which is not installed.
lida 0.0.10 requires python-multipart, which is not installed.
lida 0.0.10 requires uvicorn, which is not installed.
requests-cache 0.7.5 requires attrs<22.0,>=21.2, but you have attrs 23.1.0 which is incompatible.[0m[31m
[0mSuccessfully installed attrs-23.1.0


In [4]:
from chembl_webresource_client.new_client import new_client

In [48]:
import pandas as pd
import numpy as np
import pickle
import stat
import copy

#--------------------------------------------------------------

from rdkit import Chem
from rdkit.Chem import Draw, PandasTools
from rdkit.Chem import PandasTools
import mols2grid

In [6]:
dir(new_client)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'activity',
 'activity_supplementary_data_by_activity',
 'assay',
 'assay_class',
 'atc_class',
 'binding_site',
 'biotherapeutic',
 'cell_line',
 'chembl_id_lookup',
 'compound_record',
 'compound_structural_alert',
 'description',
 'document',
 'document_similarity',
 'drug',
 'drug_indication',
 'go_slim',
 'image',
 'mechanism',
 'metabolism',
 'molecule',
 'molecule_form',
 'official',
 'organism',
 'protein_classification',
 'similarity',
 'source',
 'substructure',
 'target',
 'target_component',
 'target_relation',
 'tissue',
 'xref_source']

## 2. Listing all tables<a name = 2></a>

In [7]:
available_resources = [resource for resource in dir(new_client) if not resource.startswith('_')]
available_resources

['activity',
 'activity_supplementary_data_by_activity',
 'assay',
 'assay_class',
 'atc_class',
 'binding_site',
 'biotherapeutic',
 'cell_line',
 'chembl_id_lookup',
 'compound_record',
 'compound_structural_alert',
 'description',
 'document',
 'document_similarity',
 'drug',
 'drug_indication',
 'go_slim',
 'image',
 'mechanism',
 'metabolism',
 'molecule',
 'molecule_form',
 'official',
 'organism',
 'protein_classification',
 'similarity',
 'source',
 'substructure',
 'target',
 'target_component',
 'target_relation',
 'tissue',
 'xref_source']

These resources can be used to query the database for information on drugs, chemical compounds, biological assays, and much more. Each resource corresponds to a different type of data or a different functionality provided by the ChEMBL web API.

For example, to get information about molecules: use `new_client.molecule`.

```python
from chembl_webresource_client.new_client import new_client

molecule = new_client.molecule
mols = molecule.filter(pref_name__iexact='aspirin')
mols
```

More examples:

```python
# For targets
target_resource = new_client.target
target_results = target_resource.filter(organism__icontains='Homo sapiens')
target_resource

# For assays
assay_resource = new_client.assay
assay_results = assay_resource.filter(assay_type__iexact='B')
assay_results

# For documents
document_resource = new_client.document
document_results = document_resource.filter(journal__icontains='Nature')
document_results
```

## 3. Search for Target Protein<a name = 3></a>

In [8]:
# Target search for Cyclooxygenase-2
target = new_client.target
target_query = target.search('Cyclooxygenase-2')
targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,"[{'xref_id': 'Q8SPQ9', 'xref_name': None, 'xre...",Canis lupus familiaris,Cyclooxygenase-2,23.0,False,CHEMBL4033,"[{'accession': 'Q8SPQ9', 'component_descriptio...",SINGLE PROTEIN,9615.0
1,"[{'xref_id': 'P35354', 'xref_name': None, 'xre...",Homo sapiens,Cyclooxygenase-2,18.0,False,CHEMBL230,"[{'accession': 'P35354', 'component_descriptio...",SINGLE PROTEIN,9606.0
2,"[{'xref_id': 'O62698', 'xref_name': None, 'xre...",Bos taurus,Cyclooxygenase-2,18.0,False,CHEMBL3331,"[{'accession': 'O62698', 'component_descriptio...",SINGLE PROTEIN,9913.0
3,"[{'xref_id': 'Q05769', 'xref_name': None, 'xre...",Mus musculus,Cyclooxygenase-2,18.0,False,CHEMBL4321,"[{'accession': 'Q05769', 'component_descriptio...",SINGLE PROTEIN,10090.0
4,"[{'xref_id': 'P35355', 'xref_name': None, 'xre...",Rattus norvegicus,Cyclooxygenase-2,18.0,False,CHEMBL2977,"[{'accession': 'P35355', 'component_descriptio...",SINGLE PROTEIN,10116.0
...,...,...,...,...,...,...,...,...,...
2842,[],Mus musculus,Glutamate NMDA receptor,0.0,False,CHEMBL3832634,"[{'accession': 'P35436', 'component_descriptio...",PROTEIN COMPLEX GROUP,10090.0
2843,[],Mus musculus,L-type calcium channel,0.0,False,CHEMBL3988632,"[{'accession': 'Q01815', 'component_descriptio...",PROTEIN FAMILY,10090.0
2844,[],Rattus norvegicus,Voltage-gated sodium channel,0.0,False,CHEMBL3988641,"[{'accession': 'O88457', 'component_descriptio...",PROTEIN FAMILY,10116.0
2845,[],Homo sapiens,UDP-glucuronosyltransferases (UGTs),0.0,False,CHEMBL4523985,"[{'accession': 'P22310', 'component_descriptio...",PROTEIN FAMILY,9606.0


In [9]:
target_hs = targets.loc[targets['organism'] == 'Homo sapiens']
target_hs.sort_values(by='score', ascending=False)

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
1,"[{'xref_id': 'P35354', 'xref_name': None, 'xre...",Homo sapiens,Cyclooxygenase-2,18.0,False,CHEMBL230,"[{'accession': 'P35354', 'component_descriptio...",SINGLE PROTEIN,9606.0
6,[],Homo sapiens,Cyclooxygenase,15.0,False,CHEMBL2094253,"[{'accession': 'P35354', 'component_descriptio...",PROTEIN FAMILY,9606.0
9,"[{'xref_id': 'P24557', 'xref_name': None, 'xre...",Homo sapiens,Thromboxane-A synthase,14.0,False,CHEMBL1835,"[{'accession': 'P24557', 'component_descriptio...",SINGLE PROTEIN,9606.0
14,[],Homo sapiens,COX-1/COX-2,14.0,False,CHEMBL4523964,"[{'accession': 'P35354', 'component_descriptio...",SELECTIVITY GROUP,9606.0
15,"[{'xref_id': 'PTGS1', 'xref_name': None, 'xref...",Homo sapiens,Cyclooxygenase-1,13.0,False,CHEMBL221,"[{'accession': 'P23219', 'component_descriptio...",SINGLE PROTEIN,9606.0
...,...,...,...,...,...,...,...,...,...
2828,[],Homo sapiens,Inhibitor of NF-kappa-B kinase (IKK),0.0,False,CHEMBL2111328,"[{'accession': 'O14920', 'component_descriptio...",PROTEIN COMPLEX,9606.0
2827,[],Homo sapiens,Serotonin (5-HT) receptor,0.0,False,CHEMBL2096904,"[{'accession': 'P30939', 'component_descriptio...",PROTEIN FAMILY,9606.0
2826,[],Homo sapiens,Alcohol dehydrogenase,0.0,False,CHEMBL2096668,"[{'accession': 'P07327', 'component_descriptio...",PROTEIN FAMILY,9606.0
2822,"[{'xref_id': 'Atrial_natriuretic_peptide', 'xr...",Homo sapiens,Atrial natriuretic factor,0.0,False,CHEMBL1293193,"[{'accession': 'P01160', 'component_descriptio...",SINGLE PROTEIN,9606.0


- **cross_references**: Contains cross-reference IDs for the target proteins.
- **organism**: The organism from which the target protein originates.
- **pref_name**: The preferred name of the target protein.
- **score**: A numerical score associated with the target protein indicating relevance.
- **species_group_flag**: A boolean flag indicating if the target is a species group.
- **target_chembl_id**: The unique ChEMBL ID for the target protein.
- **target_components**: Contains accession numbers and descriptions of target components.
- **target_type**: Type of the target (e.g., SINGLE PROTEIN, SELECTIVITY GROUP).
- **tax_id**: Taxonomy ID for the organism.


### 3.1 Select Homo Sapiens Cyclooxygenase-2<a name = 4></a>

In [10]:
selected_target = target_hs.target_chembl_id.iloc[0]
selected_target

'CHEMBL230'

### 3.2 Retrieve Bioactivity Data filtered by IC50<a name = 5></a>

In [11]:
activity = new_client.activity
bioactivities = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

In [12]:
activity_df = pd.DataFrame.from_dict(bioactivities)
activity_df

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,34205,[],CHEMBL762912,In vitro inhibitory activity against human pro...,B,,,BAO_0000190,...,Homo sapiens,Cyclooxygenase-2,9606,,,IC50,uM,UO_0000065,,0.06
1,,,34209,[],CHEMBL762912,In vitro inhibitory activity against human pro...,B,,,BAO_0000190,...,Homo sapiens,Cyclooxygenase-2,9606,,,IC50,uM,UO_0000065,,3.23
2,,,35476,[],CHEMBL762912,In vitro inhibitory activity against human pro...,B,,,BAO_0000190,...,Homo sapiens,Cyclooxygenase-2,9606,,,IC50,uM,UO_0000065,,0.08
3,,,36218,[],CHEMBL769655,Tested in vitro for inhibition against Prostag...,B,,,BAO_0000190,...,Homo sapiens,Cyclooxygenase-2,9606,,,IC50,nM,UO_0000065,,0.12
4,,,36708,[],CHEMBL762912,In vitro inhibitory activity against human pro...,B,,,BAO_0000190,...,Homo sapiens,Cyclooxygenase-2,9606,,,IC50,uM,UO_0000065,,100.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7499,,,24957364,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5214216,Selectivity interaction (NIBR principial panel...,B,,,BAO_0000190,...,Homo sapiens,Cyclooxygenase-2,9606,,,IC50,µM,,,0.8
7500,"{'action_type': 'INHIBITOR', 'description': 'N...",,24969909,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5217865,Inhibition of COX2 (unknown origin) assessed a...,B,,,BAO_0000190,...,Homo sapiens,Cyclooxygenase-2,9606,,,IC50,uM,UO_0000065,,3.76
7501,"{'action_type': 'INHIBITOR', 'description': 'N...",,24969910,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5217865,Inhibition of COX2 (unknown origin) assessed a...,B,,,BAO_0000190,...,Homo sapiens,Cyclooxygenase-2,9606,,,IC50,uM,UO_0000065,,2.29
7502,"{'action_type': 'INHIBITOR', 'description': 'N...",,24969911,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5217865,Inhibition of COX2 (unknown origin) assessed a...,B,,,BAO_0000190,...,Homo sapiens,Cyclooxygenase-2,9606,,,IC50,uM,UO_0000065,,1.93


In [13]:
activity_df.columns

Index(['action_type', 'activity_comment', 'activity_id', 'activity_properties',
       'assay_chembl_id', 'assay_description', 'assay_type',
       'assay_variant_accession', 'assay_variant_mutation', 'bao_endpoint',
       'bao_format', 'bao_label', 'canonical_smiles', 'data_validity_comment',
       'data_validity_description', 'document_chembl_id', 'document_journal',
       'document_year', 'ligand_efficiency', 'molecule_chembl_id',
       'molecule_pref_name', 'parent_molecule_chembl_id', 'pchembl_value',
       'potential_duplicate', 'qudt_units', 'record_id', 'relation', 'src_id',
       'standard_flag', 'standard_relation', 'standard_text_value',
       'standard_type', 'standard_units', 'standard_upper_value',
       'standard_value', 'target_chembl_id', 'target_organism',
       'target_pref_name', 'target_tax_id', 'text_value', 'toid', 'type',
       'units', 'uo_units', 'upper_value', 'value'],
      dtype='object')

In [14]:
# Columns to keep
columns_to_keep = ['canonical_smiles', 'molecule_chembl_id', 'parent_molecule_chembl_id',
                   'pchembl_value', 'standard_value', 'standard_relation', 'standard_units', 'standard_type',
                   'target_chembl_id', 'target_pref_name', 'molecule_pref_name', 'units']

# Dropping all other columns
activity_df = activity_df.loc[:, columns_to_keep]

In [15]:
activity_df

Unnamed: 0,canonical_smiles,molecule_chembl_id,parent_molecule_chembl_id,pchembl_value,standard_value,standard_relation,standard_units,standard_type,target_chembl_id,target_pref_name,molecule_pref_name,units
0,Cc1ccc(-c2ccc(S(C)(=O)=O)cc2)n1-c1ccccc1,CHEMBL297008,CHEMBL297008,7.22,60.0,=,nM,IC50,CHEMBL230,Cyclooxygenase-2,,uM
1,Cc1c(C=O)cc(-c2ccc(S(C)(=O)=O)cc2)n1-c1ccc(F)cc1,CHEMBL289813,CHEMBL289813,5.49,3230.0,=,nM,IC50,CHEMBL230,Cyclooxygenase-2,,uM
2,Cc1c(COc2cccc(Cl)c2)cc(-c2ccc(S(C)(=O)=O)cc2)n...,CHEMBL43736,CHEMBL43736,7.10,80.0,=,nM,IC50,CHEMBL230,Cyclooxygenase-2,,uM
3,Fc1ccc(-c2[nH]c(-c3ccc(F)cc3)c3c2C2CCC3CC2)cc1,CHEMBL140167,CHEMBL140167,9.92,0.12,=,nM,IC50,CHEMBL230,Cyclooxygenase-2,,nM
4,CCc1ccc(-c2ccc(S(C)(=O)=O)cc2)n1-c1ccc(F)cc1,CHEMBL44194,CHEMBL44194,,100000.0,>,nM,IC50,CHEMBL230,Cyclooxygenase-2,,uM
...,...,...,...,...,...,...,...,...,...,...,...,...
7499,Cc1c(Nc2ccc(S(=O)(=O)N3CCN(C)CC3)cc2F)nc2ccc(N...,CHEMBL4758581,CHEMBL4758581,,0.8,=,µM,IC50,CHEMBL230,Cyclooxygenase-2,,µM
7500,C[C@H](C(=O)OC(Cn1ccnc1)c1ccc(F)cc1)c1ccc(-c2c...,CHEMBL5220891,CHEMBL5220891,5.42,3760.0,=,nM,IC50,CHEMBL230,Cyclooxygenase-2,,uM
7501,C[C@H](C(=O)N(C)CC(O)(Cn1ccnc1)c1ccc(Cl)cc1)c1...,CHEMBL5219013,CHEMBL5219013,5.64,2290.0,=,nM,IC50,CHEMBL230,Cyclooxygenase-2,,uM
7502,C[C@H](C(=O)N(C)CC(O)(Cn1ccnc1)c1ccc(Cl)cc1Cl)...,CHEMBL5219227,CHEMBL5219227,5.71,1930.0,=,nM,IC50,CHEMBL230,Cyclooxygenase-2,,uM


### 3.3 Remove Empty entries<a name = 6></a>

In [16]:
empty_smiles = activity_df.loc[activity_df['canonical_smiles'].isna(), :]
len(empty_smiles)

33

In [17]:
empty_smiles

Unnamed: 0,canonical_smiles,molecule_chembl_id,parent_molecule_chembl_id,pchembl_value,standard_value,standard_relation,standard_units,standard_type,target_chembl_id,target_pref_name,molecule_pref_name,units
4009,,CHEMBL1366,CHEMBL1366,,,,,IC50,CHEMBL230,Cyclooxygenase-2,AURANOFIN,
4024,,CHEMBL1476898,CHEMBL1476898,,,,,IC50,CHEMBL230,Cyclooxygenase-2,CADMIUM ACETATE,
4035,,CHEMBL1458880,CHEMBL1458880,,,,,IC50,CHEMBL230,Cyclooxygenase-2,CADMIUM DICHLORIDE,
4199,,CHEMBL306043,CHEMBL306043,,,,,IC50,CHEMBL230,Cyclooxygenase-2,GOLD SODIUM THIOMALATE,
4216,,CHEMBL1201469,CHEMBL1201469,,,,,IC50,CHEMBL230,Cyclooxygenase-2,GRAMICIDIN,
4230,,CHEMBL1200431,CHEMBL1200431,,,,,IC50,CHEMBL230,Cyclooxygenase-2,GADOPENTETATE DIMEGLUMINE,
4251,,CHEMBL1909056,CHEMBL1909056,,,,,IC50,CHEMBL230,Cyclooxygenase-2,COBALT(II) ACETYLACETONATE,
4291,,CHEMBL11359,CHEMBL2068237,,,,,IC50,CHEMBL230,Cyclooxygenase-2,CISPLATIN,
4312,,CHEMBL1351,CHEMBL1351,,,,,IC50,CHEMBL230,Cyclooxygenase-2,CARBOPLATIN,
4316,,CHEMBL1909057,CHEMBL1909057,,,,,IC50,CHEMBL230,Cyclooxygenase-2,COPPER(II) OXIDE,


In [18]:
len(activity_df)

7504

In [19]:
activity_df = activity_df.loc[activity_df['canonical_smiles'].notna()]
len(activity_df)

7471

In [20]:
empty_std_val = activity_df.loc[activity_df['standard_value'].isna()]
len(empty_std_val)

969

In [21]:
empty_std_val

Unnamed: 0,canonical_smiles,molecule_chembl_id,parent_molecule_chembl_id,pchembl_value,standard_value,standard_relation,standard_units,standard_type,target_chembl_id,target_pref_name,molecule_pref_name,units
6,COC(=O)c1ccc(-n2c(C)ccc2-c2ccc(F)cc2)cc1,CHEMBL44464,CHEMBL44464,,,,nM,IC50,CHEMBL230,Cyclooxygenase-2,,uM
13,CCN(CC)OC(=O)c1ccc(-n2c(C)ccc2-c2ccc(F)cc2)cc1,CHEMBL290608,CHEMBL290608,,,,nM,IC50,CHEMBL230,Cyclooxygenase-2,,uM
23,Cc1ccc(-c2ccc(F)cc2)n1-c1ccc([S+](C)[O-])cc1,CHEMBL40891,CHEMBL40891,,,,nM,IC50,CHEMBL230,Cyclooxygenase-2,,uM
33,CC(=O)c1ccc(-n2c(C)ccc2-c2ccc(F)cc2)cc1,CHEMBL442349,CHEMBL442349,,,,nM,IC50,CHEMBL230,Cyclooxygenase-2,,uM
41,COc1ccc(-c2ccc(C)n2-c2ccc(C(C)=O)cc2)cc1,CHEMBL298079,CHEMBL298079,,,,nM,IC50,CHEMBL230,Cyclooxygenase-2,,uM
...,...,...,...,...,...,...,...,...,...,...,...,...
7014,CC[C@H](C(=O)O)c1ccc2cc(OC)ccc2c1,CHEMBL4466610,CHEMBL4466610,,,,,IC50,CHEMBL230,Cyclooxygenase-2,,
7121,CC(c1cc2ccccc2s1)N(O)C(N)=O,CHEMBL93,CHEMBL93,,,,,IC50,CHEMBL230,Cyclooxygenase-2,ZILEUTON,
7372,COc1ccc2[nH]cc(CCNC(C)=O)c2c1,CHEMBL45,CHEMBL45,,,,,IC50,CHEMBL230,Cyclooxygenase-2,MELATONIN,
7428,CC[C@@]1(O)C(=O)OCc2c1cc1n(c2=O)Cc2cc3c(CN(C)C...,CHEMBL84,CHEMBL84,,,,,IC50,CHEMBL230,Cyclooxygenase-2,TOPOTECAN,


In [22]:
activity_df = activity_df.loc[activity_df['standard_value'].notna()]
len(activity_df)

6502

In [23]:
activity_df

Unnamed: 0,canonical_smiles,molecule_chembl_id,parent_molecule_chembl_id,pchembl_value,standard_value,standard_relation,standard_units,standard_type,target_chembl_id,target_pref_name,molecule_pref_name,units
0,Cc1ccc(-c2ccc(S(C)(=O)=O)cc2)n1-c1ccccc1,CHEMBL297008,CHEMBL297008,7.22,60.0,=,nM,IC50,CHEMBL230,Cyclooxygenase-2,,uM
1,Cc1c(C=O)cc(-c2ccc(S(C)(=O)=O)cc2)n1-c1ccc(F)cc1,CHEMBL289813,CHEMBL289813,5.49,3230.0,=,nM,IC50,CHEMBL230,Cyclooxygenase-2,,uM
2,Cc1c(COc2cccc(Cl)c2)cc(-c2ccc(S(C)(=O)=O)cc2)n...,CHEMBL43736,CHEMBL43736,7.10,80.0,=,nM,IC50,CHEMBL230,Cyclooxygenase-2,,uM
3,Fc1ccc(-c2[nH]c(-c3ccc(F)cc3)c3c2C2CCC3CC2)cc1,CHEMBL140167,CHEMBL140167,9.92,0.12,=,nM,IC50,CHEMBL230,Cyclooxygenase-2,,nM
4,CCc1ccc(-c2ccc(S(C)(=O)=O)cc2)n1-c1ccc(F)cc1,CHEMBL44194,CHEMBL44194,,100000.0,>,nM,IC50,CHEMBL230,Cyclooxygenase-2,,uM
...,...,...,...,...,...,...,...,...,...,...,...,...
7499,Cc1c(Nc2ccc(S(=O)(=O)N3CCN(C)CC3)cc2F)nc2ccc(N...,CHEMBL4758581,CHEMBL4758581,,0.8,=,µM,IC50,CHEMBL230,Cyclooxygenase-2,,µM
7500,C[C@H](C(=O)OC(Cn1ccnc1)c1ccc(F)cc1)c1ccc(-c2c...,CHEMBL5220891,CHEMBL5220891,5.42,3760.0,=,nM,IC50,CHEMBL230,Cyclooxygenase-2,,uM
7501,C[C@H](C(=O)N(C)CC(O)(Cn1ccnc1)c1ccc(Cl)cc1)c1...,CHEMBL5219013,CHEMBL5219013,5.64,2290.0,=,nM,IC50,CHEMBL230,Cyclooxygenase-2,,uM
7502,C[C@H](C(=O)N(C)CC(O)(Cn1ccnc1)c1ccc(Cl)cc1Cl)...,CHEMBL5219227,CHEMBL5219227,5.71,1930.0,=,nM,IC50,CHEMBL230,Cyclooxygenase-2,,uM


### 3.4 Remove Duplicates<a name = 7></a>

In [25]:
len(activity_df['canonical_smiles'].unique())

4733

In [26]:
smiles_dupl = activity_df.loc[activity_df['canonical_smiles'].duplicated()]
smiles_dupl.sort_values(by='canonical_smiles')

Unnamed: 0,canonical_smiles,molecule_chembl_id,parent_molecule_chembl_id,pchembl_value,standard_value,standard_relation,standard_units,standard_type,target_chembl_id,target_pref_name,molecule_pref_name,units
3889,Brc1ccc(Oc2ccc(OCCN3CCCC3)cc2)cc1,CHEMBL1775104,CHEMBL1775104,5.29,5100.0,=,nM,IC50,CHEMBL230,Cyclooxygenase-2,,uM
1991,C#CCCCC(=O)c1cc(C(C)(C)C)c(O)c(C(C)(C)C)c1,CHEMBL13878,CHEMBL13878,7.00,100.0,=,nM,IC50,CHEMBL230,Cyclooxygenase-2,TEBUFELONE,uM
2355,C#CCCCC(=O)c1cc(C(C)(C)C)c(O)c(C(C)(C)C)c1,CHEMBL13878,CHEMBL13878,7.00,100.0,=,nM,IC50,CHEMBL230,Cyclooxygenase-2,TEBUFELONE,uM
1986,C#CCCCC(=O)c1cc(C(C)(C)C)c2c(c1)C(C)(C)CO2,CHEMBL13920,CHEMBL13920,7.82,15.0,=,nM,IC50,CHEMBL230,Cyclooxygenase-2,,uM
6019,C#Cc1cccc(Nc2ncnc3cc(OCCOC(=O)C(C)c4ccc5cc(OC)...,CHEMBL3740850,CHEMBL3740850,4.64,23000.0,=,nM,IC50,CHEMBL230,Cyclooxygenase-2,,uM
...,...,...,...,...,...,...,...,...,...,...,...,...
6042,Oc1ccc(/C=C/c2cc(O)cc(O)c2)cc1,CHEMBL165,CHEMBL165,6.00,996.0,=,nM,IC50,CHEMBL230,Cyclooxygenase-2,RESVERATROL,uM
3877,c1cc(Oc2ccc(OCCN3CCCC3)cc2)ccn1,CHEMBL1775092,CHEMBL1775092,,100000.0,>,nM,IC50,CHEMBL230,Cyclooxygenase-2,,uM
3871,c1ccc(Oc2ccc(OCCN3CCCC3)cc2)cc1,CHEMBL162424,CHEMBL162424,,100000.0,>,nM,IC50,CHEMBL230,Cyclooxygenase-2,,uM
3876,c1ccc(Oc2ccc(OCCN3CCCC3)cc2)nc1,CHEMBL1775091,CHEMBL1775091,,100000.0,>,nM,IC50,CHEMBL230,Cyclooxygenase-2,,uM


In [27]:
activity_df = activity_df.loc[~activity_df['canonical_smiles'].duplicated()]
len(activity_df)

4733

In [28]:
activity_df = activity_df[['canonical_smiles', 'molecule_chembl_id', 'standard_value', 'standard_units', 'standard_type']]
activity_df

Unnamed: 0,canonical_smiles,molecule_chembl_id,standard_value,standard_units,standard_type
0,Cc1ccc(-c2ccc(S(C)(=O)=O)cc2)n1-c1ccccc1,CHEMBL297008,60.0,nM,IC50
1,Cc1c(C=O)cc(-c2ccc(S(C)(=O)=O)cc2)n1-c1ccc(F)cc1,CHEMBL289813,3230.0,nM,IC50
2,Cc1c(COc2cccc(Cl)c2)cc(-c2ccc(S(C)(=O)=O)cc2)n...,CHEMBL43736,80.0,nM,IC50
3,Fc1ccc(-c2[nH]c(-c3ccc(F)cc3)c3c2C2CCC3CC2)cc1,CHEMBL140167,0.12,nM,IC50
4,CCc1ccc(-c2ccc(S(C)(=O)=O)cc2)n1-c1ccc(F)cc1,CHEMBL44194,100000.0,nM,IC50
...,...,...,...,...,...
7498,O=c1[nH]cc(CSc2ccc(C(F)F)cc2)[nH]c1=O,CHEMBL5191702,10000.0,nM,IC50
7499,Cc1c(Nc2ccc(S(=O)(=O)N3CCN(C)CC3)cc2F)nc2ccc(N...,CHEMBL4758581,0.8,µM,IC50
7500,C[C@H](C(=O)OC(Cn1ccnc1)c1ccc(F)cc1)c1ccc(-c2c...,CHEMBL5220891,3760.0,nM,IC50
7501,C[C@H](C(=O)N(C)CC(O)(Cn1ccnc1)c1ccc(Cl)cc1)c1...,CHEMBL5219013,2290.0,nM,IC50


### 3.5 Manage Standard Value Units<a name = 8></a>

In [29]:
activity_df.loc[activity_df['standard_units'] != 'nM']

Unnamed: 0,canonical_smiles,molecule_chembl_id,standard_value,standard_units,standard_type
3023,C=C[C@@H]1[C@@H](O)C[C@@H]2[C@]3(CC[C@]4(C)[C@...,CHEMBL479210,4.7,ug,IC50
3117,O=c1cc(-c2ccc(O)c(O)c2)oc2cc(O)cc(O)c12,CHEMBL151,100.0,ug.mL-1,IC50
3129,COc1cc2oc(-c3ccc(O)cc3)cc(=O)c2c(O)c1OC,CHEMBL348436,6.0,%,IC50
3221,COc1cc(C(O)C(COC(=O)/C=C/c2ccc(O)cc2)Oc2c(OC)c...,CHEMBL501943,100.0,ug.mL-1,IC50
3222,COc1cc(C(O)C(COC(=O)/C=C/c2ccc(O)cc2)Oc2ccc(/C...,CHEMBL455027,100.0,ug.mL-1,IC50
3223,C[C@@H]1O[C@@H](O[C@@H]2Cc3c(O)cc(O)cc3O[C@@H]...,CHEMBL517484,100.0,ug.mL-1,IC50
3228,COC(=O)[C@@H](c1ccccc1)[C@H](C1=C(O)/C(=C/c2cc...,CHEMBL463211,100.0,ug.mL-1,IC50
3270,C=C(CC[C@@H](C)[C@H]1CC[C@H]2[C@@H]3CC=C4C(=O)...,CHEMBL470866,10.0,ug.mL-1,IC50
7329,Cc1ccc2oc(=O)cc(CSc3nnc(CSc4nc5ccccc5o4)o3)c2c1,CHEMBL4856182,19.95,ug.mL-1,IC50
7330,COc1ccc2oc(=O)cc(CSc3nnc(CSc4nc5ccccc5o4)o3)c2c1,CHEMBL4868808,32.93,ug.mL-1,IC50


In [30]:
len(activity_df.loc[activity_df['standard_units'] != 'nM'])

22

In [31]:
activity_df = activity_df[activity_df['standard_units'] == 'nM']
len(activity_df)

4711

### 3.6 Label: Active, Inactive, Intermediate<a name = 9></a>

In [32]:
activity_df['standard_value'].describe()

count         4711
unique        1470
top       100000.0
freq           368
Name: standard_value, dtype: object

In [33]:
activity_df = activity_df.copy()
activity_df['standard_value'] = activity_df['standard_value'].astype('float64')

In [34]:
activity_df['standard_value'].describe()

count    4.711000e+03
mean     5.641769e+04
std      1.003254e+06
min      0.000000e+00
25%      1.700000e+02
50%      2.000000e+03
75%      1.616500e+04
max      6.000000e+07
Name: standard_value, dtype: float64

In [35]:
bio_class = []

for i in activity_df['standard_value']:
    if i >= 10000:
        bio_class.append('active')
    elif i >= 1000:
        bio_class.append('intermediate')
    else:
        bio_class.append('inactive')

In [36]:
activity_df['class'] = bio_class

In [37]:
activity_df

Unnamed: 0,canonical_smiles,molecule_chembl_id,standard_value,standard_units,standard_type,class
0,Cc1ccc(-c2ccc(S(C)(=O)=O)cc2)n1-c1ccccc1,CHEMBL297008,60.00,nM,IC50,inactive
1,Cc1c(C=O)cc(-c2ccc(S(C)(=O)=O)cc2)n1-c1ccc(F)cc1,CHEMBL289813,3230.00,nM,IC50,intermediate
2,Cc1c(COc2cccc(Cl)c2)cc(-c2ccc(S(C)(=O)=O)cc2)n...,CHEMBL43736,80.00,nM,IC50,inactive
3,Fc1ccc(-c2[nH]c(-c3ccc(F)cc3)c3c2C2CCC3CC2)cc1,CHEMBL140167,0.12,nM,IC50,inactive
4,CCc1ccc(-c2ccc(S(C)(=O)=O)cc2)n1-c1ccc(F)cc1,CHEMBL44194,100000.00,nM,IC50,active
...,...,...,...,...,...,...
7497,O=c1[nH]cc(CSc2ccc(C(F)(F)F)cc2)[nH]c1=O,CHEMBL5178658,10000.00,nM,IC50,active
7498,O=c1[nH]cc(CSc2ccc(C(F)F)cc2)[nH]c1=O,CHEMBL5191702,10000.00,nM,IC50,active
7500,C[C@H](C(=O)OC(Cn1ccnc1)c1ccc(F)cc1)c1ccc(-c2c...,CHEMBL5220891,3760.00,nM,IC50,intermediate
7501,C[C@H](C(=O)N(C)CC(O)(Cn1ccnc1)c1ccc(Cl)cc1)c1...,CHEMBL5219013,2290.00,nM,IC50,intermediate


### 3.7 Convert IC50 to pIC50<a name = 10></a>

In [38]:
# Replace 0 with a small value (e.g., 1 nM) to avoid log(0)
activity_df['standard_value'] = activity_df['standard_value'].replace(0,1)

In [39]:
pIC50 = []

for std_values in activity_df['standard_value']:
    molar = std_values*(10**-9)
    pIC50.append(-np.log10(molar))

In [40]:
pIC50[:10]

[7.221848749616356,
 5.490797477668897,
 7.096910013008056,
 9.920818753952375,
 4.0,
 5.79317412396815,
 4.0,
 4.0,
 4.0,
 4.0]

In [41]:
activity_df['pIC50'] = pIC50
activity_df

Unnamed: 0,canonical_smiles,molecule_chembl_id,standard_value,standard_units,standard_type,class,pIC50
0,Cc1ccc(-c2ccc(S(C)(=O)=O)cc2)n1-c1ccccc1,CHEMBL297008,60.00,nM,IC50,inactive,7.221849
1,Cc1c(C=O)cc(-c2ccc(S(C)(=O)=O)cc2)n1-c1ccc(F)cc1,CHEMBL289813,3230.00,nM,IC50,intermediate,5.490797
2,Cc1c(COc2cccc(Cl)c2)cc(-c2ccc(S(C)(=O)=O)cc2)n...,CHEMBL43736,80.00,nM,IC50,inactive,7.096910
3,Fc1ccc(-c2[nH]c(-c3ccc(F)cc3)c3c2C2CCC3CC2)cc1,CHEMBL140167,0.12,nM,IC50,inactive,9.920819
4,CCc1ccc(-c2ccc(S(C)(=O)=O)cc2)n1-c1ccc(F)cc1,CHEMBL44194,100000.00,nM,IC50,active,4.000000
...,...,...,...,...,...,...,...
7497,O=c1[nH]cc(CSc2ccc(C(F)(F)F)cc2)[nH]c1=O,CHEMBL5178658,10000.00,nM,IC50,active,5.000000
7498,O=c1[nH]cc(CSc2ccc(C(F)F)cc2)[nH]c1=O,CHEMBL5191702,10000.00,nM,IC50,active,5.000000
7500,C[C@H](C(=O)OC(Cn1ccnc1)c1ccc(F)cc1)c1ccc(-c2c...,CHEMBL5220891,3760.00,nM,IC50,intermediate,5.424812
7501,C[C@H](C(=O)N(C)CC(O)(Cn1ccnc1)c1ccc(Cl)cc1)c1...,CHEMBL5219013,2290.00,nM,IC50,intermediate,5.640165


In [42]:
activity_df.drop(['standard_value', 'standard_units', 'standard_type'], axis=1, inplace=True)

In [43]:
activity_df

Unnamed: 0,canonical_smiles,molecule_chembl_id,class,pIC50
0,Cc1ccc(-c2ccc(S(C)(=O)=O)cc2)n1-c1ccccc1,CHEMBL297008,inactive,7.221849
1,Cc1c(C=O)cc(-c2ccc(S(C)(=O)=O)cc2)n1-c1ccc(F)cc1,CHEMBL289813,intermediate,5.490797
2,Cc1c(COc2cccc(Cl)c2)cc(-c2ccc(S(C)(=O)=O)cc2)n...,CHEMBL43736,inactive,7.096910
3,Fc1ccc(-c2[nH]c(-c3ccc(F)cc3)c3c2C2CCC3CC2)cc1,CHEMBL140167,inactive,9.920819
4,CCc1ccc(-c2ccc(S(C)(=O)=O)cc2)n1-c1ccc(F)cc1,CHEMBL44194,active,4.000000
...,...,...,...,...
7497,O=c1[nH]cc(CSc2ccc(C(F)(F)F)cc2)[nH]c1=O,CHEMBL5178658,active,5.000000
7498,O=c1[nH]cc(CSc2ccc(C(F)F)cc2)[nH]c1=O,CHEMBL5191702,active,5.000000
7500,C[C@H](C(=O)OC(Cn1ccnc1)c1ccc(F)cc1)c1ccc(-c2c...,CHEMBL5220891,intermediate,5.424812
7501,C[C@H](C(=O)N(C)CC(O)(Cn1ccnc1)c1ccc(Cl)cc1)c1...,CHEMBL5219013,intermediate,5.640165


## 4. Molecules Visualization<a name = 11></a>

In [44]:
activity_df['Molecules'] = activity_df['canonical_smiles'].apply(lambda x: Chem.MolFromSmiles(x))

In [45]:
activity_df.sort_values(by='pIC50', ascending=False)

Unnamed: 0,canonical_smiles,molecule_chembl_id,class,pIC50,Molecules
3044,COc1ccc(-c2c(-c3ccc(S(N)(=O)=O)cc3)[nH]c3ccccc...,CHEMBL499068,inactive,11.221849,<rdkit.Chem.rdchem.Mol object at 0x7ac8e0d74900>
7481,[11CH3]Oc1ccc(-c2c(-c3ccc(S(N)(=O)=O)cc3)[nH]c...,CHEMBL5196097,inactive,11.221849,<rdkit.Chem.rdchem.Mol object at 0x7ac8e0bc19a0>
3043,COc1ccc(-c2c(-c3ccc(S(C)(=O)=O)cc3)[nH]c3ccccc...,CHEMBL499069,inactive,10.698970,<rdkit.Chem.rdchem.Mol object at 0x7ac8e0d74890>
3047,CS(=O)(=O)c1ccc(-c2[nH]c3ccccc3c2-c2ccc(F)cc2)cc1,CHEMBL501208,inactive,10.698970,<rdkit.Chem.rdchem.Mol object at 0x7ac8e0d74a50>
3049,Cc1ccc2[nH]c(-c3ccc(S(N)(=O)=O)cc3)c(-c3ccccc3...,CHEMBL525247,inactive,10.698970,<rdkit.Chem.rdchem.Mol object at 0x7ac8e0d74b30>
...,...,...,...,...,...
2359,CS(=O)(=O)c1ccc(-c2occ(Cl)c(=O)c2-c2ccccc2F)cc1,CHEMBL309156,active,2.200000,<rdkit.Chem.rdchem.Mol object at 0x7ac8e0d68120>
2365,COc1ccc(-c2c(-c3ccc(S(C)(=O)=O)cc3)occ(Cl)c2=O...,CHEMBL69683,active,2.163000,<rdkit.Chem.rdchem.Mol object at 0x7ac8e0d683c0>
2369,CS(=O)(=O)c1ccc(-c2occ(Cl)c(=O)c2-c2cccnc2)cc1,CHEMBL69030,active,2.151900,<rdkit.Chem.rdchem.Mol object at 0x7ac8e0d68580>
5212,CS(=O)(=O)c1ccc(-c2nc(NC3CC=CCC3)cc(C(F)(F)F)n...,CHEMBL2234864,active,1.517190,<rdkit.Chem.rdchem.Mol object at 0x7ac8e0d8e5e0>


In [46]:
mols2grid.display(activity_df, mol_col = 'Molecules', subset = ['molecule_chembl_id', 'class', 'pIC50'], transform={"pIC50": lambda x: f"{x:.2f}"})

MolGridWidget()

## 5. Save DataSet<a name = 12></a>

In [47]:
activity_df.to_csv('bioactivity_final_cox2.csv', index=False)