## **ChEMBL Database**

The [*ChEMBL Database*](https://www.ebi.ac.uk/chembl/) is a database that contains curated bioactivity data of more than 2 million compounds.This data is used by research scholars too, this is real data.

## **Installing libraries**

Install the ChEMBL web service package so that we can retrieve bioactivity data from the ChEMBL Database.

In [1]:
! pip install chembl_webresource_client

Collecting chembl_webresource_client
  Downloading chembl_webresource_client-0.10.7-py3-none-any.whl (55 kB)
[?25l[K     |██████                          | 10 kB 25.5 MB/s eta 0:00:01[K     |███████████▉                    | 20 kB 10.1 MB/s eta 0:00:01[K     |█████████████████▊              | 30 kB 8.8 MB/s eta 0:00:01[K     |███████████████████████▋        | 40 kB 8.0 MB/s eta 0:00:01[K     |█████████████████████████████▌  | 51 kB 4.5 MB/s eta 0:00:01[K     |████████████████████████████████| 55 kB 1.7 MB/s 
Collecting requests-cache~=0.7.0
  Downloading requests_cache-0.7.5-py3-none-any.whl (39 kB)
Collecting pyyaml>=5.4
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 11.8 MB/s 
Collecting itsdangerous>=2.0.1
  Downloading itsdangerous-2.0.1-py3-none-any.whl (18 kB)
Collecting url-normalize<2.0,>=1.4
  Downloading url_normalize-1.4.3-py2.py3

## **Importing libraries**

In [3]:
# Import necessary libraries
import pandas as pd
from chembl_webresource_client.new_client import new_client

## **Search for Target protein**

### **Target search for Anoctamin**



In [4]:
# Target search for coronavirus
target = new_client.target
# Search is same as we search in chembl website search
target_query = target.search('anoctamin')
targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],Mus musculus,Anoctamin-1,18.0,False,CHEMBL4105874,"[{'accession': 'Q8BHY3', 'component_descriptio...",SINGLE PROTEIN,10090
1,[],Homo sapiens,Anoctamin-1,17.0,False,CHEMBL2046267,"[{'accession': 'Q5XXA6', 'component_descriptio...",SINGLE PROTEIN,9606
2,[],Homo sapiens,Anoctamin-2,17.0,False,CHEMBL4105767,"[{'accession': 'Q9NQ90', 'component_descriptio...",SINGLE PROTEIN,9606


### **Select and retrieve bioactivity data for *Human Anoctmin-1* (first entry)**

We will assign the fifth entry (which corresponds to the target protein, *Human Acetylcholinesterase*) to the ***selected_target*** variable 

In [5]:
selected_target = targets.target_chembl_id[1]
selected_target

'CHEMBL2046267'

Here, we will retrieve only bioactivity data for *Human Anoctamin-1* (CHEMBL220) that are reported as pChEMBL values.

In [6]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

In [7]:
df = pd.DataFrame.from_dict(res)

In [8]:
df

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,10943671,[],CHEMBL2050116,Inhibition of human TMEM16A transfected in FRT...,B,,,BAO_0000190,BAO_0000219,cell-based format,COc1ccc(-c2oc3ccc(OCc4cccc(F)c4)cc3c2C(=O)O)cc1,,,CHEMBL2046432,Bioorg. Med. Chem.,2012,"{'bei': '11.58', 'le': '0.21', 'lle': '-0.98',...",CHEMBL2046972,,CHEMBL2046972,4.54,False,http://www.openphacts.org/units/Nanomolar,1651934,=,1,True,=,,IC50,nM,,28700.0,CHEMBL2046267,Homo sapiens,Anoctamin-1,9606,,,IC50,uM,UO_0000065,,28.7
1,,10943672,[],CHEMBL2050116,Inhibition of human TMEM16A transfected in FRT...,B,,,BAO_0000190,BAO_0000219,cell-based format,COc1ccc(-c2oc3ccc(OCc4cccc(I)c4)cc3c2C(=O)O)cc1,,,CHEMBL2046432,Bioorg. Med. Chem.,2012,"{'bei': '10.45', 'le': '0.25', 'lle': '-0.76',...",CHEMBL2046973,,CHEMBL2046973,5.23,False,http://www.openphacts.org/units/Nanomolar,1651935,=,1,True,=,,IC50,nM,,5900.0,CHEMBL2046267,Homo sapiens,Anoctamin-1,9606,,,IC50,uM,UO_0000065,,5.9
2,,10943673,[],CHEMBL2050116,Inhibition of human TMEM16A transfected in FRT...,B,,,BAO_0000190,BAO_0000219,cell-based format,COc1ccc(-c2oc3ccc(OCc4ccc(Br)cc4)cc3c2C(=O)O)cc1,,,CHEMBL2046432,Bioorg. Med. Chem.,2012,"{'bei': '10.56', 'le': '0.23', 'lle': '-1.36',...",CHEMBL2046974,,CHEMBL2046974,4.79,False,http://www.openphacts.org/units/Nanomolar,1651936,=,1,True,=,,IC50,nM,,16300.0,CHEMBL2046267,Homo sapiens,Anoctamin-1,9606,,,IC50,uM,UO_0000065,,16.3
3,,10943674,[],CHEMBL2050116,Inhibition of human TMEM16A transfected in FRT...,B,,,BAO_0000190,BAO_0000219,cell-based format,COc1ccc(-c2oc3ccc(OCc4cccc(Cl)c4)cc3c2C(=O)O)cc1,,,CHEMBL2046432,Bioorg. Med. Chem.,2012,"{'bei': '11.09', 'le': '0.21', 'lle': '-1.51',...",CHEMBL2047075,,CHEMBL2047075,4.54,False,http://www.openphacts.org/units/Nanomolar,1652074,=,1,True,=,,IC50,nM,,29200.0,CHEMBL2046267,Homo sapiens,Anoctamin-1,9606,,,IC50,uM,UO_0000065,,29.2
4,,10943675,[],CHEMBL2050116,Inhibition of human TMEM16A transfected in FRT...,B,,,BAO_0000190,BAO_0000219,cell-based format,COc1ccc(-c2oc3ccc(OCc4c(F)cccc4F)cc3c2C(=O)O)cc1,,,CHEMBL2046432,Bioorg. Med. Chem.,2012,"{'bei': '11.13', 'le': '0.21', 'lle': '-1.09',...",CHEMBL2047076,,CHEMBL2047076,4.57,False,http://www.openphacts.org/units/Nanomolar,1652075,=,1,True,=,,IC50,nM,,27000.0,CHEMBL2046267,Homo sapiens,Anoctamin-1,9606,,,IC50,uM,UO_0000065,,27.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
115,,18873550,[],CHEMBL4310424,Inhibition of YFP-fused ANO1 (unknown origin) ...,B,,,BAO_0000190,BAO_0000219,cell-based format,Cc1cc(Cl)ccc1OCc1nnc(-c2ccccn2)o1,,,CHEMBL4308858,Eur J Med Chem,2018,,CHEMBL4581421,,CHEMBL4581421,,False,http://www.openphacts.org/units/Nanomolar,3149440,>,1,True,>,,IC50,nM,,100000.0,CHEMBL2046267,Homo sapiens,Anoctamin-1,9606,,,IC50,uM,UO_0000065,,100.0
116,,18873551,[],CHEMBL4310424,Inhibition of YFP-fused ANO1 (unknown origin) ...,B,,,BAO_0000190,BAO_0000219,cell-based format,CCOC(=O)c1cc2ccccn2c1/C=N/NC(=O)COc1ccc(Cl)cc1C,,,CHEMBL4308858,Eur J Med Chem,2018,,CHEMBL4471557,,CHEMBL4471557,,False,http://www.openphacts.org/units/Nanomolar,3149441,>,1,True,>,,IC50,nM,,100000.0,CHEMBL2046267,Homo sapiens,Anoctamin-1,9606,,,IC50,uM,UO_0000065,,100.0
117,,18873552,[],CHEMBL4310424,Inhibition of YFP-fused ANO1 (unknown origin) ...,B,,,BAO_0000190,BAO_0000219,cell-based format,Cc1cc(Cl)ccc1OC(C)C(=O)N/N=C/c1ccccc1OC(F)(F)F,,,CHEMBL4308858,Eur J Med Chem,2018,,CHEMBL4453183,,CHEMBL4453183,,False,http://www.openphacts.org/units/Nanomolar,3149442,>,1,True,>,,IC50,nM,,100000.0,CHEMBL2046267,Homo sapiens,Anoctamin-1,9606,,,IC50,uM,UO_0000065,,100.0
118,Not Active,18873553,[],CHEMBL4310424,Inhibition of YFP-fused ANO1 (unknown origin) ...,B,,,BAO_0000190,BAO_0000219,cell-based format,Cc1cc(Cl)ccc1OC(C)(C)C(=O)N/N=C/c1ccccc1OC(F)(F)F,,,CHEMBL4308858,Eur J Med Chem,2018,,CHEMBL4459865,,CHEMBL4459865,,False,,3149443,,1,False,,,IC50,,,,CHEMBL2046267,Homo sapiens,Anoctamin-1,9606,,,IC50,,,,


Finally we will save the resulting bioactivity data to a CSV file **bioactivity_data.csv**.

In [9]:
#save data into your system
df.to_csv('anoctamin_data_raw.csv', index=False)

## **Handling missing data**
If any compounds has missing value for the **standard_value** and **canonical_smiles** column then drop it.

In [10]:
df2 = df[df.standard_value.notna()]
df2 = df2[df.canonical_smiles.notna()]
df2

  


Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,10943671,[],CHEMBL2050116,Inhibition of human TMEM16A transfected in FRT...,B,,,BAO_0000190,BAO_0000219,cell-based format,COc1ccc(-c2oc3ccc(OCc4cccc(F)c4)cc3c2C(=O)O)cc1,,,CHEMBL2046432,Bioorg. Med. Chem.,2012,"{'bei': '11.58', 'le': '0.21', 'lle': '-0.98',...",CHEMBL2046972,,CHEMBL2046972,4.54,False,http://www.openphacts.org/units/Nanomolar,1651934,=,1,True,=,,IC50,nM,,28700.0,CHEMBL2046267,Homo sapiens,Anoctamin-1,9606,,,IC50,uM,UO_0000065,,28.7
1,,10943672,[],CHEMBL2050116,Inhibition of human TMEM16A transfected in FRT...,B,,,BAO_0000190,BAO_0000219,cell-based format,COc1ccc(-c2oc3ccc(OCc4cccc(I)c4)cc3c2C(=O)O)cc1,,,CHEMBL2046432,Bioorg. Med. Chem.,2012,"{'bei': '10.45', 'le': '0.25', 'lle': '-0.76',...",CHEMBL2046973,,CHEMBL2046973,5.23,False,http://www.openphacts.org/units/Nanomolar,1651935,=,1,True,=,,IC50,nM,,5900.0,CHEMBL2046267,Homo sapiens,Anoctamin-1,9606,,,IC50,uM,UO_0000065,,5.9
2,,10943673,[],CHEMBL2050116,Inhibition of human TMEM16A transfected in FRT...,B,,,BAO_0000190,BAO_0000219,cell-based format,COc1ccc(-c2oc3ccc(OCc4ccc(Br)cc4)cc3c2C(=O)O)cc1,,,CHEMBL2046432,Bioorg. Med. Chem.,2012,"{'bei': '10.56', 'le': '0.23', 'lle': '-1.36',...",CHEMBL2046974,,CHEMBL2046974,4.79,False,http://www.openphacts.org/units/Nanomolar,1651936,=,1,True,=,,IC50,nM,,16300.0,CHEMBL2046267,Homo sapiens,Anoctamin-1,9606,,,IC50,uM,UO_0000065,,16.3
3,,10943674,[],CHEMBL2050116,Inhibition of human TMEM16A transfected in FRT...,B,,,BAO_0000190,BAO_0000219,cell-based format,COc1ccc(-c2oc3ccc(OCc4cccc(Cl)c4)cc3c2C(=O)O)cc1,,,CHEMBL2046432,Bioorg. Med. Chem.,2012,"{'bei': '11.09', 'le': '0.21', 'lle': '-1.51',...",CHEMBL2047075,,CHEMBL2047075,4.54,False,http://www.openphacts.org/units/Nanomolar,1652074,=,1,True,=,,IC50,nM,,29200.0,CHEMBL2046267,Homo sapiens,Anoctamin-1,9606,,,IC50,uM,UO_0000065,,29.2
4,,10943675,[],CHEMBL2050116,Inhibition of human TMEM16A transfected in FRT...,B,,,BAO_0000190,BAO_0000219,cell-based format,COc1ccc(-c2oc3ccc(OCc4c(F)cccc4F)cc3c2C(=O)O)cc1,,,CHEMBL2046432,Bioorg. Med. Chem.,2012,"{'bei': '11.13', 'le': '0.21', 'lle': '-1.09',...",CHEMBL2047076,,CHEMBL2047076,4.57,False,http://www.openphacts.org/units/Nanomolar,1652075,=,1,True,=,,IC50,nM,,27000.0,CHEMBL2046267,Homo sapiens,Anoctamin-1,9606,,,IC50,uM,UO_0000065,,27.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
114,,18873549,[],CHEMBL4310424,Inhibition of YFP-fused ANO1 (unknown origin) ...,B,,,BAO_0000190,BAO_0000219,cell-based format,Cc1cc(Cl)ccc1OCC(=O)N/N=C/c1ccccn1,,,CHEMBL4308858,Eur J Med Chem,2018,"{'bei': '14.83', 'le': '0.29', 'lle': '1.93', ...",CHEMBL4517769,,CHEMBL4517769,4.50,False,http://www.openphacts.org/units/Nanomolar,3149439,=,1,True,=,,IC50,nM,,31300.0,CHEMBL2046267,Homo sapiens,Anoctamin-1,9606,,,IC50,uM,UO_0000065,,31.3
115,,18873550,[],CHEMBL4310424,Inhibition of YFP-fused ANO1 (unknown origin) ...,B,,,BAO_0000190,BAO_0000219,cell-based format,Cc1cc(Cl)ccc1OCc1nnc(-c2ccccn2)o1,,,CHEMBL4308858,Eur J Med Chem,2018,,CHEMBL4581421,,CHEMBL4581421,,False,http://www.openphacts.org/units/Nanomolar,3149440,>,1,True,>,,IC50,nM,,100000.0,CHEMBL2046267,Homo sapiens,Anoctamin-1,9606,,,IC50,uM,UO_0000065,,100.0
116,,18873551,[],CHEMBL4310424,Inhibition of YFP-fused ANO1 (unknown origin) ...,B,,,BAO_0000190,BAO_0000219,cell-based format,CCOC(=O)c1cc2ccccn2c1/C=N/NC(=O)COc1ccc(Cl)cc1C,,,CHEMBL4308858,Eur J Med Chem,2018,,CHEMBL4471557,,CHEMBL4471557,,False,http://www.openphacts.org/units/Nanomolar,3149441,>,1,True,>,,IC50,nM,,100000.0,CHEMBL2046267,Homo sapiens,Anoctamin-1,9606,,,IC50,uM,UO_0000065,,100.0
117,,18873552,[],CHEMBL4310424,Inhibition of YFP-fused ANO1 (unknown origin) ...,B,,,BAO_0000190,BAO_0000219,cell-based format,Cc1cc(Cl)ccc1OC(C)C(=O)N/N=C/c1ccccc1OC(F)(F)F,,,CHEMBL4308858,Eur J Med Chem,2018,,CHEMBL4453183,,CHEMBL4453183,,False,http://www.openphacts.org/units/Nanomolar,3149442,>,1,True,>,,IC50,nM,,100000.0,CHEMBL2046267,Homo sapiens,Anoctamin-1,9606,,,IC50,uM,UO_0000065,,100.0


In [11]:
len(df2.canonical_smiles.unique())

107

In [12]:
df2_nr = df2.drop_duplicates(['canonical_smiles'])
df2_nr

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,10943671,[],CHEMBL2050116,Inhibition of human TMEM16A transfected in FRT...,B,,,BAO_0000190,BAO_0000219,cell-based format,COc1ccc(-c2oc3ccc(OCc4cccc(F)c4)cc3c2C(=O)O)cc1,,,CHEMBL2046432,Bioorg. Med. Chem.,2012,"{'bei': '11.58', 'le': '0.21', 'lle': '-0.98',...",CHEMBL2046972,,CHEMBL2046972,4.54,False,http://www.openphacts.org/units/Nanomolar,1651934,=,1,True,=,,IC50,nM,,28700.0,CHEMBL2046267,Homo sapiens,Anoctamin-1,9606,,,IC50,uM,UO_0000065,,28.7
1,,10943672,[],CHEMBL2050116,Inhibition of human TMEM16A transfected in FRT...,B,,,BAO_0000190,BAO_0000219,cell-based format,COc1ccc(-c2oc3ccc(OCc4cccc(I)c4)cc3c2C(=O)O)cc1,,,CHEMBL2046432,Bioorg. Med. Chem.,2012,"{'bei': '10.45', 'le': '0.25', 'lle': '-0.76',...",CHEMBL2046973,,CHEMBL2046973,5.23,False,http://www.openphacts.org/units/Nanomolar,1651935,=,1,True,=,,IC50,nM,,5900.0,CHEMBL2046267,Homo sapiens,Anoctamin-1,9606,,,IC50,uM,UO_0000065,,5.9
2,,10943673,[],CHEMBL2050116,Inhibition of human TMEM16A transfected in FRT...,B,,,BAO_0000190,BAO_0000219,cell-based format,COc1ccc(-c2oc3ccc(OCc4ccc(Br)cc4)cc3c2C(=O)O)cc1,,,CHEMBL2046432,Bioorg. Med. Chem.,2012,"{'bei': '10.56', 'le': '0.23', 'lle': '-1.36',...",CHEMBL2046974,,CHEMBL2046974,4.79,False,http://www.openphacts.org/units/Nanomolar,1651936,=,1,True,=,,IC50,nM,,16300.0,CHEMBL2046267,Homo sapiens,Anoctamin-1,9606,,,IC50,uM,UO_0000065,,16.3
3,,10943674,[],CHEMBL2050116,Inhibition of human TMEM16A transfected in FRT...,B,,,BAO_0000190,BAO_0000219,cell-based format,COc1ccc(-c2oc3ccc(OCc4cccc(Cl)c4)cc3c2C(=O)O)cc1,,,CHEMBL2046432,Bioorg. Med. Chem.,2012,"{'bei': '11.09', 'le': '0.21', 'lle': '-1.51',...",CHEMBL2047075,,CHEMBL2047075,4.54,False,http://www.openphacts.org/units/Nanomolar,1652074,=,1,True,=,,IC50,nM,,29200.0,CHEMBL2046267,Homo sapiens,Anoctamin-1,9606,,,IC50,uM,UO_0000065,,29.2
4,,10943675,[],CHEMBL2050116,Inhibition of human TMEM16A transfected in FRT...,B,,,BAO_0000190,BAO_0000219,cell-based format,COc1ccc(-c2oc3ccc(OCc4c(F)cccc4F)cc3c2C(=O)O)cc1,,,CHEMBL2046432,Bioorg. Med. Chem.,2012,"{'bei': '11.13', 'le': '0.21', 'lle': '-1.09',...",CHEMBL2047076,,CHEMBL2047076,4.57,False,http://www.openphacts.org/units/Nanomolar,1652075,=,1,True,=,,IC50,nM,,27000.0,CHEMBL2046267,Homo sapiens,Anoctamin-1,9606,,,IC50,uM,UO_0000065,,27.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
114,,18873549,[],CHEMBL4310424,Inhibition of YFP-fused ANO1 (unknown origin) ...,B,,,BAO_0000190,BAO_0000219,cell-based format,Cc1cc(Cl)ccc1OCC(=O)N/N=C/c1ccccn1,,,CHEMBL4308858,Eur J Med Chem,2018,"{'bei': '14.83', 'le': '0.29', 'lle': '1.93', ...",CHEMBL4517769,,CHEMBL4517769,4.50,False,http://www.openphacts.org/units/Nanomolar,3149439,=,1,True,=,,IC50,nM,,31300.0,CHEMBL2046267,Homo sapiens,Anoctamin-1,9606,,,IC50,uM,UO_0000065,,31.3
115,,18873550,[],CHEMBL4310424,Inhibition of YFP-fused ANO1 (unknown origin) ...,B,,,BAO_0000190,BAO_0000219,cell-based format,Cc1cc(Cl)ccc1OCc1nnc(-c2ccccn2)o1,,,CHEMBL4308858,Eur J Med Chem,2018,,CHEMBL4581421,,CHEMBL4581421,,False,http://www.openphacts.org/units/Nanomolar,3149440,>,1,True,>,,IC50,nM,,100000.0,CHEMBL2046267,Homo sapiens,Anoctamin-1,9606,,,IC50,uM,UO_0000065,,100.0
116,,18873551,[],CHEMBL4310424,Inhibition of YFP-fused ANO1 (unknown origin) ...,B,,,BAO_0000190,BAO_0000219,cell-based format,CCOC(=O)c1cc2ccccn2c1/C=N/NC(=O)COc1ccc(Cl)cc1C,,,CHEMBL4308858,Eur J Med Chem,2018,,CHEMBL4471557,,CHEMBL4471557,,False,http://www.openphacts.org/units/Nanomolar,3149441,>,1,True,>,,IC50,nM,,100000.0,CHEMBL2046267,Homo sapiens,Anoctamin-1,9606,,,IC50,uM,UO_0000065,,100.0
117,,18873552,[],CHEMBL4310424,Inhibition of YFP-fused ANO1 (unknown origin) ...,B,,,BAO_0000190,BAO_0000219,cell-based format,Cc1cc(Cl)ccc1OC(C)C(=O)N/N=C/c1ccccc1OC(F)(F)F,,,CHEMBL4308858,Eur J Med Chem,2018,,CHEMBL4453183,,CHEMBL4453183,,False,http://www.openphacts.org/units/Nanomolar,3149442,>,1,True,>,,IC50,nM,,100000.0,CHEMBL2046267,Homo sapiens,Anoctamin-1,9606,,,IC50,uM,UO_0000065,,100.0


## **Data pre-processing of the bioactivity data**

### **Combine the 3 columns (molecule_chembl_id,canonical_smiles,standard_value) and bioactivity_class into a DataFrame**

In [13]:
selection = ['molecule_chembl_id','canonical_smiles','standard_value']
df3 = df2_nr[selection]
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL2046972,COc1ccc(-c2oc3ccc(OCc4cccc(F)c4)cc3c2C(=O)O)cc1,28700.0
1,CHEMBL2046973,COc1ccc(-c2oc3ccc(OCc4cccc(I)c4)cc3c2C(=O)O)cc1,5900.0
2,CHEMBL2046974,COc1ccc(-c2oc3ccc(OCc4ccc(Br)cc4)cc3c2C(=O)O)cc1,16300.0
3,CHEMBL2047075,COc1ccc(-c2oc3ccc(OCc4cccc(Cl)c4)cc3c2C(=O)O)cc1,29200.0
4,CHEMBL2047076,COc1ccc(-c2oc3ccc(OCc4c(F)cccc4F)cc3c2C(=O)O)cc1,27000.0
...,...,...,...
114,CHEMBL4517769,Cc1cc(Cl)ccc1OCC(=O)N/N=C/c1ccccn1,31300.0
115,CHEMBL4581421,Cc1cc(Cl)ccc1OCc1nnc(-c2ccccn2)o1,100000.0
116,CHEMBL4471557,CCOC(=O)c1cc2ccccn2c1/C=N/NC(=O)COc1ccc(Cl)cc1C,100000.0
117,CHEMBL4453183,Cc1cc(Cl)ccc1OC(C)C(=O)N/N=C/c1ccccc1OC(F)(F)F,100000.0


Saves dataframe to CSV file

In [14]:
df3.to_csv('Anoctamin_data_preprocessed.csv', index=False)

### **Labeling compounds as either being active, inactive or intermediate**
The bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM will be considered to be **active** while those greater than 10,000 nM will be considered to be **inactive**. As for those values in between 1,000 and 10,000 nM will be referred to as **intermediate**. 

In [15]:
df4 = pd.read_csv('Anoctamin_data_preprocessed.csv')

In [16]:
bioactivity_threshold = []
for i in df4.standard_value:
  if float(i) >= 10000:
    bioactivity_threshold.append("inactive")
  elif float(i) <= 1000:
    bioactivity_threshold.append("active")
  else:
    bioactivity_threshold.append("intermediate")

In [17]:
bioactivity_class = pd.Series(bioactivity_threshold, name='class')
df5 = pd.concat([df4, bioactivity_class], axis=1)
df5

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,class
0,CHEMBL2046972,COc1ccc(-c2oc3ccc(OCc4cccc(F)c4)cc3c2C(=O)O)cc1,28700.0,inactive
1,CHEMBL2046973,COc1ccc(-c2oc3ccc(OCc4cccc(I)c4)cc3c2C(=O)O)cc1,5900.0,intermediate
2,CHEMBL2046974,COc1ccc(-c2oc3ccc(OCc4ccc(Br)cc4)cc3c2C(=O)O)cc1,16300.0,inactive
3,CHEMBL2047075,COc1ccc(-c2oc3ccc(OCc4cccc(Cl)c4)cc3c2C(=O)O)cc1,29200.0,inactive
4,CHEMBL2047076,COc1ccc(-c2oc3ccc(OCc4c(F)cccc4F)cc3c2C(=O)O)cc1,27000.0,inactive
...,...,...,...,...
102,CHEMBL4517769,Cc1cc(Cl)ccc1OCC(=O)N/N=C/c1ccccn1,31300.0,inactive
103,CHEMBL4581421,Cc1cc(Cl)ccc1OCc1nnc(-c2ccccn2)o1,100000.0,inactive
104,CHEMBL4471557,CCOC(=O)c1cc2ccccn2c1/C=N/NC(=O)COc1ccc(Cl)cc1C,100000.0,inactive
105,CHEMBL4453183,Cc1cc(Cl)ccc1OC(C)C(=O)N/N=C/c1ccccc1OC(F)(F)F,100000.0,inactive


Saves dataframe to CSV file

In [18]:
# checkpoint 2
df5.to_csv('Anoctamin1_bioactivity_data_curated.csv', index=False)

In [None]:
! wget https://repo.anaconda.com/miniconda/Miniconda3-py37_4.8.2-Linux-x86_64.sh
! chmod +x Miniconda3-py37_4.8.2-Linux-x86_64.sh
! bash ./Miniconda3-py37_4.8.2-Linux-x86_64.sh -b -f -p /usr/local
! conda install -c rdkit rdkit -y
import sys
sys.path.append('/usr/local/lib/python3.7/site-packages/')

In [21]:
import numpy as np
from rdkit import Chem
from rdkit.Chem import Descriptors, Lipinski

In [23]:
df5['canonical_smiles'][0]

'COc1ccc(-c2oc3ccc(OCc4cccc(F)c4)cc3c2C(=O)O)cc1'

In [24]:
molecule= Chem.MolFromSmiles(df5['canonical_smiles'][0]) 

In [25]:
molecule

<rdkit.Chem.rdchem.Mol at 0x7f8449ce4bc0>

In [26]:
# https://codeocean.com/explore/capsules?query=tag:data-curation

def lipinski(smiles, verbose=False):

    moldata= []
    for elem in smiles:
        mol=Chem.MolFromSmiles(elem) 
        moldata.append(mol)
       
    baseData= np.arange(1,1)
    i=0  
    for mol in moldata:        
       
        desc_MolWt = Descriptors.MolWt(mol)
        desc_MolLogP = Descriptors.MolLogP(mol)
        desc_NumHDonors = Lipinski.NumHDonors(mol)
        desc_NumHAcceptors = Lipinski.NumHAcceptors(mol)
           
        row = np.array([desc_MolWt,
                        desc_MolLogP,
                        desc_NumHDonors,
                        desc_NumHAcceptors])   
    
        if(i==0):
            baseData=row
        else:
            baseData=np.vstack([baseData, row])
        i=i+1      
    
    columnNames=["MW","LogP","NumHDonors","NumHAcceptors"]   
    descriptors = pd.DataFrame(data=baseData,columns=columnNames)
    
    return descriptors

In [27]:
df_lipinski = lipinski(df5['canonical_smiles'])
df_lipinski

Unnamed: 0,MW,LogP,NumHDonors,NumHAcceptors
0,392.382,5.52470,1.0,4.0
1,500.288,5.99020,1.0,4.0
2,453.288,6.14810,1.0,4.0
3,408.837,6.03900,1.0,4.0
4,410.372,5.66380,1.0,4.0
...,...,...,...,...
102,303.749,2.57252,1.0,4.0
103,301.733,3.67242,0.0,5.0
104,413.861,3.60692,1.0,6.0
105,400.784,4.46462,1.0,4.0


In [28]:
df6 = pd.concat([df5,df_lipinski],axis=1)

In [29]:
df6

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,class,MW,LogP,NumHDonors,NumHAcceptors
0,CHEMBL2046972,COc1ccc(-c2oc3ccc(OCc4cccc(F)c4)cc3c2C(=O)O)cc1,28700.0,inactive,392.382,5.52470,1.0,4.0
1,CHEMBL2046973,COc1ccc(-c2oc3ccc(OCc4cccc(I)c4)cc3c2C(=O)O)cc1,5900.0,intermediate,500.288,5.99020,1.0,4.0
2,CHEMBL2046974,COc1ccc(-c2oc3ccc(OCc4ccc(Br)cc4)cc3c2C(=O)O)cc1,16300.0,inactive,453.288,6.14810,1.0,4.0
3,CHEMBL2047075,COc1ccc(-c2oc3ccc(OCc4cccc(Cl)c4)cc3c2C(=O)O)cc1,29200.0,inactive,408.837,6.03900,1.0,4.0
4,CHEMBL2047076,COc1ccc(-c2oc3ccc(OCc4c(F)cccc4F)cc3c2C(=O)O)cc1,27000.0,inactive,410.372,5.66380,1.0,4.0
...,...,...,...,...,...,...,...,...
102,CHEMBL4517769,Cc1cc(Cl)ccc1OCC(=O)N/N=C/c1ccccn1,31300.0,inactive,303.749,2.57252,1.0,4.0
103,CHEMBL4581421,Cc1cc(Cl)ccc1OCc1nnc(-c2ccccn2)o1,100000.0,inactive,301.733,3.67242,0.0,5.0
104,CHEMBL4471557,CCOC(=O)c1cc2ccccn2c1/C=N/NC(=O)COc1ccc(Cl)cc1C,100000.0,inactive,413.861,3.60692,1.0,6.0
105,CHEMBL4453183,Cc1cc(Cl)ccc1OC(C)C(=O)N/N=C/c1ccccc1OC(F)(F)F,100000.0,inactive,400.784,4.46462,1.0,4.0


NumHDonors<5, 
NumHAcceptors<10, 
logp<=5, 
MW<=500.

---