## **ChEMBL Database**

The [*ChEMBL Database*](https://www.ebi.ac.uk/chembl/) is a database that contains curated bioactivity data of more than 2 million compounds.This data is used by research scholars too, this is real data.

## **Installing libraries**

Install the ChEMBL web service package so that we can retrieve bioactivity data from the ChEMBL Database.

In [1]:
! pip install chembl_webresource_client

Collecting chembl_webresource_client
  Downloading chembl_webresource_client-0.10.7-py3-none-any.whl (55 kB)
[?25l[K     |██████                          | 10 kB 19.2 MB/s eta 0:00:01[K     |███████████▉                    | 20 kB 24.2 MB/s eta 0:00:01[K     |█████████████████▊              | 30 kB 12.3 MB/s eta 0:00:01[K     |███████████████████████▋        | 40 kB 9.7 MB/s eta 0:00:01[K     |█████████████████████████████▌  | 51 kB 4.3 MB/s eta 0:00:01[K     |████████████████████████████████| 55 kB 2.3 MB/s 
[?25hCollecting requests-cache~=0.7.0
  Downloading requests_cache-0.7.5-py3-none-any.whl (39 kB)
Collecting itsdangerous>=2.0.1
  Downloading itsdangerous-2.0.1-py3-none-any.whl (18 kB)
Collecting url-normalize<2.0,>=1.4
  Downloading url_normalize-1.4.3-py2.py3-none-any.whl (6.8 kB)
Collecting pyyaml>=5.4
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |██████████████████████

## **Importing libraries**

In [2]:
# Import necessary libraries
import pandas as pd
from chembl_webresource_client.new_client import new_client

## **Search for Target protein**

### **Target search for Anoctamin**



In [3]:
# Target search for coronavirus
target = new_client.target
# Search is same as we search in chembl website search
target_query = target.search('anoctamin')
targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],Mus musculus,Anoctamin-1,18.0,False,CHEMBL4105874,"[{'accession': 'Q8BHY3', 'component_descriptio...",SINGLE PROTEIN,10090
1,[],Homo sapiens,Anoctamin-1,17.0,False,CHEMBL2046267,"[{'accession': 'Q5XXA6', 'component_descriptio...",SINGLE PROTEIN,9606
2,[],Homo sapiens,Anoctamin-2,17.0,False,CHEMBL4105767,"[{'accession': 'Q9NQ90', 'component_descriptio...",SINGLE PROTEIN,9606


### **Select and retrieve bioactivity data for *Human Anoctmin-1* (first entry)**

We will assign the fifth entry (which corresponds to the target protein, *Human Acetylcholinesterase*) to the ***selected_target*** variable 

In [4]:
selected_target = targets.target_chembl_id[1]
selected_target

'CHEMBL2046267'

Here, we will retrieve only bioactivity data for *Human Anoctamin-1* (CHEMBL220) that are reported as pChEMBL values.

In [5]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

In [6]:
df = pd.DataFrame.from_dict(res)

In [7]:
df

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,10943671,[],CHEMBL2050116,Inhibition of human TMEM16A transfected in FRT...,B,,,BAO_0000190,BAO_0000219,cell-based format,COc1ccc(-c2oc3ccc(OCc4cccc(F)c4)cc3c2C(=O)O)cc1,,,CHEMBL2046432,Bioorg. Med. Chem.,2012,"{'bei': '11.58', 'le': '0.21', 'lle': '-0.98',...",CHEMBL2046972,,CHEMBL2046972,4.54,False,http://www.openphacts.org/units/Nanomolar,1651934,=,1,True,=,,IC50,nM,,28700.0,CHEMBL2046267,Homo sapiens,Anoctamin-1,9606,,,IC50,uM,UO_0000065,,28.7
1,,10943672,[],CHEMBL2050116,Inhibition of human TMEM16A transfected in FRT...,B,,,BAO_0000190,BAO_0000219,cell-based format,COc1ccc(-c2oc3ccc(OCc4cccc(I)c4)cc3c2C(=O)O)cc1,,,CHEMBL2046432,Bioorg. Med. Chem.,2012,"{'bei': '10.45', 'le': '0.25', 'lle': '-0.76',...",CHEMBL2046973,,CHEMBL2046973,5.23,False,http://www.openphacts.org/units/Nanomolar,1651935,=,1,True,=,,IC50,nM,,5900.0,CHEMBL2046267,Homo sapiens,Anoctamin-1,9606,,,IC50,uM,UO_0000065,,5.9
2,,10943673,[],CHEMBL2050116,Inhibition of human TMEM16A transfected in FRT...,B,,,BAO_0000190,BAO_0000219,cell-based format,COc1ccc(-c2oc3ccc(OCc4ccc(Br)cc4)cc3c2C(=O)O)cc1,,,CHEMBL2046432,Bioorg. Med. Chem.,2012,"{'bei': '10.56', 'le': '0.23', 'lle': '-1.36',...",CHEMBL2046974,,CHEMBL2046974,4.79,False,http://www.openphacts.org/units/Nanomolar,1651936,=,1,True,=,,IC50,nM,,16300.0,CHEMBL2046267,Homo sapiens,Anoctamin-1,9606,,,IC50,uM,UO_0000065,,16.3
3,,10943674,[],CHEMBL2050116,Inhibition of human TMEM16A transfected in FRT...,B,,,BAO_0000190,BAO_0000219,cell-based format,COc1ccc(-c2oc3ccc(OCc4cccc(Cl)c4)cc3c2C(=O)O)cc1,,,CHEMBL2046432,Bioorg. Med. Chem.,2012,"{'bei': '11.09', 'le': '0.21', 'lle': '-1.51',...",CHEMBL2047075,,CHEMBL2047075,4.54,False,http://www.openphacts.org/units/Nanomolar,1652074,=,1,True,=,,IC50,nM,,29200.0,CHEMBL2046267,Homo sapiens,Anoctamin-1,9606,,,IC50,uM,UO_0000065,,29.2
4,,10943675,[],CHEMBL2050116,Inhibition of human TMEM16A transfected in FRT...,B,,,BAO_0000190,BAO_0000219,cell-based format,COc1ccc(-c2oc3ccc(OCc4c(F)cccc4F)cc3c2C(=O)O)cc1,,,CHEMBL2046432,Bioorg. Med. Chem.,2012,"{'bei': '11.13', 'le': '0.21', 'lle': '-1.09',...",CHEMBL2047076,,CHEMBL2047076,4.57,False,http://www.openphacts.org/units/Nanomolar,1652075,=,1,True,=,,IC50,nM,,27000.0,CHEMBL2046267,Homo sapiens,Anoctamin-1,9606,,,IC50,uM,UO_0000065,,27.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
115,,18873550,[],CHEMBL4310424,Inhibition of YFP-fused ANO1 (unknown origin) ...,B,,,BAO_0000190,BAO_0000219,cell-based format,Cc1cc(Cl)ccc1OCc1nnc(-c2ccccn2)o1,,,CHEMBL4308858,Eur J Med Chem,2018,,CHEMBL4581421,,CHEMBL4581421,,False,http://www.openphacts.org/units/Nanomolar,3149440,>,1,True,>,,IC50,nM,,100000.0,CHEMBL2046267,Homo sapiens,Anoctamin-1,9606,,,IC50,uM,UO_0000065,,100.0
116,,18873551,[],CHEMBL4310424,Inhibition of YFP-fused ANO1 (unknown origin) ...,B,,,BAO_0000190,BAO_0000219,cell-based format,CCOC(=O)c1cc2ccccn2c1/C=N/NC(=O)COc1ccc(Cl)cc1C,,,CHEMBL4308858,Eur J Med Chem,2018,,CHEMBL4471557,,CHEMBL4471557,,False,http://www.openphacts.org/units/Nanomolar,3149441,>,1,True,>,,IC50,nM,,100000.0,CHEMBL2046267,Homo sapiens,Anoctamin-1,9606,,,IC50,uM,UO_0000065,,100.0
117,,18873552,[],CHEMBL4310424,Inhibition of YFP-fused ANO1 (unknown origin) ...,B,,,BAO_0000190,BAO_0000219,cell-based format,Cc1cc(Cl)ccc1OC(C)C(=O)N/N=C/c1ccccc1OC(F)(F)F,,,CHEMBL4308858,Eur J Med Chem,2018,,CHEMBL4453183,,CHEMBL4453183,,False,http://www.openphacts.org/units/Nanomolar,3149442,>,1,True,>,,IC50,nM,,100000.0,CHEMBL2046267,Homo sapiens,Anoctamin-1,9606,,,IC50,uM,UO_0000065,,100.0
118,Not Active,18873553,[],CHEMBL4310424,Inhibition of YFP-fused ANO1 (unknown origin) ...,B,,,BAO_0000190,BAO_0000219,cell-based format,Cc1cc(Cl)ccc1OC(C)(C)C(=O)N/N=C/c1ccccc1OC(F)(F)F,,,CHEMBL4308858,Eur J Med Chem,2018,,CHEMBL4459865,,CHEMBL4459865,,False,,3149443,,1,False,,,IC50,,,,CHEMBL2046267,Homo sapiens,Anoctamin-1,9606,,,IC50,,,,


Finally we will save the resulting bioactivity data to a CSV file **bioactivity_data.csv**.

In [8]:
#save data into your system
df.to_csv('anoctamin_data_raw.csv', index=False)

## **Handling missing data**
If any compounds has missing value for the **standard_value** and **canonical_smiles** column then drop it.

In [9]:
df2 = df[df.standard_value.notna()]
df2 = df2[df.canonical_smiles.notna()]
df2

  


Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,10943671,[],CHEMBL2050116,Inhibition of human TMEM16A transfected in FRT...,B,,,BAO_0000190,BAO_0000219,cell-based format,COc1ccc(-c2oc3ccc(OCc4cccc(F)c4)cc3c2C(=O)O)cc1,,,CHEMBL2046432,Bioorg. Med. Chem.,2012,"{'bei': '11.58', 'le': '0.21', 'lle': '-0.98',...",CHEMBL2046972,,CHEMBL2046972,4.54,False,http://www.openphacts.org/units/Nanomolar,1651934,=,1,True,=,,IC50,nM,,28700.0,CHEMBL2046267,Homo sapiens,Anoctamin-1,9606,,,IC50,uM,UO_0000065,,28.7
1,,10943672,[],CHEMBL2050116,Inhibition of human TMEM16A transfected in FRT...,B,,,BAO_0000190,BAO_0000219,cell-based format,COc1ccc(-c2oc3ccc(OCc4cccc(I)c4)cc3c2C(=O)O)cc1,,,CHEMBL2046432,Bioorg. Med. Chem.,2012,"{'bei': '10.45', 'le': '0.25', 'lle': '-0.76',...",CHEMBL2046973,,CHEMBL2046973,5.23,False,http://www.openphacts.org/units/Nanomolar,1651935,=,1,True,=,,IC50,nM,,5900.0,CHEMBL2046267,Homo sapiens,Anoctamin-1,9606,,,IC50,uM,UO_0000065,,5.9
2,,10943673,[],CHEMBL2050116,Inhibition of human TMEM16A transfected in FRT...,B,,,BAO_0000190,BAO_0000219,cell-based format,COc1ccc(-c2oc3ccc(OCc4ccc(Br)cc4)cc3c2C(=O)O)cc1,,,CHEMBL2046432,Bioorg. Med. Chem.,2012,"{'bei': '10.56', 'le': '0.23', 'lle': '-1.36',...",CHEMBL2046974,,CHEMBL2046974,4.79,False,http://www.openphacts.org/units/Nanomolar,1651936,=,1,True,=,,IC50,nM,,16300.0,CHEMBL2046267,Homo sapiens,Anoctamin-1,9606,,,IC50,uM,UO_0000065,,16.3
3,,10943674,[],CHEMBL2050116,Inhibition of human TMEM16A transfected in FRT...,B,,,BAO_0000190,BAO_0000219,cell-based format,COc1ccc(-c2oc3ccc(OCc4cccc(Cl)c4)cc3c2C(=O)O)cc1,,,CHEMBL2046432,Bioorg. Med. Chem.,2012,"{'bei': '11.09', 'le': '0.21', 'lle': '-1.51',...",CHEMBL2047075,,CHEMBL2047075,4.54,False,http://www.openphacts.org/units/Nanomolar,1652074,=,1,True,=,,IC50,nM,,29200.0,CHEMBL2046267,Homo sapiens,Anoctamin-1,9606,,,IC50,uM,UO_0000065,,29.2
4,,10943675,[],CHEMBL2050116,Inhibition of human TMEM16A transfected in FRT...,B,,,BAO_0000190,BAO_0000219,cell-based format,COc1ccc(-c2oc3ccc(OCc4c(F)cccc4F)cc3c2C(=O)O)cc1,,,CHEMBL2046432,Bioorg. Med. Chem.,2012,"{'bei': '11.13', 'le': '0.21', 'lle': '-1.09',...",CHEMBL2047076,,CHEMBL2047076,4.57,False,http://www.openphacts.org/units/Nanomolar,1652075,=,1,True,=,,IC50,nM,,27000.0,CHEMBL2046267,Homo sapiens,Anoctamin-1,9606,,,IC50,uM,UO_0000065,,27.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
114,,18873549,[],CHEMBL4310424,Inhibition of YFP-fused ANO1 (unknown origin) ...,B,,,BAO_0000190,BAO_0000219,cell-based format,Cc1cc(Cl)ccc1OCC(=O)N/N=C/c1ccccn1,,,CHEMBL4308858,Eur J Med Chem,2018,"{'bei': '14.83', 'le': '0.29', 'lle': '1.93', ...",CHEMBL4517769,,CHEMBL4517769,4.50,False,http://www.openphacts.org/units/Nanomolar,3149439,=,1,True,=,,IC50,nM,,31300.0,CHEMBL2046267,Homo sapiens,Anoctamin-1,9606,,,IC50,uM,UO_0000065,,31.3
115,,18873550,[],CHEMBL4310424,Inhibition of YFP-fused ANO1 (unknown origin) ...,B,,,BAO_0000190,BAO_0000219,cell-based format,Cc1cc(Cl)ccc1OCc1nnc(-c2ccccn2)o1,,,CHEMBL4308858,Eur J Med Chem,2018,,CHEMBL4581421,,CHEMBL4581421,,False,http://www.openphacts.org/units/Nanomolar,3149440,>,1,True,>,,IC50,nM,,100000.0,CHEMBL2046267,Homo sapiens,Anoctamin-1,9606,,,IC50,uM,UO_0000065,,100.0
116,,18873551,[],CHEMBL4310424,Inhibition of YFP-fused ANO1 (unknown origin) ...,B,,,BAO_0000190,BAO_0000219,cell-based format,CCOC(=O)c1cc2ccccn2c1/C=N/NC(=O)COc1ccc(Cl)cc1C,,,CHEMBL4308858,Eur J Med Chem,2018,,CHEMBL4471557,,CHEMBL4471557,,False,http://www.openphacts.org/units/Nanomolar,3149441,>,1,True,>,,IC50,nM,,100000.0,CHEMBL2046267,Homo sapiens,Anoctamin-1,9606,,,IC50,uM,UO_0000065,,100.0
117,,18873552,[],CHEMBL4310424,Inhibition of YFP-fused ANO1 (unknown origin) ...,B,,,BAO_0000190,BAO_0000219,cell-based format,Cc1cc(Cl)ccc1OC(C)C(=O)N/N=C/c1ccccc1OC(F)(F)F,,,CHEMBL4308858,Eur J Med Chem,2018,,CHEMBL4453183,,CHEMBL4453183,,False,http://www.openphacts.org/units/Nanomolar,3149442,>,1,True,>,,IC50,nM,,100000.0,CHEMBL2046267,Homo sapiens,Anoctamin-1,9606,,,IC50,uM,UO_0000065,,100.0


In [10]:
len(df2.canonical_smiles.unique())

107

In [11]:
df2_nr = df2.drop_duplicates(['canonical_smiles'])
df2_nr

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,10943671,[],CHEMBL2050116,Inhibition of human TMEM16A transfected in FRT...,B,,,BAO_0000190,BAO_0000219,cell-based format,COc1ccc(-c2oc3ccc(OCc4cccc(F)c4)cc3c2C(=O)O)cc1,,,CHEMBL2046432,Bioorg. Med. Chem.,2012,"{'bei': '11.58', 'le': '0.21', 'lle': '-0.98',...",CHEMBL2046972,,CHEMBL2046972,4.54,False,http://www.openphacts.org/units/Nanomolar,1651934,=,1,True,=,,IC50,nM,,28700.0,CHEMBL2046267,Homo sapiens,Anoctamin-1,9606,,,IC50,uM,UO_0000065,,28.7
1,,10943672,[],CHEMBL2050116,Inhibition of human TMEM16A transfected in FRT...,B,,,BAO_0000190,BAO_0000219,cell-based format,COc1ccc(-c2oc3ccc(OCc4cccc(I)c4)cc3c2C(=O)O)cc1,,,CHEMBL2046432,Bioorg. Med. Chem.,2012,"{'bei': '10.45', 'le': '0.25', 'lle': '-0.76',...",CHEMBL2046973,,CHEMBL2046973,5.23,False,http://www.openphacts.org/units/Nanomolar,1651935,=,1,True,=,,IC50,nM,,5900.0,CHEMBL2046267,Homo sapiens,Anoctamin-1,9606,,,IC50,uM,UO_0000065,,5.9
2,,10943673,[],CHEMBL2050116,Inhibition of human TMEM16A transfected in FRT...,B,,,BAO_0000190,BAO_0000219,cell-based format,COc1ccc(-c2oc3ccc(OCc4ccc(Br)cc4)cc3c2C(=O)O)cc1,,,CHEMBL2046432,Bioorg. Med. Chem.,2012,"{'bei': '10.56', 'le': '0.23', 'lle': '-1.36',...",CHEMBL2046974,,CHEMBL2046974,4.79,False,http://www.openphacts.org/units/Nanomolar,1651936,=,1,True,=,,IC50,nM,,16300.0,CHEMBL2046267,Homo sapiens,Anoctamin-1,9606,,,IC50,uM,UO_0000065,,16.3
3,,10943674,[],CHEMBL2050116,Inhibition of human TMEM16A transfected in FRT...,B,,,BAO_0000190,BAO_0000219,cell-based format,COc1ccc(-c2oc3ccc(OCc4cccc(Cl)c4)cc3c2C(=O)O)cc1,,,CHEMBL2046432,Bioorg. Med. Chem.,2012,"{'bei': '11.09', 'le': '0.21', 'lle': '-1.51',...",CHEMBL2047075,,CHEMBL2047075,4.54,False,http://www.openphacts.org/units/Nanomolar,1652074,=,1,True,=,,IC50,nM,,29200.0,CHEMBL2046267,Homo sapiens,Anoctamin-1,9606,,,IC50,uM,UO_0000065,,29.2
4,,10943675,[],CHEMBL2050116,Inhibition of human TMEM16A transfected in FRT...,B,,,BAO_0000190,BAO_0000219,cell-based format,COc1ccc(-c2oc3ccc(OCc4c(F)cccc4F)cc3c2C(=O)O)cc1,,,CHEMBL2046432,Bioorg. Med. Chem.,2012,"{'bei': '11.13', 'le': '0.21', 'lle': '-1.09',...",CHEMBL2047076,,CHEMBL2047076,4.57,False,http://www.openphacts.org/units/Nanomolar,1652075,=,1,True,=,,IC50,nM,,27000.0,CHEMBL2046267,Homo sapiens,Anoctamin-1,9606,,,IC50,uM,UO_0000065,,27.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
114,,18873549,[],CHEMBL4310424,Inhibition of YFP-fused ANO1 (unknown origin) ...,B,,,BAO_0000190,BAO_0000219,cell-based format,Cc1cc(Cl)ccc1OCC(=O)N/N=C/c1ccccn1,,,CHEMBL4308858,Eur J Med Chem,2018,"{'bei': '14.83', 'le': '0.29', 'lle': '1.93', ...",CHEMBL4517769,,CHEMBL4517769,4.50,False,http://www.openphacts.org/units/Nanomolar,3149439,=,1,True,=,,IC50,nM,,31300.0,CHEMBL2046267,Homo sapiens,Anoctamin-1,9606,,,IC50,uM,UO_0000065,,31.3
115,,18873550,[],CHEMBL4310424,Inhibition of YFP-fused ANO1 (unknown origin) ...,B,,,BAO_0000190,BAO_0000219,cell-based format,Cc1cc(Cl)ccc1OCc1nnc(-c2ccccn2)o1,,,CHEMBL4308858,Eur J Med Chem,2018,,CHEMBL4581421,,CHEMBL4581421,,False,http://www.openphacts.org/units/Nanomolar,3149440,>,1,True,>,,IC50,nM,,100000.0,CHEMBL2046267,Homo sapiens,Anoctamin-1,9606,,,IC50,uM,UO_0000065,,100.0
116,,18873551,[],CHEMBL4310424,Inhibition of YFP-fused ANO1 (unknown origin) ...,B,,,BAO_0000190,BAO_0000219,cell-based format,CCOC(=O)c1cc2ccccn2c1/C=N/NC(=O)COc1ccc(Cl)cc1C,,,CHEMBL4308858,Eur J Med Chem,2018,,CHEMBL4471557,,CHEMBL4471557,,False,http://www.openphacts.org/units/Nanomolar,3149441,>,1,True,>,,IC50,nM,,100000.0,CHEMBL2046267,Homo sapiens,Anoctamin-1,9606,,,IC50,uM,UO_0000065,,100.0
117,,18873552,[],CHEMBL4310424,Inhibition of YFP-fused ANO1 (unknown origin) ...,B,,,BAO_0000190,BAO_0000219,cell-based format,Cc1cc(Cl)ccc1OC(C)C(=O)N/N=C/c1ccccc1OC(F)(F)F,,,CHEMBL4308858,Eur J Med Chem,2018,,CHEMBL4453183,,CHEMBL4453183,,False,http://www.openphacts.org/units/Nanomolar,3149442,>,1,True,>,,IC50,nM,,100000.0,CHEMBL2046267,Homo sapiens,Anoctamin-1,9606,,,IC50,uM,UO_0000065,,100.0


## **Data pre-processing of the bioactivity data**

### **Combine the 3 columns (molecule_chembl_id,canonical_smiles,standard_value) and bioactivity_class into a DataFrame**

In [12]:
selection = ['molecule_chembl_id','canonical_smiles','standard_value']
df3 = df2_nr[selection]
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL2046972,COc1ccc(-c2oc3ccc(OCc4cccc(F)c4)cc3c2C(=O)O)cc1,28700.0
1,CHEMBL2046973,COc1ccc(-c2oc3ccc(OCc4cccc(I)c4)cc3c2C(=O)O)cc1,5900.0
2,CHEMBL2046974,COc1ccc(-c2oc3ccc(OCc4ccc(Br)cc4)cc3c2C(=O)O)cc1,16300.0
3,CHEMBL2047075,COc1ccc(-c2oc3ccc(OCc4cccc(Cl)c4)cc3c2C(=O)O)cc1,29200.0
4,CHEMBL2047076,COc1ccc(-c2oc3ccc(OCc4c(F)cccc4F)cc3c2C(=O)O)cc1,27000.0
...,...,...,...
114,CHEMBL4517769,Cc1cc(Cl)ccc1OCC(=O)N/N=C/c1ccccn1,31300.0
115,CHEMBL4581421,Cc1cc(Cl)ccc1OCc1nnc(-c2ccccn2)o1,100000.0
116,CHEMBL4471557,CCOC(=O)c1cc2ccccn2c1/C=N/NC(=O)COc1ccc(Cl)cc1C,100000.0
117,CHEMBL4453183,Cc1cc(Cl)ccc1OC(C)C(=O)N/N=C/c1ccccc1OC(F)(F)F,100000.0


Saves dataframe to CSV file

In [13]:
df3.to_csv('Anoctamin_data_preprocessed.csv', index=False)

### **Labeling compounds as either being active, inactive or intermediate**
The bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM will be considered to be **active** while those greater than 10,000 nM will be considered to be **inactive**. As for those values in between 1,000 and 10,000 nM will be referred to as **intermediate**. 

In [14]:
df4 = pd.read_csv('Anoctamin_data_preprocessed.csv')

In [15]:
bioactivity_threshold = []
for i in df4.standard_value:
  if float(i) >= 10000:
    bioactivity_threshold.append("inactive")
  elif float(i) <= 1000:
    bioactivity_threshold.append("active")
  else:
    bioactivity_threshold.append("intermediate")

In [16]:
bioactivity_class = pd.Series(bioactivity_threshold, name='class')
df5 = pd.concat([df4, bioactivity_class], axis=1)
df5

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,class
0,CHEMBL2046972,COc1ccc(-c2oc3ccc(OCc4cccc(F)c4)cc3c2C(=O)O)cc1,28700.0,inactive
1,CHEMBL2046973,COc1ccc(-c2oc3ccc(OCc4cccc(I)c4)cc3c2C(=O)O)cc1,5900.0,intermediate
2,CHEMBL2046974,COc1ccc(-c2oc3ccc(OCc4ccc(Br)cc4)cc3c2C(=O)O)cc1,16300.0,inactive
3,CHEMBL2047075,COc1ccc(-c2oc3ccc(OCc4cccc(Cl)c4)cc3c2C(=O)O)cc1,29200.0,inactive
4,CHEMBL2047076,COc1ccc(-c2oc3ccc(OCc4c(F)cccc4F)cc3c2C(=O)O)cc1,27000.0,inactive
...,...,...,...,...
102,CHEMBL4517769,Cc1cc(Cl)ccc1OCC(=O)N/N=C/c1ccccn1,31300.0,inactive
103,CHEMBL4581421,Cc1cc(Cl)ccc1OCc1nnc(-c2ccccn2)o1,100000.0,inactive
104,CHEMBL4471557,CCOC(=O)c1cc2ccccn2c1/C=N/NC(=O)COc1ccc(Cl)cc1C,100000.0,inactive
105,CHEMBL4453183,Cc1cc(Cl)ccc1OC(C)C(=O)N/N=C/c1ccccc1OC(F)(F)F,100000.0,inactive


Saves dataframe to CSV file

In [17]:
# checkpoint 2
df5.to_csv('Anoctamin1_bioactivity_data_curated.csv', index=False)

In [18]:
! wget https://repo.anaconda.com/miniconda/Miniconda3-py37_4.8.2-Linux-x86_64.sh
! chmod +x Miniconda3-py37_4.8.2-Linux-x86_64.sh
! bash ./Miniconda3-py37_4.8.2-Linux-x86_64.sh -b -f -p /usr/local
! conda install -c rdkit rdkit -y
import sys
sys.path.append('/usr/local/lib/python3.7/site-packages/')

--2022-01-27 17:10:55--  https://repo.anaconda.com/miniconda/Miniconda3-py37_4.8.2-Linux-x86_64.sh
Resolving repo.anaconda.com (repo.anaconda.com)... 104.16.131.3, 104.16.130.3, 2606:4700::6810:8303, ...
Connecting to repo.anaconda.com (repo.anaconda.com)|104.16.131.3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 85055499 (81M) [application/x-sh]
Saving to: ‘Miniconda3-py37_4.8.2-Linux-x86_64.sh’


2022-01-27 17:10:55 (151 MB/s) - ‘Miniconda3-py37_4.8.2-Linux-x86_64.sh’ saved [85055499/85055499]

PREFIX=/usr/local
Unpacking payload ...
Collecting package metadata (current_repodata.json): - \ | / done
Solving environment: \ | done

## Package Plan ##

  environment location: /usr/local

  added / updated specs:
    - _libgcc_mutex==0.1=main
    - asn1crypto==1.3.0=py37_0
    - ca-certificates==2020.1.1=0
    - certifi==2019.11.28=py37_0
    - cffi==1.14.0=py37h2e261b9_0
    - chardet==3.0.4=py37_1003
    - conda-package-handling==1.6.0=py37h7b6447

In [19]:
import numpy as np
from rdkit import Chem
from rdkit.Chem import Descriptors, Lipinski

In [20]:
df5['canonical_smiles'][0]

'COc1ccc(-c2oc3ccc(OCc4cccc(F)c4)cc3c2C(=O)O)cc1'

In [21]:
molecule= Chem.MolFromSmiles(df5['canonical_smiles'][0]) 

In [22]:
molecule

<rdkit.Chem.rdchem.Mol at 0x7fb6a1741710>

In [23]:
# https://codeocean.com/explore/capsules?query=tag:data-curation

def lipinski(smiles, verbose=False):

    moldata= []
    for elem in smiles:
        mol=Chem.MolFromSmiles(elem) 
        moldata.append(mol)
       
    baseData= np.arange(1,1)
    i=0  
    for mol in moldata:        
       
        desc_MolWt = Descriptors.MolWt(mol)
        desc_MolLogP = Descriptors.MolLogP(mol)
        desc_NumHDonors = Lipinski.NumHDonors(mol)
        desc_NumHAcceptors = Lipinski.NumHAcceptors(mol)
           
        row = np.array([desc_MolWt,
                        desc_MolLogP,
                        desc_NumHDonors,
                        desc_NumHAcceptors])   
    
        if(i==0):
            baseData=row
        else:
            baseData=np.vstack([baseData, row])
        i=i+1      
    
    columnNames=["MW","LogP","NumHDonors","NumHAcceptors"]   
    descriptors = pd.DataFrame(data=baseData,columns=columnNames)
    
    return descriptors

In [24]:
df_lipinski = lipinski(df5['canonical_smiles'])
df_lipinski

Unnamed: 0,MW,LogP,NumHDonors,NumHAcceptors
0,392.382,5.52470,1.0,4.0
1,500.288,5.99020,1.0,4.0
2,453.288,6.14810,1.0,4.0
3,408.837,6.03900,1.0,4.0
4,410.372,5.66380,1.0,4.0
...,...,...,...,...
102,303.749,2.57252,1.0,4.0
103,301.733,3.67242,0.0,5.0
104,413.861,3.60692,1.0,6.0
105,400.784,4.46462,1.0,4.0


In [25]:
df6 = pd.concat([df5,df_lipinski],axis=1)

In [26]:
df6

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,class,MW,LogP,NumHDonors,NumHAcceptors
0,CHEMBL2046972,COc1ccc(-c2oc3ccc(OCc4cccc(F)c4)cc3c2C(=O)O)cc1,28700.0,inactive,392.382,5.52470,1.0,4.0
1,CHEMBL2046973,COc1ccc(-c2oc3ccc(OCc4cccc(I)c4)cc3c2C(=O)O)cc1,5900.0,intermediate,500.288,5.99020,1.0,4.0
2,CHEMBL2046974,COc1ccc(-c2oc3ccc(OCc4ccc(Br)cc4)cc3c2C(=O)O)cc1,16300.0,inactive,453.288,6.14810,1.0,4.0
3,CHEMBL2047075,COc1ccc(-c2oc3ccc(OCc4cccc(Cl)c4)cc3c2C(=O)O)cc1,29200.0,inactive,408.837,6.03900,1.0,4.0
4,CHEMBL2047076,COc1ccc(-c2oc3ccc(OCc4c(F)cccc4F)cc3c2C(=O)O)cc1,27000.0,inactive,410.372,5.66380,1.0,4.0
...,...,...,...,...,...,...,...,...
102,CHEMBL4517769,Cc1cc(Cl)ccc1OCC(=O)N/N=C/c1ccccn1,31300.0,inactive,303.749,2.57252,1.0,4.0
103,CHEMBL4581421,Cc1cc(Cl)ccc1OCc1nnc(-c2ccccn2)o1,100000.0,inactive,301.733,3.67242,0.0,5.0
104,CHEMBL4471557,CCOC(=O)c1cc2ccccn2c1/C=N/NC(=O)COc1ccc(Cl)cc1C,100000.0,inactive,413.861,3.60692,1.0,6.0
105,CHEMBL4453183,Cc1cc(Cl)ccc1OC(C)C(=O)N/N=C/c1ccccc1OC(F)(F)F,100000.0,inactive,400.784,4.46462,1.0,4.0


NumHDonors<5, 
NumHAcceptors<10, 
logp<=5, 
MW<=500.

In [27]:
def pIC50(input):
    pIC50 = []

    for x in input['standard_value']:
        molar = x*(10**-9) # Converts nM to M
        pIC50.append(-np.log10(molar))
    
    input['pIC50'] = pIC50
    new_df = input.drop('standard_value', axis = 1)

    return new_df

In [28]:
df_final = pIC50(df6)
df_final

Unnamed: 0,molecule_chembl_id,canonical_smiles,class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,CHEMBL2046972,COc1ccc(-c2oc3ccc(OCc4cccc(F)c4)cc3c2C(=O)O)cc1,inactive,392.382,5.52470,1.0,4.0,4.542118
1,CHEMBL2046973,COc1ccc(-c2oc3ccc(OCc4cccc(I)c4)cc3c2C(=O)O)cc1,intermediate,500.288,5.99020,1.0,4.0,5.229148
2,CHEMBL2046974,COc1ccc(-c2oc3ccc(OCc4ccc(Br)cc4)cc3c2C(=O)O)cc1,inactive,453.288,6.14810,1.0,4.0,4.787812
3,CHEMBL2047075,COc1ccc(-c2oc3ccc(OCc4cccc(Cl)c4)cc3c2C(=O)O)cc1,inactive,408.837,6.03900,1.0,4.0,4.534617
4,CHEMBL2047076,COc1ccc(-c2oc3ccc(OCc4c(F)cccc4F)cc3c2C(=O)O)cc1,inactive,410.372,5.66380,1.0,4.0,4.568636
...,...,...,...,...,...,...,...,...
102,CHEMBL4517769,Cc1cc(Cl)ccc1OCC(=O)N/N=C/c1ccccn1,inactive,303.749,2.57252,1.0,4.0,4.504456
103,CHEMBL4581421,Cc1cc(Cl)ccc1OCc1nnc(-c2ccccn2)o1,inactive,301.733,3.67242,0.0,5.0,4.000000
104,CHEMBL4471557,CCOC(=O)c1cc2ccccn2c1/C=N/NC(=O)COc1ccc(Cl)cc1C,inactive,413.861,3.60692,1.0,6.0,4.000000
105,CHEMBL4453183,Cc1cc(Cl)ccc1OC(C)C(=O)N/N=C/c1ccccc1OC(F)(F)F,inactive,400.784,4.46462,1.0,4.0,4.000000


In [29]:
l = []
dfr = pd.DataFrame(l)
dfr = df_final

In [30]:
# NumHDonors<5, NumHAcceptors<10, logp<=5, MW<=500.

In [31]:
dfr

Unnamed: 0,molecule_chembl_id,canonical_smiles,class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,CHEMBL2046972,COc1ccc(-c2oc3ccc(OCc4cccc(F)c4)cc3c2C(=O)O)cc1,inactive,392.382,5.52470,1.0,4.0,4.542118
1,CHEMBL2046973,COc1ccc(-c2oc3ccc(OCc4cccc(I)c4)cc3c2C(=O)O)cc1,intermediate,500.288,5.99020,1.0,4.0,5.229148
2,CHEMBL2046974,COc1ccc(-c2oc3ccc(OCc4ccc(Br)cc4)cc3c2C(=O)O)cc1,inactive,453.288,6.14810,1.0,4.0,4.787812
3,CHEMBL2047075,COc1ccc(-c2oc3ccc(OCc4cccc(Cl)c4)cc3c2C(=O)O)cc1,inactive,408.837,6.03900,1.0,4.0,4.534617
4,CHEMBL2047076,COc1ccc(-c2oc3ccc(OCc4c(F)cccc4F)cc3c2C(=O)O)cc1,inactive,410.372,5.66380,1.0,4.0,4.568636
...,...,...,...,...,...,...,...,...
102,CHEMBL4517769,Cc1cc(Cl)ccc1OCC(=O)N/N=C/c1ccccn1,inactive,303.749,2.57252,1.0,4.0,4.504456
103,CHEMBL4581421,Cc1cc(Cl)ccc1OCc1nnc(-c2ccccn2)o1,inactive,301.733,3.67242,0.0,5.0,4.000000
104,CHEMBL4471557,CCOC(=O)c1cc2ccccn2c1/C=N/NC(=O)COc1ccc(Cl)cc1C,inactive,413.861,3.60692,1.0,6.0,4.000000
105,CHEMBL4453183,Cc1cc(Cl)ccc1OC(C)C(=O)N/N=C/c1ccccc1OC(F)(F)F,inactive,400.784,4.46462,1.0,4.0,4.000000


In [32]:
df_filter = dfr[((dfr.MW<=500) & (dfr.LogP<=5)) & ((dfr.NumHDonors<5) & (dfr.NumHAcceptors<10))]

In [33]:
df_filter

Unnamed: 0,molecule_chembl_id,canonical_smiles,class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
31,CHEMBL4062416,CC(=O)Nc1sc2c(c1C(=O)Nc1ccccc1C)CCCCC2,inactive,342.464,4.53612,2.0,3.0,5.0
32,CHEMBL4068320,CCC(=O)Nc1sc2c(c1C(=O)Nc1ccccc1C)CCCCC2,inactive,356.491,4.92622,2.0,3.0,4.69897
33,CHEMBL3134585,O=C(Nc1ccccc1)c1c(NC(=O)C(F)(F)F)sc2c1CCCCC2,active,382.407,4.7701,2.0,3.0,6.522879
35,CHEMBL4081619,CC(=O)Nc1sc2c(c1C(=O)Nc1ccccc1)CCCCC2,inactive,328.437,4.2277,2.0,3.0,4.69897
36,CHEMBL4087045,CCC(=O)Nc1sc2c(c1C(=O)Nc1ccccc1)CCCCC2,intermediate,342.464,4.6178,2.0,3.0,5.920819
38,CHEMBL4072679,O=C(Nc1ccccc1F)c1c(NC(=O)C(F)(F)F)sc2c1CCCCC2,intermediate,400.397,4.9092,2.0,3.0,5.886057
39,CHEMBL4099581,O=C(Nc1ccc(F)cc1)c1c(NC(=O)C(F)(F)F)sc2c1CCCCC2,active,400.397,4.9092,2.0,3.0,6.49485
49,CHEMBL4082703,O=C(Nc1ccccc1)c1c(NC(=O)C(F)(F)F)sc2c1CCCC2,active,368.38,4.38,2.0,3.0,6.431798
50,CHEMBL4069503,Cc1ccccc1NC(=O)c1c(NC(=O)C(F)(F)F)sc2c1CCCC2,active,382.407,4.68842,2.0,3.0,6.769551
51,CHEMBL4090522,Cc1ccc(NC(=O)c2c(NC(=O)C(F)(F)F)sc3c2CCCC3)cc1,active,382.407,4.68842,2.0,3.0,6.657577


In [34]:
dfn = pd.DataFrame(l)
dfn = df_filter

In [35]:
# to make indices starting from 0,1,2,...
dfn.reset_index(inplace = True)

In [36]:
dfn.drop('index',axis=1)
## these are drug like compunds that could be used, this is only used to eliminate drugs that aren't used for oral purposes. 

Unnamed: 0,molecule_chembl_id,canonical_smiles,class,MW,LogP,NumHDonors,NumHAcceptors,pIC50
0,CHEMBL4062416,CC(=O)Nc1sc2c(c1C(=O)Nc1ccccc1C)CCCCC2,inactive,342.464,4.53612,2.0,3.0,5.0
1,CHEMBL4068320,CCC(=O)Nc1sc2c(c1C(=O)Nc1ccccc1C)CCCCC2,inactive,356.491,4.92622,2.0,3.0,4.69897
2,CHEMBL3134585,O=C(Nc1ccccc1)c1c(NC(=O)C(F)(F)F)sc2c1CCCCC2,active,382.407,4.7701,2.0,3.0,6.522879
3,CHEMBL4081619,CC(=O)Nc1sc2c(c1C(=O)Nc1ccccc1)CCCCC2,inactive,328.437,4.2277,2.0,3.0,4.69897
4,CHEMBL4087045,CCC(=O)Nc1sc2c(c1C(=O)Nc1ccccc1)CCCCC2,intermediate,342.464,4.6178,2.0,3.0,5.920819
5,CHEMBL4072679,O=C(Nc1ccccc1F)c1c(NC(=O)C(F)(F)F)sc2c1CCCCC2,intermediate,400.397,4.9092,2.0,3.0,5.886057
6,CHEMBL4099581,O=C(Nc1ccc(F)cc1)c1c(NC(=O)C(F)(F)F)sc2c1CCCCC2,active,400.397,4.9092,2.0,3.0,6.49485
7,CHEMBL4082703,O=C(Nc1ccccc1)c1c(NC(=O)C(F)(F)F)sc2c1CCCC2,active,368.38,4.38,2.0,3.0,6.431798
8,CHEMBL4069503,Cc1ccccc1NC(=O)c1c(NC(=O)C(F)(F)F)sc2c1CCCC2,active,382.407,4.68842,2.0,3.0,6.769551
9,CHEMBL4090522,Cc1ccc(NC(=O)c2c(NC(=O)C(F)(F)F)sc3c2CCCC3)cc1,active,382.407,4.68842,2.0,3.0,6.657577


## Padel Descriptors

### These are another way for predicting the drugs compounds, esentially the IC50 values, each compound has 881 values known as fingerprint of a drug compund means each has unique 881 arranged compounds with 0's and 1's.

In [37]:

selection = ['canonical_smiles','molecule_chembl_id']
dfn_selection = dfn[selection]
dfn_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

In [38]:
# downloading the padel files from a repository, for more info look at http://www.yapcwsoft.com/dd/padeldescriptor/

! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh

--2022-01-27 17:11:57--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
Resolving github.com (github.com)... 52.69.186.44
Connecting to github.com (github.com)|52.69.186.44|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip [following]
--2022-01-27 17:11:58--  https://raw.githubusercontent.com/dataprofessor/bioinformatics/master/padel.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25768637 (25M) [application/zip]
Saving to: ‘padel.zip’


2022-01-27 17:12:00 (140 MB/s) - ‘padel.zip’ saved [25768637/25768637]

--2022-01-27 17:12:00--  https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh
Resolving github.com (gith

In [39]:
! unzip padel.zip

Archive:  padel.zip
   creating: PaDEL-Descriptor/
  inflating: __MACOSX/._PaDEL-Descriptor  
  inflating: PaDEL-Descriptor/MACCSFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._MACCSFingerprinter.xml  
  inflating: PaDEL-Descriptor/AtomPairs2DFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._AtomPairs2DFingerprinter.xml  
  inflating: PaDEL-Descriptor/EStateFingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._EStateFingerprinter.xml  
  inflating: PaDEL-Descriptor/Fingerprinter.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._Fingerprinter.xml  
  inflating: PaDEL-Descriptor/.DS_Store  
  inflating: __MACOSX/PaDEL-Descriptor/._.DS_Store  
   creating: PaDEL-Descriptor/license/
  inflating: __MACOSX/PaDEL-Descriptor/._license  
  inflating: PaDEL-Descriptor/KlekotaRothFingerprintCount.xml  
  inflating: __MACOSX/PaDEL-Descriptor/._KlekotaRothFingerprintCount.xml  
  inflating: PaDEL-Descriptor/config  
  inflating: __MACOSX/PaDEL-Descriptor/._config  
  inf

In [64]:
## using bash command we run .sh files 
# saving the molecule ids in molecules.smi, .smi files are usually in plain text and 
# these kind of files are taken into input for the Padel discriptors function 
# to give us the fingerprints of the drug, you can see molecules.smi in files section of colab notebook.
! bash padel.sh

Processing CHEMBL4062416 in molecule.smi (1/49). 
Processing CHEMBL4068320 in molecule.smi (2/49). 
Processing CHEMBL3134585 in molecule.smi (3/49). Average speed: 2.11 s/mol.
Processing CHEMBL4081619 in molecule.smi (4/49). Average speed: 1.06 s/mol.
Processing CHEMBL4072679 in molecule.smi (6/49). Average speed: 0.71 s/mol.
Processing CHEMBL4087045 in molecule.smi (5/49). Average speed: 0.91 s/mol.
Processing CHEMBL4099581 in molecule.smi (7/49). Average speed: 0.68 s/mol.
Processing CHEMBL4082703 in molecule.smi (8/49). Average speed: 0.59 s/mol.
Processing CHEMBL4069503 in molecule.smi (9/49). Average speed: 0.58 s/mol.
Processing CHEMBL4090522 in molecule.smi (10/49). Average speed: 0.53 s/mol.
Processing CHEMBL4080209 in molecule.smi (11/49). Average speed: 0.53 s/mol.
Processing CHEMBL4098234 in molecule.smi (12/49). Average speed: 0.48 s/mol.
Processing CHEMBL4088086 in molecule.smi (13/49). Average speed: 0.48 s/mol.
Processing CHEMBL4059811 in molecule.smi (14/49). Average sp

In [42]:
! cat molecule.smi | wc -l

49


In [43]:
# fingerprints of molecules
fnp = pd.read_csv('descriptors_output.csv')

In [44]:
fnp

Unnamed: 0,Name,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,PubchemFP10,PubchemFP11,PubchemFP12,PubchemFP13,PubchemFP14,PubchemFP15,PubchemFP16,PubchemFP17,PubchemFP18,PubchemFP19,PubchemFP20,PubchemFP21,PubchemFP22,PubchemFP23,PubchemFP24,PubchemFP25,PubchemFP26,PubchemFP27,PubchemFP28,PubchemFP29,PubchemFP30,PubchemFP31,PubchemFP32,PubchemFP33,PubchemFP34,PubchemFP35,PubchemFP36,PubchemFP37,PubchemFP38,...,PubchemFP841,PubchemFP842,PubchemFP843,PubchemFP844,PubchemFP845,PubchemFP846,PubchemFP847,PubchemFP848,PubchemFP849,PubchemFP850,PubchemFP851,PubchemFP852,PubchemFP853,PubchemFP854,PubchemFP855,PubchemFP856,PubchemFP857,PubchemFP858,PubchemFP859,PubchemFP860,PubchemFP861,PubchemFP862,PubchemFP863,PubchemFP864,PubchemFP865,PubchemFP866,PubchemFP867,PubchemFP868,PubchemFP869,PubchemFP870,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
0,CHEMBL4062416,1,1,1,0,0,0,0,0,0,1,1,1,1,0,1,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,CHEMBL4068320,1,1,1,0,0,0,0,0,0,1,1,1,1,0,1,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,CHEMBL3134585,1,1,1,0,0,0,0,0,0,1,1,1,1,0,1,1,0,0,1,1,0,0,0,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,CHEMBL4081619,1,1,1,0,0,0,0,0,0,1,1,1,1,0,1,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,CHEMBL4072679,1,1,1,0,0,0,0,0,0,1,1,1,1,0,1,1,0,0,1,1,0,0,0,1,1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5,CHEMBL4087045,1,1,1,0,0,0,0,0,0,1,1,1,1,0,1,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6,CHEMBL4099581,1,1,1,0,0,0,0,0,0,1,1,1,1,0,1,1,0,0,1,1,0,0,0,1,1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7,CHEMBL4082703,1,1,0,0,0,0,0,0,0,1,1,1,1,0,1,1,0,0,1,1,0,0,0,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8,CHEMBL4069503,1,1,1,0,0,0,0,0,0,1,1,1,1,0,1,1,0,0,1,1,0,0,0,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
9,CHEMBL4090522,1,1,1,0,0,0,0,0,0,1,1,1,1,0,1,1,0,0,1,1,0,0,0,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [45]:
ic = dfn['pIC50']

In [46]:
ic

0     5.000000
1     4.698970
2     6.522879
3     4.698970
4     5.920819
5     5.886057
6     6.494850
7     6.431798
8     6.769551
9     6.657577
10    6.309804
11    5.795880
12    5.522879
13    5.886057
14    5.301030
15    5.420216
16    4.698970
17    5.207608
18    4.698970
19    4.698970
20    5.602060
21    5.886057
22    6.431798
23    5.677781
24    6.958607
25    6.000000
26    5.795880
27    4.195179
28    4.590067
29    4.804100
30    6.522879
31    4.000000
32    7.677781
33    6.677781
34    4.000000
35    4.000000
36    4.673664
37    6.050610
38    4.000000
39    4.000000
40    4.712198
41    4.000000
42    4.000000
43    4.970616
44    4.504456
45    4.000000
46    4.000000
47    4.000000
48    4.000000
Name: pIC50, dtype: float64

In [47]:
data = pd.concat([ic,fnp],axis=1)

In [48]:
data.drop('Name',axis=1,inplace = True)

In [49]:
data_y = data['pIC50']

In [50]:
data.drop('pIC50',axis=1,inplace = True)

In [51]:
data_x = data

In [52]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

In [53]:
X_train, X_test, Y_train, Y_test = train_test_split(data_x,data_y, test_size=0.2)

In [54]:
X_train

Unnamed: 0,PubchemFP0,PubchemFP1,PubchemFP2,PubchemFP3,PubchemFP4,PubchemFP5,PubchemFP6,PubchemFP7,PubchemFP8,PubchemFP9,PubchemFP10,PubchemFP11,PubchemFP12,PubchemFP13,PubchemFP14,PubchemFP15,PubchemFP16,PubchemFP17,PubchemFP18,PubchemFP19,PubchemFP20,PubchemFP21,PubchemFP22,PubchemFP23,PubchemFP24,PubchemFP25,PubchemFP26,PubchemFP27,PubchemFP28,PubchemFP29,PubchemFP30,PubchemFP31,PubchemFP32,PubchemFP33,PubchemFP34,PubchemFP35,PubchemFP36,PubchemFP37,PubchemFP38,PubchemFP39,...,PubchemFP841,PubchemFP842,PubchemFP843,PubchemFP844,PubchemFP845,PubchemFP846,PubchemFP847,PubchemFP848,PubchemFP849,PubchemFP850,PubchemFP851,PubchemFP852,PubchemFP853,PubchemFP854,PubchemFP855,PubchemFP856,PubchemFP857,PubchemFP858,PubchemFP859,PubchemFP860,PubchemFP861,PubchemFP862,PubchemFP863,PubchemFP864,PubchemFP865,PubchemFP866,PubchemFP867,PubchemFP868,PubchemFP869,PubchemFP870,PubchemFP871,PubchemFP872,PubchemFP873,PubchemFP874,PubchemFP875,PubchemFP876,PubchemFP877,PubchemFP878,PubchemFP879,PubchemFP880
28,1,1,0,0,0,0,0,0,0,1,1,1,0,0,1,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8,1,1,1,0,0,0,0,0,0,1,1,1,1,0,1,1,0,0,1,1,0,0,0,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1,1,1,0,0,0,0,0,0,1,1,1,1,0,1,1,0,0,1,1,0,0,0,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
43,1,1,1,0,0,0,0,0,0,1,1,1,1,0,1,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
38,1,1,1,0,0,0,0,0,0,1,1,1,1,0,1,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
9,1,1,1,0,0,0,0,0,0,1,1,1,1,0,1,1,0,0,1,1,0,0,0,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
22,1,1,0,0,0,0,0,0,0,1,1,1,1,0,1,1,0,0,1,1,0,0,0,1,1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
33,1,1,0,0,0,0,0,0,0,1,1,1,1,0,1,1,0,0,1,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
25,1,1,1,0,0,0,0,0,0,1,1,1,1,0,1,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1,1,1,0,0,0,0,0,0,1,1,1,1,0,1,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [55]:
X_train.shape, Y_train.shape

((39, 881), (39,))

In [56]:
X_test.shape, Y_test.shape

((10, 881), (10,))

## Try using SVM, K-means clustering models too

In [61]:
# Linear Regression
from sklearn.linear_model import LinearRegression

LR = LinearRegression()

LR.fit(X_train,Y_train)


LinearRegression()

In [62]:
y_prediction =  LR.predict(X_test)
y_prediction

array([ 4.66406250e+00,  4.05468750e+00,  3.86328125e+00,  5.85156250e+00,
        7.41406250e+00,  5.92187500e+00,  1.91768849e+12, -3.78702539e+11,
        6.96875000e+00, -3.78702539e+11])

### The negative values indiacate that the model didnt train well as you can see there are only 39 values so whenever a new input comes it may predict some absurd values as it is shown above.

In [59]:
#Random Forest
model = RandomForestRegressor(n_estimators=50)
model.fit(X_train, Y_train)
r2 = model.score(X_test, Y_test)
r2

-0.43283680241144795

In [60]:
Y_pred = model.predict(X_test)

In [63]:
Y_pred

array([4.90530996, 4.88573375, 5.08756382, 5.4790927 , 6.43775024,
       5.98675818, 4.99053059, 5.37016535, 6.19811825, 5.37016535])

### As you can see here random forest predicts way better that linear regression but as you can see the stadnard deviation isnt too much, this can also be reason for bad model training. 