# **Computational Drug Discovery Pipelines-**
## **Data Collection and Pre-processing**

## **Installing libraries**

Install the ChEMBL web service package so that we can retrieve bioactivity data from the ChEMBL Database.

In [1]:
! pip install chembl_webresource_client

Collecting chembl_webresource_client
  Downloading chembl_webresource_client-0.10.7-py3-none-any.whl (55 kB)
[?25l[K     |██████                          | 10 kB 19.3 MB/s eta 0:00:01[K     |███████████▉                    | 20 kB 24.6 MB/s eta 0:00:01[K     |█████████████████▊              | 30 kB 21.4 MB/s eta 0:00:01[K     |███████████████████████▋        | 40 kB 17.8 MB/s eta 0:00:01[K     |█████████████████████████████▌  | 51 kB 7.8 MB/s eta 0:00:01[K     |████████████████████████████████| 55 kB 2.5 MB/s 
Collecting requests-cache~=0.7.0
  Downloading requests_cache-0.7.2-py3-none-any.whl (49 kB)
[K     |████████████████████████████████| 49 kB 5.5 MB/s 
Collecting url-normalize<2.0,>=1.4
  Downloading url_normalize-1.4.3-py2.py3-none-any.whl (6.8 kB)
Collecting pyyaml>=5.4
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |████████████████████████████████| 636 kB 14.5 MB/s 
[?25hCollecting itsdangerous>=2.0.1
  Downloading itsdangerous-2.0

## **Importing libraries**

In [2]:
# Import necessary libraries
import pandas as pd
from chembl_webresource_client.new_client import new_client

## **Search for Target protein**

### **Target search for coronavirus**

In [3]:
# Target search for coronavirus
target = new_client.target
target_query = target.search('Acetylcholinesterase')
targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,"[{'xref_id': 'P22303', 'xref_name': None, 'xre...",Homo sapiens,Acetylcholinesterase,27.0,False,CHEMBL220,"[{'accession': 'P22303', 'component_descriptio...",SINGLE PROTEIN,9606
1,[],Homo sapiens,Cholinesterases; ACHE & BCHE,27.0,False,CHEMBL2095233,"[{'accession': 'P06276', 'component_descriptio...",SELECTIVITY GROUP,9606
2,[],Drosophila melanogaster,Acetylcholinesterase,17.0,False,CHEMBL2242744,"[{'accession': 'P07140', 'component_descriptio...",SINGLE PROTEIN,7227
3,"[{'xref_id': 'P04058', 'xref_name': None, 'xre...",Torpedo californica,Acetylcholinesterase,15.0,False,CHEMBL4780,"[{'accession': 'P04058', 'component_descriptio...",SINGLE PROTEIN,7787
4,"[{'xref_id': 'P21836', 'xref_name': None, 'xre...",Mus musculus,Acetylcholinesterase,15.0,False,CHEMBL3198,"[{'accession': 'P21836', 'component_descriptio...",SINGLE PROTEIN,10090
5,"[{'xref_id': 'P37136', 'xref_name': None, 'xre...",Rattus norvegicus,Acetylcholinesterase,15.0,False,CHEMBL3199,"[{'accession': 'P37136', 'component_descriptio...",SINGLE PROTEIN,10116
6,"[{'xref_id': 'O42275', 'xref_name': None, 'xre...",Electrophorus electricus,Acetylcholinesterase,15.0,False,CHEMBL4078,"[{'accession': 'O42275', 'component_descriptio...",SINGLE PROTEIN,8005
7,"[{'xref_id': 'P23795', 'xref_name': None, 'xre...",Bos taurus,Acetylcholinesterase,15.0,False,CHEMBL4768,"[{'accession': 'P23795', 'component_descriptio...",SINGLE PROTEIN,9913
8,[],Anopheles gambiae,Acetylcholinesterase,15.0,False,CHEMBL2046266,"[{'accession': 'Q869C3', 'component_descriptio...",SINGLE PROTEIN,7165
9,[],Bemisia tabaci,AChE2,15.0,False,CHEMBL2366409,"[{'accession': 'B3SST5', 'component_descriptio...",SINGLE PROTEIN,7038


### **Select and retrieve bioactivity data for Human Acetylcholinesterase (first entry)**

We will assign the fourth entry (which corresponds to the target protein, *coronavirus 3C-like proteinase*) to the ***selected_target*** variable 

In [4]:
selected_target = targets.target_chembl_id[0]
selected_target

'CHEMBL220'

Here, we will retrieve only bioactivity data for Human Acetylcholinesterase (CHEMBL220) that are reported as pChEMBL values.

In [7]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

In [8]:
df = pd.DataFrame.from_dict(res)

In [9]:
df.head(3)

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,33969,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,BAO_0000357,single protein format,CCOc1nn(-c2cccc(OCc3ccccc3)c2)c(=O)o1,,,CHEMBL1148382,J. Med. Chem.,2004.0,"{'bei': '19.61', 'le': '0.36', 'lle': '3.32', ...",CHEMBL133897,,CHEMBL133897,6.12,False,http://www.openphacts.org/units/Nanomolar,252547,=,1,True,=,,IC50,nM,,750.0,CHEMBL220,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.75
1,,37563,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,BAO_0000357,single protein format,O=C(N1CCCCC1)n1nc(-c2ccc(Cl)cc2)nc1SCC1CC1,,,CHEMBL1148382,J. Med. Chem.,2004.0,"{'bei': '18.57', 'le': '0.38', 'lle': '2.45', ...",CHEMBL336398,,CHEMBL336398,7.0,False,http://www.openphacts.org/units/Nanomolar,252533,=,1,True,=,,IC50,nM,,100.0,CHEMBL220,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.1
2,,37565,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,BAO_0000357,single protein format,CN(C(=O)n1nc(-c2ccc(Cl)cc2)nc1SCC(F)(F)F)c1ccccc1,,,CHEMBL1148382,J. Med. Chem.,2004.0,,CHEMBL131588,,CHEMBL131588,,False,http://www.openphacts.org/units/Nanomolar,252530,>,1,True,>,,IC50,nM,,50000.0,CHEMBL220,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,50.0


In [10]:
df.standard_type.unique()

array(['IC50'], dtype=object)

Finally we will save the resulting bioactivity data to a CSV file **bioactivity_data.csv**.

In [11]:
df.to_csv('AChE_bioactivity_data.csv', index=False)

## **Copying files to Google Drive**

Firstly, we need to mount the Google Drive into Colab so that we can have access to our Google adrive from within Colab.

In [12]:
from google.colab import drive
drive.mount('/content/gdrive/', force_remount=True)


Mounted at /content/gdrive/


Next, we save the file in the **data** folder in our colab notebooks folder

In [None]:
#! mkdir "/content/gdrive/My Drive/Colab Notebooks/Bioinformatics Project/Drug Discovery/data"

In [17]:
! cp AChE_bioactivity_data.csv "/content/gdrive/My Drive/Colab Notebooks/Bioinformatics Project/Drug Discovery/Data Collection and Pre-processing/data"

In [18]:
! ls -l "/content/gdrive/My Drive/Colab Notebooks/Bioinformatics Project/Drug Discovery/Data Collection and Pre-processing/data"

total 3859
-rw------- 1 root root 3870914 Aug  2 09:35 AChE_bioactivity_data.csv
-rw------- 1 root root   70334 Aug  2 02:17 bioactivity_data.csv
-rw------- 1 root root    9337 Aug  2 02:38 bioactivity_preprocessed_data.csv


Let's see the CSV files that we have so far.

In [19]:
! ls

AChE_bioactivity_data.csv  gdrive  sample_data


Taking a glimpse of the **bioactivity_data.csv** file that we've just created.

In [20]:
! head AChE_bioactivity_data.csv

activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
,33969,[],CHEMBL643384,Inhibitory concentration against acetylcholinesterase,B,,,BAO_0000190,BAO_0000357,single protein format,CCOc1nn(-c2cccc(OCc3ccccc3)c2)c(=O)o1,,,CHEMBL1148382,J. Med. Chem.,2004.0,"{'bei': '19.61', 'le': '0.36', 'lle': '3.32', 'sei': '9.21'}",CHEMBL133897,,CHEMBL133897,6.12,False,http://www.openphac

## **Handling missing data**
If any compounds has missing value for the **standard_value** column then drop it

In [21]:
df2 = df[df.standard_value.notna()]
df2

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,33969,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,BAO_0000357,single protein format,CCOc1nn(-c2cccc(OCc3ccccc3)c2)c(=O)o1,,,CHEMBL1148382,J. Med. Chem.,2004.0,"{'bei': '19.61', 'le': '0.36', 'lle': '3.32', ...",CHEMBL133897,,CHEMBL133897,6.12,False,http://www.openphacts.org/units/Nanomolar,252547,=,1,True,=,,IC50,nM,,750.0,CHEMBL220,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.75
1,,37563,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,BAO_0000357,single protein format,O=C(N1CCCCC1)n1nc(-c2ccc(Cl)cc2)nc1SCC1CC1,,,CHEMBL1148382,J. Med. Chem.,2004.0,"{'bei': '18.57', 'le': '0.38', 'lle': '2.45', ...",CHEMBL336398,,CHEMBL336398,7.00,False,http://www.openphacts.org/units/Nanomolar,252533,=,1,True,=,,IC50,nM,,100.0,CHEMBL220,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.1
2,,37565,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,BAO_0000357,single protein format,CN(C(=O)n1nc(-c2ccc(Cl)cc2)nc1SCC(F)(F)F)c1ccccc1,,,CHEMBL1148382,J. Med. Chem.,2004.0,,CHEMBL131588,,CHEMBL131588,,False,http://www.openphacts.org/units/Nanomolar,252530,>,1,True,>,,IC50,nM,,50000.0,CHEMBL220,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,50.0
3,,38902,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,BAO_0000357,single protein format,O=C(N1CCCCC1)n1nc(-c2ccc(Cl)cc2)nc1SCC(F)(F)F,,,CHEMBL1148382,J. Med. Chem.,2004.0,"{'bei': '16.11', 'le': '0.34', 'lle': '1.81', ...",CHEMBL130628,,CHEMBL130628,6.52,False,http://www.openphacts.org/units/Nanomolar,252534,=,1,True,=,,IC50,nM,,300.0,CHEMBL220,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.3
4,,41170,[],CHEMBL643384,Inhibitory concentration against acetylcholine...,B,,,BAO_0000190,BAO_0000357,single protein format,CSc1nc(-c2ccc(OC(F)(F)F)cc2)nn1C(=O)N(C)C,,,CHEMBL1148382,J. Med. Chem.,2004.0,"{'bei': '17.60', 'le': '0.36', 'lle': '3.00', ...",CHEMBL130478,,CHEMBL130478,6.10,False,http://www.openphacts.org/units/Nanomolar,252552,=,1,True,=,,IC50,nM,,800.0,CHEMBL220,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7544,,20703835,[],CHEMBL4627889,Inhibition of AChE (unknown origin) using acet...,B,,,BAO_0000190,BAO_0000357,single protein format,COc1ccc(CCC(=O)Nc2nc(-c3cc4ccccc4oc3=O)cs2)cc1OC,,,CHEMBL4627271,Bioorg Med Chem Lett,2020.0,"{'bei': '14.05', 'le': '0.27', 'lle': '1.62', ...",CHEMBL4645659,,CHEMBL4645659,6.13,False,http://www.openphacts.org/units/Nanomolar,3486808,=,1,True,=,,IC50,nM,,740.0,CHEMBL220,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.74
7545,,20703856,[],CHEMBL4627888,Inhibition of AChE (unknown origin),B,,,BAO_0000190,BAO_0000357,single protein format,COc1ccc(-c2csc(NC(=O)CCN3CCCC3)n2)cc1,,,CHEMBL4627271,Bioorg Med Chem Lett,2020.0,"{'bei': '18.99', 'le': '0.37', 'lle': '3.05', ...",CHEMBL513063,,CHEMBL513063,6.29,False,http://www.openphacts.org/units/Nanomolar,3486809,=,1,True,=,,IC50,nM,,510.0,CHEMBL220,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,0.51
7546,,20708928,[],CHEMBL4628756,Inhibition of human AchE,A,,,BAO_0000190,BAO_0000357,single protein format,COc1cc(C2C3=C(CCCC3=O)NC3=C2C(=O)CCC3)ccc1OCc1...,Outside typical range,Values for this activity type are unusually la...,CHEMBL4627331,Bioorg Med Chem Lett,2020.0,,CHEMBL4640608,,CHEMBL4640608,,False,http://www.openphacts.org/units/Nanomolar,3487873,=,1,True,=,,IC50,nM,,125000.0,CHEMBL220,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,125.0
7547,,20708929,[],CHEMBL4628756,Inhibition of human AchE,A,,,BAO_0000190,BAO_0000357,single protein format,O=C1CCCC2=C1C(c1ccc(OCc3cccc(F)c3)c(Br)c1)C1=C...,,,CHEMBL4627331,Bioorg Med Chem Lett,2020.0,,CHEMBL4173961,,CHEMBL4173961,,False,http://www.openphacts.org/units/Nanomolar,3487876,>,1,True,>,,IC50,nM,,100000.0,CHEMBL220,Homo sapiens,Acetylcholinesterase,9606,,,IC50,uM,UO_0000065,,100.0


Apparently, for this dataset there is no missing data. But we can use the above code cell for bioactivity data of other target protein.

## **Data pre-processing of the bioactivity data**

### **Labeling compounds as either being active, inactive or intermediate**
The bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM will be considered to be **active** while those greater than 10,000 nM will be considered to be **inactive**. As for those values in between 1,000 and 10,000 nM will be referred to as **intermediate**. 

In [22]:
bioactivity_class = []
for i in df2.standard_value:
  if float(i) >= 10000:
    bioactivity_class.append("inactive")
  elif float(i) <= 1000:
    bioactivity_class.append("active")
  else:
    bioactivity_class.append("intermediate")

### **Iterate the *molecule_chembl_id* to a list**

In [23]:
df2.molecule_chembl_id

0        CHEMBL133897
1        CHEMBL336398
2        CHEMBL131588
3        CHEMBL130628
4        CHEMBL130478
            ...      
7544    CHEMBL4645659
7545     CHEMBL513063
7546    CHEMBL4640608
7547    CHEMBL4173961
7548         CHEMBL95
Name: molecule_chembl_id, Length: 6342, dtype: object

In [24]:
mol_cid = []
for i in df2.molecule_chembl_id:
  mol_cid.append(i)

In [25]:
mol_cid

['CHEMBL133897',
 'CHEMBL336398',
 'CHEMBL131588',
 'CHEMBL130628',
 'CHEMBL130478',
 'CHEMBL130112',
 'CHEMBL130098',
 'CHEMBL337486',
 'CHEMBL336538',
 'CHEMBL131051',
 'CHEMBL341437',
 'CHEMBL335033',
 'CHEMBL122983',
 'CHEMBL338720',
 'CHEMBL339995',
 'CHEMBL335158',
 'CHEMBL131536',
 'CHEMBL106126',
 'CHEMBL334971',
 'CHEMBL336625',
 'CHEMBL130666',
 'CHEMBL134061',
 'CHEMBL133388',
 'CHEMBL130645',
 'CHEMBL133580',
 'CHEMBL336524',
 'CHEMBL336276',
 'CHEMBL334395',
 'CHEMBL131320',
 'CHEMBL339297',
 'CHEMBL337714',
 'CHEMBL122575',
 'CHEMBL130704',
 'CHEMBL46151',
 'CHEMBL54126',
 'CHEMBL297316',
 'CHEMBL65667',
 'CHEMBL431519',
 'CHEMBL296429',
 'CHEMBL154972',
 'CHEMBL152722',
 'CHEMBL544022',
 'CHEMBL349127',
 'CHEMBL156659',
 'CHEMBL45118',
 'CHEMBL542609',
 'CHEMBL155322',
 'CHEMBL154211',
 'CHEMBL1203537',
 'CHEMBL86',
 'CHEMBL47375',
 'CHEMBL542360',
 'CHEMBL345849',
 'CHEMBL154689',
 'CHEMBL346733',
 'CHEMBL416464',
 'CHEMBL1203539',
 'CHEMBL347729',
 'CHEMBL296786',
 'CH

### **Iterate *canonical_smiles* to a list**

In [26]:
canonical_smiles = []
for i in df2.canonical_smiles:
  canonical_smiles.append(i)

In [27]:
canonical_smiles

['CCOc1nn(-c2cccc(OCc3ccccc3)c2)c(=O)o1',
 'O=C(N1CCCCC1)n1nc(-c2ccc(Cl)cc2)nc1SCC1CC1',
 'CN(C(=O)n1nc(-c2ccc(Cl)cc2)nc1SCC(F)(F)F)c1ccccc1',
 'O=C(N1CCCCC1)n1nc(-c2ccc(Cl)cc2)nc1SCC(F)(F)F',
 'CSc1nc(-c2ccc(OC(F)(F)F)cc2)nn1C(=O)N(C)C',
 'CSc1nc(-c2ccc(C)cc2)nn1C(=O)N(C)c1ccccc1',
 'CSc1nc(-c2ccc(Cl)cc2)nn1C(=O)N(C)C',
 'CCCCCCSc1nc(-c2ccc(Cl)cc2)nn1C(=O)N1CCOCC1',
 'COc1ccc(-c2nc(SC)n(C(=O)N(C)C)n2)cc1',
 'CSc1nc(-c2ccc(OC(F)(F)F)cc2)nn1C(=O)N(C)c1ccccc1',
 'CCSc1nc(-c2ccc(OC)cc2)nn1C(=O)N1CCOCC1',
 'CSc1nc(-c2ccc3ccccc3c2)nn1C(=O)N(C)C',
 'C[C@H]1C(=O)N(C(=O)NCc2ccccc2)[C@@H]1Oc1ccc(C(=O)C(C)(C)C)cc1',
 'CSc1nc(-c2ccc(-c3ccccc3)cc2)nn1C(=O)N(C)C',
 'CSc1nc(/C=C/c2ccccc2)nn1C(=O)N(C)C',
 'CCCCCCSc1nc(-c2ccc(Cl)cc2)nn1C(=O)N1CCCCC1',
 'CSc1nc(-c2ccc(Cl)cc2)nn1C(=O)N(C)c1ccccc1',
 'Cc1c(C(C)C)c(=O)on1C(=O)N1CCC[C@H](C)C1',
 'CCSc1nc(-c2ccc(OC)cc2)nn1C(=O)N(C)c1ccccc1',
 'CCCCCCSc1nc(-c2ccc(C)cc2)nn1C(=O)N(C)c1ccccc1',
 'CSc1nc(-c2ccc(Cl)cc2)nn1C(=O)N1CCCCC1',
 'O=C(N1CCOCC1)n1nc(-c2cc

### **Iterate *standard_value* to a list**

In [28]:
standard_value = []
for i in df2.standard_value:
  standard_value.append(i)

In [29]:
standard_value

['750.0',
 '100.0',
 '50000.0',
 '300.0',
 '800.0',
 '2400.0',
 '100.0',
 '50000.0',
 '800.0',
 '50000.0',
 '50000.0',
 '50.0',
 '100000.0',
 '560.0',
 '10.0',
 '10.0',
 '1400.0',
 '17000.0',
 '1100.0',
 '50000.0',
 '26.0',
 '50000.0',
 '800.0',
 '50000.0',
 '50000.0',
 '56.0',
 '200.0',
 '300.0',
 '200.0',
 '50000.0',
 '500.0',
 '100000.0',
 '50000.0',
 '260000.0',
 '22.0',
 '1000000.0',
 '3800.0',
 '40.0',
 '1000000.0',
 '73.0',
 '3100.0',
 '520000.0',
 '4500.0',
 '1200.0',
 '1000000.0',
 '1000000.0',
 '540.0',
 '30.0',
 '830.0',
 '20000.0',
 '1000000.0',
 '1000000.0',
 '390.0',
 '390.0',
 '430.0',
 '1000000.0',
 '600.0',
 '2000.0',
 '1000000.0',
 '21000.0',
 '290.0',
 '700.0',
 '460000.0',
 '630.0',
 '2100.0',
 '330000.0',
 '340000.0',
 '1000000.0',
 '290.0',
 '210000.0',
 '40.0',
 '80.0',
 '95.0',
 '710000.0',
 '180000.0',
 '20000.0',
 '60.0',
 '70.0',
 '280.0',
 '11000.0',
 '1500.0',
 '650.0',
 '930.0',
 '370.0',
 '4500.0',
 '720000.0',
 '30.0',
 '7600.0',
 '80.0',
 '1000000.0',
 

### **Combine the 4 lists into a dataframe**

In [32]:
data_tuples = list(zip(mol_cid, canonical_smiles, bioactivity_class, standard_value))
df3_alt = pd.DataFrame( data_tuples,  columns=['molecule_chembl_id', 'canonical_smiles', 'bioactivity_class', 'standard_value'])

In [33]:
df3_alt

Unnamed: 0,molecule_chembl_id,canonical_smiles,bioactivity_class,standard_value
0,CHEMBL133897,CCOc1nn(-c2cccc(OCc3ccccc3)c2)c(=O)o1,active,750.0
1,CHEMBL336398,O=C(N1CCCCC1)n1nc(-c2ccc(Cl)cc2)nc1SCC1CC1,active,100.0
2,CHEMBL131588,CN(C(=O)n1nc(-c2ccc(Cl)cc2)nc1SCC(F)(F)F)c1ccccc1,inactive,50000.0
3,CHEMBL130628,O=C(N1CCCCC1)n1nc(-c2ccc(Cl)cc2)nc1SCC(F)(F)F,active,300.0
4,CHEMBL130478,CSc1nc(-c2ccc(OC(F)(F)F)cc2)nn1C(=O)N(C)C,active,800.0
...,...,...,...,...
6337,CHEMBL4645659,COc1ccc(CCC(=O)Nc2nc(-c3cc4ccccc4oc3=O)cs2)cc1OC,active,740.0
6338,CHEMBL513063,COc1ccc(-c2csc(NC(=O)CCN3CCCC3)n2)cc1,active,510.0
6339,CHEMBL4640608,COc1cc(C2C3=C(CCCC3=O)NC3=C2C(=O)CCC3)ccc1OCc1...,inactive,125000.0
6340,CHEMBL4173961,O=C1CCCC2=C1C(c1ccc(OCc3cccc(F)c3)c(Br)c1)C1=C...,inactive,100000.0


### **Alternative Method**

In [30]:
selection = ['molecule_chembl_id', 'canonical_smiles', 'standard_value']
df3 = df2[selection]

In [31]:
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL133897,CCOc1nn(-c2cccc(OCc3ccccc3)c2)c(=O)o1,750.0
1,CHEMBL336398,O=C(N1CCCCC1)n1nc(-c2ccc(Cl)cc2)nc1SCC1CC1,100.0
2,CHEMBL131588,CN(C(=O)n1nc(-c2ccc(Cl)cc2)nc1SCC(F)(F)F)c1ccccc1,50000.0
3,CHEMBL130628,O=C(N1CCCCC1)n1nc(-c2ccc(Cl)cc2)nc1SCC(F)(F)F,300.0
4,CHEMBL130478,CSc1nc(-c2ccc(OC(F)(F)F)cc2)nn1C(=O)N(C)C,800.0
...,...,...,...
7544,CHEMBL4645659,COc1ccc(CCC(=O)Nc2nc(-c3cc4ccccc4oc3=O)cs2)cc1OC,740.0
7545,CHEMBL513063,COc1ccc(-c2csc(NC(=O)CCN3CCCC3)n2)cc1,510.0
7546,CHEMBL4640608,COc1cc(C2C3=C(CCCC3=O)NC3=C2C(=O)CCC3)ccc1OCc1...,125000.0
7547,CHEMBL4173961,O=C1CCCC2=C1C(c1ccc(OCc3cccc(F)c3)c(Br)c1)C1=C...,100000.0


In [34]:
pd.concat([df3, pd.Series(bioactivity_class)], axis = 1)

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,0
0,CHEMBL133897,CCOc1nn(-c2cccc(OCc3ccccc3)c2)c(=O)o1,750.0,active
1,CHEMBL336398,O=C(N1CCCCC1)n1nc(-c2ccc(Cl)cc2)nc1SCC1CC1,100.0,active
2,CHEMBL131588,CN(C(=O)n1nc(-c2ccc(Cl)cc2)nc1SCC(F)(F)F)c1ccccc1,50000.0,inactive
3,CHEMBL130628,O=C(N1CCCCC1)n1nc(-c2ccc(Cl)cc2)nc1SCC(F)(F)F,300.0,active
4,CHEMBL130478,CSc1nc(-c2ccc(OC(F)(F)F)cc2)nn1C(=O)N(C)C,800.0,active
...,...,...,...,...
7544,CHEMBL4645659,COc1ccc(CCC(=O)Nc2nc(-c3cc4ccccc4oc3=O)cs2)cc1OC,740.0,
7545,CHEMBL513063,COc1ccc(-c2csc(NC(=O)CCN3CCCC3)n2)cc1,510.0,
7546,CHEMBL4640608,COc1cc(C2C3=C(CCCC3=O)NC3=C2C(=O)CCC3)ccc1OCc1...,125000.0,
7547,CHEMBL4173961,O=C1CCCC2=C1C(c1ccc(OCc3cccc(F)c3)c(Br)c1)C1=C...,100000.0,


Saves dataframe to CSV file

In [35]:
df3.to_csv('AChE_bioactivity_preprocessed_data.csv', index=False)

In [36]:
! ls -l

total 4232
-rw-r--r-- 1 root root 3870914 Aug  2 09:31 AChE_bioactivity_data.csv
-rw-r--r-- 1 root root  450457 Aug  2 09:39 AChE_bioactivity_preprocessed_data.csv
drwx------ 5 root root    4096 Aug  2 09:31 gdrive
drwxr-xr-x 1 root root    4096 Jul 16 13:20 sample_data


Let's copy to the Google Drive

In [37]:
! cp AChE_bioactivity_preprocessed_data.csv "/content/gdrive/My Drive/Colab Notebooks/Bioinformatics Project/Drug Discovery/Data Collection and Pre-processing/data"

In [38]:
! ls "/content/gdrive/My Drive/Colab Notebooks/Bioinformatics Project/Drug Discovery/Data Collection and Pre-processing/data"

AChE_bioactivity_data.csv		bioactivity_data.csv
AChE_bioactivity_preprocessed_data.csv	bioactivity_preprocessed_data.csv


---