# **Data Collection**
## Computational Medication Discovery, Download Bioactivity Data
### Real-life data project for portfolio. In detail, we will be building a ML model using the ChEMBL DB for bioactivity data.

### ChEMBL DB
The ChEMBL DB is a db that contains curated bioactivity ddata of more than 2 million compounds. It is compiled from more than 76k, 1.2 million essays and the data spans 13k targets and 1.800k cells and 33k indications.

### Libraries used
ChEMBL web service package so that we can retrieve bioactivity data from the ChEMBL DB.

In [1]:
! pip install chembl_webresource_client

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting chembl_webresource_client
  Downloading chembl_webresource_client-0.10.8-py3-none-any.whl (55 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.2/55.2 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
Collecting requests-cache~=0.7.0
  Downloading requests_cache-0.7.5-py3-none-any.whl (39 kB)
Collecting attrs<22.0,>=21.2
  Downloading attrs-21.4.0-py2.py3-none-any.whl (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.6/60.6 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
Collecting url-normalize<2.0,>=1.4
  Downloading url_normalize-1.4.3-py2.py3-none-any.whl (6.8 kB)
Installing collected packages: url-normalize, attrs, requests-cache, chembl_webresource_client
  Attempting uninstall: attrs
    Found existing installation: attrs 23.1.0
    Uninstalling attrs-23.1.0:
      Successfully uninstalled attrs-23.1.0
Successfully installed attrs-

In [2]:
# import necessary libraries
import pandas as pd
from chembl_webresource_client.new_client import new_client

### Search for Target protein
Target search for schistosoma

In [3]:
# Target search for schistosoma
target = new_client.target
target_query = target.search('schistosoma')
targets = pd.DataFrame.from_dict(target_query)
targets
# 10 are single protein, 3 are organisms and 1 nucleic acid

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],Schistosoma japonicum,Schistosoma japonicum,16.0,False,CHEMBL612644,[],ORGANISM,6182
1,[],Schistosoma mansoni,Schistosoma mansoni,16.0,False,CHEMBL612893,[],ORGANISM,6183
2,[],Schistosoma haematobium,Schistosoma haematobium,16.0,False,CHEMBL4296598,[],ORGANISM,6185
3,[],Schistosoma mansoni,Thioredoxin glutathione reductase,10.0,False,CHEMBL6110,"[{'accession': 'Q962Y6', 'component_descriptio...",SINGLE PROTEIN,6183
4,[],Schistosoma mansoni,Thioredoxin peroxidase,10.0,False,CHEMBL1293279,"[{'accession': 'O97161', 'component_descriptio...",SINGLE PROTEIN,6183
5,[],Schistosoma mansoni,Voltage-activated calcium channel beta 1 subunit,10.0,False,CHEMBL2363079,"[{'accession': 'Q95US7', 'component_descriptio...",SINGLE PROTEIN,6183
6,[],Schistosoma mansoni,Voltage-activated calcium channel beta 2 subunit,10.0,False,CHEMBL2363080,"[{'accession': 'Q962H3', 'component_descriptio...",SINGLE PROTEIN,6183
7,[],Schistosoma mansoni,DNA,10.0,False,CHEMBL2366043,[],NUCLEIC-ACID,6183
8,[],Schistosoma mansoni,Histone deacetylase 8,10.0,False,CHEMBL3797017,"[{'accession': 'A5H660', 'component_descriptio...",SINGLE PROTEIN,6183
9,[],Schistosoma japonicum,Glutathione-S-transferase,10.0,False,CHEMBL4105850,"[{'accession': 'Q26513', 'component_descriptio...",SINGLE PROTEIN,6182


# **Select and retrieve bioactivity data for NAD-dependent protein deacetylase(tenth entry indexed from 0)**

The tenth entry(target protein, NAD-dependent protein deacetylase) will be assigned to the selected_target variable

In [4]:
selected_target = targets.target_chembl_id[10]
selected_target

'CHEMBL4523517'

Here we will retrieve only bioactivity data for NAD-dependent protein deacetylase(CHEMBL4523517) that are reported as IC50 values in nM(nanomolar) unit.

In [11]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

In [12]:
df = pd.DataFrame.from_dict(res)

In [13]:
df

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,18918972,[],CHEMBL4322639,Inhibition of recombinant Schistosoma mansoni ...,B,,,BAO_0000190,BAO_0000219,...,Schistosoma mansoni,NAD-dependent protein deacetylase,6183,,,IC50,uM,UO_0000065,,23.1
1,,18918974,[],CHEMBL4322641,Inhibition of recombinant Schistosoma mansoni ...,B,,,BAO_0000190,BAO_0000219,...,Schistosoma mansoni,NAD-dependent protein deacetylase,6183,,,IC50,uM,UO_0000065,,420.0
2,,18918975,[],CHEMBL4322624,Inhibition of recombinant Schistosoma mansoni ...,B,,,BAO_0000190,BAO_0000219,...,Schistosoma mansoni,NAD-dependent protein deacetylase,6183,,,IC50,uM,UO_0000065,,1.9


the higher the standard_value, the worse the potency becomes, so we'll need one as low as possible(more amount of drug to produce the same inhibition at 50 to produce the same effect aka "take 5 mg of medication or 5 litres")

In [15]:
df.standard_type.unique()

array(['IC50'], dtype=object)

We're going to save the resulting bioactivity data to a CSV file bioactivity_data_schisto.csv

In [16]:
df.to_csv('bioactivity_data_schisto.csv', index=False)

Add to drive
Firstly mount GDrive into Colab so we can access GDrive from Colab.

In [17]:
from google.colab import drive
drive.mount('/content/gdrive/', force_remount=True)

Mounted at /content/gdrive/


We will create a data folder in Colab Notebook folder in GDrive

In [18]:
! ls "/content/gdrive/My Drive/Colab Notebooks/"

'Breast Cancer EDA And Classification.ipynb'
'Breast Cancer prediction using ANN.ipynb'
 CDD_ML_Part_2_Exploratory_Data_Analysis.ipynb
'CN Sandu-Martinas.ipynb'
'Copie a blocnotesului 1_Introducere in COLAB_RO.ipynb'
'Copie a blocnotesului Homework 1 (1).ipynb'
'Copie a blocnotesului Homework 1.ipynb'
'Copie a blocnotesului lab9.ipynb'
'Copie a blocnotesului rdkit-pip.ipynb'
 data
 dogs-vs-cats
'Identificare tesut canceros.ipynb'
'Intro to Keras with breast cancer data[ANN].ipynb'
'Keras with TensorFlow Course - FreeCodeCamp'
 lab9.ipynb
'ML_2_Exploratory Data Analysis.ipynb'
 ML_bioactivity_data.ipynb
 ML_bioactivity_schisto_data.ipynb
 ML_Exploratory_Data_Analysis.ipynb
'Proiect RN Dynamic Duo.ipynb'
'Stock Price Predictor.ipynb'
 Untitled0.ipynb
'Untitled1 (1).ipynb'
 Untitled1.ipynb


In [20]:
! mkdir "/content/gdrive/My Drive/Colab Notebooks/data"

mkdir: cannot create directory ‘/content/gdrive/My Drive/Colab Notebooks/data’: File exists


In [21]:
! cp bioactivity_data_schisto.csv "/content/gdrive/My Drive/Colab Notebooks/data"

In [22]:
! ls -l "/content/gdrive/My Drive/Colab Notebooks/data"

total 89
-rw------- 1 root root 68258 Apr 29 15:36 bioactivity_data.csv
-rw------- 1 root root 13120 Apr 29 20:07 bioactivity_data_schisto.csv
-rw------- 1 root root  9112 Apr 29 15:37 bioactivity_preprocessed_data.csv


The CSV files for now

In [23]:
!ls

bioactivity_data_schisto.csv  gdrive  sample_data


we'll take a brief look at the bioactivity_data_schisto.csv file that was just created

In [24]:
! head bioactivity_data_schisto.csv

activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
,18918972,[],CHEMBL4322639,Inhibition of recombinant Schistosoma mansoni sirtuin 2 (21 to 322 residues) expressed in Escherichia coli BL21(DE3) cells assessed as inhibition of substrate deacetylation using ZMAL as substrate,B,,,BAO_0000190,BAO_0000219,cell-based format,NC(=O)c1cccnc1,,,CHEMBL4321809,J Med Chem,2019,"{'be

# **Taking care of missing data**

If any compounds has missing value for the standard_value column then drop it.

In [25]:
df2 = df[df.standard_value.notna()]
df2

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,18918972,[],CHEMBL4322639,Inhibition of recombinant Schistosoma mansoni ...,B,,,BAO_0000190,BAO_0000219,...,Schistosoma mansoni,NAD-dependent protein deacetylase,6183,,,IC50,uM,UO_0000065,,23.1
1,,18918974,[],CHEMBL4322641,Inhibition of recombinant Schistosoma mansoni ...,B,,,BAO_0000190,BAO_0000219,...,Schistosoma mansoni,NAD-dependent protein deacetylase,6183,,,IC50,uM,UO_0000065,,420.0
2,,18918975,[],CHEMBL4322624,Inhibition of recombinant Schistosoma mansoni ...,B,,,BAO_0000190,BAO_0000219,...,Schistosoma mansoni,NAD-dependent protein deacetylase,6183,,,IC50,uM,UO_0000065,,1.9
3,,18918976,[],CHEMBL4322624,Inhibition of recombinant Schistosoma mansoni ...,B,,,BAO_0000190,BAO_0000219,...,Schistosoma mansoni,NAD-dependent protein deacetylase,6183,,,IC50,uM,UO_0000065,,18.2
4,,18918977,[],CHEMBL4322624,Inhibition of recombinant Schistosoma mansoni ...,B,,,BAO_0000190,BAO_0000219,...,Schistosoma mansoni,NAD-dependent protein deacetylase,6183,,,IC50,uM,UO_0000065,14.5,14.0
5,,18918978,[],CHEMBL4322624,Inhibition of recombinant Schistosoma mansoni ...,B,,,BAO_0000190,BAO_0000219,...,Schistosoma mansoni,NAD-dependent protein deacetylase,6183,,,IC50,uM,UO_0000065,,23.7
6,,18918979,[],CHEMBL4322624,Inhibition of recombinant Schistosoma mansoni ...,B,,,BAO_0000190,BAO_0000219,...,Schistosoma mansoni,NAD-dependent protein deacetylase,6183,,,IC50,uM,UO_0000065,,12.8
7,,18918980,[],CHEMBL4322624,Inhibition of recombinant Schistosoma mansoni ...,B,,,BAO_0000190,BAO_0000219,...,Schistosoma mansoni,NAD-dependent protein deacetylase,6183,,,IC50,uM,UO_0000065,,27.7
8,,18918981,[],CHEMBL4322624,Inhibition of recombinant Schistosoma mansoni ...,B,,,BAO_0000190,BAO_0000219,...,Schistosoma mansoni,NAD-dependent protein deacetylase,6183,,,IC50,uM,UO_0000065,,2.34
9,,18918982,[],CHEMBL4322624,Inhibition of recombinant Schistosoma mansoni ...,B,,,BAO_0000190,BAO_0000219,...,Schistosoma mansoni,NAD-dependent protein deacetylase,6183,,,IC50,uM,UO_0000065,,23.1


There's no missing data in this dataset. But we can use the above code cell for bioactivity data of other target protein.

# **Data pre-processing of the bioactivity data**

## **Labeling compounds as either being active, inactive or intermediate**

The bioactivity data is in the IC50 unit. Compounds having values of less than 1000bM will be considered to be active while those greater than 10.000 nM will be considered to  be inactive. As for those values in between 1.000 and 10.000 nM will be referred to as intermediate.

In [27]:
bioactivity_class = []
for i in df2.standard_value:
  if float(i) >= 10000:
    bioactivity_class.append("inactive")
  elif float(i) <= 1000:
    bioactivity_class.append("active")
  else:
    bioactivity_class.append("intermediate")

Iterate the molecule_chembl_id to a list

In [28]:
mol_cid = []
for i in df2.molecule_chembl_id:
  mol_cid.append(i)

Iterate canonical_smiles to a list


*   also this dataset is made of multiple compounds and a compound is a molecule which is a chemical structure that produces a modulatory activity(exerts some effect on a target protein) to produce a desired biological effect which ultimately cure symptoms.
*   Each compound will be described by a molecule chambl id so each will represent one compound and it's possible that multiple rows can have the same chamblid but for simplicity we'll keep just unique ones so no redundancy



In [29]:
canonical_smiles = []
for i in df2.canonical_smiles:
  canonical_smiles.append(i)

Iterate standard_value to a list

In [31]:
standard_value = []
for i in df2.standard_value:
  standard_value.append(i)

Combine the 4 lists into a dataframe

In [32]:
data_tuples = list(zip(mol_cid, canonical_smiles, standard_value, bioactivity_class))
df3 = pd.DataFrame(data_tuples, columns=['molecule_chembl_id', 'canonical_smiles', 'bioactivity_class', 'standard_value'])
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,bioactivity_class,standard_value
0,CHEMBL1140,NC(=O)c1cccnc1,23100.0,inactive
1,CHEMBL1140,NC(=O)c1cccnc1,420000.0,inactive
2,CHEMBL252556,COC1=C(OC)C(=O)C(CCCCCCCCCCO)=C(C)C1=O,1900.0,intermediate
3,CHEMBL3430999,O=C(c1cc2cc(Cl)ccc2[nH]1)N1CCCCC1c1cccnc1,18200.0,inactive
4,CHEMBL3431191,CC(=O)Nc1ccc(-c2noc(CC3CCCN(Cc4cccc(F)c4C)C3)n...,14000.0,inactive
5,CHEMBL3431127,COc1ccc(Oc2ccc(C(C)N(C)c3ncc4c(N)nc(N)nc4n3)cc...,23700.0,inactive
6,CHEMBL4459211,CCC(c1ccc(Oc2ccc(OC)cc2)cc1)N(C)c1ncc2c(N)nc(N...,12800.0,inactive
7,CHEMBL4571708,COc1ccc(Oc2ccc(C(C(C)C)N(C)c3ncc4c(N)nc(N)nc4n...,27700.0,inactive
8,CHEMBL4442390,COc1ccc(Oc2ccc(C(c3ccccc3)N(C)c3ncc4c(N)nc(N)n...,2340.0,intermediate
9,CHEMBL4529336,COc1cccc(Oc2ccc(C(C)N(C)c3ncc4c(N)nc(N)nc4n3)c...,23100.0,inactive


### **Alternative method(better)**

In [33]:
selection = ['molecule_chembl_id', 'canonical_smiles', 'standard_value']
df3 = df2[selection]
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL1140,NC(=O)c1cccnc1,23100.0
1,CHEMBL1140,NC(=O)c1cccnc1,420000.0
2,CHEMBL252556,COC1=C(OC)C(=O)C(CCCCCCCCCCO)=C(C)C1=O,1900.0
3,CHEMBL3430999,O=C(c1cc2cc(Cl)ccc2[nH]1)N1CCCCC1c1cccnc1,18200.0
4,CHEMBL3431191,CC(=O)Nc1ccc(-c2noc(CC3CCCN(Cc4cccc(F)c4C)C3)n...,14000.0
5,CHEMBL3431127,COc1ccc(Oc2ccc(C(C)N(C)c3ncc4c(N)nc(N)nc4n3)cc...,23700.0
6,CHEMBL4459211,CCC(c1ccc(Oc2ccc(OC)cc2)cc1)N(C)c1ncc2c(N)nc(N...,12800.0
7,CHEMBL4571708,COc1ccc(Oc2ccc(C(C(C)C)N(C)c3ncc4c(N)nc(N)nc4n...,27700.0
8,CHEMBL4442390,COc1ccc(Oc2ccc(C(c3ccccc3)N(C)c3ncc4c(N)nc(N)n...,2340.0
9,CHEMBL4529336,COc1cccc(Oc2ccc(C(C)N(C)c3ncc4c(N)nc(N)nc4n3)c...,23100.0


In [34]:
pd.concat([df3, pd.Series(bioactivity_class)], axis=1)

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,0
0,CHEMBL1140,NC(=O)c1cccnc1,23100.0,inactive
1,CHEMBL1140,NC(=O)c1cccnc1,420000.0,inactive
2,CHEMBL252556,COC1=C(OC)C(=O)C(CCCCCCCCCCO)=C(C)C1=O,1900.0,intermediate
3,CHEMBL3430999,O=C(c1cc2cc(Cl)ccc2[nH]1)N1CCCCC1c1cccnc1,18200.0,inactive
4,CHEMBL3431191,CC(=O)Nc1ccc(-c2noc(CC3CCCN(Cc4cccc(F)c4C)C3)n...,14000.0,inactive
5,CHEMBL3431127,COc1ccc(Oc2ccc(C(C)N(C)c3ncc4c(N)nc(N)nc4n3)cc...,23700.0,inactive
6,CHEMBL4459211,CCC(c1ccc(Oc2ccc(OC)cc2)cc1)N(C)c1ncc2c(N)nc(N...,12800.0,inactive
7,CHEMBL4571708,COc1ccc(Oc2ccc(C(C(C)C)N(C)c3ncc4c(N)nc(N)nc4n...,27700.0,inactive
8,CHEMBL4442390,COc1ccc(Oc2ccc(C(c3ccccc3)N(C)c3ncc4c(N)nc(N)n...,2340.0,intermediate
9,CHEMBL4529336,COc1cccc(Oc2ccc(C(C)N(C)c3ncc4c(N)nc(N)nc4n3)c...,23100.0,inactive


Save dataframe to CSV

In [35]:
df3.to_csv('bioactivity_preprocessed_schisto_data.csv', index=False)

In [36]:
! ls -l

total 28
-rw-r--r-- 1 root root 13120 Apr 29 20:03 bioactivity_data_schisto.csv
-rw-r--r-- 1 root root  1399 Apr 29 20:33 bioactivity_preprocessed_schisto_data.csv
drwx------ 5 root root  4096 Apr 29 20:05 gdrive
drwxr-xr-x 1 root root  4096 Apr 27 13:35 sample_data


### Copy to GDrive

In [37]:
! cp bioactivity_preprocessed_schisto_data.csv "/content/gdrive/My Drive/Colab Notebooks/data"

In [38]:
! ls "/content/gdrive/My Drive/Colab Notebooks/data"

bioactivity_data.csv	      bioactivity_preprocessed_data.csv
bioactivity_data_schisto.csv  bioactivity_preprocessed_schisto_data.csv
