# Molecular Modeling Project Work: Computational Drug Discovery (Tegegne E.)


### Part_I - Download Bioactivity Data
 - Under this part, data collection and preprocessing procedures will be carried out using the ChEMBL database.



## **ChEMBL Database**

The ChEMBL Database is a database that contains curated bioactivity data of more than 2.1 million compounds. It is compiled from more than 81,000 documents, 1.4 million assays and the data spans 14,000 targets and 2,800 cells and 40,000 indications. [Data as of December 13, 2021; ChEMBL version 29].

## **Installing libraries**

Install the ChEMBL web service package so that we can retrieve bioactivity data from the ChEMBL Database.

In [1]:
! pip install chembl_webresource_client

Collecting chembl_webresource_client
  Downloading chembl_webresource_client-0.10.9-py3-none-any.whl.metadata (1.4 kB)
Collecting requests-cache~=1.2 (from chembl_webresource_client)
  Downloading requests_cache-1.2.1-py3-none-any.whl.metadata (9.9 kB)
Collecting cattrs>=22.2 (from requests-cache~=1.2->chembl_webresource_client)
  Downloading cattrs-24.1.2-py3-none-any.whl.metadata (8.4 kB)
Collecting url-normalize>=1.4 (from requests-cache~=1.2->chembl_webresource_client)
  Downloading url_normalize-1.4.3-py2.py3-none-any.whl.metadata (3.1 kB)
Downloading chembl_webresource_client-0.10.9-py3-none-any.whl (55 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.2/55.2 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading requests_cache-1.2.1-py3-none-any.whl (61 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.4/61.4 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading cattrs-24.1.2-py3-none-any.whl (66 kB)
[2K   [90m━━━━━━━━━━

## **Importing libraries**

In [2]:
# Import necessary libraries
import pandas as pd
from chembl_webresource_client.new_client import new_client

## Search for Target protein Thioredoxin-glutathione reductase (TGR)
### Thioredoxin-glutathione reductase (TGR)
 - Is an enzyme that plays a significant role in maintaining cellular redox homeostasis by reducing thioredoxin and glutathione disulfides..


### **Target search for *Schistosoma mansoni***

In [3]:
# Target search for Schistosomiasis mansoni
target = new_client.target
target_query = target.search('Schistosoma mansoni')
targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],Schistosoma mansoni,Schistosoma mansoni,34.0,False,CHEMBL612893,[],ORGANISM,6183
1,[],Schistosoma japonicum,Schistosoma japonicum,16.0,False,CHEMBL612644,[],ORGANISM,6182
2,[],Schistosoma haematobium,Schistosoma haematobium,16.0,False,CHEMBL4296598,[],ORGANISM,6185
3,[],Schistosoma mansoni,Thioredoxin glutathione reductase,11.0,False,CHEMBL6110,"[{'accession': 'Q962Y6', 'component_descriptio...",SINGLE PROTEIN,6183
4,[],Schistosoma mansoni,Thioredoxin peroxidase,11.0,False,CHEMBL1293279,"[{'accession': 'O97161', 'component_descriptio...",SINGLE PROTEIN,6183
5,[],Schistosoma mansoni,Voltage-activated calcium channel beta 1 subunit,11.0,False,CHEMBL2363079,"[{'accession': 'Q95US7', 'component_descriptio...",SINGLE PROTEIN,6183
6,[],Schistosoma mansoni,Voltage-activated calcium channel beta 2 subunit,11.0,False,CHEMBL2363080,"[{'accession': 'Q962H3', 'component_descriptio...",SINGLE PROTEIN,6183
7,[],Schistosoma mansoni,Histone deacetylase 8,11.0,False,CHEMBL3797017,"[{'accession': 'A5H660', 'component_descriptio...",SINGLE PROTEIN,6183
8,[],Schistosoma mansoni,NAD-dependent protein deacetylase,11.0,False,CHEMBL4523517,"[{'accession': 'T1VXA1', 'component_descriptio...",SINGLE PROTEIN,6183
9,[],Schistosoma mansoni,"Dihydroorotate dehydrogenase (quinone), mitoch...",11.0,False,CHEMBL4523950,"[{'accession': 'G4VFD7', 'component_descriptio...",SINGLE PROTEIN,6183


### Select and retrieve bioactivity data for Target protein of *Schistosoma mansoni* _Thioredoxin Glutathione Reductase (smTGR) (3 entry)

- The 3rd entry will be selected (which corresponds to the target protein, *smTGR* ) to the ***selected_target*** variable


In [4]:
df = pd.DataFrame.from_dict(target_query)

In [5]:
selected_target = targets.target_chembl_id[3]
selected_target

'CHEMBL6110'

Here, only bioactivity data for Thioredoxin gluthatione reductase (CHEMBL6110) reported as IC$_{50}$ values in nM (nanomolar) units will be retrieved.

In [6]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

In [7]:
df = pd.DataFrame.from_dict(res)

In [8]:
df.head(100)

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,2931445,[],CHEMBL1048278,Inhibition of Schistosoma mansoni TGR,B,,,BAO_0000190,...,Schistosoma mansoni,Thioredoxin glutathione reductase,6183,,,IC50,uM,UO_0000065,,0.35
1,,,2931446,[],CHEMBL1048278,Inhibition of Schistosoma mansoni TGR,B,,,BAO_0000190,...,Schistosoma mansoni,Thioredoxin glutathione reductase,6183,,,IC50,uM,UO_0000065,,0.4
2,,,2931447,[],CHEMBL1048278,Inhibition of Schistosoma mansoni TGR,B,,,BAO_0000190,...,Schistosoma mansoni,Thioredoxin glutathione reductase,6183,,,IC50,uM,UO_0000065,,0.48
3,,,2931448,[],CHEMBL1048278,Inhibition of Schistosoma mansoni TGR,B,,,BAO_0000190,...,Schistosoma mansoni,Thioredoxin glutathione reductase,6183,,,IC50,uM,UO_0000065,,1.0
4,,,2931449,[],CHEMBL1048278,Inhibition of Schistosoma mansoni TGR,B,,,BAO_0000190,...,Schistosoma mansoni,Thioredoxin glutathione reductase,6183,,,IC50,uM,UO_0000065,,3.5
5,,,2931450,[],CHEMBL1048278,Inhibition of Schistosoma mansoni TGR,B,,,BAO_0000190,...,Schistosoma mansoni,Thioredoxin glutathione reductase,6183,,,IC50,uM,UO_0000065,,0.063
6,,,2931451,[],CHEMBL1048278,Inhibition of Schistosoma mansoni TGR,B,,,BAO_0000190,...,Schistosoma mansoni,Thioredoxin glutathione reductase,6183,,,IC50,uM,UO_0000065,,3.5
7,,,2931452,[],CHEMBL1048278,Inhibition of Schistosoma mansoni TGR,B,,,BAO_0000190,...,Schistosoma mansoni,Thioredoxin glutathione reductase,6183,,,IC50,uM,UO_0000065,,2.8
8,,,2931453,[],CHEMBL1048278,Inhibition of Schistosoma mansoni TGR,B,,,BAO_0000190,...,Schistosoma mansoni,Thioredoxin glutathione reductase,6183,,,IC50,uM,UO_0000065,,17.9
9,,,2931454,[],CHEMBL1048278,Inhibition of Schistosoma mansoni TGR,B,,,BAO_0000190,...,Schistosoma mansoni,Thioredoxin glutathione reductase,6183,,,IC50,uM,UO_0000065,,15.8


Finally we will save the resulting bioactivity data to a CSV file **bioactivity_data.csv**.

In [9]:
df.to_csv('smTGR_bioactivity_IC50.csv', index=False)

## **Handling missing data**
If any compounds has missing value for the **standard_value** column then drop it

In [10]:
df2 = df[df.standard_value.notna()]
df2

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,2931445,[],CHEMBL1048278,Inhibition of Schistosoma mansoni TGR,B,,,BAO_0000190,...,Schistosoma mansoni,Thioredoxin glutathione reductase,6183,,,IC50,uM,UO_0000065,,0.35
1,,,2931446,[],CHEMBL1048278,Inhibition of Schistosoma mansoni TGR,B,,,BAO_0000190,...,Schistosoma mansoni,Thioredoxin glutathione reductase,6183,,,IC50,uM,UO_0000065,,0.4
2,,,2931447,[],CHEMBL1048278,Inhibition of Schistosoma mansoni TGR,B,,,BAO_0000190,...,Schistosoma mansoni,Thioredoxin glutathione reductase,6183,,,IC50,uM,UO_0000065,,0.48
3,,,2931448,[],CHEMBL1048278,Inhibition of Schistosoma mansoni TGR,B,,,BAO_0000190,...,Schistosoma mansoni,Thioredoxin glutathione reductase,6183,,,IC50,uM,UO_0000065,,1.0
4,,,2931449,[],CHEMBL1048278,Inhibition of Schistosoma mansoni TGR,B,,,BAO_0000190,...,Schistosoma mansoni,Thioredoxin glutathione reductase,6183,,,IC50,uM,UO_0000065,,3.5
5,,,2931450,[],CHEMBL1048278,Inhibition of Schistosoma mansoni TGR,B,,,BAO_0000190,...,Schistosoma mansoni,Thioredoxin glutathione reductase,6183,,,IC50,uM,UO_0000065,,0.063
6,,,2931451,[],CHEMBL1048278,Inhibition of Schistosoma mansoni TGR,B,,,BAO_0000190,...,Schistosoma mansoni,Thioredoxin glutathione reductase,6183,,,IC50,uM,UO_0000065,,3.5
7,,,2931452,[],CHEMBL1048278,Inhibition of Schistosoma mansoni TGR,B,,,BAO_0000190,...,Schistosoma mansoni,Thioredoxin glutathione reductase,6183,,,IC50,uM,UO_0000065,,2.8
8,,,2931453,[],CHEMBL1048278,Inhibition of Schistosoma mansoni TGR,B,,,BAO_0000190,...,Schistosoma mansoni,Thioredoxin glutathione reductase,6183,,,IC50,uM,UO_0000065,,17.9
9,,,2931454,[],CHEMBL1048278,Inhibition of Schistosoma mansoni TGR,B,,,BAO_0000190,...,Schistosoma mansoni,Thioredoxin glutathione reductase,6183,,,IC50,uM,UO_0000065,,15.8


Apparently, for this dataset there is no missing data. But we can use the above code cell for bioactivity data of other target protein.

## **Data pre-processing of the bioactivity data**

### **Labeling compounds as either being active, inactive or intermediate**
The bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM will be considered to be **active** while those greater than 10,000 nM will be considered to be **inactive**. As for those values in between 1,000 and 10,000 nM will be referred to as **intermediate**.

In [11]:
bioactivity_class = []
for i in df2.standard_value:
  if float(i) >= 10000:
    bioactivity_class.append("inactive")
  elif float(i) <= 1000:
    bioactivity_class.append("active")
  else:
    bioactivity_class.append("intermediate")

### **Combine the 3 columns (molecule_chembl_id,canonical_smiles,standard_value) and bioactivity_class into a DataFrame**

In [12]:
selection = ['molecule_chembl_id','canonical_smiles','standard_value']
df3 = df2[selection]
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL576624,N#Cc1c(-c2csc(-c3no[n+]([O-])c3C#N)c2)no[n+]1[O-],350.0
1,CHEMBL569420,N#Cc1c(-c2ccc(-c3no[n+]([O-])c3C#N)s2)no[n+]1[O-],400.0
2,CHEMBL570121,N#Cc1c(-c2cc(F)cc(-c3no[n+]([O-])c3C#N)c2)no[n...,480.0
3,CHEMBL582970,N#Cc1c(-c2ccc(-c3no[n+]([O-])c3C#N)cc2)no[n+]1...,1000.0
4,CHEMBL575013,N#Cc1c(-c2cccc(-c3no[n+]([O-])c3C#N)c2)no[n+]1...,3500.0
5,CHEMBL576265,N#Cc1c(C(=O)c2cccs2)no[n+]1[O-],63.0
6,CHEMBL583780,N#Cc1c(-c2cccs2)no[n+]1[O-],3500.0
7,CHEMBL565242,N#Cc1c(-c2ccco2)no[n+]1[O-],2800.0
8,CHEMBL570132,N#Cc1c(-c2ccc(O)cc2)no[n+]1[O-],17900.0
9,CHEMBL576082,N#Cc1c(-c2ccc(-c3ccccc3)cc2)no[n+]1[O-],15800.0


In [13]:
bioactivity_class = pd.Series(bioactivity_class, name='bioactivity_class')
df4 = pd.concat([df3, bioactivity_class], axis=1)
df4

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,bioactivity_class
0,CHEMBL576624,N#Cc1c(-c2csc(-c3no[n+]([O-])c3C#N)c2)no[n+]1[O-],350.0,active
1,CHEMBL569420,N#Cc1c(-c2ccc(-c3no[n+]([O-])c3C#N)s2)no[n+]1[O-],400.0,active
2,CHEMBL570121,N#Cc1c(-c2cc(F)cc(-c3no[n+]([O-])c3C#N)c2)no[n...,480.0,active
3,CHEMBL582970,N#Cc1c(-c2ccc(-c3no[n+]([O-])c3C#N)cc2)no[n+]1...,1000.0,active
4,CHEMBL575013,N#Cc1c(-c2cccc(-c3no[n+]([O-])c3C#N)c2)no[n+]1...,3500.0,intermediate
5,CHEMBL576265,N#Cc1c(C(=O)c2cccs2)no[n+]1[O-],63.0,active
6,CHEMBL583780,N#Cc1c(-c2cccs2)no[n+]1[O-],3500.0,intermediate
7,CHEMBL565242,N#Cc1c(-c2ccco2)no[n+]1[O-],2800.0,intermediate
8,CHEMBL570132,N#Cc1c(-c2ccc(O)cc2)no[n+]1[O-],17900.0,inactive
9,CHEMBL576082,N#Cc1c(-c2ccc(-c3ccccc3)cc2)no[n+]1[O-],15800.0,inactive


Saves dataframe to CSV file

In [14]:
df4.to_csv('smTGR_bioactivity_IC50_preprocessed.csv', index=False)

In [15]:
! ls -l

total 44
drwxr-xr-x 1 root root  4096 Nov 15 14:19 sample_data
-rw-r--r-- 1 root root 32234 Nov 19 07:00 smTGR_bioactivity_IC50.csv
-rw-r--r-- 1 root root  4135 Nov 19 07:11 smTGR_bioactivity_IC50_preprocessed.csv


---