# ** Bioinformatics Project - Computational Drug Discovery [Part 1] **
**Download CORONAVIRUS Bioactivity Data**


![img](https://www.paho.org/sites/default/files/styles/top_hero/public/2023-11/coronavirus-microscopic-view.jpg?h=4aea44c5&itok=28SpnJoB)

* In this Jupyter notebook, we will be building a real-life **data science project** that you can include in your **data science portfolio**. Particularly, we will be building a machine learning model using the ChEMBL bioactivity data.

* In **Part 1**, we will be performing Data Collection and Pre-Processing from the ChEMBL Database.

---

## **ChEMBL Database**

* The [*ChEMBL Database*](https://www.ebi.ac.uk/chembl/) is a database that contains curated bioactivity data of more than 2 million compounds. It is compiled from more than 76,000 documents, 1.2 million assays and the data spans 13,000 targets and 1,800 cells and 33,000 indications.
* [Data as of March 25, 2020; ChEMBL version 26].

## **Installing libraries**

Install the ChEMBL web service package so that we can retrieve bioactivity data from the ChEMBL Database.

In [None]:
! pip install chembl_webresource_client

Collecting chembl_webresource_client
  Downloading chembl_webresource_client-0.10.9-py3-none-any.whl (55 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.2/55.2 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
Collecting requests-cache~=1.2 (from chembl_webresource_client)
  Downloading requests_cache-1.2.0-py3-none-any.whl (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.4/61.4 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
Collecting cattrs>=22.2 (from requests-cache~=1.2->chembl_webresource_client)
  Downloading cattrs-23.2.3-py3-none-any.whl (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.5/57.5 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
Collecting url-normalize>=1.4 (from requests-cache~=1.2->chembl_webresource_client)
  Downloading url_normalize-1.4.3-py2.py3-none-any.whl (6.8 kB)
Installing collected packages: url-normalize, cattrs, requests-cache, chembl_webresource_client
Successfully installed cattrs-23

## **Importing libraries**

In [None]:
# Import necessary libraries
import pandas as pd
from chembl_webresource_client.new_client import new_client

## **Search for Target protein**

### **Target search for Coronavirus**

In [None]:
# Target search for coronavirus
from tabulate import tabulate
target = new_client.target
target_query = target.search('Coronavirus')
targets = pd.DataFrame.from_dict(target_query)
#print(tabulate(targets.tail(5), headers = 'keys', tablefmt = 'psql'))
#targets
newtarget=targets.drop(['cross_references','target_components','organism'],axis=1)
print(tabulate(newtarget, headers = 'keys', tablefmt = 'psql'))

+----+------------------------------------------------------+---------+----------------------+--------------------+----------------+----------+
|    | pref_name                                            |   score | species_group_flag   | target_chembl_id   | target_type    |   tax_id |
|----+------------------------------------------------------+---------+----------------------+--------------------+----------------+----------|
|  0 | Coronavirus                                          |      17 | False                | CHEMBL613732       | ORGANISM       |    11119 |
|  1 | Feline coronavirus                                   |      14 | False                | CHEMBL612744       | ORGANISM       |    12663 |
|  2 | Murine coronavirus                                   |      14 | False                | CHEMBL5209664      | ORGANISM       |   694005 |
|  3 | Canine coronavirus                                   |      14 | False                | CHEMBL5291668      | ORGANISM       |    

### **Select and retrieve bioactivity data for *SARS coronavirus 3C-like proteinase* (Seventh entry)**

We will assign the sixth entry (which corresponds to the target protein, *SARS coronavirus 3C-like proteinase*) to the ***selected_target*** variable

In [None]:
selected_target = targets.target_chembl_id[6]
selected_target

'CHEMBL3927'

Here, we will retrieve only bioactivity data for *SARS coronavirus 3C-like proteinase* (CHEMBL3927) that are reported as pChEMBL values.

In [None]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

In [None]:
df = pd.DataFrame.from_dict(res)

In [None]:
print(tabulate(df.head(), headers = 'keys', tablefmt = 'psql'))

+----+---------------+--------------------+---------------+-----------------------+-------------------+---------------------------------------------------------------------------------------------------------------+--------------+---------------------------+--------------------------+----------------+--------------+-----------------------+-------------------------------------------------+-------------------------+-----------------------------+----------------------+----------------------+-----------------+---------------------------------------------------------------+----------------------+----------------------+-----------------------------+-----------------+-----------------------+-------------------------------------------+-------------+------------+----------+-----------------+---------------------+-----------------------+-----------------+------------------+------------------------+------------------+--------------------+-------------------+-------------------------------------+-

In [None]:
print("Shape of data",df.shape,
      "\n\n\nInfo of the data\n" ,df.info)

Shape of data (133, 46) 


Info of the data
 <bound method DataFrame.info of     action_type activity_comment  activity_id activity_properties  \
0          None             None      1480935                  []   
1          None             None      1480936                  []   
2          None             None      1481061                  []   
3          None             None      1481065                  []   
4          None             None      1481066                  []   
..          ...              ...          ...                 ...   
128        None             None     12041507                  []   
129        None             None     12041508                  []   
130        None             None     12041509                  []   
131        None             None     12041510                  []   
132        None             None     12041511                  []   

    assay_chembl_id                                  assay_description  \
0      CHEMBL829584 

Finally we will save the resulting bioactivity data to a CSV file **bioactivity_data.csv**.

In [None]:
df.to_csv('bioactivity_data_raw.csv', index=False)

**Copying the CSV file**

Firstly, we need to mount the Google Drive into Colab so that we can have access to our Google adrive from within Colab

In [None]:
from google.colab import drive
drive.mount('/content/gdrive',force_remount=True)

Mounted at /content/gdrive


Next, we create a data folder inside my Project folder on Google Drive to save the file.

In [None]:
! mkdir "/content/gdrive/My Drive/Colab Notebooks/Project/data"

mkdir: cannot create directory ‘/content/gdrive/My Drive/Colab Notebooks/Project/data’: File exists


In [None]:
! cp bioactivity_data_raw.csv "/content/gdrive/My Drive/Colab Notebooks/Project/data"

Let's see the CSV files that we have so far.

In [None]:
! ls -l "/content/gdrive/My Drive/Colab Notebooks/Project/data"

total 353
-rw------- 1 root root  15608 Mar 11 15:26 bioactivity_data_2class_pIC50.csv
-rw------- 1 root root  17579 Mar 11 15:26 bioactivity_data_3class_pIC50.csv
-rw------- 1 root root 248010 Mar 11 12:26 bioactivity_data_3class_pIC50_fp.csv
-rw------- 1 root root  10357 Mar 11 11:08 bioactivity_data_preprocessed.csv
-rw------- 1 root root  68403 Mar 17 12:25 bioactivity_data_raw.csv


Taking a glimpse of the bioactivity_data_raw.csv file that we've have created.

In [None]:
! head bioactivity_data_raw.csv

action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
,,1480935,[],CHEMBL829584,In vitro inhibitory concentration against SARS coronavirus main protease (SARS CoV 3C-like protease),B,,,BAO_0000190,BAO_0000357,single protein format,Cc1noc(C)c1CN1C(=O)C(=O)c2cc(C#N)ccc21,,,CHEMBL1139624,Bioorg Med Chem Lett,2005,"{'bei': '18.28', 'le': '0.33', 'lle': '3.25', 'sei'

## **Handling missing data**
If any compounds has missing value for the **standard_value** and **canonical_smiles** column then drop it.

In [None]:
df2 = df[df.standard_value.notna()]
df2 = df2[df.canonical_smiles.notna()]
df2

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,1480935,[],CHEMBL829584,In vitro inhibitory concentration against SARS...,B,,,BAO_0000190,...,SARS coronavirus,SARS coronavirus 3C-like proteinase,227859,,,IC50,uM,UO_0000065,,7.2
1,,,1480936,[],CHEMBL829584,In vitro inhibitory concentration against SARS...,B,,,BAO_0000190,...,SARS coronavirus,SARS coronavirus 3C-like proteinase,227859,,,IC50,uM,UO_0000065,,9.4
2,,,1481061,[],CHEMBL830868,In vitro inhibitory concentration against SARS...,B,,,BAO_0000190,...,SARS coronavirus,SARS coronavirus 3C-like proteinase,227859,,,IC50,uM,UO_0000065,,13.5
3,,,1481065,[],CHEMBL829584,In vitro inhibitory concentration against SARS...,B,,,BAO_0000190,...,SARS coronavirus,SARS coronavirus 3C-like proteinase,227859,,,IC50,uM,UO_0000065,,13.11
4,,,1481066,[],CHEMBL829584,In vitro inhibitory concentration against SARS...,B,,,BAO_0000190,...,SARS coronavirus,SARS coronavirus 3C-like proteinase,227859,,,IC50,uM,UO_0000065,,2.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
128,,,12041507,[],CHEMBL2150313,Inhibition of SARS-CoV PLpro expressed in Esch...,B,,,BAO_0000190,...,SARS coronavirus,SARS coronavirus 3C-like proteinase,227859,,,IC50,uM,UO_0000065,,10.6
129,,,12041508,[],CHEMBL2150313,Inhibition of SARS-CoV PLpro expressed in Esch...,B,,,BAO_0000190,...,SARS coronavirus,SARS coronavirus 3C-like proteinase,227859,,,IC50,uM,UO_0000065,,10.1
130,,,12041509,[],CHEMBL2150313,Inhibition of SARS-CoV PLpro expressed in Esch...,B,,,BAO_0000190,...,SARS coronavirus,SARS coronavirus 3C-like proteinase,227859,,,IC50,uM,UO_0000065,,11.5
131,,,12041510,[],CHEMBL2150313,Inhibition of SARS-CoV PLpro expressed in Esch...,B,,,BAO_0000190,...,SARS coronavirus,SARS coronavirus 3C-like proteinase,227859,,,IC50,uM,UO_0000065,,10.7


## **Data pre-processing of the bioactivity data**

### **Labeling compounds as either being active, inactive or intermediate**
The bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM will be considered to be **active** while those greater than 10,000 nM will be considered to be **inactive**. As for those values in between 1,000 and 10,000 nM will be referred to as **intermediate**.

In [None]:
bioactivity_threshold = []
for i in df2.standard_value:
  if float(i) >= 10000:
    bioactivity_threshold.append("inactive")
  elif float(i) <= 1000:
    bioactivity_threshold.append("active")
  else:
    bioactivity_threshold.append("intermediate")

**Iterate the molecule_cheml_id to a list**

In [None]:
mol_cid=[]
for i in df2.molecule_chembl_id:
  mol_cid.append(i)

**Iterate the canonical_smiles to a list**

In [None]:
canonical_smiles=[]
for i in df2.canonical_smiles:
  canonical_smiles.append(i)

**Iterate the standard_value to a list**

In [None]:
standard_value=[]
for i in df2.standard_value:
  standard_value.append(i)

### **Combine the 3 columns (molecule_chembl_id,canonical_smiles,standard_value) and bioactivity_threshold into a DataFrame**

In [None]:
data_tuples=list(zip(mol_cid,canonical_smiles,standard_value,bioactivity_threshold))
df3=pd.DataFrame(data_tuples,columns=['molecule_chembl_id','canonical_smiles','standard_value','bioactivity_class'])

In [None]:
print(tabulate(df3.head(10), headers = 'keys', tablefmt = 'psql'))

+----+----------------------+-------------------------------------------------+------------------+---------------------+
|    | molecule_chembl_id   | canonical_smiles                                |   standard_value | bioactivity_class   |
|----+----------------------+-------------------------------------------------+------------------+---------------------|
|  0 | CHEMBL187579         | Cc1noc(C)c1CN1C(=O)C(=O)c2cc(C#N)ccc21          |             7200 | intermediate        |
|  1 | CHEMBL188487         | O=C1C(=O)N(Cc2ccc(F)cc2Cl)c2ccc(I)cc21          |             9400 | intermediate        |
|  2 | CHEMBL185698         | O=C1C(=O)N(CC2COc3ccccc3O2)c2ccc(I)cc21         |            13500 | inactive            |
|  3 | CHEMBL426082         | O=C1C(=O)N(Cc2cc3ccccc3s2)c2ccccc21             |            13110 | inactive            |
|  4 | CHEMBL187717         | O=C1C(=O)N(Cc2cc3ccccc3s2)c2c1cccc2[N+](=O)[O-] |             2000 | intermediate        |
|  5 | CHEMBL365134         | O=

Saves dataframe to CSV file

In [None]:
df3.to_csv('bioactivity_data_preprocessed.csv', index=False)

In [None]:
! cp bioactivity_data_preprocessed.csv "/content/gdrive/My Drive/Colab Notebooks/Project/data"

In [None]:
! ls -l "/content/gdrive/My Drive/Colab Notebooks/Project/data"

total 353
-rw------- 1 root root  15608 Mar 11 15:26 bioactivity_data_2class_pIC50.csv
-rw------- 1 root root  17579 Mar 11 15:26 bioactivity_data_3class_pIC50.csv
-rw------- 1 root root 248010 Mar 11 12:26 bioactivity_data_3class_pIC50_fp.csv
-rw------- 1 root root  10357 Mar 17 12:25 bioactivity_data_preprocessed.csv
-rw------- 1 root root  68403 Mar 17 12:25 bioactivity_data_raw.csv


---