# **Bioinformation Poliovirus - Data Collection and Pre-Processing**

Muhammad Ikhwan Bin Baharuddin

[*Ikhwan Github Profile*](https://github.com/Ikhwen)

In this Jupyter notebook, I will be building a real-life **data science project** that includes a machine learning model using the ChEMBL bioactivity data.

---

## **ChEMBL Database**

The [*ChEMBL Database*](https://www.ebi.ac.uk/chembl/) is a database that contains curated bioactivity data of more than 2 million compounds. It is compiled from more than 81,000 documents, 1.3 million assays and the data spans 14,000 targets and 1,900 cells and 33,000 indications.
[Data as of December 8, 2021]

## **Installing libraries**

Install the ChEMBL web service package so that we can retrieve bioactivity data from the ChEMBL Database.

In [None]:
! pip install chembl_webresource_client

Collecting chembl_webresource_client
  Downloading chembl_webresource_client-0.10.7-py3-none-any.whl (55 kB)
[?25l[K     |██████                          | 10 kB 22.3 MB/s eta 0:00:01[K     |███████████▉                    | 20 kB 26.2 MB/s eta 0:00:01[K     |█████████████████▊              | 30 kB 20.9 MB/s eta 0:00:01[K     |███████████████████████▋        | 40 kB 12.6 MB/s eta 0:00:01[K     |█████████████████████████████▌  | 51 kB 5.6 MB/s eta 0:00:01[K     |████████████████████████████████| 55 kB 2.2 MB/s 
Collecting requests-cache~=0.7.0
  Downloading requests_cache-0.7.5-py3-none-any.whl (39 kB)
Collecting url-normalize<2.0,>=1.4
  Downloading url_normalize-1.4.3-py2.py3-none-any.whl (6.8 kB)
Collecting itsdangerous>=2.0.1
  Downloading itsdangerous-2.0.1-py3-none-any.whl (18 kB)
Collecting pyyaml>=5.4
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |███████████████████████████

## **Importing libraries**

In [None]:
import pandas as pd
from chembl_webresource_client.new_client import new_client

For more information please reffer to [*Chembl webresource client GitHub*](https://github.com/chembl/chembl_webresource_client/)

In [None]:
dir(new_client)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'activity',
 'activity_supplementary_data_by_activity',
 'assay',
 'assay_class',
 'atc_class',
 'binding_site',
 'biotherapeutic',
 'cell_line',
 'chembl_id_lookup',
 'compound_record',
 'compound_structural_alert',
 'description',
 'document',
 'document_similarity',
 'drug',
 'drug_indication',
 'go_slim',
 'image',
 'mechanism',
 'metabolism',
 'molecule',
 'molecule_form',
 'official',
 'organism',
 'protein_class',
 'similarity',
 'source',
 'substructure',
 'target',
 'target_component',
 'target_relation',
 'tissue',
 'xref_source']

## **Determining the search**

### **Target search for poliovirus**

In [None]:
target = new_client.target
target_query = target.search('poliovirus')
targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],Homo sapiens,Nectin-4,19.0,False,CHEMBL3712928,"[{'accession': 'Q96NY8', 'component_descriptio...",SINGLE PROTEIN,9606
1,[],Human enterovirus C,Poliovirus,17.0,False,CHEMBL612462,[],ORGANISM,138950
2,[],Human poliovirus 1,Human poliovirus 1,13.0,False,CHEMBL613556,[],ORGANISM,12080
3,[],Human poliovirus 3,Human poliovirus 3,13.0,False,CHEMBL613557,[],ORGANISM,12086
4,[],Human poliovirus 2,Human poliovirus 2,13.0,False,CHEMBL613752,[],ORGANISM,12083
5,"[{'xref_id': 'P03300', 'xref_name': None, 'xre...",Human poliovirus 1 Mahoney,Poliovirus type 1 polyprotein,11.0,False,CHEMBL5127,"[{'accession': 'P03300', 'component_descriptio...",SINGLE PROTEIN,12081
6,[],Human poliovirus 1 strain Sabin,Human poliovirus 1 strain Sabin,10.0,False,CHEMBL2366966,[],ORGANISM,12082


### **Select and retrieve bioactivity data for *Polio Virus* (fifth entry)**


We will assign the fifth entry (which corresponds to the target protein) to the ***selected_target*** variable 

In [None]:
selected_target = targets.target_chembl_id[5]
selected_target

'CHEMBL5127'

In [None]:
#finding out target_organism

targets.organism.unique()

array(['Homo sapiens', 'Human enterovirus C', 'Human poliovirus 1',
       'Human poliovirus 3', 'Human poliovirus 2',
       'Human poliovirus 1 Mahoney', 'Human poliovirus 1 strain Sabin'],
      dtype=object)

Here, we will retrieve only bioactivity data for *Poliovirus type 1 polyprotein* (CHEMBL5127) and organism (Homo Sapiens) 

In [None]:
activity = new_client.activity #new activity always use new_client
res = activity.filter(target_chembl_id=selected_target).filter(organism='homo sapiens')
df = pd.DataFrame.from_dict(res)
df.head(3)

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,144743,[],CHEMBL771458,Inhibitory activity against polio virus RNA po...,B,,,BAO_0000190,BAO_0000357,single protein format,O=C(O)c1ccc2c(c1)nc(-c1cccs1)n2C1CCCCC1,Outside typical range,Values for this activity type are unusually la...,CHEMBL1149223,Bioorg. Med. Chem. Lett.,2004,,CHEMBL174629,,CHEMBL174629,,False,http://www.openphacts.org/units/Nanomolar,350900,>,1,True,>,,IC50,nM,,250000.0,CHEMBL5127,Human poliovirus 1 Mahoney,Poliovirus type 1 polyprotein,12081,,,IC50,uM,UO_0000065,,250.0
1,,145888,[],CHEMBL771458,Inhibitory activity against polio virus RNA po...,B,,,BAO_0000190,BAO_0000357,single protein format,O=C(O)c1ccc2c(c1)nc(-c1ccncc1)n2C1CCCCC1,Outside typical range,Values for this activity type are unusually la...,CHEMBL1149223,Bioorg. Med. Chem. Lett.,2004,,CHEMBL174836,,CHEMBL174836,,False,http://www.openphacts.org/units/Nanomolar,350893,>,1,True,>,,IC50,nM,,500000.0,CHEMBL5127,Human poliovirus 1 Mahoney,Poliovirus type 1 polyprotein,12081,,,IC50,uM,UO_0000065,,500.0
2,,147043,[],CHEMBL771458,Inhibitory activity against polio virus RNA po...,B,,,BAO_0000190,BAO_0000357,single protein format,CC1CCCCC1n1c(-c2ccccn2)nc2cc(C(=O)O)ccc21,Outside typical range,Values for this activity type are unusually la...,CHEMBL1149223,Bioorg. Med. Chem. Lett.,2004,,CHEMBL174542,,CHEMBL174542,,False,http://www.openphacts.org/units/Nanomolar,350881,>,1,True,>,,IC50,nM,,500000.0,CHEMBL5127,Human poliovirus 1 Mahoney,Poliovirus type 1 polyprotein,12081,,,IC50,uM,UO_0000065,,500.0


Finally we will save the resulting bioactivity data to a CSV file **bioactivity_data.csv**.

In [None]:
df.to_csv('Poliovirus(homosapiens).csv', index=False)

Saving a csv file when using Google Collab

In [None]:
from google.colab import files
df.to_csv('Poliovirus(homosapiens).csv', index=False)
files.download('Poliovirus(homosapiens).csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## **Handling missing data**
If any compounds has missing value for the **standard_value** column then drop it

In [None]:
df2 = df[df.standard_value.notna()]
df2.head(3)

Unnamed: 0,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,144743,[],CHEMBL771458,Inhibitory activity against polio virus RNA po...,B,,,BAO_0000190,BAO_0000357,single protein format,O=C(O)c1ccc2c(c1)nc(-c1cccs1)n2C1CCCCC1,Outside typical range,Values for this activity type are unusually la...,CHEMBL1149223,Bioorg. Med. Chem. Lett.,2004,,CHEMBL174629,,CHEMBL174629,,False,http://www.openphacts.org/units/Nanomolar,350900,>,1,True,>,,IC50,nM,,250000.0,CHEMBL5127,Human poliovirus 1 Mahoney,Poliovirus type 1 polyprotein,12081,,,IC50,uM,UO_0000065,,250.0
1,,145888,[],CHEMBL771458,Inhibitory activity against polio virus RNA po...,B,,,BAO_0000190,BAO_0000357,single protein format,O=C(O)c1ccc2c(c1)nc(-c1ccncc1)n2C1CCCCC1,Outside typical range,Values for this activity type are unusually la...,CHEMBL1149223,Bioorg. Med. Chem. Lett.,2004,,CHEMBL174836,,CHEMBL174836,,False,http://www.openphacts.org/units/Nanomolar,350893,>,1,True,>,,IC50,nM,,500000.0,CHEMBL5127,Human poliovirus 1 Mahoney,Poliovirus type 1 polyprotein,12081,,,IC50,uM,UO_0000065,,500.0
2,,147043,[],CHEMBL771458,Inhibitory activity against polio virus RNA po...,B,,,BAO_0000190,BAO_0000357,single protein format,CC1CCCCC1n1c(-c2ccccn2)nc2cc(C(=O)O)ccc21,Outside typical range,Values for this activity type are unusually la...,CHEMBL1149223,Bioorg. Med. Chem. Lett.,2004,,CHEMBL174542,,CHEMBL174542,,False,http://www.openphacts.org/units/Nanomolar,350881,>,1,True,>,,IC50,nM,,500000.0,CHEMBL5127,Human poliovirus 1 Mahoney,Poliovirus type 1 polyprotein,12081,,,IC50,uM,UO_0000065,,500.0


Apparently, for this dataset there is no missing data. But we can use the above code cell for bioactivity data of other target protein.

## **Data pre-processing of the bioactivity data**

### **Labeling compounds as either being active, inactive or intermediate**
The bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM will be considered to be **active** while those greater than 10,000 nM will be considered to be **inactive**. As for those values in between 1,000 and 10,000 nM will be referred to as **intermediate**. 

In [None]:
poliovirus_class = []

for i in df2.standard_value:
  if float(i) >= 10000:
    poliovirus_class.append("inactive")
  elif float (i) <= 1000:
    poliovirus_class.append("active")
  else:
    poliovirus_class.append("intermidiate")

### **Iterate the *molecule_chembl_id* to a list**

In [None]:
mol_cid = []

for i in df2.molecule_chembl_id:
  mol_cid.append(i)

### **Iterate *canonical_smiles* to a list**

In [None]:
canonical_smiles = []
for i in df2.canonical_smiles:
  canonical_smiles.append(i)

### **Iterate *standard_value* to a list**

In [None]:
standard_value = []
for i in df2.standard_value:
  standard_value.append(i)

### **Combine the 4 lists into a dataframe**

In [None]:
data_tuples = list(zip(mol_cid,canonical_smiles, poliovirus_class,standard_value))
df3 = pd.DataFrame(data_tuples, columns= ['molecule_chembl_id','canonical_smiles','poliovirus_class','standard_value'])

In [None]:
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,poliovirus_class,standard_value
0,CHEMBL174629,O=C(O)c1ccc2c(c1)nc(-c1cccs1)n2C1CCCCC1,inactive,250000.0
1,CHEMBL174836,O=C(O)c1ccc2c(c1)nc(-c1ccncc1)n2C1CCCCC1,inactive,500000.0
2,CHEMBL174542,CC1CCCCC1n1c(-c2ccccn2)nc2cc(C(=O)O)ccc21,inactive,500000.0
3,CHEMBL173718,CC(C)=Cc1nc2cc(C(=O)O)ccc2n1C1CCCCC1,inactive,500000.0
4,CHEMBL178076,COc1ccc(CCNC(=O)c2ccc3c(c2)nc(-c2ccccn2)n3C2CC...,inactive,500000.0
5,CHEMBL362058,O=C(O)c1ccc2c(c1)nc(-c1ccco1)n2C1CCCCC1,inactive,500000.0
6,CHEMBL426227,O=C(O)c1ccc2c(c1)nc(-c1ncc[nH]1)n2C1CCCCC1,inactive,250000.0
7,CHEMBL366941,COc1ccc(CNC(=O)c2ccc3c(c2)nc(-c2ccccn2)n3C2CCC...,inactive,500000.0
8,CHEMBL175177,O=C(O)c1ccc2c(c1)nc(-c1ccoc1)n2C1CCCCC1,inactive,250000.0
9,CHEMBL368168,Cn1cccc1-c1nc2cc(C(=O)O)ccc2n1C1CCCCC1,inactive,500000.0


### **Alternative method**

In [None]:
# selection = ['molecule_chembl_id', 'canonical_smiles', 'standard_value']
# df3 = df2[selection]
# df3

In [None]:
# pd.concat([df3,pd.Series(bioactivity_class)], axis=1)

Saves dataframe to CSV file

In [None]:
df3.to_csv('poliovirus_preprocessed_data.csv', index=False)

In [None]:
from google.colab import files
df3.to_csv('poliovirus_preprocessed_data.csv', index=False)
files.download('poliovirus_preprocessed_data.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
! ls -l

total 28
-rw-r--r-- 1 root root  1585 Dec  8 12:59  bioactivity_preprocessed_data.csv
drwx------ 6 root root  4096 Dec  8 10:22  gdrive
-rw-r--r-- 1 root root 11385 Dec  8 13:12 'Poliovirus(homosapiens).csv'
-rw-r--r-- 1 root root  1585 Dec  8 13:14  poliovirus_preprocessed_data.csv
drwxr-xr-x 1 root root  4096 Dec  3 14:33  sample_data


Let's copy to the Google Drive

In [None]:
! cp poliovirus_preprocessed_data.csv "/content/gdrive/My Drive/Colab Notebooks/polio"

In [None]:
! ls "/content/gdrive/My Drive/Colab Notebooks/polio"

'/content/gdrive/My Drive/Colab Notebooks/polio'


---