<a href="https://colab.research.google.com/github/Meet2197/ChEMBL/blob/main/bioinformatics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Bioinformatics Project - Computational Drug Discovery

In this Jupyter notebook, we will be building a real-life data science project that you can include in your data science portfolio. Particularly, we will be building a machine learning model using the ChEMBL bioactivity data.

In Part 1, we will be performing Data Collection and Pre-Processing from the ChEMBL Database.

Note for this Concised Version:

Redundant code cells were deleted.
Code cells for saving files to Google Drive has been deleted.
ChEMBL Database
The ChEMBL Database is a database that contains curated bioactivity data of more than 2 million compounds. It is compiled from more than 76,000 documents, 1.2 million assays and the data spans 13,000 targets and 1,800 cells and 33,000 indications. [Data as of March 25, 2020; ChEMBL version 26].

Installing libraries
Install the ChEMBL web service package so that we can retrieve bioactivity data from the ChEMBL Database.

In [None]:
pip install chembl-webresource-client

In [88]:
import pandas as pd
from chembl_webresource_client.new_client import new_client
target = new_client.target
target_query = target.search('human immunodeficiency viruses')
targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],Homo sapiens,Transcription factor HIVEP2,26.0,False,CHEMBL4523214,"[{'accession': 'P31629', 'component_descriptio...",SINGLE PROTEIN,9606
1,[],Human immunodeficiency virus,Human immunodeficiency virus,21.0,False,CHEMBL613758,[],ORGANISM,12721
2,"[{'xref_id': 'P15822', 'xref_name': None, 'xre...",Homo sapiens,Human immunodeficiency virus type I enhancer-b...,19.0,False,CHEMBL2909,"[{'accession': 'P15822', 'component_descriptio...",SINGLE PROTEIN,9606
3,[],Human immunodeficiency virus 1,Human immunodeficiency virus 1,18.0,False,CHEMBL378,[],ORGANISM,11676
4,[],Human immunodeficiency virus 2,Human immunodeficiency virus 2,18.0,False,CHEMBL380,[],ORGANISM,11709
...,...,...,...,...,...,...,...,...,...
179,"[{'xref_id': 'P35329', 'xref_name': None, 'xre...",Mus musculus,B-cell receptor CD22,4.0,False,CHEMBL1075279,"[{'accession': 'P35329', 'component_descriptio...",SINGLE PROTEIN,10090
180,[],Homo sapiens,Coagulation factor VII/tissue factor,4.0,False,CHEMBL2095194,"[{'accession': 'P08709', 'component_descriptio...",PROTEIN COMPLEX,9606
181,[],Homo sapiens,Cytochrome P450 1A,4.0,False,CHEMBL3544905,"[{'accession': 'P04798', 'component_descriptio...",PROTEIN FAMILY,9606
182,[],Homo sapiens,Cytochrome P450,3.0,False,CHEMBL4523986,"[{'accession': 'P08684', 'component_descriptio...",PROTEIN FAMILY,9606


Selecting ID

In [89]:
selected_target = targets.target_chembl_id[8]
selected_target

'CHEMBL243'

In [90]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

In [91]:
df = pd.DataFrame.from_dict(res)
df.head(3)

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,32165,[],CHEMBL763424,Inhibitory concentration against HIV-1 Proteas...,B,,,BAO_0000190,...,Human immunodeficiency virus 1,Human immunodeficiency virus type 1 protease,11676,,,IC50,ug ml-1,UO_0000274,,30.0
1,,,32166,[],CHEMBL763424,Inhibitory concentration against HIV-1 Proteas...,B,,,BAO_0000190,...,Human immunodeficiency virus 1,Human immunodeficiency virus type 1 protease,11676,,,IC50,ug ml-1,UO_0000274,,93.0
2,,,32456,[],CHEMBL769366,In vivo antiviral activity (IC50) against HIV-...,B,,,BAO_0000190,...,Human immunodeficiency virus 1,Human immunodeficiency virus type 1 protease,11676,,,IC50,nM,UO_0000065,,42.0


In [92]:
df.to_csv('bioactivity_data.csv', index = 'False')

In [93]:
from google.colab import drive
drive.mount('/content/gdrive/', force_remount=True)

Mounted at /content/gdrive/


# Directory generation in google drive

In [94]:
mkdir "/content/gdrive/My Drive/Colab Notebooks/data"

mkdir: cannot create directory ‘/content/gdrive/My Drive/Colab Notebooks/data’: File exists


In [95]:
cp bioactivity_data.csv "/content/gdrive/My Drive/Colab Notebooks/data"

In [96]:
ls "/content/gdrive/My Drive/Colab Notebooks/data"

bioactivity_data.csv


# List with time of created file

In [97]:
ls -l "/content/gdrive/My Drive/Colab Notebooks/data"

total 1972
-rw------- 1 root root 2018793 Oct  3 14:30 bioactivity_data.csv


# Handling Missing Data

In [98]:
print(df.columns)

Index(['action_type', 'activity_comment', 'activity_id', 'activity_properties',
       'assay_chembl_id', 'assay_description', 'assay_type',
       'assay_variant_accession', 'assay_variant_mutation', 'bao_endpoint',
       'bao_format', 'bao_label', 'canonical_smiles', 'data_validity_comment',
       'data_validity_description', 'document_chembl_id', 'document_journal',
       'document_year', 'ligand_efficiency', 'molecule_chembl_id',
       'molecule_pref_name', 'parent_molecule_chembl_id', 'pchembl_value',
       'potential_duplicate', 'qudt_units', 'record_id', 'relation', 'src_id',
       'standard_flag', 'standard_relation', 'standard_text_value',
       'standard_type', 'standard_units', 'standard_upper_value',
       'standard_value', 'target_chembl_id', 'target_organism',
       'target_pref_name', 'target_tax_id', 'text_value', 'toid', 'type',
       'units', 'uo_units', 'upper_value', 'value'],
      dtype='object')


In [99]:
df2 = df[df.standard_value.notna()]
df2

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,32165,[],CHEMBL763424,Inhibitory concentration against HIV-1 Proteas...,B,,,BAO_0000190,...,Human immunodeficiency virus 1,Human immunodeficiency virus type 1 protease,11676,,,IC50,ug ml-1,UO_0000274,,30.0
1,,,32166,[],CHEMBL763424,Inhibitory concentration against HIV-1 Proteas...,B,,,BAO_0000190,...,Human immunodeficiency virus 1,Human immunodeficiency virus type 1 protease,11676,,,IC50,ug ml-1,UO_0000274,,93.0
2,,,32456,[],CHEMBL769366,In vivo antiviral activity (IC50) against HIV-...,B,,,BAO_0000190,...,Human immunodeficiency virus 1,Human immunodeficiency virus type 1 protease,11676,,,IC50,nM,UO_0000065,,42.0
3,,,32459,[],CHEMBL769366,In vivo antiviral activity (IC50) against HIV-...,B,,,BAO_0000190,...,Human immunodeficiency virus 1,Human immunodeficiency virus type 1 protease,11676,,,IC50,nM,UO_0000065,,1000.0
4,,,33353,[],CHEMBL696055,Inhibitory activity was determined against HIV...,B,,,BAO_0000190,...,Human immunodeficiency virus 1,Human immunodeficiency virus type 1 protease,11676,,,IC50,nM,UO_0000065,,4.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3449,"{'action_type': 'INHIBITOR', 'description': 'N...",,25090011,[],CHEMBL5257798,Inhibition of HIV-1 protease,B,,,BAO_0000190,...,Human immunodeficiency virus 1,Human immunodeficiency virus type 1 protease,11676,,,IC50,uM,UO_0000065,,2.0
3450,"{'action_type': 'INHIBITOR', 'description': 'N...",,25090012,[],CHEMBL5257798,Inhibition of HIV-1 protease,B,,,BAO_0000190,...,Human immunodeficiency virus 1,Human immunodeficiency virus type 1 protease,11676,,,IC50,nM,UO_0000065,,0.01
3451,"{'action_type': 'INHIBITOR', 'description': 'N...",,25105067,[],CHEMBL5262984,Inhibition of HIV-1 protease by FRET-based assay,B,,,BAO_0000190,...,Human immunodeficiency virus 1,Human immunodeficiency virus type 1 protease,11676,,,IC50,nM,UO_0000065,,4.2
3452,"{'action_type': 'INHIBITOR', 'description': 'N...",,25105068,[],CHEMBL5262984,Inhibition of HIV-1 protease by FRET-based assay,B,,,BAO_0000190,...,Human immunodeficiency virus 1,Human immunodeficiency virus type 1 protease,11676,,,IC50,nM,UO_0000065,,11.8


In [100]:
df2.molecule_chembl_id

Unnamed: 0,molecule_chembl_id
0,CHEMBL9650
1,CHEMBL273396
2,CHEMBL108102
3,CHEMBL322033
4,CHEMBL20143
...,...
3449,CHEMBL5284565
3450,CHEMBL478728
3451,CHEMBL159580
3452,CHEMBL5268749
