<a href="https://colab.research.google.com/github/JesicaBA/CB-Bio/blob/main/bioactivity_data_melanoma_associated_antigen_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Computational Drug Discovery-Bioactivity Data**

Jesica Allende

[*LinkedIn Profile*](https://iplogger.com/2hKMx4)

Here I will show you how to find and classify drugs based on their bioactivity against melanoma-associated antigen-4.

---

## **ChEMBL Database**

The [*ChEMBL Database*](https://www.ebi.ac.uk/chembl/) is a freely available and curated database that provides information on the bioactivities of small molecules and their targets, facilitating drug discovery and development in the field of chemogenomics.


## **Installing libraries**

Install the ChEMBL web service package so that we can retrieve bioactivity data from the ChEMBL Database.

In [None]:
! pip install chembl_webresource_client



## **Importing libraries**

In [None]:
import pandas as pd
from chembl_webresource_client.new_client import new_client

## **Search for Target protein**

### **Target search for melanoma**

In [None]:
target = new_client.target
target_query = target.search('melanoma')
targets = pd.DataFrame.from_dict(target_query)
targets.head(15)

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],Homo sapiens,Cell surface glycoprotein MUC18,15.0,False,CHEMBL3712863,"[{'accession': 'P43121', 'component_descriptio...",SINGLE PROTEIN,9606.0
1,[],Homo sapiens,Melanoma cells,14.0,False,CHEMBL614126,[],CELL-LINE,9606.0
2,[],Homo sapiens,Melanoma-associated antigen 4,14.0,False,CHEMBL4296022,"[{'accession': 'P43358', 'component_descriptio...",SINGLE PROTEIN,9606.0
3,[],Homo sapiens,Interferon-inducible protein AIM2,14.0,False,CHEMBL4630802,"[{'accession': 'O14862', 'component_descriptio...",SINGLE PROTEIN,9606.0
4,[],Homo sapiens,Melanoma-associated antigen 3,14.0,False,CHEMBL4662941,"[{'accession': 'P43357', 'component_descriptio...",SINGLE PROTEIN,9606.0
5,[],Homo sapiens,CD63 antigen,13.0,False,CHEMBL3713303,"[{'accession': 'P08962', 'component_descriptio...",SINGLE PROTEIN,9606.0
6,[],Homo sapiens,Melanoma cell line,12.0,False,CHEMBL613892,[],CELL-LINE,9606.0
7,[],Homo sapiens,Cereblon/Melanoma-associated antigen D1,12.0,False,CHEMBL4742325,"[{'accession': 'Q96SW2', 'component_descriptio...",PROTEIN-PROTEIN INTERACTION,9606.0
8,[],Homo sapiens,3677 melanoma cell line,11.0,False,CHEMBL612820,[],CELL-LINE,9606.0
9,[],Homo sapiens,BRO melanoma cell line,11.0,False,CHEMBL614665,[],CELL-LINE,9606.0


### **Select and retrieve bioactivity data for *Melanoma-associated antigen 4* (third entry)**

I assign the third entry (which corresponds to the target protein, *Melanoma-associated antigen 4*) to the ***selected_target*** variable

In [None]:
selected_target = targets.target_chembl_id[2]
selected_target

'CHEMBL4296022'

Let´s retrieve only bioactivity data for *Melanoma-associated antigen 4* (CHEMBL4296022) that are reported as IC$_{50}$ values in nM.

In [None]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

In [None]:
df = pd.DataFrame.from_dict(res)

In [None]:
df.head(6)

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,"{'action_type': 'INHIBITOR', 'description': 'N...",,24922732,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5166839,Inhibition of Eu-W1024 streptavidin-labeled MA...,B,,,BAO_0000190,...,Homo sapiens,Melanoma-associated antigen 4,9606,,,IC50,nM,UO_0000065,,2.15
1,"{'action_type': 'INHIBITOR', 'description': 'N...",,24922733,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5166839,Inhibition of Eu-W1024 streptavidin-labeled MA...,B,,,BAO_0000190,...,Homo sapiens,Melanoma-associated antigen 4,9606,,,IC50,nM,UO_0000065,,1.0
2,"{'action_type': 'INHIBITOR', 'description': 'N...",,24922734,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5166839,Inhibition of Eu-W1024 streptavidin-labeled MA...,B,,,BAO_0000190,...,Homo sapiens,Melanoma-associated antigen 4,9606,,,IC50,nM,UO_0000065,,1.26
3,"{'action_type': 'INHIBITOR', 'description': 'N...",,24922735,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5166839,Inhibition of Eu-W1024 streptavidin-labeled MA...,B,,,BAO_0000190,...,Homo sapiens,Melanoma-associated antigen 4,9606,,,IC50,nM,UO_0000065,,10.6
4,"{'action_type': 'INHIBITOR', 'description': 'N...",,24922736,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5166839,Inhibition of Eu-W1024 streptavidin-labeled MA...,B,,,BAO_0000190,...,Homo sapiens,Melanoma-associated antigen 4,9606,,,IC50,nM,UO_0000065,,13.9
5,"{'action_type': 'INHIBITOR', 'description': 'N...",,24922737,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5166839,Inhibition of Eu-W1024 streptavidin-labeled MA...,B,,,BAO_0000190,...,Homo sapiens,Melanoma-associated antigen 4,9606,,,IC50,nM,UO_0000065,,862.2


What is the size of the DataFrame?

In [None]:
num_rows = df.shape[0]
num_columns = df.shape[1]

print(f"Number of Rows: {num_rows}")
print(f"Number of Columns: {num_columns}")

Number of Rows: 17
Number of Columns: 46


In [None]:
df.standard_type.unique()

array(['IC50'], dtype=object)

Finally, I save the resulting bioactivity data to a CSV file **bioactivity_data_Melanoma-associated antigen 4.csv**.

In [None]:
df.to_csv('bioactivity_data_melanoma-associated_antigen-4.csv', index=False)

## **Copying files to Google Drive**

Firstly, it is needed to mount the Google Drive into Colab so that I can have access to my Google Drive from Colab.

In [None]:
from google.colab import drive
drive.mount('/content/gdrive/', force_remount=True)


Mounted at /content/gdrive/


Next, let´s create a **Bioinfo** folder in my **Colab Notebooks** folder in Google Drive.

In [None]:
! mkdir "/content/gdrive/My Drive/Colab Notebooks/Bioinfo"

mkdir: cannot create directory ‘/content/gdrive/My Drive/Colab Notebooks/Bioinfo’: File exists


In [None]:
! cp bioactivity_data_melanoma-associated_antigen-4.csv "/content/gdrive/My Drive/Colab Notebooks/Bioinfo"

In [None]:
! ls -l "/content/gdrive/My Drive/Colab Notebooks/Bioinfo"

total 26
-rw------- 1 root root 25925 Jun 29 00:05 bioactivity_data_melanoma-associated_antigen-4.csv


Let's see the CSV files that we have so far.

In [None]:
! ls

bioactivity_data_melanoma-associated_antigen-4.csv  gdrive  sample_data


Let's take a look at the **bioactivity_data_Melanoma-associated antigen_4.csv** file that just created.

In [None]:
! head bioactivity_data_melanoma-associated_antigen-4.csv

action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,bao_format,bao_label,canonical_smiles,data_validity_comment,data_validity_description,document_chembl_id,document_journal,document_year,ligand_efficiency,molecule_chembl_id,molecule_pref_name,parent_molecule_chembl_id,pchembl_value,potential_duplicate,qudt_units,record_id,relation,src_id,standard_flag,standard_relation,standard_text_value,standard_type,standard_units,standard_upper_value,standard_value,target_chembl_id,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
"{'action_type': 'INHIBITOR', 'description': 'Negatively effects (inhibits) the normal functioning of the protein e.g., prevention of enzymatic reaction or activation of downstream pathway', 'parent_type': 'NEGATIVE MODULATOR'}",,24922732,"[{'comments': None, 'relation': '=', 'result_flag': 0, 'standard_relat

## **Missing data**
If any compounds have missing value for the **value** column then drop it

In [None]:
df2 = df[df.value.notna()]
df2.head(6)

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,"{'action_type': 'INHIBITOR', 'description': 'N...",,24922732,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5166839,Inhibition of Eu-W1024 streptavidin-labeled MA...,B,,,BAO_0000190,...,Homo sapiens,Melanoma-associated antigen 4,9606,,,IC50,nM,UO_0000065,,2.15
1,"{'action_type': 'INHIBITOR', 'description': 'N...",,24922733,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5166839,Inhibition of Eu-W1024 streptavidin-labeled MA...,B,,,BAO_0000190,...,Homo sapiens,Melanoma-associated antigen 4,9606,,,IC50,nM,UO_0000065,,1.0
2,"{'action_type': 'INHIBITOR', 'description': 'N...",,24922734,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5166839,Inhibition of Eu-W1024 streptavidin-labeled MA...,B,,,BAO_0000190,...,Homo sapiens,Melanoma-associated antigen 4,9606,,,IC50,nM,UO_0000065,,1.26
3,"{'action_type': 'INHIBITOR', 'description': 'N...",,24922735,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5166839,Inhibition of Eu-W1024 streptavidin-labeled MA...,B,,,BAO_0000190,...,Homo sapiens,Melanoma-associated antigen 4,9606,,,IC50,nM,UO_0000065,,10.6
4,"{'action_type': 'INHIBITOR', 'description': 'N...",,24922736,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5166839,Inhibition of Eu-W1024 streptavidin-labeled MA...,B,,,BAO_0000190,...,Homo sapiens,Melanoma-associated antigen 4,9606,,,IC50,nM,UO_0000065,,13.9
5,"{'action_type': 'INHIBITOR', 'description': 'N...",,24922737,"[{'comments': None, 'relation': '=', 'result_f...",CHEMBL5166839,Inhibition of Eu-W1024 streptavidin-labeled MA...,B,,,BAO_0000190,...,Homo sapiens,Melanoma-associated antigen 4,9606,,,IC50,nM,UO_0000065,,862.2


What is the size of the DataFrame2?

In [None]:
num_rows = df2.shape[0]
num_columns = df2.shape[1]

print(f"Number of Rows: {num_rows}")
print(f"Number of Columns: {num_columns}")

Number of Rows: 16
Number of Columns: 46


For this dataset2 there is one missing data. There are 16 rows when initially were 17.

## **Data pre-processing of the bioactivity data**

### **Classification of compounds as active, inactive, or intermediates**
The idea of a bioactivity power is in the IC50. Compounds with values of less than 1000 nM will be considered **active** while those greater than 7000 nM will be considered **inactive**. As for those values between 1000 and 7000 nM will be referred to as **intermediate**.

In [None]:
bioactivity_class = []
for i in df2.value:
  if float(i) >= 7000:
    bioactivity_class.append("inactive")
  elif float(i) <= 1000:
    bioactivity_class.append("active")
  else:
    bioactivity_class.append("intermediate")

### **Adding the *molecule_chembl_id* to a list**

In [None]:
mol_cid = []
for i in df2.molecule_chembl_id:
  mol_cid.append(i)

### **Adding *canonical_smiles* to a list**

In [None]:
canonical_smiles = []
for i in df2.canonical_smiles:
  canonical_smiles.append(i)

### **Adding *standard_value* to a list**

In [None]:
value = []
for i in df2.value:
  value.append(i)

### **Combine the 4 lists into a dataframe**

In [None]:
data_tuples = list(zip(mol_cid, canonical_smiles, bioactivity_class, value))
df3 = pd.DataFrame( data_tuples,  columns=['molecule_chembl_id', 'canonical_smiles', 'bioactivity_class', 'value']).sort_values(by="bioactivity_class")

_The table is shown based on the bioactivity_class in ascending order._

In [None]:
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,bioactivity_class,value
0,CHEMBL5188210,CC[C@H](C)[C@@H]1NC(=O)[C@H](Cc2ccc(O)cc2)NC(=...,active,2.15
1,CHEMBL5187815,CC(C)C[C@@H]1NC(=O)[C@H](CCCNC(=N)N)NC(=O)[C@H...,active,1.0
2,CHEMBL5199001,CC(C)C[C@@H]1NC(=O)[C@H]([C@H](C)O)NC(=O)[C@H]...,active,1.26
3,CHEMBL5208958,CC[C@H](C)[C@@H]1NC(=O)[C@H](Cc2ccc(O)cc2)NC(=...,active,10.6
4,CHEMBL5182108,CC[C@H](C)[C@@H]1NC(=O)[C@H](Cc2ccc(O)cc2)NC(=...,active,13.9
5,CHEMBL5189065,CC[C@H](C)[C@@H]1NC(=O)[C@H](Cc2ccc(O)cc2)NC(=...,active,862.2
6,CHEMBL5197762,CC[C@H](C)[C@@H]1NC(=O)[C@H](Cc2ccc(O)cc2)NC(=...,active,67.5
8,CHEMBL5187977,CC[C@H](C)[C@@H]1NC(=O)[C@H](Cc2ccc(O)cc2)NC(=...,active,103.1
11,CHEMBL5175619,CC[C@H](C)[C@@H]1NC(=O)[C@H](Cc2ccc(O)cc2)NC(=...,active,147.55
12,CHEMBL5197324,CC[C@H](C)[C@@H]1NC(=O)[C@H](Cc2ccc(O)cc2)NC(=...,active,190.14


_The table is shown based on the bioactivity_class in descending order._

In [None]:
data_tuples = list(zip(mol_cid, canonical_smiles, bioactivity_class, value))
df3 = pd.DataFrame( data_tuples,  columns=['molecule_chembl_id', 'canonical_smiles', 'bioactivity_class', 'value']).sort_values(by="bioactivity_class", ascending=False)
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,bioactivity_class,value
7,CHEMBL5188864,CC[C@H](C)[C@@H]1NC(=O)[C@H](Cc2ccc(O)cc2)NC(=...,intermediate,2211.8
9,CHEMBL5202217,CC(C)C[C@@H]1NC(=O)[C@H](CCCNC(=N)N)NC(=O)[C@H...,intermediate,6183.7
10,CHEMBL5174657,CC[C@H](C)[C@@H]1NC(=O)[C@H](C)NC(=O)CSC[C@@H]...,intermediate,1892.7
13,CHEMBL5175748,CC[C@H](C)[C@@H]1NC(=O)[C@H](Cc2ccc(O)cc2)NC(=...,intermediate,2227.95
14,CHEMBL5174081,CC[C@H](C)[C@@H]1NC(=O)[C@H](Cc2ccc(O)cc2)NC(=...,inactive,10000.0
15,CHEMBL5208599,CC[C@H](C)[C@H](NC(=O)[C@H](Cc1ccc(O)cc1)NC(C)...,inactive,10000.0
0,CHEMBL5188210,CC[C@H](C)[C@@H]1NC(=O)[C@H](Cc2ccc(O)cc2)NC(=...,active,2.15
1,CHEMBL5187815,CC(C)C[C@@H]1NC(=O)[C@H](CCCNC(=N)N)NC(=O)[C@H...,active,1.0
2,CHEMBL5199001,CC(C)C[C@@H]1NC(=O)[C@H]([C@H](C)O)NC(=O)[C@H]...,active,1.26
3,CHEMBL5208958,CC[C@H](C)[C@@H]1NC(=O)[C@H](Cc2ccc(O)cc2)NC(=...,active,10.6


In [None]:
num_rows = df3.shape[0]
num_columns = df3.shape[1]

print(f"Number of Rows: {num_rows}")
print(f"Number of Columns: {num_columns}")

Number of Rows: 16
Number of Columns: 4


### **Alternative method to iterate columns**

In [None]:
selection = ['molecule_chembl_id', 'canonical_smiles', 'value']
df3 = df2[selection]
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,value
0,CHEMBL5188210,CC[C@H](C)[C@@H]1NC(=O)[C@H](Cc2ccc(O)cc2)NC(=...,2.15
1,CHEMBL5187815,CC(C)C[C@@H]1NC(=O)[C@H](CCCNC(=N)N)NC(=O)[C@H...,1.0
2,CHEMBL5199001,CC(C)C[C@@H]1NC(=O)[C@H]([C@H](C)O)NC(=O)[C@H]...,1.26
3,CHEMBL5208958,CC[C@H](C)[C@@H]1NC(=O)[C@H](Cc2ccc(O)cc2)NC(=...,10.6
4,CHEMBL5182108,CC[C@H](C)[C@@H]1NC(=O)[C@H](Cc2ccc(O)cc2)NC(=...,13.9
5,CHEMBL5189065,CC[C@H](C)[C@@H]1NC(=O)[C@H](Cc2ccc(O)cc2)NC(=...,862.2
6,CHEMBL5197762,CC[C@H](C)[C@@H]1NC(=O)[C@H](Cc2ccc(O)cc2)NC(=...,67.5
7,CHEMBL5188864,CC[C@H](C)[C@@H]1NC(=O)[C@H](Cc2ccc(O)cc2)NC(=...,2211.8
8,CHEMBL5187977,CC[C@H](C)[C@@H]1NC(=O)[C@H](Cc2ccc(O)cc2)NC(=...,103.1
9,CHEMBL5202217,CC(C)C[C@@H]1NC(=O)[C@H](CCCNC(=N)N)NC(=O)[C@H...,6183.7


In [None]:
pd.concat([df3,pd.Series(bioactivity_class)], axis=1)

Unnamed: 0,molecule_chembl_id,canonical_smiles,value,0
0,CHEMBL5188210,CC[C@H](C)[C@@H]1NC(=O)[C@H](Cc2ccc(O)cc2)NC(=...,2.15,active
1,CHEMBL5187815,CC(C)C[C@@H]1NC(=O)[C@H](CCCNC(=N)N)NC(=O)[C@H...,1.0,active
2,CHEMBL5199001,CC(C)C[C@@H]1NC(=O)[C@H]([C@H](C)O)NC(=O)[C@H]...,1.26,active
3,CHEMBL5208958,CC[C@H](C)[C@@H]1NC(=O)[C@H](Cc2ccc(O)cc2)NC(=...,10.6,active
4,CHEMBL5182108,CC[C@H](C)[C@@H]1NC(=O)[C@H](Cc2ccc(O)cc2)NC(=...,13.9,active
5,CHEMBL5189065,CC[C@H](C)[C@@H]1NC(=O)[C@H](Cc2ccc(O)cc2)NC(=...,862.2,active
6,CHEMBL5197762,CC[C@H](C)[C@@H]1NC(=O)[C@H](Cc2ccc(O)cc2)NC(=...,67.5,active
7,CHEMBL5188864,CC[C@H](C)[C@@H]1NC(=O)[C@H](Cc2ccc(O)cc2)NC(=...,2211.8,intermediate
8,CHEMBL5187977,CC[C@H](C)[C@@H]1NC(=O)[C@H](Cc2ccc(O)cc2)NC(=...,103.1,active
9,CHEMBL5202217,CC(C)C[C@@H]1NC(=O)[C@H](CCCNC(=N)N)NC(=O)[C@H...,6183.7,intermediate


Saves dataframe to CSV file

In [None]:
df3.to_csv('bioactivity_data_melanoma-associated_antigen-4.csv', index=False)

In [None]:
! ls -l

total 12
-rw-r--r-- 1 root root 3936 Jun 29 00:05 bioactivity_data_melanoma-associated_antigen-4.csv
drwx------ 5 root root 4096 Jun 29 00:05 gdrive
drwxr-xr-x 1 root root 4096 Jun 27 13:35 sample_data


Let's copy to the Google Drive

In [None]:
! cp bioactivity_data_melanoma-associated_antigen-4.csv "/content/gdrive/My Drive/Colab Notebooks/Bioinfo"

In [None]:
! ls "/content/gdrive/My Drive/Colab Notebooks/Bioinfo"

bioactivity_data_melanoma-associated_antigen-4.csv


---