# **Predictive Modeling of FLT3 Bioactivity Using Machine Learning [Part 1] Download Bioactivity Data**



## **ChEMBL Database**

The [*ChEMBL Database*](https://www.ebi.ac.uk/chembl/) is a database that contains curated bioactivity data of more than 2.4 million compounds. It is compiled from more than 89,900 documents, 1.6 million assays and the data spans 15,000 targets and 2,000 cells and 48,800 indications.
[Data as of November 14, 2024; ChEMBL version 34].

## **Installing libraries**

Install the ChEMBL web service package so that bioactivity data can be retrieved from the ChEMBL Database.

In [3]:
! pip install chembl_webresource_client

Collecting chembl_webresource_client
  Downloading chembl_webresource_client-0.10.9-py3-none-any.whl.metadata (1.4 kB)
Collecting requests-cache~=1.2 (from chembl_webresource_client)
  Downloading requests_cache-1.2.1-py3-none-any.whl.metadata (9.9 kB)
Collecting cattrs>=22.2 (from requests-cache~=1.2->chembl_webresource_client)
  Downloading cattrs-24.1.2-py3-none-any.whl.metadata (8.4 kB)
Collecting url-normalize>=1.4 (from requests-cache~=1.2->chembl_webresource_client)
  Downloading url_normalize-1.4.3-py2.py3-none-any.whl.metadata (3.1 kB)
Downloading chembl_webresource_client-0.10.9-py3-none-any.whl (55 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.2/55.2 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading requests_cache-1.2.1-py3-none-any.whl (61 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.4/61.4 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading cattrs-24.1.2-py3-none-any.whl (66 kB)
[2K   [90m━━━━━━━━━━

## **Importing libraries**

In [4]:
# Import necessary libraries
import pandas as pd
from chembl_webresource_client.new_client import new_client

## **Search for Target protein**

### **Target search for FLT3**

In [5]:
# Target search for FLT3
target = new_client.target
target_query = target.search('FLT3')
targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,cross_references,organism,pref_name,score,species_group_flag,target_chembl_id,target_components,target_type,tax_id
0,[],Mus musculus,Receptor-type tyrosine-protein kinase FLT3,18.0,False,CHEMBL2034796,"[{'accession': 'Q00342', 'component_descriptio...",SINGLE PROTEIN,10090
1,"[{'xref_id': 'P36888', 'xref_name': None, 'xre...",Homo sapiens,Tyrosine-protein kinase receptor FLT3,16.0,False,CHEMBL1974,"[{'accession': 'P36888', 'component_descriptio...",SINGLE PROTEIN,9606
2,[],Homo sapiens,VHL/FLT3,16.0,False,CHEMBL4523735,"[{'accession': 'P36888', 'component_descriptio...",PROTEIN-PROTEIN INTERACTION,9606
3,[],Homo sapiens,Protein cereblon/Tyrosine-protein kinase recep...,16.0,False,CHEMBL4630730,"[{'accession': 'P36888', 'component_descriptio...",PROTEIN-PROTEIN INTERACTION,9606


### **Select and retrieve bioactivity data for *Receptor-type tyrosine-protein kinase FLT3* (first entry)**

 Assigned the first entry (which corresponds to the target protein, *Receptor-type tyrosine-protein kinase FLT3*) to the ***selected_target*** variable

In [6]:
selected_target = targets.target_chembl_id[0]
selected_target

'CHEMBL2034796'

Here, we will retrieve only bioactivity data for *Receptor-type tyrosine-protein kinase FLT3* (CHEMBL2034796) that are reported as IC$_{50}$ values in nM (nanomolar) unit.

In [7]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

In [9]:
df = pd.DataFrame.from_dict(res)

In [10]:
df.head(3)

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,10905866,[],CHEMBL2038873,Inhibition of Tel-fused FLT3 in mouse BA/F3 cells,B,,,BAO_0000190,...,Mus musculus,Receptor-type tyrosine-protein kinase FLT3,10090,,,IC50,uM,UO_0000065,,5.1
1,,,10905904,[],CHEMBL2038873,Inhibition of Tel-fused FLT3 in mouse BA/F3 cells,B,,,BAO_0000190,...,Mus musculus,Receptor-type tyrosine-protein kinase FLT3,10090,,,IC50,uM,UO_0000065,,2.1
2,,,10905951,[],CHEMBL2038873,Inhibition of Tel-fused FLT3 in mouse BA/F3 cells,B,,,BAO_0000190,...,Mus musculus,Receptor-type tyrosine-protein kinase FLT3,10090,,,IC50,uM,UO_0000065,,5.5


In [11]:
df.standard_type.unique()

array(['IC50'], dtype=object)

# Saved the resulting bioactivity data to a CSV file **bioactivity_data.csv**.

In [12]:
df.to_csv('bioactivity_data_raw.csv', index=False)

## Copying files to Google Drive


In [13]:
from google.colab import drive
drive.mount('/content/gdrive/', force_remount=True)

Mounted at /content/gdrive/


**Handling missing data**

In [14]:
df2 = df[df.standard_value.notna()]
df2

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,10905866,[],CHEMBL2038873,Inhibition of Tel-fused FLT3 in mouse BA/F3 cells,B,,,BAO_0000190,...,Mus musculus,Receptor-type tyrosine-protein kinase FLT3,10090,,,IC50,uM,UO_0000065,,5.1
1,,,10905904,[],CHEMBL2038873,Inhibition of Tel-fused FLT3 in mouse BA/F3 cells,B,,,BAO_0000190,...,Mus musculus,Receptor-type tyrosine-protein kinase FLT3,10090,,,IC50,uM,UO_0000065,,2.1
2,,,10905951,[],CHEMBL2038873,Inhibition of Tel-fused FLT3 in mouse BA/F3 cells,B,,,BAO_0000190,...,Mus musculus,Receptor-type tyrosine-protein kinase FLT3,10090,,,IC50,uM,UO_0000065,,5.5
3,,,10906029,[],CHEMBL2038873,Inhibition of Tel-fused FLT3 in mouse BA/F3 cells,B,,,BAO_0000190,...,Mus musculus,Receptor-type tyrosine-protein kinase FLT3,10090,,,IC50,uM,UO_0000065,,6.4
4,,,10906062,[],CHEMBL2038873,Inhibition of Tel-fused FLT3 in mouse BA/F3 cells,B,,,BAO_0000190,...,Mus musculus,Receptor-type tyrosine-protein kinase FLT3,10090,,,IC50,uM,UO_0000065,,4.7
5,,,10906970,[],CHEMBL2038873,Inhibition of Tel-fused FLT3 in mouse BA/F3 cells,B,,,BAO_0000190,...,Mus musculus,Receptor-type tyrosine-protein kinase FLT3,10090,,,IC50,uM,UO_0000065,,9.4
6,,,14545494,[],CHEMBL3225224,Inhibition of FLT3 autophosphorylation in mous...,B,,,BAO_0000190,...,Mus musculus,Receptor-type tyrosine-protein kinase FLT3,10090,,,IC50,nM,UO_0000065,,1.5


## **Data pre-processing of the bioactivity data**

### **Labeling compounds as either being active, inactive or intermediate**
The bioactivity data is in the IC50 unit. Compounds having values of less than 1000 nM will be considered to be **active** while those greater than 10,000 nM will be considered to be **inactive**. As for those values in between 1,000 and 10,000 nM will be referred to as **intermediate**.

In [15]:
bioactivity_class = []
for i in df2.standard_value:
  if float(i) >= 10000:
    bioactivity_class.append("inactive")
  elif float(i) <= 1000:
    bioactivity_class.append("active")
  #else:
  #  bioactivity_class.append("intermediate")

### **Combine the 3 columns (molecule_chembl_id,canonical_smiles,standard_value) and bioactivity_class into a DataFrame**

In [19]:
selection = ['molecule_chembl_id','canonical_smiles','standard_value']
df3 = df2[selection]
df3

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value
0,CHEMBL2037226,Cc1ccc(NC(=O)Nc2cc(C(F)(F)F)ccc2F)cc1Nc1ccc2c(...,5100.0
1,CHEMBL2037224,O=C(Nc1cccc(Nc2ccc3c(c2)NC(=O)/C3=C\c2ccc[nH]2...,2100.0
2,CHEMBL2037220,CN1CCN(c2cc(C(=O)Nc3cccc(Nc4ccc5c(c4)NC(=O)/C5...,5500.0
3,CHEMBL2037211,CN1CCN(c2cc(C(=O)Nc3cccc(Nc4ccc5c(c4)NC(=O)/C5...,6400.0
4,CHEMBL2037209,Cc1cn(-c2cc(C(=O)Nc3cccc(Nc4ccc5c(c4)NC(=O)/C5...,4700.0
5,CHEMBL2037208,O=C1Nc2cc(Nc3cccc(NC(=O)c4cccc(C(F)(F)F)c4)c3)...,9400.0
6,CHEMBL603469,C[C@]12O[C@H](C[C@]1(O)CO)n1c3ccccc3c3c4c(c5c6...,1.5


In [20]:
bioactivity_class = pd.Series(bioactivity_class, name='bioactivity_class')
df4 = pd.concat([df3, bioactivity_class], axis=1)
df4

Unnamed: 0,molecule_chembl_id,canonical_smiles,standard_value,bioactivity_class
0,CHEMBL2037226,Cc1ccc(NC(=O)Nc2cc(C(F)(F)F)ccc2F)cc1Nc1ccc2c(...,5100.0,active
1,CHEMBL2037224,O=C(Nc1cccc(Nc2ccc3c(c2)NC(=O)/C3=C\c2ccc[nH]2...,2100.0,
2,CHEMBL2037220,CN1CCN(c2cc(C(=O)Nc3cccc(Nc4ccc5c(c4)NC(=O)/C5...,5500.0,
3,CHEMBL2037211,CN1CCN(c2cc(C(=O)Nc3cccc(Nc4ccc5c(c4)NC(=O)/C5...,6400.0,
4,CHEMBL2037209,Cc1cn(-c2cc(C(=O)Nc3cccc(Nc4ccc5c(c4)NC(=O)/C5...,4700.0,
5,CHEMBL2037208,O=C1Nc2cc(Nc3cccc(NC(=O)c4cccc(C(F)(F)F)c4)c3)...,9400.0,
6,CHEMBL603469,C[C@]12O[C@H](C[C@]1(O)CO)n1c3ccccc3c3c4c(c5c6...,1.5,


Saves dataframe to CSV file

In [22]:
df3.to_csv('bioactivity_data_preprocessed.csv', index=False)

In [None]:
! ls -l

total 20
-rw-r--r-- 1 root root  755 Nov 14 11:57 bioactivity_data_preprocessed.csv
-rw-r--r-- 1 root root 4347 Nov 14 10:44 bioactivity_data_raw.csv
drwx------ 7 root root 4096 Nov 14 10:55 gdrive
drwxr-xr-x 1 root root 4096 Nov 12 14:25 sample_data


---