## **ChEMBL Database**

The [*ChEMBL Database*](https://www.ebi.ac.uk/chembl/) is a database that contains curated bioactivity data of more than 2 million compounds. It is compiled from more than 76,000 documents, 1.2 million assays and the data spans 13,000 targets and 1,800 cells and 33,000 indications.
[Data as of March 25, 2020; ChEMBL version 26].

## **Installing libraries**
Install the ChEMBL web service package so that we can retrieve bioactivity data from the ChEMBL Database.

In [None]:
! pip install chembl_webresource_client

## **Importing libraries**

In [None]:
# Import necessary libraries
import pandas as pd
from chembl_webresource_client.new_client import new_client

## **Search for Target protein**



### **Target search for BACE**

In [None]:

target = new_client.target
target_query = target.search('BACE')
targets = pd.DataFrame.from_dict(target_query)
targets

### **Select and retrieve bioactivity data for *Human Bace1* (4th entry)**

We will assign the fourth entry (which corresponds to the target protein, *Human BACE1*) to the ***selected_target*** variable

In [None]:
selected_target = targets.target_chembl_id[3]
selected_target

Here, we will retrieve only bioactivity data for *Human BACE1* (CHEMBL4822) that are reported as pChEMBL values.


In [None]:
activity = new_client.activity
res = activity.filter(target_chembl_id=selected_target).filter(standard_type="IC50")

In [None]:
df = pd.DataFrame.from_dict(res)

In [None]:
df

Finally we will save the resulting bioactivity data to a CSV file **bioactivity_data.csv**.

In [None]:
df.to_csv('BACE_1_bioactivity_data_raw.csv', index=False)

## **Handling missing data**
If any compounds has missing value for the **standard_value** and **canonical_smiles** column then drop it.

In [None]:
df2 = df[df.standard_value.notna()]
df2 = df2[df.canonical_smiles.notna()]
df2

In [None]:
len(df2.canonical_smiles.unique())

In [None]:
df2_nr = df2.drop_duplicates(['canonical_smiles'])
df2_nr

## **Data pre-processing of the bioactivity data**

### **Combine the 3 columns (molecule_chembl_id,canonical_smiles,standard_value) and bioactivity_class into a DataFrame**

In [None]:
selection = ['molecule_chembl_id','canonical_smiles','standard_value']
df3 = df2_nr[selection]
df3

Saves dataframe to CSV file

In [None]:
df3.to_csv('BACE_2_bioactivity_data_preprocessed.csv', index=False)

### **Labeling compounds as either being active, inactive or intermediate**
The bioactivity data is in the IC50 unit. Compounds having values of less than 2000 nM will be considered to be **active** while those greater than 8000 nM will be considered to be **inactive**. As for those values in between 2,000 and 8,000 nM will be referred to as **intermediate**.

In [None]:
df4 = pd.read_csv('BACE_2_bioactivity_data_preprocessed.csv')

In [None]:
bioactivity_threshold = []
for i in df4.standard_value:
  if float(i) >= 10000:
    bioactivity_threshold.append("inactive")
  elif float(i) <= 1000:
    bioactivity_threshold.append("active")
  else:
    bioactivity_threshold.append("intermediate")

In [None]:
bioactivity_class = pd.Series(bioactivity_threshold, name='bioactivity_class')
df5 = pd.concat([df4, bioactivity_class], axis=1)
df5

Saves dataframe to CSV file

In [None]:
df5.to_csv('BACE_03_bioactivity_data_curated.csv', index=False)

In [None]:
! zip BACE1.zip *.csv

In [None]:
! ls -l