<a href="https://colab.research.google.com/github/HEK-Research/Multitask-Deep-Learning-Affinity-Prediction/blob/main/Sci_Reports_(2021)_Data_Curation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [18]:
import pandas as pd
import numpy as np

##### After I have mounted drive to Colab, I can access any files in my Google Drive by modify the file_path (see below). You can make a copy of the entire dataset folder to your own Google Drive and modify the path accordingly. 

In [5]:
file_path = '/content/drive/MyDrive/Project_4_MTDNN/ChEMBL Datasets/CHEMBL3371.csv'

In [9]:
# Read in the original ChEMBL dataset
df = pd.read_csv(file_path,   delimiter=';', skiprows=0, low_memory=False)
# Take a look at the size of the original dataset
print(df.shape)

(9363, 45)


In [10]:
# A quick look at all columns 
df.columns

Index(['Molecule ChEMBL ID', 'Molecule Name', 'Molecule Max Phase',
       'Molecular Weight', '#RO5 Violations', 'AlogP', 'Compound Key',
       'Smiles', 'Standard Type', 'Standard Relation', 'Standard Value',
       'Standard Units', 'pChEMBL Value', 'Data Validity Comment', 'Comment',
       'Uo Units', 'Ligand Efficiency BEI', 'Ligand Efficiency LE',
       'Ligand Efficiency LLE', 'Ligand Efficiency SEI', 'Potential Duplicate',
       'Assay ChEMBL ID', 'Assay Description', 'Assay Type', 'BAO Format ID',
       'BAO Label', 'Assay Organism', 'Assay Tissue ChEMBL ID',
       'Assay Tissue Name', 'Assay Cell Type', 'Assay Subcellular Fraction',
       'Assay Parameters', 'Assay Variant Accession', 'Assay Variant Mutation',
       'Target ChEMBL ID', 'Target Name', 'Target Organism', 'Target Type',
       'Document ChEMBL ID', 'Source ID', 'Source Description',
       'Document Journal', 'Document Year', 'Cell ChEMBL ID', 'Properties'],
      dtype='object')

###### Some simple analysis to check with the filters
1. Only compounds with reported direct interactions. This corresponds to 'Assay Type' = 'B' 

For example, this dataset includes assay type:
* 'F' - Biological effect of a compound
* 'B' - Binding of compounds to a molecular target, e.g. Ki, IC50, Kd
* 'A' - ADME data, e.g. t1/2, oral bioavailability

(For further information: https://chembl.gitbook.io/chembl-interface-documentation/frequently-asked-questions/chembl-data-questions) 

In [11]:
print(df['Assay Type'].unique())

['F' 'B' 'A']


2. Exact activity measures. This corresponds to 'Standard Relation' = "'='"

In [12]:
print(df['Standard Relation'].unique())

[nan "'='" "'<'" "'>'" "'~'" "'<='"]


3. Only standard potency measurements were considered. 

This is a bit complicated. However, we can utilize the correlation between different standard type (activity type) vs. pChEMBL Value. 

All numerical measurements reported in various units were standardized to 'Standard Value" in nM. This is converted to a negative logarithmic scale.

IC50 = 1 nm is equivalent to pChEMBL Value of 9. 
Higher the potency (lower IC50), smaller the pChEMBL value. (Think in terms of aciditiy vs pH) 

Which means, only standard potency measurements would have numerical pChEMBL Values reported. 

I did a groupby analysis to identify and extract 'Standard Type' with numerical 'pChEMBL Values'.

In [13]:
print(df['Standard Type'].unique())

['IC50' 'Ki' 'Activity' 'Imax' 'EC50' 'Inhibition' 'Emax' 'Kbapp' 'Kd'
 'Kb' 'Efficacy' '%max' 'Log Ki' 'IA' 'Delta pKi' 'pKb' 'INH' '% Ctrl'
 'pA2' 'Displacement' 'pKi'
 '% Inhibition of Control Agonist Response (Mean n=2)'
 'Activation (% of control)' 'effect'
 '% of Control Agonist Response (Mean n=2)' 'Affinity'
 'Mean fold stimulation'
 '% Inhibition of Control Specific Binding (Mean n=2)'
 'Inhibition (% of control)']


In [14]:
# group the dataframe by type and inspect the pchembl_value column
grouped = df.groupby('Standard Type')['pChEMBL Value']

activities = []
# check if any group has only numerical values or only None values
for group_name, group_values in grouped:
    group_values = group_values.dropna()
    if len(group_values) == 0:
        print(f"All values in group {group_name} are None.")
    elif all(isinstance(val, float) for val in group_values):
        print(f"All values in group {group_name} are numerical.")
        activities.append(group_name)
    else:
        print(f"Group {group_name} has mixed data types.")
        
print("Standard potency measurement types to keep are:", activities)

All values in group % Ctrl are None.
All values in group % Inhibition of Control Agonist Response (Mean n=2) are None.
All values in group % Inhibition of Control Specific Binding (Mean n=2) are None.
All values in group % of Control Agonist Response (Mean n=2) are None.
All values in group %max are None.
All values in group Activation (% of control) are None.
All values in group Activity are None.
All values in group Affinity are None.
All values in group Delta pKi are None.
All values in group Displacement are None.
All values in group EC50 are numerical.
All values in group Efficacy are None.
All values in group Emax are None.
All values in group IA are None.
All values in group IC50 are numerical.
All values in group INH are None.
All values in group Imax are None.
All values in group Inhibition are None.
All values in group Inhibition (% of control) are None.
All values in group Kb are None.
All values in group Kbapp are None.
All values in group Kd are numerical.
All values in gr

4. All compounds annotated as ('inactive', 'not active', 'inconclusive', 'potential transcription error', or 'pan assay interference compounds (PAINS)') were discarded. 

I also did some investigation of how to best implement this filter, as explained below:

In [15]:
# The 'Comment' column contains a mixture of Non, string, and numerical values:
df.Comment[0:20]

0          Not Active
1                 NaN
2                 NaN
3                 NaN
4                 NaN
5              309762
6              309798
7                 NaN
8                 NaN
9                 NaN
10             221395
11                NaN
12             222882
13             222783
14                NaN
15                NaN
16    Partial agonist
17                NaN
18                NaN
19                NaN
Name: Comment, dtype: object

In [21]:
# I used regular expressions to replace all numerical values by NaN
df['Comment'] = df['Comment'].replace(to_replace=r'^\d+$', value=np.nan, regex=True)
print(df.Comment[0:20])
print(df['Comment'].unique())

0          Not Active
1                 NaN
2                 NaN
3                 NaN
4                 NaN
5                 NaN
6                 NaN
7                 NaN
8                 NaN
9                 NaN
10                NaN
11                NaN
12                NaN
13                NaN
14                NaN
15                NaN
16    Partial agonist
17                NaN
18                NaN
19                NaN
Name: Comment, dtype: object
['Not Active' nan 'Partial agonist' 'Not Determined'
 'Not Active (inhibition < 50% @ 10 uM and thus dose-reponse curve not measured)'
 'Active' 'Agonist' 'Dose-dependent effect' 'Antagonist' 'Slightly Active'
 'Dose-Dependent Effect']


In [22]:
# I noticed that it appeared that any compound with numerical pChEMBL Value does not have Comment, or in another word, have only NaN. 
# The following groupby analysis is to confirm whether this observation is true:

grouped = df.groupby('Comment')['pChEMBL Value']

# check if any group has only numerical values or only None values
for group_name, group_values in grouped:
    group_values = group_values.dropna()
    if len(group_values) == 0:
        print(f"All values in group {group_name} are None.")
    elif all(isinstance(val, float) for val in group_values):
        print(f"All values in group {group_name} are numerical.")
        
    else:
        print(f"Group {group_name} has mixed data types.")

All values in group Active are None.
All values in group Agonist are None.
All values in group Antagonist are None.
All values in group Dose-Dependent Effect are None.
All values in group Dose-dependent effect are None.
All values in group Not Active are None.
All values in group Not Active (inhibition < 50% @ 10 uM and thus dose-reponse curve not measured) are None.
All values in group Not Determined are None.
All values in group Partial agonist are None.
All values in group Slightly Active are None.


We can conclude that keep only rows with 'Comment' = NaN, we can drop any compounds with undesirable annotations. 