# Introduction
### The goal of this task is to check that the author's model which is ADME@NCATS provides the same results when running via the Ersilia Model Hub(eos74bo)

# Mode description
### Kinetic aqueous solubility (μg/mL) was experimentally determined using the same SOP in over 200 NCATS drug discovery projects. A final dataset of 11780 non-redundant molecules and their associated solubility was used to train a SVM classifier. Approximately half of the dataset has poor solubility (< 10 μg/mL), and two-thirds of these low soluble molecules report values of < 1 μg/mL. A subset of the data used is available at PubChem (AID 1645848). The model output float as probablity of a compound having poor solublibity (< 10 µg/ml).

# Validation Data Set

### I will validate the model by running predictions on a subset of NPC data and compare the result with GCNN solubility model of NCATS@ADME.

In [1]:
# In this codeblock I will import the necessary packages and specify the paths to relevant folders
# import the necessary packages and specify the paths to relevant folders

%%capture
%env MINICONDA_INSTALLER_SCRIPT=Miniconda3-py37_4.12.0-Linux-x86_64.sh
%env MINICONDA_PREFIX=/usr/local
%env PYTHONPATH="$PYTHONPATH:/usr/local/lib/python3.7/site-packages"
%env PIP_ROOT_USER_ACTION=ignore

!wget https://repo.anaconda.com/miniconda/Miniconda3-py37_4.12.0-Linux-x86_64.sh -O miniconda.sh
!bash miniconda.sh -b -f -p $MINICONDA_PREFIX

!python -m pip install git+https://github.com/ersilia-os/ersilia.git
!python -m pip install requests --upgrade
!pip install rdkit

## Mount google drive
from google.colab import drive

drive.mount("/content/drive")
import sys
import os
import pandas as pd
import numpy as np
from sklearn.metrics import roc_auc_score, balanced_accuracy_score, confusion_matrix, cohen_kappa_score


_ = sys.path.append("/usr/local/lib/python3.7/site-packages")

!pip install rdkit

sys.path.append("/content/drive/MyDrive/Ersilia_ModelValidation")


# specify your output folder

output_folder = "/content/drive/MyDrive/Ersilia_ModelValidation/Data/Output/eos74bo_validation"  # @param {type:"string"}

# specify the input folder path

input_folder = "/content/drive/MyDrive/Ersilia_ModelValidation/Data/Input/eos74bo_validation"  # @param {type:"string"}


In [9]:
# In this codeblock I will load the data from the /data folder to a Pandas dataframe and understand which headers it has

test_data = pd.read_csv(os.path.join(input_folder, 'valid_test_data.csv'))

# check the first five rows with its header

print(test_data.head())

                                 standardized_smiles  \
0  CCOC(=O)N[C@@H]1CC[C@@H]2[C@@H](C1)C[C@H]1C(=O...   
1                           Clc1cc(Cl)c(OCC#CI)cc1Cl   
2            c1ccc(-c2ccc(C(c3ccccc3)n3ccnc3)cc2)cc1   
3    Cc1cc(/C=C/C#N)cc(C)c1Nc1ccnc(Nc2ccc(C#N)cc2)n1   
4               CN(C/C=C/C#CC(C)(C)C)Cc1cccc2ccccc12   

                     inchi_key  outcomes  
0  ZBGXUVOIWDMMJE-JNGLTUCJSA-N         1  
1  CTETYYAZBPJBHE-UHFFFAOYSA-N         1  
2  OCAPBUJLXMYKEJ-UHFFFAOYSA-N         1  
3  YIBOMRUWOWDFLG-ONEGZZNKSA-N         1  
4  DOMXUEMWDBAQBQ-WEVVVXLNSA-N         1  


In [10]:
print(test_data.shape)

(176, 3)


### Check data Quality

To ensure the quality of the validation, I will check to see the validation data is not present in the   [subset of train data](https://pubchem.ncbi.nlm.nih.gov/bioassay/1645848) that was made publicly available.

In [5]:
train_data = pd.read_csv(os.path.join(input_folder, 'train_data.csv'))

print(train_data.head())

                                 standardized_smiles  \
0            O=c1cc(-c2ccc(O)c(O)c2)oc2cc(O)cc(O)c12   
1                C=CCc1ccc(O)c(-c2ccc(O)c(CC=C)c2)c1   
2  CC[C@H]1NC(=O)[C@@H](NC(=O)c2ncccc2O)[C@@H](C)...   
3     O=c1ncn2nc(Sc3ccc(F)cc3F)ccc2c1-c1c(Cl)cccc1Cl   
4  O=C(Cc1ccc(Cl)c(Cl)c1)Nc1ccc(S(=O)(=O)Nc2ccon2...   

                     inchi_key  outcomes  
0  IQPNAANSBPBGFQ-UHFFFAOYSA-N         0  
1  FVYXIJYOAGAUQK-UHFFFAOYSA-N         0  
2  FEPMHVLSLDOMQC-IYPFLVAKSA-N         0  
3  VEPKQEUBKLEPRA-UHFFFAOYSA-N         0  
4  AIDVIFPYWYKRCE-UHFFFAOYSA-N         0  


In [6]:
print(train_data.shape)

(2455, 3)


In [11]:


# Extract the unique Inchi_keys from train_data and test_data
train_inchi_keys = set(train_data['inchi_key'])
test_inchi_keys = set(test_data['inchi_key'])

# Check for common Inchi_keys
common_inchi_keys = train_inchi_keys.intersection(test_inchi_keys)

if len(common_inchi_keys) == 0:
    print("No common Inchi_keys found between train_data and test_data.")
else:
    print("There are common Inchi_keys between train_data and test_data.")
    print("Common Inchi_keys:", common_inchi_keys)

No common Inchi_keys found between train_data and test_data.


# Model Predictions

In [14]:
#  Extract SMILES to a list
standardized_smiles_list = test_data['standardized_smiles'].tolist()

In [12]:
# enter model name
model_name = "eos74bo"  # @param {type:"string"}

# Fetch the Model
import time

begin = time.time()
!ersilia fetch $model_name
end = time.time()

print("Time taken:", round((end - begin), 2), "seconds")


[34m⬇️  Fetching model eos74bo: ncats-solubility[0m
sudo: unknown user udockerusername
sudo: error initializing audit plugin sudoers_audit
  Running command git clone -q https://github.com/ersilia-os/bentoml-ersilia.git /tmp/pip-req-build-gndrzl_n
Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ done
Solving environment: / failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): \ | / - \ | / - \ | / - done
Solving environment: | / - \ | / - \ | / - \ | / - \ done


  current version: 4.12.0
  latest version: 24.3.0

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /usr/local/envs/eosbase-bentoml-0.11.0-py37

  added / updat

In [13]:
# Serve the model
!ersilia serve $model_name

sudo: unknown user udockerusername
sudo: error initializing audit plugin sudoers_audit
sudo: unknown user udockerusername
sudo: error initializing audit plugin sudoers_audit
[32m🚀 Serving model eos74bo: ncats-solubility[0m
[0m
[33m   URL: http://127.0.0.1:47441[0m
[33m   PID: 15486[0m
[33m   SRV: conda[0m
[0m
[34m👉 To run model:[0m
[34m   - run[0m
[0m
[34m💁 Information:[0m
[34m   - info[0m


In [15]:
# Run predictions

api = "predict"  # @param {type:"string"}

from ersilia import ErsiliaModel
import time

model = ErsiliaModel(model_name)
begin = time.time()
output = model.api(input=standardized_smiles_list, output="pandas")
end = time.time()

print("Successful 👍! Time taken:", round((end - begin), 2), "seconds")
model.close()

Successful 👍! Time taken: 14.16 seconds


In [17]:
# check the size of the output to make sure it matches the size of input
print(output.shape)

(176, 3)


In [16]:
# Check your results
print(output.head())

# Save my results in Google Drive

output.to_csv(os.path.join(output_folder, 'eos74bo_validation_pred.csv'), index=False)


                           key  \
0  ZBGXUVOIWDMMJE-JNGLTUCJSA-N   
1  CTETYYAZBPJBHE-UHFFFAOYSA-N   
2  OCAPBUJLXMYKEJ-UHFFFAOYSA-N   
3  YIBOMRUWOWDFLG-ONEGZZNKSA-N   
4  DOMXUEMWDBAQBQ-WEVVVXLNSA-N   

                                               input  outcome  
0  CCOC(=O)N[C@@H]1CC[C@@H]2[C@@H](C1)C[C@H]1C(=O...    0.997  
1                           Clc1cc(Cl)c(OCC#CI)cc1Cl    1.000  
2            c1ccc(-c2ccc(C(c3ccccc3)n3ccnc3)cc2)cc1    0.996  
3    Cc1cc(/C=C/C#N)cc(C)c1Nc1ccnc(Nc2ccc(C#N)cc2)n1    1.000  
4               CN(C/C=C/C#CC(C)(C)C)Cc1cccc2ccccc12    0.996  


In [18]:
eos74bo_validation_prediction = pd.read_csv(os.path.join(input_folder, 'valid_test_data.csv'))

In [25]:
# Extract the 'outcome' column from the output DataFrame
outcome = output.iloc[:, 2]

# Assign the extracted 'outcome' column to the test DataFrame
eos74bo_validation_prediction['predicted_probability'] = outcome

# Print the resulting DataFrame
print(eos74bo_validation_prediction.head())

                                 standardized_smiles  \
0  CCOC(=O)N[C@@H]1CC[C@@H]2[C@@H](C1)C[C@H]1C(=O...   
1                           Clc1cc(Cl)c(OCC#CI)cc1Cl   
2            c1ccc(-c2ccc(C(c3ccccc3)n3ccnc3)cc2)cc1   
3    Cc1cc(/C=C/C#N)cc(C)c1Nc1ccnc(Nc2ccc(C#N)cc2)n1   
4               CN(C/C=C/C#CC(C)(C)C)Cc1cccc2ccccc12   

                     inchi_key  outcomes  predicted_probability  
0  ZBGXUVOIWDMMJE-JNGLTUCJSA-N         1                  0.997  
1  CTETYYAZBPJBHE-UHFFFAOYSA-N         1                  1.000  
2  OCAPBUJLXMYKEJ-UHFFFAOYSA-N         1                  0.996  
3  YIBOMRUWOWDFLG-ONEGZZNKSA-N         1                  1.000  
4  DOMXUEMWDBAQBQ-WEVVVXLNSA-N         1                  0.996  


In [26]:
import numpy as np
# Define a threshold (e.g., 0.5 for binary classification)
threshold = 0.5

# Convert predicted probabilities to class labels
eos74bo_validation_prediction['predicted_outcomes'] = np.where(eos74bo_validation_prediction['predicted_probability'] >= threshold, 1, 0)


In [37]:
eos74bo_validation_prediction.head()

Unnamed: 0,standardized_smiles,inchi_key,outcomes,predicted_probability,predicted_outcomes
0,CCOC(=O)N[C@@H]1CC[C@@H]2[C@@H](C1)C[C@H]1C(=O...,ZBGXUVOIWDMMJE-JNGLTUCJSA-N,1,0.997,1
1,Clc1cc(Cl)c(OCC#CI)cc1Cl,CTETYYAZBPJBHE-UHFFFAOYSA-N,1,1.0,1
2,c1ccc(-c2ccc(C(c3ccccc3)n3ccnc3)cc2)cc1,OCAPBUJLXMYKEJ-UHFFFAOYSA-N,1,0.996,1
3,Cc1cc(/C=C/C#N)cc(C)c1Nc1ccnc(Nc2ccc(C#N)cc2)n1,YIBOMRUWOWDFLG-ONEGZZNKSA-N,1,1.0,1
4,CN(C/C=C/C#CC(C)(C)C)Cc1cccc2ccccc12,DOMXUEMWDBAQBQ-WEVVVXLNSA-N,1,0.996,1


In [47]:
# create prediction column to interpret the Probability of a compound having poor solublibity (< 10 µg/ml)
eos74bo_validation_prediction['Prediction'] = eos74bo_validation_prediction['predicted_outcomes'].map({0: 'high solubility', 1: 'low solubility'})

In [48]:
eos74bo_validation_prediction.head()

Unnamed: 0,standardized_smiles,inchi_key,outcomes,predicted_probability,predicted_outcomes,Prediction
0,CCOC(=O)N[C@@H]1CC[C@@H]2[C@@H](C1)C[C@H]1C(=O...,ZBGXUVOIWDMMJE-JNGLTUCJSA-N,1,0.997,1,low solubility
1,Clc1cc(Cl)c(OCC#CI)cc1Cl,CTETYYAZBPJBHE-UHFFFAOYSA-N,1,1.0,1,low solubility
2,c1ccc(-c2ccc(C(c3ccccc3)n3ccnc3)cc2)cc1,OCAPBUJLXMYKEJ-UHFFFAOYSA-N,1,0.996,1,low solubility
3,Cc1cc(/C=C/C#N)cc(C)c1Nc1ccnc(Nc2ccc(C#N)cc2)n1,YIBOMRUWOWDFLG-ONEGZZNKSA-N,1,1.0,1,low solubility
4,CN(C/C=C/C#CC(C)(C)C)Cc1cccc2ccccc12,DOMXUEMWDBAQBQ-WEVVVXLNSA-N,1,0.996,1,low solubility


In [49]:
# Save my results in Google Drive

eos74bo_validation_prediction.to_csv(os.path.join(output_folder, 'eos74bo_npc_predictions.csv'), index=False)


In [51]:
eos74bo_validation_prediction['Prediction'].value_counts()

high solubility    128
low solubility      48
Name: Prediction, dtype: int64

In [2]:
eos74bo_validation_prediction= pd.read_csv(os.path.join(output_folder, 'eos74bo_npc_predictions.csv'))

In [4]:
# evaluate the model with the following metrics
auc_roc = roc_auc_score(eos74bo_validation_prediction['outcomes'], eos74bo_validation_prediction['predicted_outcomes'])
bacc = balanced_accuracy_score(eos74bo_validation_prediction['outcomes'], eos74bo_validation_prediction['predicted_outcomes'])
tn, fp, fn, tp = confusion_matrix(eos74bo_validation_prediction['outcomes'], eos74bo_validation_prediction['predicted_outcomes']).ravel()

sensitivity = tp / (tp + fn)
specificity = tn / (tn + fp)
kappa = cohen_kappa_score(eos74bo_validation_prediction['outcomes'], eos74bo_validation_prediction['predicted_outcomes'])
print(f'auc_roc score is {auc_roc:.4f}')
print(f'The balanced accuracy is {bacc:.4f}')
print(f'The sensitivity is {sensitivity:.4f} and the specificity is {specificity:.4f}')
print(f'The kappa score is {kappa:.4f}')


auc_roc score is 0.8235
The balanced accuracy is 0.8235
The sensitivity is 0.7838 and the specificity is 0.8633
The kappa score is 0.5835


# ADME@NCATS PREDICTIONS

In [31]:
adme_ncats_result = pd.read_csv(os.path.join(output_folder, 'ADME_NPC_Prediction.csv'))
print(adme_ncats_result.head())

                                 standardized_smiles  \
0  CCOC(=O)N[C@@H]1CC[C@@H]2[C@@H](C1)C[C@H]1C(=O...   
1                           Clc1cc(Cl)c(OCC#CI)cc1Cl   
2            c1ccc(-c2ccc(C(c3ccccc3)n3ccnc3)cc2)cc1   
3    Cc1cc(/C=C/C#N)cc(C)c1Nc1ccnc(Nc2ccc(C#N)cc2)n1   
4               CN(C/C=C/C#CC(C)(C)C)Cc1cccc2ccccc12   

                     inchi_key  outcomes Predicted Class (Probability)  \
0  ZBGXUVOIWDMMJE-JNGLTUCJSA-N         1                       1 (1.0)   
1  CTETYYAZBPJBHE-UHFFFAOYSA-N         1                       1 (1.0)   
2  OCAPBUJLXMYKEJ-UHFFFAOYSA-N         1                       1 (1.0)   
3  YIBOMRUWOWDFLG-ONEGZZNKSA-N         1                       1 (1.0)   
4  DOMXUEMWDBAQBQ-WEVVVXLNSA-N         1                       1 (1.0)   

       Prediction  Tanimoto Similarity       Model  
0  low solubility                  NaN  Solubility  
1  low solubility                  NaN  Solubility  
2  low solubility                  NaN  Solubility  
3  low

In [32]:
adme_ncats_result['Prediction'].unique()

array(['low solubility', 'high solubility'], dtype=object)

In [34]:
adme_ncats_result.dtypes

standardized_smiles              object
inchi_key                        object
outcomes                          int64
Predicted Class (Probability)    object
Prediction                       object
dtype: object

In [33]:
adme_ncats_result.drop(columns=['Tanimoto Similarity', 'Model'], inplace=True)

In [35]:
# Split the column into two separate columns
adme_ncats_result[['Predicted Class', 'Probability']] = adme_ncats_result['Predicted Class (Probability)'].str.split(' ', expand=True)

# Remove parentheses from the Probability column
adme_ncats_result['Probability'] = adme_ncats_result['Probability'].str.strip('()')

adme_ncats_result.drop(columns='Predicted Class (Probability)', inplace=True)

# Convert Probability column to float
adme_ncats_result['Probability'] = adme_ncats_result['Probability'].astype(float)

# convert predicted class column to int
adme_ncats_result['Predicted Class'] = adme_ncats_result['Predicted Class'].astype(int)

In [36]:
adme_ncats_result.head()

Unnamed: 0,standardized_smiles,inchi_key,outcomes,Prediction,Predicted Class,Probability
0,CCOC(=O)N[C@@H]1CC[C@@H]2[C@@H](C1)C[C@H]1C(=O...,ZBGXUVOIWDMMJE-JNGLTUCJSA-N,1,low solubility,1,1.0
1,Clc1cc(Cl)c(OCC#CI)cc1Cl,CTETYYAZBPJBHE-UHFFFAOYSA-N,1,low solubility,1,1.0
2,c1ccc(-c2ccc(C(c3ccccc3)n3ccnc3)cc2)cc1,OCAPBUJLXMYKEJ-UHFFFAOYSA-N,1,low solubility,1,1.0
3,Cc1cc(/C=C/C#N)cc(C)c1Nc1ccnc(Nc2ccc(C#N)cc2)n1,YIBOMRUWOWDFLG-ONEGZZNKSA-N,1,low solubility,1,1.0
4,CN(C/C=C/C#CC(C)(C)C)Cc1cccc2ccccc12,DOMXUEMWDBAQBQ-WEVVVXLNSA-N,1,low solubility,1,1.0


In [41]:
#adme_ncats_result= adme_ncats_result.rename(columns={'Probability': 'predicted_probability'})
adme_ncats_result = adme_ncats_result.rename(columns={'Predicted Class': 'predicted_outcomes'})

In [42]:
adme_ncats_result.head()

Unnamed: 0,standardized_smiles,inchi_key,outcomes,Prediction,predicted_outcomes,predicted_probability
0,CCOC(=O)N[C@@H]1CC[C@@H]2[C@@H](C1)C[C@H]1C(=O...,ZBGXUVOIWDMMJE-JNGLTUCJSA-N,1,low solubility,1,1.0
1,Clc1cc(Cl)c(OCC#CI)cc1Cl,CTETYYAZBPJBHE-UHFFFAOYSA-N,1,low solubility,1,1.0
2,c1ccc(-c2ccc(C(c3ccccc3)n3ccnc3)cc2)cc1,OCAPBUJLXMYKEJ-UHFFFAOYSA-N,1,low solubility,1,1.0
3,Cc1cc(/C=C/C#N)cc(C)c1Nc1ccnc(Nc2ccc(C#N)cc2)n1,YIBOMRUWOWDFLG-ONEGZZNKSA-N,1,low solubility,1,1.0
4,CN(C/C=C/C#CC(C)(C)C)Cc1cccc2ccccc12,DOMXUEMWDBAQBQ-WEVVVXLNSA-N,1,low solubility,1,1.0


In [44]:
ordered_columns= ['standardized_smiles', 'inchi_key', 'outcomes', 'predicted_probability', 'predicted_outcomes', 'Prediction']
adme_ncats_result= adme_ncats_result[ordered_columns]

In [45]:
adme_ncats_result.head()

Unnamed: 0,standardized_smiles,inchi_key,outcomes,predicted_probability,predicted_outcomes,Prediction
0,CCOC(=O)N[C@@H]1CC[C@@H]2[C@@H](C1)C[C@H]1C(=O...,ZBGXUVOIWDMMJE-JNGLTUCJSA-N,1,1.0,1,low solubility
1,Clc1cc(Cl)c(OCC#CI)cc1Cl,CTETYYAZBPJBHE-UHFFFAOYSA-N,1,1.0,1,low solubility
2,c1ccc(-c2ccc(C(c3ccccc3)n3ccnc3)cc2)cc1,OCAPBUJLXMYKEJ-UHFFFAOYSA-N,1,1.0,1,low solubility
3,Cc1cc(/C=C/C#N)cc(C)c1Nc1ccnc(Nc2ccc(C#N)cc2)n1,YIBOMRUWOWDFLG-ONEGZZNKSA-N,1,1.0,1,low solubility
4,CN(C/C=C/C#CC(C)(C)C)Cc1cccc2ccccc12,DOMXUEMWDBAQBQ-WEVVVXLNSA-N,1,1.0,1,low solubility


In [46]:
adme_ncats_result.to_csv(os.path.join(output_folder, 'adme@ncats_npc_predictions.csv'), index=False)


In [55]:
adme_ncats_result['Prediction'].value_counts()

high solubility    128
low solubility      48
Name: Prediction, dtype: int64

In [5]:
adme_ncats_result = pd.read_csv(os.path.join(output_folder, 'adme@ncats_npc_predictions.csv'))

In [6]:
# evaluate the model with following metrics
auc_roc = roc_auc_score(adme_ncats_result['outcomes'], adme_ncats_result['predicted_outcomes'])
bacc= balanced_accuracy_score(adme_ncats_result['outcomes'], adme_ncats_result['predicted_outcomes'])
tn, fp, fn, tp = confusion_matrix(adme_ncats_result['outcomes'], adme_ncats_result['predicted_outcomes']).ravel()

sensitivity = tp / (tp + fn)
specificity = tn / (tn + fp)
kappa = cohen_kappa_score(adme_ncats_result['outcomes'], adme_ncats_result['predicted_outcomes'])
print(f'auc_roc score is {auc_roc:.4f}')
print(f'The balanced accuracy is {bacc:.4f}')
print(f'The sensitivity is {sensitivity:.4f} and the specificity is {specificity:.4f}')
print(f'The kappa score is {kappa:.4f}')


auc_roc score is 0.8235
The balanced accuracy is 0.8235
The sensitivity is 0.7838 and the specificity is 0.8633
The kappa score is 0.5835


# Conclusion
The two models produced the same result when validated with a subset of cleaned NPC marketed drug datasets.