<a href="https://colab.research.google.com/github/Malikbadmus/model-validation-eos30gr/blob/main/notebooks/Data_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook showcases data cleaning performed on small molecules compounds that have successfully completed all phases of clinical development and have received approval for use. The datasets was sourced from ChEMBL by Malik Badmus as part of Outreachy 2024 Contribution

### Data Preprocessing

In [2]:
import os
import sys
import pandas as pd
# search path for modules 
sys.path.append(os.path.abspath("../src"))
DATAPATH = "../data"
SRC= "../src"

#File path
input_file_path = os.path.join(DATAPATH, "Raw", "mol_datasets1.csv")
output_file_path = os.path.join(DATAPATH, "Processed", "100_Molecules.csv")

# Reading the CSV file into a pandas DataFrame
Data = pd.read_csv(input_file_path, delimiter=';', quotechar='"')

In [2]:
#inspect the Dataframe
Data.head()

Unnamed: 0,ChEMBL ID,Name,Synonyms,Type,Max Phase,Molecular Weight,Targets,Bioactivities,AlogP,Polar Surface Area,...,Heavy Atoms,HBA (Lipinski),HBD (Lipinski),#RO5 Violations (Lipinski),Molecular Weight (Monoisotopic),Np Likeness Score,Molecular Species,Molecular Formula,Smiles,Inchi Key
0,CHEMBL1868702,GESTRINONE,A 46 745|A-46-745|A-46745|DIMETRIOSE|GESTRINON...,Small molecule,4.0,308.42,19.0,61.0,3.72,37.3,...,23.0,2.0,1.0,0.0,308.1776,1.86,NEUTRAL,C21H24O2,C#C[C@]1(O)CC[C@H]2[C@@H]3CCC4=CC(=O)CCC4=C3C=...,BJJXHLWLUDYTGC-ANULTFPQSA-N
1,CHEMBL2106076,CEFPIROME SULFATE,CEFPIROME SULFATE|CEFPIROME SULFATE (1:1)|CEFP...,Small molecule,4.0,612.67,1.0,1.0,-1.04,153.92,...,35.0,11.0,3.0,2.0,514.1093,-0.19,ACID,C22H24N6O9S3,CO/N=C(\C(=O)N[C@@H]1C(=O)N2C(C(=O)[O-])=C(C[n...,RKTNPKZEPLCLSF-QHBKFCFHSA-N
2,CHEMBL1446650,MEBEVERINE HYDROCHLORIDE,COLOFAC|COLOFAC 100|COLOFAC IBS|COLOFAC MR|CSA...,Small molecule,4.0,466.02,15.0,44.0,4.6,57.23,...,31.0,6.0,0.0,0.0,429.2515,-0.6,BASE,C25H36ClNO5,CCN(CCCCOC(=O)c1ccc(OC)c(OC)c1)C(C)Cc1ccc(OC)c...,PLGQWYOULXPJRE-UHFFFAOYSA-N
3,CHEMBL3707281,MAGNESIUM LACTATE,"ANHYDROUS MAGNESIUM LACTATE, DL-|DL-LACTIC ACI...",Small molecule,4.0,202.44,,,,,...,,,,,202.0328,,,C6H10MgO6,CC(O)C(=O)[O-].CC(O)C(=O)[O-].[Mg+2],OVGXLJDWSLQDRT-UHFFFAOYSA-L
4,CHEMBL3833409,HYDROTALCITE,HYDROTALCITE,Small molecule,4.0,531.91,,,,,...,,,,,529.9019,,,CH16Al2Mg6O19,O=C([O-])[O-].[Al+3].[Al+3].[Mg+2].[Mg+2].[Mg+...,GDVKFRBCXAPAQJ-UHFFFAOYSA-A


In [3]:
Data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3592 entries, 0 to 3591
Data columns (total 33 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   ChEMBL ID                        3592 non-null   object 
 1   Name                             3592 non-null   object 
 2   Synonyms                         3538 non-null   object 
 3   Type                             3592 non-null   object 
 4   Max Phase                        3592 non-null   float64
 5   Molecular Weight                 3522 non-null   float64
 6   Targets                          2954 non-null   float64
 7   Bioactivities                    2954 non-null   float64
 8   AlogP                            3166 non-null   float64
 9   Polar Surface Area               3166 non-null   float64
 10  HBA                              3166 non-null   float64
 11  HBD                              3166 non-null   float64
 12  #RO5 Violations     

Based on the observations above, our dataframe has 33 column, a substantial amount of extraneous data, contributing to what we can described as "noise" in this context.

From the example (Notebook1) given we only need one (1) column to run predictions on ersilia, the canonical smile which is a standardized and unique representation of a molecular structure , the canonical smile is derived from the SMILES string contained in the "Smiles" column. The InChiKey is provided in the datasets but we can go ahead to standardized it to ensure consistency*

Moreover, the Range index of 3,592 records, surpasses the number of molecules we need which is 1000. The Data therefore needs to be cleaned up.

In [4]:
#Additional Verification
print(Data.iloc[890])

ChEMBL ID                                                              CHEMBL4216467
Name                                                                      RIPRETINIB
Synonyms                                                 DCC-2618|QINLOCK|RIPRETINIB
Type                                                                  Small molecule
Max Phase                                                                        4.0
Molecular Weight                                                              510.37
Targets                                                                          6.0
Bioactivities                                                                   22.0
AlogP                                                                           5.67
Polar Surface Area                                                             88.05
HBA                                                                              5.0
HBD                                                              

In [5]:
from processing import standardise_smiles, standardise_inchikey

new_columns = ['Smiles', 'Inchi Key']
N_Data = Data.drop(columns=Data.columns.difference(new_columns)).dropna(subset=['Smiles'])

#Standardizing the SMILES string and creating the canonical smiles
N_Data['Canonical_smiles'] = standardise_smiles(N_Data['Smiles'])
N_Data['Inchi Key'] = standardise_inchikey(N_Data['Inchi Key'])
N_Data.info()


[23:27:17] Can't kekulize mol.  Unkekulized atoms: 0 2 4 6 7 9
[23:27:26] Can't kekulize mol.  Unkekulized atoms: 3 10


<class 'pandas.core.frame.DataFrame'>
Index: 3384 entries, 0 to 3591
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Smiles            3384 non-null   object
 1   Inchi Key         3384 non-null   object
 2   Canonical_smiles  3102 non-null   object
dtypes: object(3)
memory usage: 105.8+ KB


**We can inspect our new Datasets**

In [6]:
N_Data.head()

Unnamed: 0,Smiles,Inchi Key,Canonical_smiles
0,C#C[C@]1(O)CC[C@H]2[C@@H]3CCC4=CC(=O)CCC4=C3C=...,BJJXHLWLUDYTGC-ANULTFPQSA-N,C#C[C@]1(O)CC[C@H]2[C@@H]3CCC4=CC(=O)CCC4=C3C=...
1,CO/N=C(\C(=O)N[C@@H]1C(=O)N2C(C(=O)[O-])=C(C[n...,RKTNPKZEPLCLSF-QHBKFCFHSA-N,CO/N=C(\C(=O)N[C@@H]1C(=O)N2C(C(=O)[O-])=C(C[n...
2,CCN(CCCCOC(=O)c1ccc(OC)c(OC)c1)C(C)Cc1ccc(OC)c...,PLGQWYOULXPJRE-UHFFFAOYSA-N,CCN(CCCCOC(=O)c1ccc(OC)c(OC)c1)C(C)Cc1ccc(OC)cc1
3,CC(O)C(=O)[O-].CC(O)C(=O)[O-].[Mg+2],OVGXLJDWSLQDRT-UHFFFAOYSA-L,
4,O=C([O-])[O-].[Al+3].[Al+3].[Mg+2].[Mg+2].[Mg+...,GDVKFRBCXAPAQJ-UHFFFAOYSA-A,


From the above, we can see that the RDKit cound not kekulized 282 molecules and thus return null values for Canonical smiles. Our datasets also contains three (3) features, the SMILES string, Standardized smile string and an identifier

In [9]:
check_columns = 'Canonical_smiles'  

# Drop rows with null values in Smiles, select 1000 molecules
Updated_Data = N_Data.dropna(subset=['Canonical_smiles'])
New_Data = Updated_Data.sample(n=1000, random_state=42)  

New_Data.drop(columns=['Smiles'], inplace=True)
New_Data.rename(columns={'Canonical_smiles': 'smiles'}, inplace=True)
New_Data.reset_index(drop=True, inplace=True)

In [10]:
New_Data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Inchi Key  1000 non-null   object
 1   smiles     1000 non-null   object
dtypes: object(2)
memory usage: 15.8+ KB


In [11]:
New_Data.head()

Unnamed: 0,Inchi Key,smiles
0,IPVQLZZIHOAWMC-QXKUPLGCSA-N,CCC[C@H](N[C@@H](C)C(=O)N1[C@H](C(=O)O)C[C@@H]...
1,GBXSMTUPTTWBMN-XIRDDKMYSA-N,CCOC(=O)[C@H](CCc1ccccc1)N[C@@H](C)C(=O)N1CCC[...
2,ZIIJJOPLRSCQNX-UHFFFAOYSA-N,CCN(CC)CCN1C(=O)CN=C(c2ccccc2F)c2cc(Cl)ccc21
3,QZFHIXARHDBPBY-UHFFFAOYSA-N,COC(=O)Nc1c(N)nc(-c2nn(Cc3ccccc3F)c3ncc(F)cc23...
4,VBHQKCBVWWUUKN-KZNAEPCWSA-N,COc1ccc(CCO[C@@H]2CCCC[C@H]2N2CC[C@@H](O)C2)cc1OC


All the Null values have been dropped and a list of 1000 molecules have been selected for our new datasets. We can go ahead and save our new datasets to /data/Processed in a csv format

In [12]:
# Save the DataFrame to a CSV file

New_Data.to_csv(output_file_path, index=False)

### Getting our predictions on the Processed Dataset


The model eos30f3 was downloaded and serve via the Ersilia Model Hub on my linux Ubuntu 22.4.0 system, the input file (eos30f3_output.csv) was parsed to it and a predictions for the 1000 molecules was generated, the Model Prediction was saved at /data/Model_prediction in a csv format

### Model Bias Evaluation

In [6]:
# Load the Model Predictions
Model_predictions = pd.read_csv(os.path.join(DATAPATH, "Model_predictions", "eos30f3_output.csv"))

Inspect the DataFrame

In [7]:
Model_predictions.head()

Unnamed: 0,key,input,activity
0,IPVQLZZIHOAWMC-QXKUPLGCSA-N,CCC[C@H](N[C@@H](C)C(=O)N1[C@H](C(=O)O)C[C@@H]...,0.141321
1,GBXSMTUPTTWBMN-XIRDDKMYSA-N,CCOC(=O)[C@H](CCc1ccccc1)N[C@@H](C)C(=O)N1CCC[...,0.235772
2,SAADBVWGJQAEFS-UHFFFAOYSA-N,CCN(CC)CCN1C(=O)CN=C(c2ccccc2F)c2cc(Cl)ccc21,0.853313
3,QZFHIXARHDBPBY-UHFFFAOYSA-N,COC(=O)Nc1c(N)nc(-c2nn(Cc3ccccc3F)c3ncc(F)cc23...,0.225853
4,VBHQKCBVWWUUKN-KZNAEPCWSA-N,COc1ccc(CCO[C@@H]2CCCC[C@H]2N2CC[C@@H](O)C2)cc1OC,0.836128


In [8]:
Model_predictions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   key       1000 non-null   object 
 1   input     1000 non-null   object 
 2   activity  1000 non-null   float64
dtypes: float64(1), object(2)
memory usage: 23.6+ KB


From the above we can deduce that our prediction data contains three (3) features, the identifier inchikey which is now key, input which is our smiles string, and an activity  column which denotes the predicted probability of compound activity, specifically indicating their potential as heRG blockers, with a predefined threshold set at 10uM.

### 