# Data preparation

For property prediction methods like MPNNs (Message Passing Neural Networks), it is crucial to have a dataset that includes the SMILES (simplified molecular-input line-entry system) representation of the molecules.

We performed data preprocessing by filtering out duplicate SMILES and any special characters (>/<) in the IC50 column from the initial NRAS ligand dataset. Moreover, with RDKit we obtain some futures.
In this script, two distinct datasets were created:

1. Dataset_Positive: This dataset contains only the NRAS ligands that were obtained from the initial dataset.
2. Dataset_Negative: This dataset includes only negative molecules that are not capable of inhibiting NRAS.


## Dataset_Positive

In [10]:
import pandas as pd
from rdkit import Chem
from rdkit.Chem import Descriptors, Crippen

df = pd.read_csv('../Data/NRAS_ligands.csv')
print(df.columns)

Index(['BindingDB Reactant_set_id', 'Ligand SMILES', 'Ligand InChI',
       'Ligand InChI Key', 'BindingDB MonomerID', 'BindingDB Ligand Name',
       'Target Name',
       'Target Source Organism According to Curator or DataSource',
       'IC50 (nM)'],
      dtype='object')


### Features with RDKit
For each ligand, a variety of molecular features were chosen and calculated using the open-source RDKit package. The selected features include:
1. Molecular Weight: This is the sum of the atomic weights of all the atoms in a molecule.
2. LogP (Partition Coefficient): LogP is a measure of a compound's lipophilicity or hydrophobicity, indicating how effectively a molecule can partition between a hydrophobic phase and an aqueous phase.
3. Total Polar Surface Area (TPSA): TPSA measures the surface area of a molecule that is polar or capable of forming hydrogen bonds. It provides insights into a compound's capacity to interact with other molecules, such as binding to biological targets.
4. Rotatable Bonds: Rotatable bonds represent the number of single bonds in a molecule that can freely rotate without breaking any chemical bonds. This feature is pertinent because it influences the molecule's conformation and flexibility.

In [11]:
features = []

for i, row in df.iterrows():
    estructura = row['Ligand SMILES']
    
    mol = Chem.MolFromSmiles(estructura)
    
    if mol is not None:

        smiles = row['Ligand SMILES']
        formula = Chem.rdMolDescriptors.CalcMolFormula(mol)
        mw = Descriptors.ExactMolWt(mol)
        logp = Crippen.MolLogP(mol)
        num_hbd = Descriptors.NumHDonors(mol)
        num_hba = Descriptors.NumHAcceptors(mol)
        tpsa = Descriptors.TPSA(mol)
        num_rb = Descriptors.NumRotatableBonds(mol)
        ic50 = row['IC50 (nM)']
        
        
        features.append([smiles, formula, mw, logp, num_hbd, num_hba, tpsa, num_rb, ic50])


df_features = pd.DataFrame(features, columns=['SMILES','Formula', 'Molecular weight', 'LogP', "H-bond donor","H-bond acceptor","TPSA","Rotatable bonds", "IC50"])
print(len(df_features))
print(df_features.head(5))

[18:52:32] Explicit valence for atom # 27 N, 4, is greater than permitted
[18:52:33] Explicit valence for atom # 27 N, 4, is greater than permitted
[18:52:34] Explicit valence for atom # 27 N, 4, is greater than permitted
[18:52:35] Explicit valence for atom # 27 N, 4, is greater than permitted


1831
                                              SMILES     Formula  \
0    COc1cc2ncc(-c3cccc(NC4CCNC4)n3)n2cc1-c1cn[nH]c1   C20H21N7O   
1      COc1cc2ncc(-c3cccc(NC4CCNC4)n3)n2cc1-c1cccnc1   C22H22N6O   
2      COc1cc2ncc(-c3cccc(NC4CCNC4)n3)n2cc1-c1ccncc1   C22H22N6O   
3  Cc1n[nH]c(C)c1-c1cn2c(cnc2cc1CO)-c1cccc(NC2CCN...   C22H25N7O   
4  COc1cc2ncc(-c3cccc(NC4CCNC4)n3)n2cc1-c1cnn(CCN...  C26H32N8O2   

   Molecular weight     LogP  H-bond donor  H-bond acceptor    TPSA  \
0        375.180758  2.56880             3                7   92.16   
1        386.185509  3.24070             2                7   76.37   
2        386.185509  3.24070             2                7   76.37   
3        403.212058  2.66934             4                7  103.16   
4        488.264822  2.37440             2               10   93.77   

   Rotatable bonds    IC50  
0                5   33170  
1                5    9920  
2                5    1830  
3                5   15720  
4             

### Drop special characters (>/<) in the IC50

In the column with the IC50 information we observe that some entries have symbols; >/< which causes the variable not to be considered numerical. For this reason we eliminate those that contain this type of symbols from our data frame.

In [12]:
df_features = df_features[df_features['IC50'].str.contains(">|<")== False]

#change column IC50 to numerical 
column = 'IC50'
df_features[column] = pd.to_numeric(df_features[column], errors='coerce')

summary = df_features.describe()
print(summary)

       Molecular weight         LogP  H-bond donor  H-bond acceptor  \
count       1443.000000  1443.000000   1443.000000      1443.000000   
mean         929.880380     5.904186      2.243243        11.722800   
std          111.243312     0.926376      0.499390         1.349762   
min          299.093773     0.510500      2.000000         5.000000   
25%          913.485045     5.544500      2.000000        11.000000   
50%          937.488416     5.996000      2.000000        12.000000   
75%          977.519717     6.401700      2.000000        13.000000   
max         1154.630408     8.069200      6.000000        15.000000   

              TPSA  Rotatable bonds          IC50  
count  1443.000000      1443.000000   1443.000000  
mean    168.349473         9.802495   2138.860707  
std      19.032505         1.439658   3281.785689  
min      54.250000         3.000000      2.000000  
25%     161.890000         9.000000     55.000000  
50%     171.540000        10.000000    550.00000

### Drop SMILES duplicates

In [13]:
df = df_features.drop_duplicates(subset=['SMILES'])
print(len(df))

578


In [14]:
df.to_csv('../Data/Dataset_Positive.csv', index=False)

## Dataset_Negative

In [15]:
# load file 
df2 = pd.read_csv('../Data/BindingDB_1000.csv')
print(df2.columns)

Index(['BindingDB Reactant_set_id', 'Ligand SMILES', 'Ligand InChI',
       'Ligand InChI Key', 'BindingDB MonomerID', 'BindingDB Ligand Name',
       'Target Name',
       'Target Source Organism According to Curator or DataSource',
       'IC50 (nM)'],
      dtype='object')


### Features with RDKit

We calculate with RDKit the same characteristics we have selected with the Dataset_positive

In [16]:
features = []

for i, row in df2.iterrows():
    estructura = row['Ligand SMILES']
    
    mol = Chem.MolFromSmiles(estructura)
    
    if mol is not None:
        
        smiles = row['Ligand SMILES']
        formula = Chem.rdMolDescriptors.CalcMolFormula(mol)
        mw = Descriptors.ExactMolWt(mol)
        logp = Crippen.MolLogP(mol)
        num_hbd = Descriptors.NumHDonors(mol)
        num_hba = Descriptors.NumHAcceptors(mol)
        tpsa = Descriptors.TPSA(mol)
        num_rb = Descriptors.NumRotatableBonds(mol)
        
        
        features.append([smiles, formula, mw, logp, num_hbd, num_hba, tpsa, num_rb])


df_features = pd.DataFrame(features, columns=['SMILES','Formula', 'Molecular weight', 'LogP', "H-bond donor","H-bond acceptor","TPSA","Rotatable bonds"])

print(len(df_features))
print(df_features.head(5))

[18:52:53] Can't kekulize mol.  Unkekulized atoms: 24 25 26 27 28


999
                                              SMILES        Formula  \
0   O=C(CCc1cccnc1)N(Cc1cccs1)Cc1nc2CCOCc2c(=O)[nH]1    C21H22N4O3S   
1  COc1cccc(OC)c1-c1noc(C2CC2)c1COc1ccc(c(Cl)c1)C...  C30H27ClFN3O7   
2    CNC(=O)c1ccccc1Sc1ccc2c(\C=C\c3ccccn3)n[nH]c2c1     C22H18N4OS   
3  CC(C)NCCN([C@@H]1CC[C@@]2(CC2C1)c1cccc(c1)C#N)...   C25H29Cl2N5O   
4  Clc1cccc2c(ccnc12)-n1cnc2c(cc(cc12)N1CCOCC1)-c...    C22H18ClN7O   

   Molecular weight     LogP  H-bond donor  H-bond acceptor    TPSA  \
0        410.141262  2.46070             1                6   88.18   
1        595.152156  5.40870             2                9  127.38   
2        386.120132  4.63910             2                4   70.67   
3        485.174916  5.60238             2                4   81.05   
4        431.126136  3.84880             1                7   84.75   

   Rotatable bonds  
0                7  
1               10  
2                5  
3                7  
4                3  


### Drop SMILES duplicates

In [17]:
df2 = df_features.drop_duplicates(subset=['SMILES'])
print(len(df2))
print(df2.head(5))

992
                                              SMILES        Formula  \
0   O=C(CCc1cccnc1)N(Cc1cccs1)Cc1nc2CCOCc2c(=O)[nH]1    C21H22N4O3S   
1  COc1cccc(OC)c1-c1noc(C2CC2)c1COc1ccc(c(Cl)c1)C...  C30H27ClFN3O7   
2    CNC(=O)c1ccccc1Sc1ccc2c(\C=C\c3ccccn3)n[nH]c2c1     C22H18N4OS   
3  CC(C)NCCN([C@@H]1CC[C@@]2(CC2C1)c1cccc(c1)C#N)...   C25H29Cl2N5O   
4  Clc1cccc2c(ccnc12)-n1cnc2c(cc(cc12)N1CCOCC1)-c...    C22H18ClN7O   

   Molecular weight     LogP  H-bond donor  H-bond acceptor    TPSA  \
0        410.141262  2.46070             1                6   88.18   
1        595.152156  5.40870             2                9  127.38   
2        386.120132  4.63910             2                4   70.67   
3        485.174916  5.60238             2                4   81.05   
4        431.126136  3.84880             1                7   84.75   

   Rotatable bonds  
0                7  
1               10  
2                5  
3                7  
4                3  


In [None]:
df2.to_csv('../Data/Dataset_Negative.csv', index=False)