<a href="https://colab.research.google.com/github/Malikbadmus/model-validation-eos30gr/blob/main/notebooks/Data_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**This notebook contains Data cleaning Done on 1000 compounds that has completed all phases of its clinical development and has been aproved to use sourced from ChEMBL by Malik Badmus as part of Outreachy 2024 Contribution
**

In [16]:
import os
import sys
import pandas as pd
# Add the src directory to the system path
sys.path.append(os.path.abspath("../src"))

DATAPATH = "../data"
SRC= "../src"

#File path
input_file_path = os.path.join(DATAPATH, "Raw", "mol_datasets1.csv")
output_file_path = os.path.join(DATAPATH, "Processed", "100_Molecules.csv")

# Read the CSV file into a pandas DataFrame
Data = pd.read_csv(input_file_path, delimiter=';', quotechar='"')

In [17]:
#inspect the Dataframe

Data.head()

Unnamed: 0,ChEMBL ID,Name,Synonyms,Type,Max Phase,Molecular Weight,Targets,Bioactivities,AlogP,Polar Surface Area,...,Heavy Atoms,HBA (Lipinski),HBD (Lipinski),#RO5 Violations (Lipinski),Molecular Weight (Monoisotopic),Np Likeness Score,Molecular Species,Molecular Formula,Smiles,Inchi Key
0,CHEMBL1868702,GESTRINONE,A 46 745|A-46-745|A-46745|DIMETRIOSE|GESTRINON...,Small molecule,4.0,308.42,19.0,61.0,3.72,37.3,...,23.0,2.0,1.0,0.0,308.1776,1.86,NEUTRAL,C21H24O2,C#C[C@]1(O)CC[C@H]2[C@@H]3CCC4=CC(=O)CCC4=C3C=...,BJJXHLWLUDYTGC-ANULTFPQSA-N
1,CHEMBL2106076,CEFPIROME SULFATE,CEFPIROME SULFATE|CEFPIROME SULFATE (1:1)|CEFP...,Small molecule,4.0,612.67,1.0,1.0,-1.04,153.92,...,35.0,11.0,3.0,2.0,514.1093,-0.19,ACID,C22H24N6O9S3,CO/N=C(\C(=O)N[C@@H]1C(=O)N2C(C(=O)[O-])=C(C[n...,RKTNPKZEPLCLSF-QHBKFCFHSA-N
2,CHEMBL1446650,MEBEVERINE HYDROCHLORIDE,COLOFAC|COLOFAC 100|COLOFAC IBS|COLOFAC MR|CSA...,Small molecule,4.0,466.02,15.0,44.0,4.6,57.23,...,31.0,6.0,0.0,0.0,429.2515,-0.6,BASE,C25H36ClNO5,CCN(CCCCOC(=O)c1ccc(OC)c(OC)c1)C(C)Cc1ccc(OC)c...,PLGQWYOULXPJRE-UHFFFAOYSA-N
3,CHEMBL3707281,MAGNESIUM LACTATE,"ANHYDROUS MAGNESIUM LACTATE, DL-|DL-LACTIC ACI...",Small molecule,4.0,202.44,,,,,...,,,,,202.0328,,,C6H10MgO6,CC(O)C(=O)[O-].CC(O)C(=O)[O-].[Mg+2],OVGXLJDWSLQDRT-UHFFFAOYSA-L
4,CHEMBL3833409,HYDROTALCITE,HYDROTALCITE,Small molecule,4.0,531.91,,,,,...,,,,,529.9019,,,CH16Al2Mg6O19,O=C([O-])[O-].[Al+3].[Al+3].[Mg+2].[Mg+2].[Mg+...,GDVKFRBCXAPAQJ-UHFFFAOYSA-A


In [18]:
Data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3592 entries, 0 to 3591
Data columns (total 33 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   ChEMBL ID                        3592 non-null   object 
 1   Name                             3592 non-null   object 
 2   Synonyms                         3538 non-null   object 
 3   Type                             3592 non-null   object 
 4   Max Phase                        3592 non-null   float64
 5   Molecular Weight                 3522 non-null   float64
 6   Targets                          2954 non-null   float64
 7   Bioactivities                    2954 non-null   float64
 8   AlogP                            3166 non-null   float64
 9   Polar Surface Area               3166 non-null   float64
 10  HBA                              3166 non-null   float64
 11  HBD                              3166 non-null   float64
 12  #RO5 Violations     

Based on the observations above, our dataframe has 33 column, a substantial amount of extraneous data, contributing to what can be described as "noise" in the DataFrame. From the example(Notebook1) given we only need two (2) columns to run predictions on ersilia, the "Name" column which contains the name of the chemical compounds and the "Smiles" column which contain our SMILE stings. Moreover, the Range index of 3,592 records, surpasses the number of molecules we need which is 1000. The Data therefore needs to be cleaned up.

In [19]:
#Additional Verification

print(Data.iloc[890])

ChEMBL ID                                                              CHEMBL4216467
Name                                                                      RIPRETINIB
Synonyms                                                 DCC-2618|QINLOCK|RIPRETINIB
Type                                                                  Small molecule
Max Phase                                                                        4.0
Molecular Weight                                                              510.37
Targets                                                                          6.0
Bioactivities                                                                   22.0
AlogP                                                                           5.67
Polar Surface Area                                                             88.05
HBA                                                                              5.0
HBD                                                              

In [20]:
# The required Columns
new_columns = ['Name', 'Smiles']
check_columns = 'Smiles'  

Updated_Data = Data.drop(columns=Data.columns.difference(new_columns))

# Drop rows with null values in Smiles and select the first 1000 molecules
New_Data = Updated_Data.dropna(subset=['Smiles']).iloc[:1000]

# Sort the Dataframe to be arranged aphabetically
N_Data = New_Data.sort_values(by='Name')

# The reset_index(drop=True) is used and resets the index of the DataFrame, providing a clean index without gaps.
N_Data.reset_index(drop=True, inplace=True)
N_Data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    1000 non-null   object
 1   Smiles  1000 non-null   object
dtypes: object(2)
memory usage: 15.8+ KB


We can inspect our new Datasets

In [21]:
N_Data.head()

Unnamed: 0,Name,Smiles
0,"3,3',4',5-TETRACHLOROSALICYLANILIDE",O=C(Nc1ccc(Cl)c(Cl)c1)c1cc(Cl)cc(Cl)c1O
1,ABAMETAPIR,Cc1ccc(-c2ccc(C)cn2)nc1
2,ABIRATERONE,C[C@]12CC[C@H](O)CC1=CC[C@@H]1[C@@H]2CC[C@]2(C...
3,ACALABRUTINIB MALEATE,CC#CC(=O)N1CCC[C@H]1c1nc(-c2ccc(C(=O)Nc3ccccn3...
4,ACECLOFENAC,O=C(O)COC(=O)Cc1ccccc1Nc1c(Cl)cccc1Cl


From the above, we can see that Our Datasets has been cleaned up successfully and Tallies with the example given(eml_canonical.csv).

In [27]:
# Save the DataFrame to a CSV file

N_Data.to_csv(output_file_path, index=False)