# Prediction of toxicity of small molecules

This notebook contains a end to end project for toxicity prediction of small molecules. Specific details about the code can be found in the present notebbok or in the helper scripts referenced in each section. 

For this project the dataset used was [MolToxPred](https://pubs.rsc.org/en/content/articlelanding/2024/ra/d3ra07322j). For further information check out the linked reference.

In [2]:
import pandas as pd

In [10]:
data = pd.read_csv('https://raw.githubusercontent.com/FabioHerrera97/Cheminformatics_ML_toxicity/refs/heads/main/data/smiles_10449_train_test.csv')
data.head()

Unnamed: 0,SMILES,Toxicity
0,Cn1cnc2c(F)c(Nc3ccc(Br)cc3Cl)c(C(=O)NOCCO)cc21,0
1,COC(=O)c1ccc2c(c1)NC(=O)/C2=C(\Nc1ccc(N(C)C(=O...,0
2,CN(C)C/C=C/C(=O)Nc1cc2c(Nc3ccc(F)c(Cl)c3)ncnc2...,0
3,COC1CC2CCC(C)C(O)(O2)C(=O)C(=O)N2CCCCC2C(=O)OC...,0
4,CS(=O)(=O)O.Cc1ccc(NC(=O)c2ccc(CN3CCN(C)CC3)cc...,0


In [11]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10449 entries, 0 to 10448
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   SMILES    10449 non-null  object
 1   Toxicity  10449 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 163.4+ KB


In [15]:
print(data['Toxicity'].value_counts())

Toxicity
0    5833
1    4616
Name: count, dtype: int64


This is a relatively balanced dataset containig 10449 compounds. 5833 of the molecules are non-toxic (label 0), while the remaining 4616 are toxic(label 1)

## Standardization of the compounds

**NOTE: This section of the project is based on [DeepMol](https://deepmol.readthedocs.io/en/latest/) standardization tutorial and [MolPipeline](https://pubs.acs.org/doi/10.1021/acs.jcim.4c00863) example notebooks.**

Standardization referes to transforming a set of chemical structures to a standardized format using a predifined set of rules. This allows to properly compare the chemical structures in the dataset to each other and handle steps like duplicated element deletion or ensure data consistency. 

There are 3 common standardization options: basic standardizer, complex standardizerand ChEMBL standardizer. Simple standardizer only perform sanititization, including steps like kekulize, check valencies, set aromaticity, conjugation and hybridization. Complex standardizers include customized procedures by performing additional steps like remove isotope information, neutralize charges, remove stereochemistry or remove smaller fragments. Finally, [ChEMBL](https://jcheminf.biomedcentral.com/articles/10.1186/s13321-020-00456-1) standardizer formats compounds according to defined rules and conventions 