# Data preparation

In this tutorial we will go deeper into the different data preparation options that are possible within QSPRpred.

The first step is to load the data.

In [1]:
import pandas as pd

df = pd.read_csv('../../tutorial_data/A2A_LIGANDS.tsv', sep='\t')

df.head()

Unnamed: 0,SMILES,pchembl_value_Mean
0,Cc1nn(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)c(...,8.68
1,Nc1c(C(=O)Nc2ccc([N+](=O)[O-])cc2)sc2c1cc1CCCC...,4.82
2,O=C(Nc1nc2ncccc2n2c(=O)n(-c3ccccc3)nc12)c1ccccc1,5.65
3,CNC(=O)C12CC1C(n1cnc3c1nc(C#CCCCCC(=O)OC)nc3NC...,5.45
4,CCCn1c(=O)c2c(nc3cc(OC)ccn32)n(CCCNC(=O)c2ccc(...,5.2


## Data Representation (`MoleculeTable`)

You may have come across the `QSPRDataset` before in some of the other tutorials.
The `QSPRDataset` is a subclass of the `MoleculeTable`.
The `MoleculeTable` is specifically designed for data sets that contain molecular structures.

For example, you can easily convert the SMILES strings to RDKit molecules in your table:

In [3]:
from qsprpred.data.data import MoleculeTable

mt = MoleculeTable(df=df, store_dir="../../tutorial_output/data", name="A2A_moleculetable", add_rdkit=True)
mt.getDF()

  for col in df_subset.columns[df_subset.applymap(MolFormatter.is_mol).any()]


Unnamed: 0_level_0,SMILES,pchembl_value_Mean,QSPRID,RDMol
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A2A_moleculetable_0,Cc1nn(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n2)c(...,8.68,A2A_moleculetable_0,
A2A_moleculetable_1,Nc1c(C(=O)Nc2ccc([N+](=O)[O-])cc2)sc2c1cc1CCCC...,4.82,A2A_moleculetable_1,
A2A_moleculetable_2,O=C(Nc1nc2ncccc2n2c(=O)n(-c3ccccc3)nc12)c1ccccc1,5.65,A2A_moleculetable_2,
A2A_moleculetable_3,CNC(=O)C12CC1C(n1cnc3c1nc(C#CCCCCC(=O)OC)nc3NC...,5.45,A2A_moleculetable_3,
A2A_moleculetable_4,CCCn1c(=O)c2c(nc3cc(OC)ccn32)n(CCCNC(=O)c2ccc(...,5.20,A2A_moleculetable_4,
...,...,...,...,...
A2A_moleculetable_4077,CNc1ncc(C(=O)NCc2ccc(OC)cc2)c2nc(-c3ccco3)nn12,7.09,A2A_moleculetable_4077,
A2A_moleculetable_4078,Nc1nc(-c2ccco2)c2ncn(C(=O)NCCc3ccccc3)c2n1,8.22,A2A_moleculetable_4078,
A2A_moleculetable_4079,Nc1nc(Nc2ccc(F)cc2)nc(CSc2nnc(N)s2)n1,4.89,A2A_moleculetable_4079,
A2A_moleculetable_4080,CCCOc1ccc(C=Cc2cc3c(c(=O)n(C)c(=O)n3C)n2C)cc1,6.51,A2A_moleculetable_4080,
