# Data Sets

You have to preprocess the data depending on the model you want to build. You can refer to sections relevant for your model of interest:

1. [Graph-Based Transformer](#Preparing-Data-for-the-Graph-Based-Transformer)
2. [Smiles-Based Transformer](#Preparing-Data-for-the-Smiles-Based-Transformer)

In this tutorial, we assume you already extracted a list of SMILES strings that you want to use either for pretraining or finetuning. You should already have placed the downloaded example data in `jupyter/data`. For the sake of simplicity, there are also reduced sets of 500 molecules each that you can use to quickly test the code. For example:

In [1]:
import pandas as pd

smiles = pd.read_csv('jupyter/data/LIGAND_RAW_small.tsv', sep='\t', header=0, usecols=('Smiles',), na_values=('NA', 'nan', 'NaN')).iloc[:,0]
smiles.dropna(inplace=True)

print(smiles.head())
smiles.shape

0    CCCCn1cc2c(nc(NC(=O)Nc3ccc(Cl)c(Cl)c3)n3nc(-c4...
1    O=C(Cc1ccccc1)Nc1nc2nn(CCc3ccccc3)cc2c2nc(-c3c...
2    O=C(COc1ccccc1)Nc1nc2nn(CCc3ccccc3)cc2c2nc(-c3...
3    CC(C)(C)NC(=O)Nc1nc2nn(CCCc3ccccc3)cc2c2nc(-c3...
4                COc1ccc(-n2cc3c(n2)c(N)nc2ccccc23)cc1
Name: Smiles, dtype: object


(500,)

This set contains the first 500 molecules from the original `jupyter/data/LIGAND_RAW.tsv` file which contains ligands related to the protein targets of interest in this tutorial, but that were also used as examples in [the original DrugEx v3 study](https://chemrxiv.org/engage/chemrxiv/article-details/61aa8b58bc299c0b30887f80). These will be our molecules of interest further in the tutorial. They will be used for [finetuning existing pretrained models](finetuning.ipynb). In this tutorial, we will create the proper data sets for this task. Note, that if you are aiming at pretraining your own model the procedure is the same.

## Standardization

The first step will be standardization of the data. That is easily accomplised with the built-in `Standardization` processor that we can apply to our compounds:

In [2]:
from drugex.data.processing import Standardization

N_PROC = 4 # standardization (like many tasks in this tutorial) can be done in parallel so we save the desired number of CPUs to use here
standardizer = Standardization(n_proc=N_PROC)
smiles = standardizer.apply(smiles)

len(smiles)

  from .autonotebook import tqdm as notebook_tqdm


429

The standardizer also handles duplicates for us so the resulting number of molecules is reduced in comparison to the original data. Some strange molecules may also fail to parse so DrugEx prints out a warning if that happens and skips the failed molecule. `Standardization` also allows to change the standardization method by supplying a custom standardizer function (i.e. `Standardization(standardizer=my_fucntion)`). You can find more details in the documentation.

## Preparing Data for the Graph-Based Transformer

The input for the transformer model are the fragments that the molecules of interest are made up of while the molcules of interest themselves are the output. In order to convert the input SMILES to this representation, we have to generate a so called corpus data set that defines the underlying chemistry or grammar rules for the model. You can use the `FragmentCorpusEncoder` processor in combination with the `GraphFragmentEncoder` to generate your data set:

*Note*: The term 'corpus' comes from NLP (Natural Language Processing) and was originally used in DrugEx v1 to describe the tokenized SMILES input for the recurrent neural network often used in NLP to represent textual data.

In [3]:
from drugex.data.fragments import FragmentCorpusEncoder
from drugex.data.fragments import GraphFragmentEncoder, FragmentPairsSplitter
from drugex.molecules.converters.fragmenters import Fragmenter
from drugex.data.corpus.vocabulary import VocGraph

encoder = FragmentCorpusEncoder(
    fragmenter=Fragmenter(4, 4, 'brics'), # handles how fragment-molecule pairs are created
    encoder=GraphFragmentEncoder(
        VocGraph(n_frags=4) # encoder uses the graph vocabulary to create the graph matrix from the created fragment-molecule pairs
    ),
    pairs_splitter=FragmentPairsSplitter(0.1, 100, unique_only=True), # in this instance, we also use a splitter to divide the fragment-molecule pairs into a test set and training set
    n_proc=N_PROC # we can again run these actions in parallel
)

When we have defined the encoder, we can just apply it on our data and use a `GraphDataSet` to collect the results from the parallel processes. Depending on the output of the splitter, the `FragmentEncoder` creates one data set per split. Above we specified `unique_only=True` in the splitter definition, which means we will only collect a training set of unique fragment-molecule combinations and a randomly chosen test set of fragment-molecule pairs. This is something one can do to reduce computational complexity. We also asked the splitter to cap test set size at 100 molecules, which also helps to speed up training down the line.

Next, we create the empty data sets that will represent the corpus data generated. Each data set must be associated with a file where the data will be saved. We use the `.tsv` extension because the output is a standard tab-delimited text file:

In [4]:
from drugex.data.datasets import GraphFragDataSet
import os

# create the directory for our input files
graph_input_folder = "data/model_inputs/graph/"
if not os.path.exists(graph_input_folder):
    os.makedirs(graph_input_folder)

# create empty data sets (we can specify a path to a file where the data set can be saved)
train = GraphFragDataSet(f"{graph_input_folder}/ligand_train.tsv")
test = GraphFragDataSet(f"{graph_input_folder}/ligand_test.tsv")

# apply the encoder and collect data (test data is collected first)
encoder.apply(smiles, encodingCollectors=[test, train])

To speed up the training, the test set size was automatically capped at 100 fragments instead of the default 10% of original data, which would have been: 118.
An exception occurred when converting molecule data: CCNC(=O)C1OC(n2cnc3c(NCCCCCCCCNC(=O)CCCCCNC(=O)COc4ccc(C=CC5=[N+]6C(=Cc7ccc(-c8cccs8)n7[B-]6(F)F)C=C5)cc4)ncnc32)C(O)C1O
 Cause: <class 'drugex.data.fragments.FragmentPairsEncodedSupplier.MoleculeEncodingException'>: Failed to encode molecule: CCNC(=O)C1OC(n2cnc3c(NCCCCCCCCNC(=O)CCCCCNC(=O)COc4ccc(C=CC5=[N+]6C(=Cc7ccc(-c8cccs8)n7[B-]6(F)F)C=C5)cc4)ncnc32)C(O)C1O
An exception occurred when converting molecule data: N=c1ccc2c(-c3ccc(C(=O)NCCCCCCn4cc(CCCC(=O)Nc5nc6ccc(Cl)cc6c6nc(-c7ccco7)nn56)nn4)cc3C(=O)O)c3ccc(N)c(S(=O)(=O)O)c3oc-2c1S(=O)(=O)O
 Cause: <class 'drugex.data.fragments.FragmentPairsEncodedSupplier.MoleculeEncodingException'>: Failed to encode molecule: N=c1ccc2c(-c3ccc(C(=O)NCCCCCCn4cc(CCCC(=O)Nc5nc6ccc(Cl)cc6c6nc(-c7ccco7)nn56)nn4)cc3C(=O)O)c3ccc(N)c(S(=O)(=O)O)c3oc

It is possible some molecules still failed to parse so you can observe this in the output to make sure some important patterns were not missed. Now that the data sets are ready we still have to save them to their destination:

In [5]:
train.save()
test.save()

You can check that the appropriate files were indeed created in the `data/inputs/graph/` folder. We can easily recreate the `GraphFragDataSet` instances from these files when we need them for training:

In [6]:
train_from_file = GraphFragDataSet('imported')
train_from_file.fromFile(train.outpath)

# we can check the output by converting the data set to a pandas DataFrame
df = train_from_file.getDataFrame()
df.head()

Unnamed: 0,C0,C1,C2,C3,C4,C5,C6,C7,C8,C9,...,C390,C391,C392,C393,C394,C395,C396,C397,C398,C399
0,1,0,0,0,1,5,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,1,2,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,1,8,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,1,5,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,1,8,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


It is also a good idea to save the used vocabulary to encode the structures along with the files for future reference:

In [7]:
train.getVoc().toFile(f"{graph_input_folder}/vocabulary.txt")

## Preparing Data for the Smiles-Based Transformer

TODO