# Data Sets

You have to preprocess the data depending on the model you want to build. You can refer to the particular sections relevant for your model of interest:

1. [Graph-Based Transformer](#Preparing-Data-for-the-Graph-Based-Transformer)
2. [Smiles-Based Transformer](#Preparing-Data-for-the-Smiles-Based-Transformer)

In this tutorial, we assume you already extracted a list of SMILES strings that you want to use either for pretraining or finetuning. You can download the example data file [here]() and extract the molecules with `pandas`:

In [1]:
import pandas as pd

smiles = pd.read_csv('data/A2AR_raw.txt', sep='\t', header=0, usecols=('CANONICAL_SMILES',), na_values=('NA', 'nan', 'NaN')).iloc[:,0]
smiles.dropna(inplace=True)

print(smiles.head())
smiles.shape

0    CCCCn1cc2c(nc(NC(=O)Nc3ccc(cc3)S(=O)(=O)O)n4nc...
1    CCCCn1cc2c(nc(NC(=O)Nc3ccc(cc3)S(=O)(=O)O)n4nc...
2    NC1=Nc2c(cnn2CCN3CCC(CC3)N4CCOCC4)C5=NN(Cc6ccc...
3        CCN1C(=O)N(CC)c2nc3N(CCc4ccccc4OC)CCCn3c2C1=O
4    COc1cncc(c1)c2cc(NC(=O)CN3CCOCC3)nc(n2)n4nc(C)...
Name: CANONICAL_SMILES, dtype: object


(11464,)

These will be our molecules of interest and further in the tutorial we will use them for [finetuning existing DrugEx models](finetuning.ipynb). Therefore, now we have to create the proper data sets for this task. However, before we do that these molecules have to be standardized. That is easily accomplised with the `Standardization` processor that we can apply to our compounds:

In [2]:
from drugex.datasets.processing import Standardization

N_PROC = 4 # standardization (like many tasks in this tutorial) can be done in parallel so we save the desired number of CPUs to use here
standardizer = Standardization(n_proc=N_PROC)
smiles = standardizer.apply(smiles)

len(smiles)

  from .autonotebook import tqdm as notebook_tqdm
Parsing Error: [V+8]
Traceback (most recent call last):
  File "/home/sichom/projects/DrugEx/drugex/molecules/converters/standardizers.py", line 58, in __call__
    raise StandardizationException(f"No carbon in SMILES: {smileR}")
drugex.molecules.converters.standardizers.StandardizationException: No carbon in SMILES: [V+8]
 Cause: <class 'drugex.molecules.converters.standardizers.StandardizationException'>: No carbon in SMILES: [V+8]
ERROR:root:Parsing Error: [V+8]
Traceback (most recent call last):
  File "/home/sichom/projects/DrugEx/drugex/molecules/converters/standardizers.py", line 58, in __call__
    raise StandardizationException(f"No carbon in SMILES: {smileR}")
drugex.molecules.converters.standardizers.StandardizationException: No carbon in SMILES: [V+8]
 Cause: <class 'drugex.molecules.converters.standardizers.StandardizationException'>: No carbon in SMILES: [V+8]
ERROR:root:Parsing Error: [Zn+2]
Traceback (most recent call la

7958

The standardizer also handles duplicates for us so the resulting number of molecules is reduced in comparison to the original data. Some strange molecules also failed to parse and DrugEx prints out a warning if that happens.

## Preparing Data for the Graph-Based Transformer

The input for the transformer model are the fragments that the molecules of interest are made up of while the molcules of interest themselves are the output. You can use the `FragmentEncoder` processor in combination with the `GraphFragmentEncoder` to generate your data set:

In [3]:
from drugex.datasets.processing import FragmentEncoder
from drugex.datasets.fragments import GraphFragmentEncoder, FragmentPairsSplitter
from drugex.molecules.converters.fragmenters import Fragmenter
from drugex.corpus.vocabulary import VocGraph

encoder = FragmentEncoder(
    fragmenter=Fragmenter(4, 4, 'brics'), # handles how fragment-molecule pairs are created
    encoder=GraphFragmentEncoder(
        VocGraph(n_frags=4) # encoder uses the graph vocabulary to create the graph matrix from the fragment-molecule pairs (see: https://chemrxiv.org/engage/api-gateway/chemrxiv/assets/orp/resource/item/61aa8b58bc299c0b30887f80/original/drug-ex-v3-scaffold-constrained-drug-design-with-graph-transformer-based-reinforcement-learning.pdf)
    ),
    pairs_splitter=FragmentPairsSplitter(0.1, 100, unique_only=True), # in this instance, we also use a splitter to divide the fragment-molecule pairs into a test set and training set
    n_proc=N_PROC # we can again run these actions in parallel
)

When we have defined the encoder, we can just apply it on our data and use a `GraphDataSet` to collect the results. Depending on the output of the splitter, the `FragmentEncoder` creates one data set per split. Above we specified `unique_only=True` in the splitter definition, which means we will be able to collect a training set of only unique fragment-molecule combinations and a randomly chosen test set of fragment-molecule pairs. The splitter returns the test set first and then the training set. Therefore, we will have to collect the results in that order. We create the empty data sets first:

In [4]:
from drugex.logs import logger
import logging

logger.setLevel(logging.ERROR) # remove warnings from the drugex logger to reduce the amount of output

In [5]:
from drugex.datasets.processing import GraphFragDataSet
import os

# create the directory for our input files
graph_input_folder = "data/inputs/graph"
if not os.path.exists(graph_input_folder):
    os.makedirs(graph_input_folder)

# create empty data sets (we can specify a path to a file where the data set can be saved)
train = GraphFragDataSet(f"{graph_input_folder}/train.txt")
test = GraphFragDataSet(f"{graph_input_folder}/test.txt")

# apply the encoder and collect data (test data is collected first)
encoder.apply(smiles, encodingCollectors=[test, train])

Now the data sets are ready and we can save them to their destination:

In [6]:
train.save()
test.save()

You can check that the appropriate files were indeed created in the `data/inputs/graph/` folder. We can easily recreate the `GraphFragDataSet` instances from these files when we need them for training:

In [7]:
train_from_file = GraphFragDataSet('imported')
train_from_file.fromFile(train.outpath)

# we can check the output by converting the data set to a pandas DataFrame
df = train_from_file.getDataFrame()
df.head()

Unnamed: 0,C0,C1,C2,C3,C4,C5,C6,C7,C8,C9,...,C390,C391,C392,C393,C394,C395,C396,C397,C398,C399
0,1,0,0,0,1,5,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,1,18,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,1,5,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,1,5,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,1,5,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


It is also a good idea to save the used vocabulary along with the files for future reference:

In [8]:
train.getVoc().toFile(f"{graph_input_folder}/vocabulary.txt")

## Preparing Data for the Smiles-Based Transformer

TODO