# Data Sets

You have to preprocess the data depending on the model you want to build. You can refer to sections relevant for your model of interest:

1. [Graph-Based Transformer](#Preparing-Data-for-the-Graph-Based-Transformer)
    * [Scaffold encoding](#Encoding-scaffolds)
2. [Smiles-Based Transformer](#Preparing-Data-for-the-Smiles-Based-Transformer) (section incomplete)
2. [Smiles-Based Encoder-Decoder](#Preparing-Data-for-the-Smiles-Based-Encoder-Decoder-Model) (section incomplete)
2. [Smiles-Based Encoder-Decoder with Attention](#Preparing-Data-for-the-Smiles-Based-Encoder-Decoder-Model-with-Attention) (section incomplete)

In this tutorial, we assume you already extracted a list of SMILES strings that you want to use either for pretraining or finetuning. If you want to use the data from this tutorial, you should already have placed the [downloaded example data](https://drive.google.com/file/d/1lYOmQBnAawnDR2Kwcy8yVARQTVzYDelw/view?usp=sharing) in `jupyter/data` (see [README.md](README.md)). For the sake of simplicity, this tutorial uses reduced sets of only 500 compounds (files with the `small` suffix), but the full sets (without a suffix) are also available so that you can build the full models as well.

## Loading Molecules

Let's start with preprocessing the data from the so called `LIGAND` set that was used in one of the DrugEx publications:

In [1]:
import pandas as pd

smiles = pd.read_csv('jupyter/data/LIGAND_RAW_small.tsv', sep='\t', header=0, usecols=('Smiles',), na_values=('NA', 'nan', 'NaN')).iloc[:,0]
smiles.dropna(inplace=True)

print(smiles.head())
smiles.shape

0    CCCCn1cc2c(nc(NC(=O)Nc3ccc(Cl)c(Cl)c3)n3nc(-c4...
1    O=C(Cc1ccccc1)Nc1nc2nn(CCc3ccccc3)cc2c2nc(-c3c...
2    O=C(COc1ccccc1)Nc1nc2nn(CCc3ccccc3)cc2c2nc(-c3...
3    CC(C)(C)NC(=O)Nc1nc2nn(CCCc3ccccc3)cc2c2nc(-c3...
4                COc1ccc(-n2cc3c(n2)c(N)nc2ccccc23)cc1
Name: Smiles, dtype: object


(500,)

This set contains the first 500 molecules from the original `jupyter/data/LIGAND_RAW.tsv` file, which contains ligands related to the protein targets of interest in this tutorial, but the full test is also used to train and evaluate models in [the original DrugEx v3 study](https://chemrxiv.org/engage/chemrxiv/article-details/61aa8b58bc299c0b30887f80). These will be our molecules of interest further in the tutorial. They will be used for [finetuning existing pretrained models](finetuning.ipynb). Here, we will create the data sets for this task. Note, that if you are aiming at pretraining your own model the procedure is the same, you just need to use a more general data set (see [pretraining](pretraining.ipynb)).

## Logging

Because we will be potentially processing a lot of data, it might be a good idea to redirect logging outputs to a seperate file so that we keep this notebook clean. DrugEx is using a package-wide logger (available as `drugex.logs.logger`). We can configure it with the standard Python `logging` package. For this tutorial, we already created a method in the `utils` (see [`utils.py`](utils.py)) package that configures a log file for us that will be saved in the `data/logs/` folder:

In [2]:
from utils import initLogger

initLogger('datasets.log')

  from .autonotebook import tqdm as notebook_tqdm


## Standardization

The first step in our efforts will be the standardization of the data. That is easily accomplised with the built-in `Standardization` processor that we can apply to our compounds:

In [3]:
from drugex.data.processing import Standardization

N_PROC = 12 # standardization (like many preprocessing tasks in this tutorial) can be done in parallel
standardizer = Standardization(n_proc=N_PROC)
smiles = standardizer.apply(smiles)

len(smiles)

Standardizing molecules (batch processing): 100%|██████████| 1/1 [00:01<00:00,  1.13s/it]


500

The standardizer also handles parsing errors for us so the resulting number of molecules can be reduced in comparison to the original data. `Standardization` also allows to change the standardization method by supplying a custom standardizer function (i.e. `Standardization(standardizer=my_fucntion)`). You can find more details in the [documentation](https://martin-sicho.github.io/drugex-docs/api/drugex.data.html?highlight=standardization#drugex.data.processing.Standardization).

The standardizer does not handle duplicates so we handle them now. The standardizer should output canonical standardize smiles and, thus, filtering out just duplicate SMILES strings should be sufficient. We do this by creatoing a `set`:

In [4]:
smiles = set(smiles)
len(smiles)

429

## Preparing Data for the Graph-Based Transformer

The input of the transformer model are the fragments that the molecules of interest are made up of while the molcules of interest themselves are the output. The model then learns to create new valid molecules from the given input fragments. In order to convert the SMILES strings to the encoded model input, we have to generate a so called corpus data set that defines the underlying chemistry or grammar rules for the model. You can use the `FragmentCorpusEncoder` processor in combination with the `GraphFragmentEncoder` to generate the data set for this model:

*Note: The term 'corpus' comes from NLP (Natural Language Processing) and was originally used in DrugEx v1 to describe the tokenized SMILES input for the recurrent neural network often used in NLP to represent textual data. We still use the term here for historical reasons even though the graph-based model is a very different type of model.*

In [5]:
from drugex.data.fragments import FragmentCorpusEncoder
from drugex.data.fragments import GraphFragmentEncoder, FragmentPairsSplitter
from drugex.molecules.converters.fragmenters import Fragmenter
from drugex.data.corpus.vocabulary import VocGraph

encoder = FragmentCorpusEncoder(
    fragmenter=Fragmenter(4, 4, 'brics'), # handles how fragment-molecule pairs are created
    encoder=GraphFragmentEncoder(
        VocGraph(n_frags=4) # encoder uses the graph vocabulary to create the graph matrix from the created fragment-molecule pairs
    ),
    pairs_splitter=FragmentPairsSplitter(0.1, 100), # in this instance, we also use a splitter to divide the fragment-molecule pairs into a test set and training set
    n_proc=N_PROC # we can again run these actions in parallel
)

When we have defined the encoder (basically a template for data processing), we can just apply it on our data to start encoding (see below). There are two operations involved in this process: 

1. **Fragmentation** - Determined by the `Fragmenter`, each input molecule is split into fragment-molecule pairs that will form one sample for the model. Depending on the splitting strategy (as determined by `FragmentPairsSplitter`), these pairs are divided into two or more sets. In this instance using the default settings, we collect two data sets in total:

    1. **test set** - The set of fragment-molecule pairs used for validation after an epoch of training. Maximum size of this test set is set with `FragmentPairsSplitter`, in this case at most 100 fragment-molecule pairs, but at least 10% of the original data.
    2. **trainining set** - It contains all fragment-molecule combinations not selected for the *test set*.

2. **Encoding** - This step is handled by the `GraphFragmentEncoder`, which is an implementation of `FragmentEncoder` that is specific to the graph-based model. Using the `VocGraph` vocabulary, it encodes each fragment-molecule pair in the data sets above to a representation understood by the model. This represenation is saved to the resulting `.tsv` files (one per each set after splitting). These files form the `GraphDataSet` and are loaded to the model via a PyTorch `DataLoader`.

We initialize the `GraphFragDataSet` instances first with the names of the associated `.tsv` files:

In [6]:
from drugex.data.datasets import GraphFragDataSet
import os

# create a dedicated directory for our graph data set files
graph_input_folder = "data/sets/graph/"
if not os.path.exists(graph_input_folder):
    os.makedirs(graph_input_folder)

# create empty data sets (we have to specify a path to a file where the data set will be saved)
train = GraphFragDataSet(f"{graph_input_folder}/ligand_train.tsv", rewrite=True)
test = GraphFragDataSet(f"{graph_input_folder}/ligand_test.tsv", rewrite=True)

Now, empty data sets are initialized and if any file already exists it will be overwritten (as set by `rewrite=True`). 

We can finally run the encoder. We pass our data sets as `encodingCollectors`, which means only the results of the second step described above will be saved:

In [7]:
# apply the encoder and collect data (test data is collected first)
encoder.apply(list(smiles), encodingCollectors=[test, train])

Creating fragment-molecule pairs (batch processing): 100%|██████████| 1/1 [00:00<00:00,  1.92it/s]
Encoding fragment-molecule pairs. (batch processing): 100%|██████████| 1/1 [00:00<00:00,  2.78it/s]
Encoding fragment-molecule pairs. (batch processing): 100%|██████████| 1/1 [00:01<00:00,  1.04s/it]


It is possible some molecules still failed to parse so you can observe this in the logfile to make sure some important patterns were not missed. Now that the data sets are ready we can check if they indeed contain data:

In [8]:
train.getData()

Unnamed: 0,C0,C1,C2,C3,C4,C5,C6,C7,C8,C9,...,C390,C391,C392,C393,C394,C395,C396,C397,C398,C399
0,1,0,0,0,1,18,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,1,5,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,1,18,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,1,5,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,1,18,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2337,1,0,0,0,1,5,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2338,1,0,0,0,1,5,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2339,1,0,0,0,1,5,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2340,1,0,0,0,1,5,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [9]:
test.getData()

Unnamed: 0,C0,C1,C2,C3,C4,C5,C6,C7,C8,C9,...,C390,C391,C392,C393,C394,C395,C396,C397,C398,C399
0,1,0,0,0,1,5,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,1,5,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,1,5,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,1,5,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,1,5,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
287,1,0,0,0,1,5,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
288,1,0,0,0,1,5,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
289,1,0,0,0,1,5,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
290,1,0,0,0,1,5,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


You can now also check that the appropriate files were indeed created in the `data/sets/graph/` folder. We can easily associate `GraphFragDataSet` instances with these files if we need them again (note that the `rewrite` flag is off):

In [10]:
test_from_file = GraphFragDataSet('data/sets/graph/ligand_test.tsv')
assert os.path.exists(test_from_file.outpath)

# we can check the output by converting the data set to a pandas DataFrame again
test_from_file.getData()

Unnamed: 0,C0,C1,C2,C3,C4,C5,C6,C7,C8,C9,...,C390,C391,C392,C393,C394,C395,C396,C397,C398,C399
0,1,0,0,0,1,5,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,1,5,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,1,5,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,1,5,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,1,5,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
287,1,0,0,0,1,5,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
288,1,0,0,0,1,5,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
289,1,0,0,0,1,5,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
290,1,0,0,0,1,5,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


The `GraphFragDataSet` instance also automatically saves the vocabulary used by the encoder for its creation: 

In [11]:
train.getVocPath()

'data/sets/graph//ligand_train.tsv.vocab'

In [12]:
test.getVocPath()

'data/sets/graph//ligand_test.tsv.vocab'

Specifying a vocabulary is required by the model during training and we can easily reacreate it from these files (see [finetuning](finetuning.ipynb)).

### Scaffold encoding

In some cases you might have idea of a scaffold or fragments that your new molecules should contain. In this case, it usefull to do the [reinforcement learning](rl_optimization.ipynb) and the [molecule generation](generation.ipynb) with preselected fragments (single fragment or combination of multiple fragments). 

Here, we show how to preprocess these fragments on a example of two input fragments: a single pyrazine and a combination of two pyrazines.

In [3]:
from utils import smilesToGrid

frags = ['c1cnccn1', 'c1cnccn1.c1cnccn1' ]  

smilesToGrid(frags)

We use the same encoder as previously to fragment and encode fragment-molecule pairs, with a small modifications:
1. Instead of using a `fragmenter` we create dummy molecules from the fragments with `dummyMolsFromFragments` 
2. Set the `splitter` to `None`, the `n_proc` and `chunk_size` to 1 

In [4]:
from drugex.molecules.converters.dummy_molecules import dummyMolsFromFragments
from drugex.data.fragments import FragmentCorpusEncoder, GraphFragmentEncoder
from drugex.data.corpus.vocabulary import VocGraph

fragmenter = dummyMolsFromFragments()
splitter = None

encoder = FragmentCorpusEncoder(
    fragmenter=fragmenter, 
    encoder=GraphFragmentEncoder(
        VocGraph(n_frags=4) 
    ),
    pairs_splitter=splitter, 
    n_proc=1,
    chunk_size=1
)

In [7]:
import os
from drugex.data.datasets import GraphFragDataSet

# create a dedicated directory for our graph data set files
graph_input_folder = "data/sets/graph/"
if not os.path.exists(graph_input_folder):
    os.makedirs(graph_input_folder)
    
dataset = GraphFragDataSet(f"{graph_input_folder}/scaffold_graph.tsv", rewrite=True)

Initialized empty dataset. The data set file does not exist (yet): data/sets/graph//scaffold_graph.tsv. You can add data by calling this instance with the appropriate parameters.


In [8]:
encoder.apply(list(frags), encodingCollectors=[dataset])


Creating fragment-molecule pairs (batch processing): 100%|██████████| 2/2 [00:00<00:00, 114.52it/s]
Encoding fragment-molecule pairs. (batch processing):   0%|          | 0/2 [00:00<?, ?it/s]The following exception occured while encoding fragment c1cnccn1.c1cnccn1 for molecule c1cn(-c2cnccn2)ccn1: 'NoneType' object has no attribute 'GetSubstructMatches'
Failed to convert item None to the new representation in <drugex.data.fragments.FragmentPairsEncodedSupplier object at 0x7f1220358eb0>
	 Cause: FragmentEncodingException('Failed to encode fragment c1cnccn1.c1cnccn1 from molecule: c1cn(-c2cnccn2)ccn1')
Encoding fragment-molecule pairs. (batch processing): 100%|██████████| 2/2 [00:00<00:00, 45.37it/s]


## Preparing Data for the Smiles-Based Transformer

TODO

### Scaffold encoding

In some cases you might have idea of a scaffold or fragments that your new molecules should contain. In this case, it usefull to do the [reinforcement learning](rl_optimization.ipynb) and the [molecule generation](generation.ipynb) with preselected fragments (single fragment or combination of multiple fragments). 

Here, we show how to preprocess these fragments on a example of two input fragments: a single pyrazine and a combination of two pyrazines.

In [None]:
from utils import smilesToGrid

frags = ['c1cnccn1', 'c1cnccn1.c1cnccn1' ]  

smilesToGrid(frags)

We use the same encoder as previously to fragment and encode fragment-molecule pairs, with a small modifications:
1. Instead of using a `fragmenter` we create dummy molecules from the fragments with `dummyMolsFromFragments` 
2. Set `splitter` to `None`,  `min_len` in `VocSmiles` to 2, `n_proc` and `chunk_size` to 1 

In [9]:
from drugex.molecules.converters.dummy_molecules import dummyMolsFromFragments
from drugex.data.fragments import FragmentCorpusEncoder, SequenceFragmentEncoder
from drugex.data.corpus.vocabulary import VocSmiles

fragmenter = dummyMolsFromFragments()
splitter = None

encoder = FragmentCorpusEncoder(
    fragmenter=fragmenter, 
    encoder=SequenceFragmentEncoder(
        VocSmiles(min_len=2) 
    ),
    pairs_splitter=splitter, 
    n_proc=1,
    chunk_size=1
)

In [12]:
import os
from drugex.data.datasets import SmilesFragDataSet

# create a dedicated directory for our graph data set files
smiles_input_folder = "data/sets/smiles/"
if not os.path.exists(smiles_input_folder):
    os.makedirs(smiles_input_folder)
    
dataset = SmilesFragDataSet(f"{smiles_input_folder}/scaffold_smi.tsv", rewrite=True)

Initialized empty dataset. The data set file does not exist (yet): data/sets/smiles//scaffold_smi.tsv. You can add data by calling this instance with the appropriate parameters.


In [13]:
encoder.apply(list(frags), encodingCollectors=[dataset])

Creating fragment-molecule pairs (batch processing): 100%|██████████| 2/2 [00:00<00:00, 136.65it/s]
Encoding fragment-molecule pairs. (batch processing): 100%|██████████| 2/2 [00:00<00:00, 125.07it/s]
