
# `Lib-INVENT Dataset`: Reaction Based Slicing Demo
The purpose of this notebook is to illustrate how to generate a configuration for slicing compounds in a dataset based on reactions.

Inputs:
- A `dataset` with compound SMILES (typically already filtered to exclude non-drug like compounds). For the training of Lib-INVENT, the filtered ChEMBL 27 smiles were used. This dataset can be found in the Lib-INVENT project repository.
- `reaction.smirks` file which contains reactions that will be used to slice the compounds.

The `output SMILES file` will contain three columns including `scaffolds`, `decorations` and the `original compounds`.

In [32]:
# load dependencies
import os
import re
import json
import tempfile

# --------- change these path variables as required
lib_invent_dataset_project_dir = "<path/to/your/project/>"
input_dataset_path = "</path/to/input_smiles_file>" 
                                                                                                  
reactions_file_path= os.path.join(lib_invent_dataset_project_dir, "tutorial/data/reaction.smirks")
output_dir = "</path/to/output_directory/>"

# --------- do not change
# get the notebook's root path
try: ipynb_path
except NameError: ipynb_path = os.getcwd()

# if required, generate a folder to store the results
try:
    os.mkdir(output_dir)
except FileExistsError:
    pass

## Setting up the configuration
`Lib-INVENT dataset` has an entry point that loads a specified `JSON` file on startup. Similar to the first tutorial, the remainder of this notebook demonstrates the process of the assembly of the dictionary which can then be converted to the desired `JSON` file saven on a local machine.

In contrast to the `stats_extraction` running mode, `reaction_based_slicing` is a relatively straightforward operation which requires fewer arguments. As mentioned above, the essential inputs needed is a dataset of compounds to be sliced, provided in a text file with one SMILES per line, and a file of reaction SMIRKS, which is provided along with these tutorials. Finally, technicalities of the slicing are specified: this includes the maximum number of cuts per compound and conditions about the form of the resulting fragments.

### 1. Condition Configuration
A `condition configuration` will be assembled first. This configuration will be used as an input in the reaction based slicing configuration. It specifies particular conditions the scaffolds and decorations have to satisfy in order to be included in the sliced dataset. 

No conditions on the decorations were imposed when preprocessing data for the Lib-INVENT publication. The scaffolds are required to contain at least one ring. This assumption is motivated by the fact that in typical library design scenarios, the base scaffold is an aromatic compound; simple scaffolds without rings do not carry significant chemical properties and are therefore not typically useful for lead optimisation applications as described in the Lib-INVENT publication. 

In [10]:
condition_configuration={
    "scaffold": [{
        "name":"ring_count",
        "min": 1
    }],
    "decoration": [
    ]
} 

In [11]:
# write the configuration file to the disc
condition_configuration_JSON_path = os.path.join(output_dir, "filter_conditions.json")
with open(condition_configuration_JSON_path, 'w') as f:
    json.dump(condition_configuration, f, indent=4, sort_keys= False)

### 2. The Reaction Based Slicing Configuration

The JSON configuration passed to the Lib-INVENT Dataset input contains two blocks: `run_type` and `parameters`. In these, the necessary arguments are passed as follows:

In [12]:
# initialize the dictionary
configuration = {
    "run_type": "reaction_based_slicing"                                          
}

In [13]:
configuration["parameters"] = {
    "input_file": input_dataset,
    "output_path": os.path.join(output_dir, "sliced"),
    "output_smiles_file": os.path.join(output_dir, "reaction_smiles"),
    "conditions_file": condition_configuration_JSON_path,
    "reactions_file": reactions_file_path,
    "max_cuts": 4,                           # the maximum number of cuts to perform on each molecule.
    "number_of_partitions": 1000,            # relevant for PySpark. Do not change.
    "validate_randomization": True           # check that randomised molecules correspond to the originals.
}

In [14]:
# write the configuration file to the disc
configuration_JSON_path = os.path.join(output_dir, "reaction_slicing_config.json")
with open(configuration_JSON_path, 'w') as f:
    json.dump(configuration, f, indent=4, sort_keys= False)

## Run
Execute in jupyter notebook

In [30]:
%%capture captured_err_stream --no-stderr 

# execute
%cd {lib_invent_dataset_project_dir}
!spark-submit --driver-memory=80g --conf spark.driver.maxResultSize=32g input.py {configuration_JSON_path}

In [31]:
# print the output to a file, just to have it for documentation
with open(os.path.join(output_dir, "run.err"), 'w') as file:
    file.write(captured_err_stream.stdout)

Execute in command line
```
# activate environment
conda activate lib_invent_data

# go to the root folder of input.py 
cd </path/to/Lib-INVENT-dataset/directory>

# execute in command line
spark-submit --driver-memory=32g --conf spark.driver.maxResultSize=16g input.py </path/to/configuration.json>
```