> **How to run this notebook (command-line)?**
1. Install the `lib_invent_data` environment:
`conda env -f environment.yml`
2. Activate the environment:
`conda activate lib_invent_data`
3. Execute `jupyter`:
`jupyter notebook`
4. Copy the link to a browser


# `Lib-INVENT Datasets`: Reaction Based Slicing Demo
The purpose of this notebook is to illustrate how to generate a configuration for slicing compounds in a dataset based on reactions.

Inputs:
- A `dataset` with compounds SMILES. For the training of Lib-INVENT, the filtered ChEMBL 27 smiles were used. This dataset can be found in the Lib-INVENT project repository.
- `reaction.smirks` file which contains reactions that will be used to slice the compounds.

The `output SMILES file` will contain three columns including `scaffolds`, `decorations` and the `original compounds`.

In [32]:
# load dependencies
import os
import re
import json
import tempfile

# --------- change these path variables as required
lib_invent_dataset_project_dir = "<path/to/your/project/>"
input_dataset_path = "</path/to/input_smiles_file>" 
                                                                                                   # unzip first
reactions_file_path= os.path.join(lib_invent_dataset_project_dir, "tutorial/data/reaction.smirks")
output_dir = "</path/to/output_directory/>"

# --------- do not change
# get the notebook's root path
try: ipynb_path
except NameError: ipynb_path = os.getcwd()

# if required, generate a folder to store the results
try:
    os.mkdir(output_dir)
except FileExistsError:
    pass

## Setting up the configuration
`Lib-INVENT datasets` has an entry point that loads a specified `JSON` file on startup. `JSON` is a low-level data format that allows to specify a fairly large number of parameters in a cascading fashion very quickly. The parameters are structured into *blocks* which can in turn contain blocks or simple values, such as *True* or *False*, strings and numbers. In this tutorial, we will go through the different blocks step-by-step, explaining their purpose and potential values for given parameters. Note, that while we will write out the configuration as a `JSON` file in the end, in `python` we handle the same information as a simple `dict`.

A `condition configuration` will be assembled first. This configuration will be used as an input in the reaction based slicing configuration, and the use for this configuration is to contain conditions used to filter scaffolds and decorations obtained from slicing molecules.

In [10]:
condition_configuration={
    "scaffold": [{
        "name":"ring_count",
        "min": 1
    }],
    "decoration": [
    ]
} 

In [11]:
# write the configuration file to the disc
condition_configuration_JSON_path = os.path.join(output_dir, "filter_conditions.json")
with open(condition_configuration_JSON_path, 'w') as f:
    json.dump(condition_configuration, f, indent=4, sort_keys= False)

The `reaction based configuration` will be assembled by executing the 3 code block provided below.

In [12]:
# initialize the dictionary
configuration = {
    "run_type": "reaction_based_slicing"                                          
}

In [13]:
configuration["parameters"] = {
    "input_file": input_dataset,
    "output_path": os.path.join(output_dir, "sliced"),
    "output_smiles_file": os.path.join(output_dir, "reaction_smiles"),
    "conditions_file": condition_configuration_JSON_path,
    "reactions_file": reactions_file_path,
    "max_cuts": 4,                           # the maximum number of cuts to perform on each molecle
    "number_of_partitions": 1000,
    "validate_randomization": True
}

In [14]:
# write the configuration file to the disc
configuration_JSON_path = os.path.join(output_dir, "reaction_slicing_config.json")
with open(configuration_JSON_path, 'w') as f:
    json.dump(configuration, f, indent=4, sort_keys= False)

## Run
Execute in jupyter notebook

In [30]:
%%capture captured_err_stream --no-stderr 

# execute
%cd <lib_invent_dataset_project_dir>
!spark-submit --driver-memory=80g --conf spark.driver.maxResultSize=32g input.py <configuration_JSON_path>

In [31]:
# print the output to a file, just to have it for documentation
with open(os.path.join(output_dir, "run.err"), 'w') as file:
    file.write(captured_err_stream.stdout)

Execute in command line
```
# activate envionment
conda activate lib_invent_data

# go to the root folder of input.py 
cd </path/to/Lib-INVENT-datasets/directory>

# execute in command line
spark-submit --driver-memory=32g --conf spark.driver.maxResultSize=16g input.py </path/to/configuratoin.json>
```