> **How to run this notebook (command-line)?**
1. Install the `lib_invent_data` environment:
`conda env -f environment.yml`
2. Activate the environment:
`conda activate lib_invent_data`
3. Execute `jupyter`:
`jupyter notebook`
4. Copy the link to a browser


# `Lib-INVENT Datasets`: Data preparation demo
This demo illustrates how to compute the distribution of the chemical properties and based on the properties filter data from ChEMBL or other sources to only include drug-like molecules. 

To proceed, please update the following code block such that it reflects your system's installation and execute it.

### Motivation
> **There are a number of reasons to pre-process the data used for training a generative model.**
1. Removal of invalid or duplicated entries.
2. Removal of unusual compounds that are clearly not drug-like (too big, reactive groups and etc.). There is normally no point training model on such examples since that bias will reflected by the generative model. 
3. Removal of rare tokens. There are rare compounds that can be seen as outliers. They in turn might contain rare tokens. Excluding them frees a slot in the vocabulary and makes it smaller. Smaller vocabulary means faster training and less memory. As a result removing compounds that introduce rare tokens to the vocabulary speeds up the generative model.

### Introduction
This configuration can be used for preparing data to only include drug-like molecules or calculate stats for sliced datasets. This Demo mainly focuses on preparing and filtering data.
> **The rules used for filtering data:**
- 2 <= num heavy atoms <= 70   
- allowed elements: [6, 7, 8, 9, 16, 17, 35]  
- remove salts, neutralize charges, sanitize
- remove side chains with 5 or more carbon atoms
- 0<num_rings <= 10
- num_atoms >= 6
- mol_weights <= 760
- num_aromatic_rings <= 8
- heteroatom_ratio > 0.5

In [168]:
# load dependencies
import os
import re
import json
import tempfile

# --------- change these path variables as required
data_path = "<path/to/chembl/smiles>" # for the training of Lib-INVENT, we used ChEMBL 27 and converted to SMILES using RDKit.
output_dir = "<path/to/your/output_directory/>"

# --------- do not change
# get the notebook's root path
try: ipynb_path
except NameError: ipynb_path = os.getcwd()  

# if required, generate a folder to store the results
try:
    os.mkdir(output_dir)
except FileExistsError:
    pass

## Setting up the configuration
`Lib-INVENT datasets` has an entry point that loads a specified `JSON` file on startup. `JSON` is a low-level data format that allows to specify a fairly large number of parameters in a cascading fashion very quickly. The parameters are structured into *blocks* which can in turn contain blocks or simple values, such as *True* or *False*, strings and numbers. In this tutorial, we will go through the different blocks step-by-step, explaining their purpose and potential values for given parameters. Note, that while we will write out the configuration as a `JSON` file in the end, in `python` we handle the same information as a simple `dict`.

In [169]:
# initialize the dictionary
configuration = {
    "run_type": "stats_extraction"                                          
}

`Standardization_config` contains rules to standardise the molecules using functions in reinvent chemistry. This standardization includes:
- 2 <= num heavy atoms <= 70
- allowed elements: [6, 7, 8, 9, 16, 17, 35]
- remove salts, neutralise charges, sanitize
- remove side chains with 5 or more carbon atoms

`filter` includes filtering rules that wanted to be applied to the data.

In [174]:
configuration["parameters"] = {
    "data_path": data_path,                 # location to store input data      
    "output_path": output_dir,              # location to store the results
    "properties": ['mol_wts',               # properties of interest, avaliable properties:
                   'num_rings',              # 'mol_wts', 'num_rings', 'num_aromatic_rings', 'num_atoms',   
                   'num_aromatic_rings',     #'hbond_donors', 'hbond_acceptors'
                   'num_atoms',
                   'hbond_donors',
                   'hbond_acceptors'],
    "token_distribution": True,             # calculate the counts of individual tokens
    "columns": ["original"],                # other options for sliced datasets: "scaffolds","decorations"  
    "mode": "orig_data",                    # other option:"sliced_data"
    "plotting": True,                       # plot the distribution of chemical properties
    "standardisation_config":{"neutralise_charges":  {"reactions": None}},            # standardize molecules
    "save_standardised": True,              # save the standardised dataframe
    "filter": {                             # filter contains properties conditions that want to be filtered,
        "num_rings": ["max", 10],            #"min": ">=","max": "<="
        "num_rings": ["min", 1],
        "num_atoms": ["min",6],
        "mol_wts": ["max", 760],
        "num_aromatic_rings": ["max", 8]
    },                                    
                                                          
    "save_cut_precomputed": True,          # save the filtered dataframe
    "token_atom_ratio": True,              # if the ratio is too high then the molecule is too complicated for the 
                                             # model to learn
    "count_decorations": False             
}

In [175]:
# write the configuration file to the disc
configuration_JSON_path = os.path.join(output_dir, "data_preparation_example.json")
with open(configuration_JSON_path, 'w') as f:
    json.dump(configuration, f, indent=4, sort_keys= False)

## Run
Execute in jupyter notebook

In [166]:
%%capture captured_err_stream --no-stderr 

# execute
%cd </path/to/Lib-DESIGN-datasets/project/directory/>
!spark-submit --driver-memory=32g --conf spark.driver.maxResultSize=16g input.py <configuration_JSON_path>

In [167]:
# print the output to a file, just to have it for documentation
with open(os.path.join(output_dir, "run.err"), 'w') as file:
    file.write(captured_err_stream.stdout)

Execute in command line
```
# activate envionment
conda activate lib_invent_data

# go to the root folder of input.py 
cd </path/to/Lib-INVENT-datasets/directory>

# execute in command line
spark-submit --driver-memory=32g --conf spark.driver.maxResultSize=16g input.py </path/to/configuratoin.json>
```