> **How to run this notebook (command-line)?**
1. Install the `lib-invent` environment. Navigate to the project directory and install:
`conda env create -f environment.yml`
2. Activate the environment:
`conda activate lib-invent`
3. Execute `jupyter`:
`jupyter notebook`
4. Copy the link to a browser

# `Lib-INVENT`: Prior Training - teacher's forcing

The purpose of this notebook is to demonstrate the process of setting up training of the prior capable of producing valid, ChEMBL-like SMILES strings. After it learns the syntax of the SMILES language, the prior is used in reinforcement learning.

The datasets provided in the public repository include a traning dataset and a validation dataset. For details of the preprocessing, see the Lib-INVENT Datasets project repository and tutorials. The expected input format is a tab-separated file with entries on each line corresponding to scaffolds, decorations and complete compounds.

To train the prior model, the required input is an initialised empty model along with the training and validation data.

The state of the model is saved with a user-specified frequency during training, resulting in a sequence of models saved in the output directory.

In [1]:
# load dependencies
import os
import re
import json
import tempfile

# --------- change these path variables as required
project_directory = "</path/to/project/directory>"
output_dir = "<path/to/output/directory>"
empty_model_path = os.path.join(project_directory, "tutorial/models/empty_model/model.empty")
training_set_path = os.path.join(project_directory, "training_sets/chembl_train.smi") 
validation_set_path = "</path/to/validation/data>" #same format as train data


# --------- do not change
# get the notebook's root path
try: ipynb_path
except NameError: ipynb_path = os.getcwd()

# if required, generate a folder to store the results
try:
    os.mkdir(output_dir)
except FileExistsError:
    pass

## Setting up the configuration
`Lib-INVENT` has an entry point that loads a specified `JSON` file on startup. `JSON` is a low-level data format that allows to specify a large number of parameters in a cascading fashion very quickly. The parameters are structured into *blocks* which can in turn contain blocks or simple values, such as *True* or *False*, strings and numbers. 

This notebook demonstrates the process of assembling an input `JSON` to pretrain the prior model. It details the purpose of each of the necessary blocks and suggests potential values of the parameters. Note, that while we will write out the configuration as a `JSON` file in the end, in `python` we handle the same information as a simple `dict`.

At the highest level, the teacher's forcing input configuration consists of a two blocks. The string parameter `run_type` specifies the type of training or action to be performed by the model (e.g. reinforcement learning or compound sampling) while a second, large block specifies all the parameter necessary for performing this action. Depending on the running mode, this can include the specification of the scoring function, logging setup or training details such as the number of epochs and learning rate.

In [2]:
# initialize the dictionary
configuration = {
    "run_type": "transfer_learning"
}

The assembly of the `parameters` block requires the specification of training parameters to be used in the training. First, paths to appropriate directories are given:

In [None]:
parameters = {
    "model_path": empty_model_path,
    "training_set_path": training_set_path,
    "output_path":os.path.join(output_dir, "trained") ,
    "validation_sets_path": validation_set_path,
    "logging_path": os.path.join(output_dir, "run.log")
}

Other necessary parameters involve the set up of the run itself and logging. The "do not change" parameters are needed for development purposes only and should not be altered during standard model usage.

In [None]:
parameters.update({
    "decoration_type": "single", # Do not change
    "with_weights": False,       # Do not change
    
    "sample_size": 10000,        # Relevant for logging
    "save_frequency": 1,         # Frequency of saving of trained models
    "epochs": 20,               
    "batch_size": 256,          
    "clip_gradients": 1.0,
    "collect_stats_frequency": 1
})

Finally, set up learning rate. LR scheduler is used, decreasing the LR by the factor `gamma` at until the minimum value is reached. The frequency of change is defined by the `step` parameter.

In [3]:
parameters.update({
     "learning_rate": {
        "start":0.0001,         
        "min": 0.000001,
        "gamma": 0.95,
        "step": 1
    }
})

This completes the assembly of the second block of the input `JSON`. This can be added to the previously initialised configuration and written out as a `JSON` file to be passed to the model.

In [None]:
# Complete the configuration
configuration.update({
    "parameters": parameters
})

In [4]:
# Save as a JSON file
configuration_JSON_path = os.path.join(output_dir, "TL_config.json")
with open(configuration_JSON_path, 'w') as f:
    json.dump(configuration, f, indent=4, sort_keys=True)

## Run
Please note this training might take days with the suggested dataset and the number of epochs.

Execute in jupyter notebook


In [7]:
%%capture captured_err_stream --no-stderr

# execute REINVENT from the command-line
!python {project_directory}/input.py {configuration_JSON_path}

In [8]:
# print the output to a file, just to have it for documentation
with open(os.path.join(output_dir, "run.err"), 'w') as file:
    file.write(captured_err_stream.stdout)

Execute in command line:
```
# activate environment
$ conda activate lib-invent

# execute in command line
$ python <project_directory>/input.py <configuration_JSON_path>
```

## Analyse the results
`tensorboard` is used for logging of all Lib-INVENT runs. The relevant logs are saved to the directory specified by the `logging_path` argument. 

To open and run tensorboard from the command line:

```
# go to the root folder of the output
$ cd output_dir

$ conda activate lib-invent

# start tensorboard
$ tensorboard --logdir "output_dir/run.log"
```
