> **How to run this notebook (command-line)?**
1. Install the `ReinventCommunity` environment:
`conda env create -f environment.yml`
2. Activate the environment:
`conda activate ReinventCommunity`
3. Execute `jupyter`:
`jupyter notebook`
4. Copy the link to a browser


# `REINVENT 3.2`: create model demo
The *create model* running mode can be used  when planning to train a new model from scratch. This mode is the first step of this process. During this step we only parse the training data and extract all relevant tockens that are used within the pool of smiles. This information is kept in model's vocabulary. However the model itself has not learned anything yet. The actual training is done in the second step when using *transfer learning* mode of `REINVENT 3.2`.

To proceed, please update the following code block such that it reflects your system's installation and execute it.

In [1]:
# load dependencies
import os
import re
import json
import tempfile

# --------- change these path variables as required
reinvent_dir = os.path.expanduser("~/Desktop/Reinvent")
reinvent_env = os.path.expanduser("~/miniconda3/envs/reinvent.v3.2")
output_dir = os.path.expanduser("~/Desktop/REINVENT_create_model_demo")

# --------- do not change
# get the notebook's root path
try: ipynb_path
except NameError: ipynb_path = os.getcwd()

# if required, generate a folder to store the results
try:
    os.mkdir(output_dir)
except FileExistsError:
    pass

## Setting up the configuration
`REINVENT` has an entry point that loads a specified `JSON` file on startup. `JSON` is a low-level data format that allows to specify a fairly large number of parameters in a cascading fashion very quickly. The parameters are structured into *blocks* which can in turn contain blocks or simple values, such as *True* or *False*, strings and numbers. In this tutorial, we will go through the different blocks step-by-step, explaining their purpose and potential values for given parameters. Note, that while we will write out the configuration as a `JSON` file in the end, in `python` we handle the same information as a simple `dict`.

In [2]:
# initialize the dictionary
configuration = {
    "version": 3,                          # we are going to use REINVENT's newest release
    "run_type": "create_model"             # other run types: "scoring", "validation",
                                           #                  "transfer_learning",
                                           #                  "reinforcement_learning" and
                                           #                  "sampling"
}

In [3]:
# add block to specify whether to run locally or not and
# where to store the results and logging
configuration["logging"] = {
    "sender": "http://127.0.0.1",          # only relevant if "recipient" is set to "remote"
    "recipient": "local",                  # either to local logging or use a remote REST-interface
    "logging_path": os.path.join(output_dir, "progress.log"), # where the run's progress log is stored
    "result_folder": os.path.join(output_dir, "results"), # where the run's results are stored
    "job_name": "Create model demo",       # set an arbitrary job name for identification
    "job_id": "demo"                       # only relevant if "recipient" is set to "remote"
}

We will need to specify a path to an agent (parameter `model_path`), which can be a prior or trained agent. For the purpose of this notebook, we will use a prior shipped with the `REINVENT 3.2` repository.

In [4]:
# provide your input dataset that will be used for training 
#we use a purged dataset provided with this repo

input_SMILES_path = os.path.join(ipynb_path, "data/chembl.filtered.smi") 
output_model_path = os.path.join(output_dir, "empty_model.ckpt")

# add the "parameters" block
configuration["parameters"] = {
    "output_model_path": output_model_path,
    "input_smiles_path": input_SMILES_path,
    "num_layers": 3,
    "layer_size": 512,
    "cell_type": "lstm",              # use lstm cell. The options are "gru" and "lstm"
    "embedding_layer_size": 256,      
    "dropout": 0.,
    "max_sequence_length": 256,
    "layer_normalization": False,
    "standardize": False              # standardization is set to false for efficiency
                                      # we assume the data is being standardized during purging
                                      # for details check the Data Preparation notebook
  }

In [5]:
# write the configuration file to the disc
configuration_JSON_path = os.path.join(output_dir, "create_model_config.json")
with open(configuration_JSON_path, 'w') as f:
    json.dump(configuration, f, indent=4, sort_keys=True)

## Run `REINVENT`
Now it is time to execute `REINVENT` locally. The execution time will vary dependent on the size of your dataset.
The resulting file will be `empty_model.ckpt` which can be used as an input for the `transfer learning` mode where the same smiles dataset should be used to train on for multiple epochs.

The command-line execution looks like this:
```
# activate envionment
conda activate reinvent.v3.2

# execute REINVENT
python <your_path>/input.py <config>.json
```

In [6]:
%%capture captured_err_stream --no-stderr

# execute REINVENT from the command-line
!{reinvent_env}/bin/python {reinvent_dir}/input.py {configuration_JSON_path}