# Synthesis Planning
This notebook goes through the whole process to use the optimized model and produce the synthesis planning.

## Environment Setup
First of all, we need to make sure that the notebook is running in the correct environment.

To do that, follow these steps:
 1. Create the project's environment
    To do that, place yourself in the project's root, and run :
    `conda env create -f environment.yml`
    This creates a new clean conda environment with the package needed by the project.
 2. Activate the environment
    On Linux and Mac:
    `source activate synnet`
    On Windows:
    `conda activate synnet`
 3. Install the project's module
    Now that the environment is activated, we need to install the project as a module.
    Place yourself in the project's root and run :
    `pip install -e .`
 4. Restart Jupyter from the new environment
    Now, we can start Jupyter from the environment, that way it has all the dependencies we need. Simply run `jupyter notebook` and open this notebook.

To test the setup, run the following cell.

In [None]:
import sys

from helpers import paths

# Check that the correct conda env is being used
if sys.prefix.split("\\")[-1] != "synnet":
    print(
        "You are not using the correct conda environment, please follow the instructions above"
    )
else:
    try:
        import synnet

        print("The environment is setup correctly")
    except ImportError:
        print(
            "The module 'synnet' is not installed, please follow the instructions above"
        )

## Pre-Processing

Now that the conda environment is correctly setup, we can start the preliminary steps to produce the synthesis results.

First, let's import some packages, define some constants.
Make sure they are correct.

In [None]:
from helpers.loader import *
from helpers.preprocessor import *
from helpers.synthesis import synthesis

# Number of cores to use for the computation. The greater, the faster
cpu_cores = 6
# Number of molecules to randomly pick from the datasets.
# Our results were made with a sample of 10000
num_samples = 10000
# Seed to use to sample the datasets
seed = 42

### Load data
First, we need to choose the trained model to use.

For now, we only have the one provided by the paper's authors

In [None]:
original_checkpoints = get_original_checkpoints()

And the model we trained

In [None]:
trained_checkpoints = get_trained_checkpoints()

Now, we need to retrieve the building blocks. We asked the company to provide them, that way we can correctly reproduce their result

To simplify the workflow, this also perform the step 0 described in INSTRUCTIONS.md

In [None]:
bblocks_raw = get_building_blocks()

We also need to download the molecules we want to test the model on.

We will use three datasets :
 - the reachable molecules
 - the ZINC dataset
  - the ChEMBL dataset

In [None]:
# Our reachable set is generated using a sample size of 10000 and a seed of 42
test_smiles = get_reachable_dataset(num_samples, seed)
zinc_smiles = get_zinc_dataset(num_samples, seed)
chembl_smiles = get_chembl_dataset(num_samples, seed)

## Process Building Blocks

### Filter Building Blocks
First, we apply the step 1 from INSTRUCTION.md

We pre-process the building blocks to identify applicable reactants for each reaction template. In other words, filter out all building blocks that do not match any reaction template. There is no need to keep them, as they cannot act as reactant.

In a first step, we match all building blocks with each reaction template.
In a second step, we save all matched building blocks and a collection of `Reactions` with their available building blocks.

In [None]:
bblocks, rxn_collection = filter_bblocks(bblocks_raw)

### Pre-compute embeddings
Then, step 2

We use the embedding space for the building blocks a lot. Hence, we pre-compute and store the building blocks.

In [None]:
mol_embedder = compute_embeddings(bblocks, cpu_cores)

# Synthesis

Now that everything is loaded and pre-processed, we can do the synthesis prediction.

First, we compute synthetic trees for the reachable smiles, both on the original and the trained checkpoints

In [None]:
synthesis(
    test_smiles,
    bblocks,
    original_checkpoints,
    rxn_collection,
    mol_embedder,
    paths.synthesis_result_path("reachable", "original"),
    rxn_template="hb",
    n_bits=4096,
    beam_width=3,
    max_step=15,
    cpu_cores=cpu_cores,
)

In [None]:
synthesis(
    test_smiles,
    bblocks,
    trained_checkpoints,
    rxn_collection,
    mol_embedder,
    paths.synthesis_result_path("reachable", "trained"),
    rxn_template="hb",
    n_bits=4096,
    beam_width=3,
    max_step=15,
    cpu_cores=cpu_cores,
)

Compute synthetic trees for the ZINC dataset, both on the original and the trained checkpoints

In [None]:
synthesis(
    zinc_smiles,
    bblocks,
    original_checkpoints,
    rxn_collection,
    mol_embedder,
    paths.synthesis_result_path("zinc", "original"),
    rxn_template="hb",
    n_bits=4096,
    beam_width=3,
    max_step=15,
    cpu_cores=cpu_cores,
)

In [None]:
synthesis(
    zinc_smiles,
    bblocks,
    trained_checkpoints,
    rxn_collection,
    mol_embedder,
    paths.synthesis_result_path("zinc", "trained"),
    rxn_template="hb",
    n_bits=4096,
    beam_width=3,
    max_step=15,
    cpu_cores=cpu_cores,
)

Compute synthetic trees for the ChEMBL dataset

In [None]:
synthesis(
    chembl_smiles,
    bblocks,
    original_checkpoints,
    rxn_collection,
    mol_embedder,
    paths.synthesis_result_path("chembl", "original"),
    rxn_template="hb",
    n_bits=4096,
    beam_width=3,
    max_step=15,
    cpu_cores=cpu_cores,
)

In [None]:
synthesis(
    chembl_smiles,
    bblocks,
    trained_checkpoints,
    rxn_collection,
    mol_embedder,
    paths.synthesis_result_path("chembl", "trained"),
    rxn_template="hb",
    n_bits=4096,
    beam_width=3,
    max_step=15,
    cpu_cores=cpu_cores,
)