# OPEN-SOURCE DEEP DOCKING GUIDE

## Original Paper summary - *Deep Docking: A Deep Learning Platform for Augmentation of Structure Based Drug Discovery*

- recent surge of
small molecules availability presents great drug discovery
opportunities, but also demands much faster screening protocols
- there is still a global lack of experience in screening such libraries, and the advantage of docking them versus smaller collections is still matter of debate.6 However, few recently published works seem to advocate for expanding VS to ultralarge chemical libraries. 
- the current chemical space remains largely inaccessible to structure-based drug discovery - current approach to address this disparity is to filter large chemical collections to manageable drug-, lead-, fragment-, and hit-like subsets (among others) using precomputed physicochemical parameters and drug-like criteria, such as molecular weight, volume, octanol−water partition coefficient, polar surface area, number or rotatable bonds, number of hydrogen bond donors and acceptors, among many others.
- conventional docking workflow is remarkably neglectful of negative results and the vast majority of docking data (both favorable and, especially unfavorable) is not being utilized in any way or form, while it could represent a very relevant, well-formatted, and content-rich input for machine learning algorithms


### Results

- DD achieves up to 100-fold reduction of an ultralarge docking database and up to 6000-fold enrichment for the top-ranked hits, while avoiding significant loss of favorable virtual hits, as it will be discussed below.
- DD pipeline:
    - For each entry of an ultralarge docking database (such as ZINC15), the standard set of ligand-based QSAR descriptors (such as molecular fingerprints) is computed;
    - A reasonably sized training subset is randomly sampled from the database and docked into the target of interest using conventional docking protocol(s)
    - The generated docking scores of the training compounds are then related to their 2D molecular descriptors through a DL model; a docking score cutoff (typically negative) is then used to divide training compounds in virtual hits (scoring below the cutoff) and nonhits (scoring above the cutoff) - deciding cutoff also gradually becomes more stringent over iterations
    - The resulting QSAR deep model (trained on empirical docking scores) is then used to predict docking outcomes of yet unprocessed entries of the database. A predefined number of predicted virtual hits are then randomly sampled and used for the training set augmentation;
    - Steps b−d are repeated iteratively until a predefined number of iterations is reached, and/or processed entries of an ultralarge docking database are converged
- Ultra Large Docking Database Sampling:
    - Selection of a representative and balanced training set is a critical step of any modeling workflow - a proper DD training set should effectively reflect database’s chemical diversity
    - Biasing sampling toward molecules that are highly ranked by DD as potential virtual hits could exclude low ranked, yet true positive molecules from being selected for model training; therefore we selected random sampling for all DD iterations.
    - To establish an optimal sampling of ZINC15 base, the relationship between the size of DD training set and the corresponding means and standard deviations of the test set recall values was evaluated
- Size reduction by virtual screening:
    - The main goal of DD methodology is to reduce an ultralarge docking database of billions of entries to a manageable few- million-molecules subset which yet encompasses the vast majority of virtual hits
    - This final molecular subset can then be normally docked into the target using one or several docking programs or can be postprocessed with other VS means.
    - DD itself is not a docking engine, but a DL score predictor to be used in conjunction with any docking program to rapidly eliminate a priori unfavorable, “undockable” molecular entities, and therefore drastically increase the speed of actual docking.
    - The majority of nonhits were removed during the first iteration for all targets, while fewer molecules were discarded in successive steps, as expected due to larger portions of unfavorable compounds being present at the beginning of the runs. It was observed that the decrease rate and the number of hits identified were target-dependent
- Analysis of DD performance
    - all underlying DL models were generalizable in a consistent way
    - there were FDRE scores comparisons done and enrichment values evaluated for top 10, 100 and 1000.
    - true hits are highly concentrated at the top of the DD rank
    - Overall, the above analysis indicates that the DD procedure can effectively discard most of unqualified molecules in a ultra large docking database, without losing more than a predefined percentage of virtual hits. In our opinion, this makes DD methodology an efficient mean for conducting large-scale VS campaigns involving billions of small molecule structures, and a valid alternative to brute force approaches demanding large amounts of computational resources.

### Discussion

- These models then enable approximation of the docking outcome for unprocessed database entries. Importantly, DL allows the use of simple 2D protein-independent descriptors such as Morgan finger- prints to capture the docking scores. We have demonstrated that such approach can yield a manageably small subset of a database, highly enriched with favorably “dockable” molecular structures.
- Moreover, DD appears to enrich final subsets with active ligands, even when only small portions of top ranked molecules are considered. This unexpected result suggests that true binders carry on certain chemical features that are complementary to the binding pocket and that the model is able to capture such features through the QSAR descriptors.

## Original Protocol summary - *Artificial intelligence–enabled virtual screening of ultra-large chemical libraries with deep docking*

- Docking is not just computationally demanding, but also a remarkably wasteful process in which a very small subset of top-scoring compounds is considered for experimental evaluation. Thus, most docked molecules are simply discarded
- DD, a technique that iteratively trains deep neural networks (DNNs) with small batches of explicitly docked com- pounds to infer the ranking of the yet-unprocessed remainder of the library

### Experimental design
#### Preparation of chemical libraries
- most used ultra-large chemical libraries are ZINC15, ZINC20 and 'make-on-demand' Enamine
- Morgan fingerprints with radius 2 and size of 1,024 bits - these extended-connectivity fingerprints represent a machine-readable description of molecules based on a fixed-length binary bit vector encoding the presence or absence of specific substructures 

#### Receptor preparation
- Target structures need to be prepared before the docking grids can be generated.
- Non-structural water, lipids and solvent molecules are usually removed; the target protein may require structural optimization to repair any missing parts, add hydrogens, compute correct protonation states of residues and energetically relax the structure.

#### Molecular sample size
- Validation, test and initial training sets are randomly sampled from the entire docking library at the first DD pass.
- From the second iteration on, the training set is iteratively augmented with random batches of molecules classified as virtual hits in the inference stage of the previous iteration
- For library in order of billion compounds, recommended size of validation and test sets in the first iteration as large as possible, ideally comprising 1 million molecules each, and absolutely avoiding using less than 250,000 molecules 
- The size of the training set, on the other hand, influences mainly model precision, and performances improve with larger training sets (700,000–1,000,000 molecules) and more iterations (8–11).
- Validation and test sets are generated only in the first iteration. Because the score threshold used to define virtual hits is decreased at each DD iteration, using small-sized sets can cause generalization issues, especially in the last iteration, in which the number of positive samples in the two sets is very limited (e.g., 0.01%).

#### Model training and inference
- Each iteration step in the DD protocol encompasses model training and inference. To identify virtual hits, the protocol uses binary classifiers in the form of feedforward DNN models (multilayer per- ceptrons) trained on 1,024-bit circular Morgan fingerprints. 
- Binary ‘positive samples’ in training, validation and test sets are virtual hits with scores above a threshold, corresponding to a predefined top percentage of the docking-ranked molecules in the validation set. The rest of the molecules are labeled as ‘negative samples
- After the binary labels are generated, a user-specified number of models with different combi- nations of hyperparameters (number of hidden layers and neurons, dropout frequencies, over- sampling of minority class and class weights) are trained to optimize model test set accuracy by using a grid search strategy
- After the training phase is finished for the initial iteration, the optimal binary classifier is used for inference of virtual hit-likeness of the remainder of the molecular library. For the next iterations, training, validation and test sets are augmented with new compounds randomly selected from molecules with predicted virtual hit-likenesses higher than a classification threshold corresponding to a user-defined recall value for validation predictions.
- The total number of iterations typically ranges from 4 to 11, and we normally train 24 models at each iteration in the optimization step. For most docking campaigns, these parameters are sufficient to shrink a database of 1–1.5 billion molecules to a few million compounds that could be conventionally docked with regular computa- tional resources. Alternatively, as we mentioned before, the preset recall value could be adjusted for more ‘aggressive’ DD-selection of top-scored compounds.

#### Applications
- The DD protocol can be used in conjunction with any popular docking program. 

#### Comparison with alternative methods
- One of the major challenges of modern CADD is a constantly growing need for computational resources required to screen chemical libraries that are exploding in size because of recent advances in automated synthesis and robotics.
- OpenEye GigaDocking
- Autodock program has been parallelized for Compute Unified Device Architecture (CUDA)45 and deployed on the Summit supercomputer
- VirtualFlow
- Bender et al. developed a guide for ultra-large docking campaigns
- These docking platforms achieved great high-throughput but are extremely resource demanding in comparison to DD.
- Conventional docking of ultra-large libraries remains unaffordable for most of the research community
- Hence many new machine learning emulation techniques developed
- DD is one of the fastest AI-enabled docking platforms and the only method that has been extensively tested on 1B+ libraries. In addition, the DD protocol does not rely on a particular docking program, and thus it is compatible with the emerging large-scale docking methods to improve their high-throughput capabilities.


#### Limitations
- DD is implemented for fast and economical virtual screening and thus provides docking details exclusively for the top-scoring molecules and disregards large fractions of chemical libraries.
-  In addition, the quality of DD results entirely depends on the suitability of the docking program to prioritize active molecules from an ultra-large library. Hence, we anticipated that it would be challenging to discover active molecules from DD of a library of a billion molecules if docking performs poorly on the specific target, just like in the case of conventional docking

https://pubs.acs.org/doi/10.1021/cn100008c#

## Open-source Protocol - *automated workflow*

<font color="red"> NOTE: Before proceeding, look at the section **SET-UP**. This section is at the end of the notebook. It is optimized for CSD3 by University of Cambridge users, but can be adjusted to other user-cases too.</font>


### Install enviroments

Main environment is in *DD_protocol.yml*.

 *DD_protocol_tensor.yml* is environment has a different Tensorflow version required for running models on A100. If the GPU used is not A100, this is probably not needed. Otherwise this environment is required in parts when models are being used.

In [None]:
# !conda env create -f DD_protocol.yml
# !conda env create -f DD_protocol_tensor.yml

In case of problems installing *DD_protocol_tensor.yml*, the main dependencies to install are shown below. Additional ones such as keras, matplotlib etc are required too, however, these are the main ones.

<font color="red"> Note (CSD3 user): </font> There might be problem using pip commands on CSD3 and one might need to set up create and set up virtual environment for this to work. Look at this section how it is done here https://docs.hpc.cam.ac.uk/hpc/software-packages/jupyter.html?highlight=pip#setup-jupyter-on-csd3

In [1]:
# !pip install nvidia-pyindex
# !pip install nvidia-tensorflow

# !conda install tensorboard

### Chemical library processing

We are using ready-to-screen version of ZINC20 from https://files.docking.org/zinc20-ML/ with SMILES and already calculated Morgan Fingerprints. This library is already prepared and there is no additional need to enumerate stereoisomers, tautomers and protomers and morgan fingerprints are ready. The chemical space contains 1,006,651,037 compounds.

Library is split into 100 text files each containing approximatelly 10,000,000 compounds. There is separate folder for smiles and separate folder for corresponding fingerprints.

However, if we want to reduce the chemical space based on the physiochemical properties, we need to apply desired filters.

To do so, *filter_by_properties.py* script can be used. Below, you can find arguments this script takes.

In [2]:
!python scripts_3/filter_by_properties.py --help

usage: filter_by_properties.py [-h] -file FILE -output_directory
                               OUTPUT_DIRECTORY [-max_logP MAX_LOGP]
                               [-max_molWt MAX_MOLWT] [-min_TPSA MIN_TPSA]
                               [-max_TPSA MAX_TPSA] [-max_HBD MAX_HBD]
                               [-use_only_molWt USE_ONLY_MOLWT]

optional arguments:
  -h, --help            show this help message and exit
  -file FILE            File to process in format SMILES <whitespace>
                        UNIQUE_ID
  -output_directory OUTPUT_DIRECTORY
                        Directory where to save the filtered output
  -max_logP MAX_LOGP    The upper bound for logP(lipophilicity)
  -max_molWt MAX_MOLWT  The upper bound for molWt(molecular_weight)
  -min_TPSA MIN_TPSA    The lower bound for
                        TPSA(topological_polar_surface_area)
  -max_TPSA MAX_TPSA    The upper bound for
                        TPSA(topological_polar_surface_area)
  -max_HBD MAX_HBD      The 

As we want to parallelize this process and use it on slurm, *run_filtering_for_multiple_files.sh* (shown below) can be used. This script paralelizes run of *filter_by_properties.py* for each subfile from library. If you want to change parameters to something else than default, pass the parameters so *filter_by_properties.py* has access to them. If the properties for sbatch have to be changed, do so in the script itself.

In [3]:
!cat scripts_3/run_filtering_for_multiple_files.sh 

#!/bin/bash
directory_to_filter=$1
output_directory=$2

# If you want to use different filtering options, please change it here
for file in $directory_to_filter/*; 
do sbatch --account=VENDRUSCOLO-SL3-CPU --partition=skylake --nodes=1 --ntasks=1 --cpus-per-task=10 --time=02:00:00 --wrap "python scripts_3/filter_by_properties.py -file $file -output_directory $output_directory -use_only_molWt True"
done


Filtering is done using RdKit and enables filtering based on **max logP, max molecular weight, min and max topological polar surface area and max hydrogen bond donor** or just **max molecular weight alone**. 

There are molecules, for which corresponding SMILES is not in natural form and cannot be read by RdKit. For this, **SMILITE** by https://github.com/rasbt/smilite can be used (code to do this is provided), which tries to retrieve alternative SMILES from ZINC database and use that to determine properties. Worth noting is that the retrieved alternative SMILES can correspond to slightly different version of molecule.

#### Remove duplicates and merge with Morgan fingerprints

Once the filtering is done, *remove_duplicates_and_merge_with_morgan.py* can be used to remove duplicates and filter and merge corresponding Morgan fingerprints with filtered compounds. 

<font color="red"> Note: </font>If original dataset is used for other reasons, or just one unique molecule per ZINC ID wants to be used, part of script can be uncommented to remove all non-unique ZINC IDs. 



To run this process concurently on batches of files using SLURM, *run_multiple_remove_duplicates_and_merge_with_morgan.sh* can be used.

In [11]:
!cat scripts_3/run_multiple_remove_duplicates_and_merge_with_morgan.sh

#!/bin/bash
input_directory_smiles=$1
input_directory_fingerprints=$2
output_directory_smiles=$3
output_directory_fingerprints=$4
account=$5
parition=$6

sbatch --account=$account --partition=$partition --nodes=1 --ntasks=1 --cpus-per-task=10 --time=02:00:00 --wrap "python scripts_3/remove_duplicates_and_merge_with_morgan.py -start_file 0 -end_file 25 -input_directory_smiles $1 -input_directory_fingerprints $2 -output_directory_smiles $3 -output_directory_fingerprints $4"; 

sbatch --account=$account --partition=$partition --nodes=1 --ntasks=1 --cpus-per-task=10 --time=02:00:00 --wrap "python scripts_3/remove_duplicates_and_merge_with_morgan.py -start_file 26 -end_file 50 -input_directory_smiles $1 -input_directory_fingerprints $2 -output_directory_smiles $3 -output_directory_fingerprints $4"; 

sbatch --account=$account --partition=$partition --nodes=1 --ntasks=1 --cpus-per-task=10 --time=02:00:00 --wrap "python scripts_3/remove_duplicates_and_merge_with_morgan.py -start_file 51 -end_

<font color="red"> Important: </font> Run the following line also to make fingerprint files to be in the right format.

In [None]:
# for file in path_to_filtered_fingerprints/*.txt; do sed -i 's/?/,/g' $file; done

### Fill in the *log.txt* file

Fill in the log file that should ideally be located under project_path/project_name (in this case this is results/abeta) which also coincides with file_path and protein name in our case. Fill in 
1.  file path, 
2. protein, 
3. path to conf file for VINA specific for this protein, 
4. path to fingerprints folder, 
5. path to smiles folder, 
6. name of tool used for docking (with VINA this does not matter)
7. number of hyperparameters to test for model, 
8. size of validation and test sets (the same), 
9. path to receptor file that will be used during VINA docking, 
10. path to OBABEL software, 
11. path to VINA software,
11. path to VINA-GPU software. 

<font color="red"> Important: </font> Each should be on a new line and there should be **NO** trailing white space after the path/name on each line  as this may cause issues when resolving paths. Some paths need to be absolute as some scripts change directory to a specific one from the main.

An example of logs.txt file is shown below.

In [5]:
!cat results/abeta/logs.txt

/home/mb2462/rds/hpc-work/DD/DD_protocol_data/DD_main_clean/results
abeta
/home/mb2462/rds/hpc-work/DD/DD_protocol_data/DD_main_clean/results/abeta/conf.txt
../library_ready_filtered_with_isomers_fingerprints
../library_ready_filtered_with_isomers_smiles
Vina                                                   
24                                                   
450000
/home/mb2462/rds/hpc-work/DD/DD_protocol_data/DD_main_clean/results/abeta/receptor.pdbqt 
/home/mb2462/test/DD_protocol_data/OPENBABEL/build/bin/obabel
/home/mb2462/rds/hpc-work/DD/DD_protocol_data/VINA/autodock_vina_1_1_2_linux_x86/bin/vina
/home/mb2462/rds/hpc-work/DD/DD_protocol_data/VINA_GPU/Vina-GPU


### PHASE 1: Random sampling

This is the method used by the original protocol. Training set size is set in this command, while testing/validation set size is retrieved from the logs file. 

It can be run using a following commmand as shown below:

In [9]:
# phase_1.sh current_iteration n_cpus_per_node path_project project training_sample_size conda_env
#!sbatch  --account={account_name} --partition={partition_name} --nodes=1 --ntasks=1 --cpus-per-task=10 --time=01:00:00 phase_1.sh 1 10 results abeta 450000 DD_protocol


### PHASE 2: Prepare ligands

To prepare ligands for docking, we are going to download 3D conformations for ligands of ZINC IDs as **SDF** files for compounds that do not have multiple isomers. For the compunds with multiple isomers present (i.e. contain _1/_2/..), we create the 3D conformations using RDKit. All SDF conformations are then converted to **PDBQT** using **OBABEL**.

For the 3D conformation generation using RDKit, we use a code by Berenger et al available at https://github.com/UnixJunkie/smi2sdf3d/blob/master/smi2sdf.py has been used. 

<font color="red"> Note: </font> For some SMILES, the generation might not be successful due to no good conformations produced/available. 


For this we first split the train/test/valid set to the group of compounds withouth isomers (to have downloaded conformations) and with isomers (to have 3D generated conformations). 

Then we download 3D conformations for batches of 1000 ligands (similarly we create 3D conformations in batches of 1000), in parallel. For download, we first create download jobs to run downloads in parallel. 

This step is automated in ***phase_2_vina_mixed_download_and_creation_ligands.sh***.



You can run ***phase_2_vina_download_ligands.sh*** using a command similar to following

In [10]:
# STRUCTURE: phase_2_vina_mixed_download_and_creation_ligands.sh current_iteration n_cpus_per_node path_project project_name name_cpu_partition account_name
#!sbatch --account={account_name} --partition={partition_name} --nodes=1 --ntasks=1 --cpus-per-task=10 --time=02:00:00 phase_2_vina_mixed_download_and_creation_ligands.sh 1 10 results  abeta skylake VENDRUSCOLO-SL3-CPU


Function to create download commands is *create_download_ligand_scripts.py* shown below.

In [6]:
!cat scripts_3/create_download_ligand_scripts.py

import pandas as pd
from argparse import ArgumentParser

# Parse the arguments
parser = ArgumentParser()
parser.add_argument("-file", required=True,
                    help="File to process")
parser.add_argument("-path_to_store_scripts", default="",
                    help="Path where to store scripts")
parser.add_argument("-path_to_store_ligands", default="",
                    help="Path where to store the ligands")
parser.add_argument("-chunk_size", default=1000,
                    help="Size of a chunk that will be downloaded in parallel")
parser.add_argument("-prefix_to_chunk_files", default="",
                    help="Prefix to use with chunk files")
parser.add_argument('--remove_ZINC_name', action='store_true', 
                    help="Boolean indicator of whether ZINC word should be removed from ID")
parser.add_argument('-output_format', default='sdf',
                    help="Output format for the compounds you want to download (sdf/smi/...)")
# This argument is used 

<font color="red"> Important (CSD3 user): </font> Script utilizes new version of curl and hence module *curl-7.63.0-intel-17.0.4-lxwgw2f* needs to be loaded when using CSD3. As the request may fail sometimes, command is designed to be retried a few times.

<font color="red"> Important: </font> As the request may fail even when tried a few times, a separate script can be used to download the batches for which the requests have failed. This should not happen for many batches. The script can be rerun number of times until all batches are downloaded. 

This scripts retries downloads for scripts that did not complete successfully or did not finish.

In [8]:
!cat phase_2_vina_retry_downloads.sh

#!/bin/bash

#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=10
#SBATCH --time=10:00:00

current_iteration=$1
path_project=$2
project_name=$3
chunk_size=$4 

# Get paths
file_path=`sed -n '1p' $path_project/$project_name/logs.txt`
protein=`sed -n '2p' $path_project/$project_name/logs.txt`

# Go to directory with the current iteration
cd $file_path/$protein/iteration_${current_iteration}

pdbqt_directory="pdbqt"

# For each batch file that has not been downloaded due to request failure, run the download again.
echo "retry based on number of lines"
for d in ${pdbqt_directory}/*_download;
do
tmp="$d"
directory_set_name_full="${tmp##*/}"
set_type="${directory_set_name_full%_*}" # train/test/validation
echo $set_type
   for f in $d/*.sdf
   do
       x=$(wc -l < "$f")
       if [ $x -lt 1000 ];
       then
           tmp="$f"
           full_filename="${tmp##*/}"
           filename="${full_filename%.*}"
           script_name=${set_type}_set_scripts/download_${filename}.sh
   

Example run of the retry function is shown below.

In [None]:
#sbatch --account={account_name} --partition={partition_name} phase_2_vina_retry_downloads.sh 1 results abeta

#### Final ligands preparation

Following either alternative or RDKit method, we end up with SDF files containing confromations of batches of ligands. We then have to split ligands to separate files and then convert them to PDBQT format.

This can be done using *phase_2_vina_prepare_ligands.sh* function and an example comand is shown below.

<font color="red"> Note (CSD3 user): </font> Sometimes you might get libboost_iostreams.so.1.66.0 error. For this you should *module load gcc* and *module load boost-1.66.0-gcc-5.4.0-sdffwvs*


In [12]:
# STRUCTURE phase_2_vina_prepare_ligands.sh current_iteration path_project project_name name_cpu_partition account_name
#!sbatch phase_2_vina_prepare_ligands.sh 1 results abeta {partition_name} {account_name}

<font color="red"> Important (CSD3 user): </font> CSD3 *rds* storage has a limit on number of files stored - 1 million. When spliting SDFs to separate files, this can cause issues as there will be not enough space if preparing more than 1 million ligands. An option would be to do preparation and docking one by one for each set (train/test/valid). Alternative is to create files and ZIP them to have only *one* file.

### PHASE 3: Docking

In this step, we dock the molecules from the sets that need to be docked using **VINA** or **VINA-GPU** based on the user's preference. We dock each batch within each set separately and output the docking results for given batch as a concatenated txt file. 

In [13]:
!cat scripts_3/run_batch_docking.sh

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --gres=gpu:1 # COMMENT OUT WHEN USING CLASSIC VINA
#SBATCH --cpus-per-task=1
#SBATCH --time=02:15:00

# PARAMETERS
batch_directory_to_dock=$1 # THIS SHOULD BE A FULL PATH IF USING VINA GPU
receptor=$2 # THIS SHOULD BE A FULL PATH IF USING VINA GPU
configuration_file=$3 # THIS SHOULD BE A FULL PATH IF USING VINA GPU
path_to_output_directory=$4 # THIS SHOULD BE A FULL PATH IF USING VINA GPU
vina_path=$5
using_vina_gpu=$6

tmp="${batch_directory_to_dock}"
batch="${tmp##*/}"

# Prepare output file paths
output_file_txt=$path_to_output_directory/docking_vina_gpu_$batch.txt
output_file_pdbqt=$path_to_output_directory/docking_vina_gpu_${batch}_dummy.pdbqt

echo $batch_directory_to_dock

# Use Vina or Vina-GPU based on the user's choice.
if [ "$using_vina_gpu" = true ] ; then
    echo "Using VINA-GPU"
    # Required for Vina-GPU to work
    ulimit -s 8192
    # For each file in the batch directory, run VINA docking. Output of all docking

<font color="red"> Note: </font> For AutoDock Vina, comment out *#SBATCH --gres=gpu:1 to spare the resources*.

Example command to run of docking for Vina and VINA-GPU is shown below:

In [14]:
# STRUCTURE phase_3_vina.sh current_iteration path_project project_name use_vina_gpu account_name partition

# For VINA-GPU
#!sbatch phase_3_vina.sh 1 results abeta true {account_name} {gpu_partition_name}

# For AutoDock Vina
#!sbatch phase_3_vina.sh 1 results abeta false {account_name} {cpu_partition_name}

#### Post processing

Now that we have docked data, we need to extract labels. We also have to correct the smiles and morgan fingerprints for the downloaded compounds so the model training in the next step is as accurate as possible. 

For this we can run 

In [15]:
# STUCTURE phase_3_vina_post_processing.sh path_to_iteration
#!sbatch phase_3_vina_post_processing.sh results/abeta/iteration_1

### PHASE 4: Model training

<font color="red"> Note (relevant to users of A100): </font> If you did not set up nvidia-tensorflow for cuda11 directly to your conda environment, it would be recommende setting up a separate environment with nvidia-tensorflow that could be used for phase 4 and phase 5. This is explain more earlier in the *Install environments* section.

This step is almost the same as in the original pipeline, with small changes. Extraction of labels is excluded here as it is included in the post processing step of phase 3.

We can regularly run phase 4 (that is slightly adjusted) with command similar to the following

In [17]:
# STRUCTURE: phase_4_vina.sh current_iteration number_of_processors_available project_path project_name gpu_partition_name desired_final_number_of_iterations percent_first_mols_hits percent_last_mols_hits recall_value max_wall_time conda_environment_name account_name
#!sbatch phase_4_vina.sh 1 3 results abeta {gpu_partition_name} 11 1 0.01 0.9 00-12:00 DD_protocol_tensor {account_name}
#!sbatch phase_4_vina.sh 1 3 results abeta {gpu_partition_name} 5 1 0.01 0.9 00-12:00 DD_protocol_tensor {account_name}

### PHASE 5: Inference

<font color="red"> Note (relevant to users of A100): </font> If you did not set up nvidia-tensorflow for cuda11 directly to your conda environment, it would be recommende setting up a separate environment with nvidia-tensorflow that could be used for phase 4 and phase 5. This is explain more earlier in the *Install environments* section.


Inference can be used with no major change from the original pipeline (just slight adjustments not changing the direct workflow). There is an update in *simple_job_predictions.py* that also takes a **full morgan fingerprints directory path** as there were issues if the path was not full. An example command is as following. 

In [18]:
# STRUCTURE:  phase_5_vina.sh current_iteration path_to_project project_name recall_value gpu_partition_name env account_name 
# !sbatch phase_5_vina.sh 1 results abeta 0.9 {gpu_partition_name} DD_protocol_tensor {account_name}

This command first evaluates the best performing model (*hyperparameter_result_evaluation.py*) and then infers scores on whole library as separate job on each library file.

<font color="red"> **Important**: </font>  For each iteration, check AUC, precision and recall if they are as expected. As advised in the original protocol,

- Precision should be at least 2.25% (0.0225)
- Recall should not differ by more than 0.015. 
- *Total Left Testing* in the *best_model_stats* and number of molecules in *morgan_1024_predictions* should not be too much appart (advised max 10%).

If these are not, follow advise in the original protocol - regenerate test/valid set, use bigger size of test/valid set etc.

<font color="red"> Note: </font>If there are issues with Keras and you are getting error ***AttributeError: 'str' object has no attribute 'decode'***, run the following command to resolve it.

In [2]:
# !pip install 'h5py==2.10.0' --force-reinstall

### FINAL PHASE: Extraction
To extract relevant number of molecules (or requesting all molecules with 'all_mol' instead of an integer) we can use command shown below. We decided to extract 3 million molecules that will be further processed down the line. 

In [19]:
# STRUCTURE: sbatch --cpus-per-task no_of_cpus utilities/final_extraction.sh path_to_smiles_directory path_to_last_predicted_hits num_of_cores num_of_molecules_to_extract(or 'all_mol') conda_environment
# sbatch --cpus-per-task 10 utilities/final_extraction.sh /home/mb2462/rds/hpc-work/DD/DD_protocol_data/library_ready_filtered_with_isomers_smiles /home/mb2462/rds/hpc-work/DD/DD_protocol_data/DD_main_clean/results/abeta/iteration_5/morgan_1024_predictions 10 3000000 DD_protocol 

## Clustering

<font color="red"> Note: Full clustering and downstream analysis can be found in **clustering_and_downstream_analysis/Clustering_and_downstream_analysis.ipynb** notebook </font>

As we have a large number of molecules to cluster (3 million), we cannot use a traditional Butina clustering with RDKit. Following   https://www.macinchem.org/reviews/clustering/clustering.php we can cluster molecules with Chemfp, which does allow clustering larger libraries. We can use 1.x developer line, which is non-commercial. Important to note is that Chemfp 1.x is **not compatibile with Python 3**, hence we have to create a separate environment that will run the code in **Python 2.7**. All steps to create environment, install combatibile RDKit (versions before 2019) and finally chemfp are shown below.

In [2]:
# conda create -y -n DD_protocol_py27 python=2.7
# conda activate DD_protocol_py27
# conda install -c rdkit rdkit=2018.09.1
# pip install chemfp

Alternatively one can use DD_protocol_py27.yml file that is already provided.

In [None]:
# !conda env create -f DD_protocol_py27.yml

Now, to create a compatibile fingerprints from smiles for the molecules we want to cluster we can do

In [1]:
# sbatch --account={account_name} --partition={cpu_partition_name} --nodes=1 --ntasks=1 --cpus-per-task=10 --time=02:00:00 --wrap="rdkit2fps extracted_smiles.smi --morgan --radius 2 --useChirality 1 > extracted_smiles.fps"

And to get the clusters

<font color="red"> Note: </font>This might have larger memory requirements, hence using partition that has higher memory per CPU.  

<font color="red"> Note (CSD3 user): </font> On CSD3 this is for example cclake-himem, if the default would not be enough, you need to increase either cpus per task or mem sbatch parameter (as per CSD3 guide).

In [20]:
# sbatch --account={account_name} --partition={cpu_partition_name} --nodes=1 --ntasks=1 --cpus-per-task=10 --time=10:30:00 --wrap="python ../scripts_3/taylor_butina.py --profile --threshold 0.78 extracted_smiles.fps -o extracted_smiles_clusters.txt"

## Possible issues and hacks that could be helpful

Some of these might be obvious but they are included just in case.

1. (CSD3 user) When running GPU processes on ampere, and then trying to run CPU processes the same login node can give error such as ***/lib64/libc.so.6: version GLIBC_2.27' not found (required by /usr/local/software/slurm/slurm-20.11.9-rhel8/lib/slurm/libslurmfull.so)***. In that case, try switching to a different login node, maybe CPU-exlusive ones such as user-id@login-e-9.hpc.cam.ac.uk

2. When trying to find out number of lines in a file (in our case lines correspond to compounds), you can do this by

In [1]:
#!wc -l filename

3. Remove specific type file in all subdirectories of directory

In [None]:
#!for subd in */; do cd $subd; rm *.sdf; cd ..; done

4. Find number of files in each subdirectory of directory

In [None]:
#!for subd in */;do cd $subd; ls | wc -l; cd ..;done

5.  You can run zip and unzip as following

In [22]:
# ZIP DIRECTORY
#!sbatch  -account={account_name} --partition={cpu_partition_name} --nodes=1 --ntasks=1 --cpus-per-task=10 --time=03:00:00 --wrap "zip -r --quiet test_iteration_1.zip test"

# UNZIP DIRECTORY
#!sbatch  -account={account_name} --partition={cpu_partition_name} --nodes=1 --ntasks=1 --cpus-per-task=10 --time=03:00:00 --wrap "unzip test_iteration_1.zip"

6. Count occurences of word in a file (for all files in the directory)

In [7]:
# OPTION 1
# for f in *; do grep -o 'ZINC'  $f | wc -l; done;
# OPTION 2 - for characters that need escaping, do \character ( slash followed by character)
# for f in *; do grep -c '\$\$\$\$' $f; done;  

## Set-up to run this notebook (Adjusted for CSD3 user)

This set-up was optimized for CSD3 cluster user. However, this can be adjusted to any cluster with sufficient  <font color="red">  *space* </font> and  <font color="red">  *resources* </font>.

### Technical Requirements:

1. *Enough CPU and GPU resources* - For time efficiency, ideally, at least 200 CPU cores and 50 GPU cores, preferably with exclusive access, if not at least with sufficiently short queues. However, less is fine also if the user is okay with longer run times.
2. *Space* - library itself is around 267GB. Hence this amount of disk space is recommended, along with additional space for intermediate files and results.
3. *File number limit* - running Vina requires separate file per ligand. Hence if 1,000,000 ligands are being docked, the disk should allow that and not have low limit on number of files. There are, however, workarounds for this and docking can be done for example in batches.


<font color="red"> Note:</font> Due to space limits, the big project files (such as datasets) should be probably stored in the *rds* space on the CSD3. The whole project can be there too, however, there are number of file limits (1 million files). Remember though, that RDS is not backed up.

### 1. Register for access for CSD3 https://www.hpc.cam.ac.uk/rcs-application .

For CSD3: The default for the team is SL3 option (lower service level) which means access to 200,000 CPU core hours and 3000 GPU hours per quarter per PI. SL3 is also limited to per-job, per-user GPU limits 32 GPUs and additional job runtime limits.

### 2. Follow user guide (https://docs.hpc.cam.ac.uk/hpc/user-guide/quickstart.html) to familiarize yourself with CSD3, logging in, get hand of usage of SLURM, job submits (even though the code should include SLURM commands) and file transfers.

To see the state of the jobs, one of possible commands is:

In [2]:
# FOR CPU JOBS:
!gstatement -p vendruscolo-sl3-cpu
# FOR GPU JOBS:
!gstatement -p vendruscolo-sl3-gpu

       JobID      User    Account    JobName  Partition                 End ExitCode      State  CompHrs
------------ --------- ---------- ---------- ---------- ------------------- -------- ---------- --------


Simultaneously, you can find an output of the job in *slurm-jobID.out* file, e.g. *slurm-63016496.out*. Status of a job can be checked also with the following command (if the job is still active)

In [None]:
!scontrol show job {JobID}

To see quota on space and number of files, run

In [7]:
!quota

Filesystem/Project    GB        quota     limit          grace           files    quota    limit   grace User/Grp/Proj
/home                 25.7       50.0      55.0                     -    ------- No File Quotas  ------- U:mb2462
/rds-d7              633.7     1099.5    1209.5                     -   190192  1048576  1048576       - P:44042


To see credits available, run

In [21]:
!mybalance

User           Usage |        Account     Usage | Account Limit Available (hours)
---------- --------- + -------------- --------- + ------------- ---------
mb2462        22,930 | VENDRUSCOLO-SL3-CPU   838,233 |       974,229   135,996
mb2462         5,974 | VENDRUSCOLO-SL3-GPU    30,607 |        33,518     2,911


### 3. Copy the directory with the code (including this jupyter notebook) to the system using the following <font color="red"> *scp* </font>command from your local console (similarly, copy any additional files you need).   

<font color="red">Note: </font> Ideally, keep big files in the <font color="red"> *rds*</font> disk space, as this has 1TB limit. However, it does also have limit on number of files - 1,000,000. Remember though, that RDS is not backed up.

<font color="red">For later: </font> Copying from system to the local computer works similarly just with switched order. Alternative can be using **rsync** command.

In [None]:
scp DD_code.zip user@login-icelake.hpc.cam.ac.uk:[destination]

### 4. Download required ZINC20 files using <font color="red"> *wget* </font> from https://files.docking.org/zinc20-ML/. 

Download smile files as well as fingerprints. Make sure, that all smile files (*smiles_all_{no}.txt*) are in directory together *library_prepared* and all fingerprints  (*smiles_all_{no}.txt*)  are in directory together *fingerprint*  to prevent confusion.

### 5. Set up a conda environment following https://docs.hpc.cam.ac.uk/hpc/software-tools/python.html#using-anaconda-python

<font color="red">Note: </font> Do not forget to activate the environment before every usage

### 6. (Alternative A) Install VINA from https://vina.scripps.edu/downloads/ . This is version 1.1.2. If higher version is required or other program is required, change this accordingly.

### 6. (Alternative B. Optional) Install VINA-GPU
<font color="red">Note: </font> VINA does take quite long to dock molecules. If GPU power is available, try to set up and use **VINA GPU** (https://github.com/DeltaGroupNJUPT/Vina-GPU) instead (or newer version of this), following steps they mention for Linux set-up. You can install boost and cuda or load them as modules (CUDA is automatically loaded on q nodes). To find paths they are asking for, look at

In [None]:
# !module show module_name

: 

Other two variables, GPU_PLATFORM and OPENCL_VERSION can be left as they are. To achieve good precision, try setting *thread=8000* and *search_depth=10*, or adjust these values based on your need. The automatic code uses *thread=8000* and *search_depth=10*.

### 7. Install Open Babel from https://github.com/openbabel/openbabel/releases/tag/openbabel-3-1-1 (or Open Babel version of choice). 

On CSD3 (CentOS), this might have to be built from source, follow instructions for this online at https://open-babel.readthedocs.io/en/latest/Installation/install.html.

<font color="red">Note: </font> On *login-e-{no}* nodes, there could be issues with libboost. Fix these by loading appropriate modules -  *module load gcc* is required as well as one of the boost gcc modules.

### 8. (Alternative A) Set up a jupyter notebook and forwarding as per user guide (https://docs.hpc.cam.ac.uk/hpc/software-packages/jupyter.html#running-jupyter)

<font color="red">Note: </font> As there is a typo in the docs, from local machine, use the following statement (with an appropriate login node, here *login-e-1* is used) instead

### 9. (Alternative A) Run the jupyter notebook as per guide, open this notebook and follow up on the previous section *Protocol summary and walk-through*

<font color="red">Note: </font> It can happen that when the connection is lost or is ended in an incorrect way, the port remains in use and hence you cannot re-establish new connection on a port. For that, on the local machine, run

### 8. an 9.(Alternative B) Open this repository and work on cluster via Visual Studio Code and follow up on the previous section *Protocol summary and walk-through*

Protocol and this notebook is available to view through it as well. 

Follow steps described on https://code.visualstudio.com/docs/remote/ssh