# Using HPC SLURM in the GRAPE evaluation pipelines

[SLURM](https://slurm.schedmd.com/) is one of the most common systems to distribute jobs over high-performance computing (HPC) systems. For this reason, we have integrated support in the GRAPE evaluation pipelines for node-label, edge-label amd edge predictions to parallelize the computation of the different holdouts.

## Parallelization logic
The parallelization logic we adopted is relatively straighforward:

1. First, we execute a preliminary step in which we retrieve the data of interest. This may include downloading graphs through the graph retrieval, for instance, so that the various SLURM nodes don't try to do that in parallel but can load the data readily.
2. Secondly, we run the SLURM script! In this step, we actually do execute the pipeline and execute in parallel the holdouts. Do note that the number of SLURM nodes to be employed must be less or equal to the number of holdouts that you intend to run. Do not that you can execute a grid search that runs for each value a set of holdouts, just specify as number of cluster nodes to the pipeline a value that is lower or equal to the holdouts.
3. Finally, we need to collect the results obtained by the various scripts.

<img src="https://github.com/AnacletoLAB/grape/blob/main/images/slurm.jpg?raw=true" width="50%">

### Setting up a virtual environment
While this is an obvious step to all SLURM users, you will most likely need to setup a virtual environment. You can do this with multiple tools, such as `conda` or `venv`. Just use the one you are more confortable with and install `grape` inside the environment by running: `pip install grape`.

### Example of the data retrieval portion
In the following code we will retrieve the STRING Homo Sapiens graph. In the bash script a bit below we will be referring to this code as `retrieve_data.py`.

In [None]:
from grape.datasets.string import HomoSapiens

# We download the data.
_ = HomoSapiens()

### Example of distributed pipeline run
In the following section we execute the edge prediction evaluation pipeline to get the performance of an edge prediction Perceptron trained exclusively on the Jaccard Index. We will execute `NUMBER_OF_HOLDOUTS` holdouts and we expect to parallelize the task on `NUMBER_OF_SLURM_NODES` nodes. We will be referring to this code snippet as `run_pipeline.py` in the SLURM script below.

Of course, within this tutorial we will get an error if we run the code snippet as we are not within a SLURM cluster.

In [None]:
from grape.datasets.string import HomoSapiens
from grape.edge_prediction import edge_prediction_evaluation, PerceptronEdgePrediction

NUMBER_OF_HOLDOUTS = 10
NUMBER_OF_SLURM_NODES= 5

assert NUMBER_OF_HOLDOUTS >= NUMBER_OF_SLURM_NODES

_ = edge_prediction_evaluation(
    holdouts_kwargs=dict(train_size=0.7),
    graphs=HomoSapiens().filter_from_ids(min_edge_weight=700),
    models=PerceptronEdgePrediction(),
    number_of_holdouts=NUMBER_OF_HOLDOUTS,
    number_of_slurm_nodes=NUMBER_OF_SLURM_NODES
)

### Example of results collection
Finally, in this third step, we simply collect the results that the node have computed in the various holdouts using pandas and glob.

In [None]:
from glob import glob
import pandas as pd

TASK_NAME = "Edge Prediction"
RESULTS_PATH = "results.csv.gz"

pd.concat([
    pd.read_csv(path, index_col=0)
    for path in glob(
        "experiments/{task_name}/*/holdout_*/*.csv.gz".format(task_name=TASK_NAME)
    )
]).to_csv(RESULTS_PATH, index=False)

## The SLURM bash script
One extremely important ingredient is the actual SLURM script to use to launch these jobs.
Here is a decent generic example that you may want to use:

```bash
#!/bin/bash
#SBATCH --job-name=name_of_your_experiment
#SBATCH --output=name_of_your_experiment.out
# The number of tasks per node should always be one, as we already parallize within the pipeline.
#SBATCH --ntasks-per-node=1
# The number of nodes, which should be <= number of holdouts
# This is generally the same as `NUMBER_OF_SLURM_NODES` from the `run_pipeline.py` script
# but may be much higher when you are running some other layer of parallelization, such
# as when you are running a grid search.
#SBATCH --nodes=5
# RAM to be used, just set a reasonable amount for your task
#SBATCH --mem=16GB
# Computation time to be used, just set a reasonable amount for your task
#SBATCH --time 24:00:00
# Number of processing cores to be used per node, just set a reasonable amount for your task 
#SBATCH --cpus-per-task=16
# We want to wait for the script to complete running.
#SBATCH --wait

# Recall to activate your Python3.7+ virtual environment
./path/to/your/virtual/environment/activate

python3 retrieve_data.py

for i in `seq $SLURM_NNODES`; do
    srun --ntasks=1 --nodes=1 --exclusive python3 run_pipeline.py &
done

python3 collect_results.py
```

If we call this file `slurm_script.sh`, you can launch this script as:

```bash
sbatch slurm_script.sh
```

Good luck!