Structural biology - Practical day 01
=======================================

Part 01
-------

Document written by [Adrián Diaz](mailto:adrian.diaz@vub.be) & [David Bickel](mailto:david.bickel@vub.be)

**Vrije Universiteit Brussel**

## The scenario

We have these proteins involved: 

- Gyrase (dimer) `Gyr:Gyr`
- Toxin (dimer) `CcdB:CcdB`
- Anti-toxin `CcdA`

### Sequences

The following code block contains the residues of both Toxin and Anti-toxin proteins in FASTA format. You will use them to create the prediction job.

```fasta
> CcdA
MKQRITVTVDSDSYQLLKAYDVNISGLVSTTMQNEARRLRAERWKAENQEGMAEVARFIEMNGSFADENRDW

> CcdB
MQFKVYTYKRESRYRLFVDVQSDIIDTPGRRMVIPLASARLLSDKVSRELYPVVHIGDESWRMMTTDMASVPVSVIGEEVADLSHRENDIKNAINLMFWGI
```

### Task A
Predict the following complexes using ColabFold:

- Group A: `CcdB:CcdB:CcdA`
- Group B: `CcdB:CcdB`


## ColabFold job parameters

### About CSV input format
ColabFold supports input files in CSV format where the `:` symbol allows you to create multimers.

```csv
id,sequence
monomer_A,DSYQLLKAYDVNISGL
dimer_A_A,DSYQLLKAYDVNISGL:DSYQLLKAYDVNISGL
complex_A_A_B,DSYQLLKAYDVNISGL:DSYQLLKAYDVNISGL:LKAYDVNISGL
```

In [None]:
import subprocess

job_name         = "structbio2022_template"
input_name       = "default_input.csv"
output_path      = "structbio2022_template"
prediction_input = """id,sequence
ccda_ccdA,MKQRITVTVDSDSYQLLKAYDVNISGLVSTTMQNEARRLRAERWKAENQEGMAEVARFIEMNGSFADENRDW:MKQRITVTVDSDSYQLLKAYDVNISGLVSTTMQNEARRLRAERWKAENQEGMAEVARFIEMNGSFADENRDW
"""

In [None]:
!mkdir -p ./input
!mkdir -p ./output

In [None]:
import os
input_path = os.path.abspath(os.path.join('./input', input_name))

print("Saving input file in", input_path)

with open(input_path, "w") as file_handler:
    file_handler.write(prediction_input)

### About Slurm jobs in Hydra

Slurm provides a complete toolbox to manage and control your jobs. Some of them carry out common tasks: 

- submitting job scripts to the queue (`sbatch`)
- printing information about the queue (`mysqueue`)

Jobs are descripted using Bash files with these header options:

- `--job-name=job_name` : Set job name to job_name
- `--time=DD-HH:MM:SS`: Define the time limit
- `--mail-type=BEGIN|END|FAIL|REQUEUE|ALL`: Conditions for sending alerts by email
- `--partition=cluster_type`: Request a cluster


More information: https://hpc.vub.be/docs/job-submission/

### About Singularity
Usually, Hydra provides the software you need in _modules_ configured by VSC staff. For instance, Alphafold is available as a module. 

- To list modules already loaded: `module list`
- To search a module: `module spider NAME`
- To load a module: `module load NAME`

However, when the application is not available yet, you could use containers as a workaround. Containerization is a technique to wrap up a software component inside an isolated and portable file (called image) that contains all the requirements installed and it's ready to be executed independently of the host machine. 

**Hydra supports Singularity as the official container provider.**

Today, we are going to use an image stored in the Singularity cloud: `agdiaz/bio2byte/colabfold:1.3.0`.

```bash
singularity run \
    --contain \
    --no-home \
    --nv \
    --bind $BASE_DIR:/data,$TEMP_DIR:/tmp \
    library://agdiaz/bio2byte/colabfold:1.3.0 /bin/bash -c "COMMAND"
```

It's important to mention the bindings here:

- `$BASE_DIR` in our VSC filesystem is mapped to `/data` inside the container.
- `$TEMP_DIR` in our VSC filesystem is mapped to `/tmp` inside the container.

More about Containers in Hydra: https://docs.vscentrum.be/en/latest/software/singularity.html

### About ColabFold

This app is configured by different parameters and the last two of them are the input and the output paths.

```bash
colabfold_batch \
    --save-pair-representations \
    --save-single-representations \
    --amber \
    --templates \
    --data /data \
    --use-gpu-relax \
    --num-recycle 3 \
    --model-type AlphaFold2-multimer-v2 \
    /data/input/$INPUT_FILE \
    /data/output/{output_path}
```

#### Available parameters

```bash
positional arguments:
  input                 Can be one of the following: Directory with fasta/a3m
                        files, a csv/tsv file, a fasta file or an a3m file
  results               Directory to write the results to

optional arguments:
  -h, --help            show this help message and exit
  --stop-at-score STOP_AT_SCORE
                        Compute models until plddt (single chain) or ptmscore
                        (complex) > threshold is reached. This can make
                        colabfold much faster by only running the first model
                        for easy queries.
  --stop-at-score-below STOP_AT_SCORE_BELOW
                        Stop to compute structures if plddt (single chain) or
                        ptmscore (complex) < threshold. This can make
                        colabfold much faster by skipping sequences that do
                        not generate good scores.
  --num-recycle NUM_RECYCLE
                        Number of prediction cycles.Increasing recycles can
                        improve the quality but slows down the prediction.
  --num-ensemble NUM_ENSEMBLE
                        Number of ensembles.The trunk of the network is run
                        multiple times with different random choices for the
                        MSA cluster centers.
  --random-seed RANDOM_SEED
                        Changing the seed for the random number generator can
                        result in different structure predictions.
  --num-models {1,2,3,4,5}
  --recompile-padding RECOMPILE_PADDING
                        Whenever the input length changes, the model needs to
                        be recompiled, which is slow. We pad sequences by this
                        factor, so we can e.g. compute sequence from length
                        100 to 110 without recompiling. The prediction will
                        become marginally slower for the longer input, but
                        overall performance increases due to not recompiling.
                        Set to 1 to disable.
  --model-order MODEL_ORDER
  --host-url HOST_URL
  --data DATA
  --msa-mode {MMseqs2 (UniRef+Environmental),MMseqs2 (UniRef only),single_sequence}
                        Using an a3m file as input overwrites this option
  --model-type {auto,AlphaFold2-ptm,AlphaFold2-multimer-v1,AlphaFold2-multimer-v2}
                        predict strucutre/complex using the following
                        model.Auto will pick "AlphaFold2" (ptm) for structure
                        predictions and "AlphaFold2-multimer-v2" for
                        complexes.
  --amber               Use amber for structure refinement
  --templates           Use templates from pdb
  --custom-template-path CUSTOM_TEMPLATE_PATH
                        Directory with pdb files to be used as input
  --env
  --cpu                 Allow running on the cpu, which is very slow
  --rank {auto,plddt,ptmscore,multimer}
                        rank models by auto, plddt or ptmscore
  --pair-mode {unpaired,paired,unpaired+paired}
                        rank models by auto, unpaired, paired, unpaired+paired
  --recompile-all-models
                        recompile all models instead of just model 1 ane 3
  --sort-queries-by {none,length,random}
                        sort queries by: none, length, random
  --save-single-representations
                        saves the single representation embeddings of all
                        models
  --save-pair-representations
                        saves the pair representation embeddings of all models
  --training            turn on training mode of the model to activate drop
                        outs
  --max-msa {512:5120,512:1024,256:512,128:256,64:128,32:64,16:32}
                        defines: `max_msa_clusters:max_extra_msa` number of
                        sequences to use
  --zip                 zip all results into one <jobname>.result.zip and
                        delete the original files
  --use-gpu-relax       run amber on GPU instead of CPU
  --overwrite-existing-results
```

- Available Colab Notebook: https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold2.ipynb
- Official repository: https://github.com/sokrypton/ColabFold

In [None]:
prediction_job = """#!/bin/bash 
# General parameters:
#SBATCH --job-name={job_name}
#SBATCH --time=01:30:00
#SBATCH --mail-type=ALL

# Resources: (1x Nvidia A100 GPU node, 40GB GPU memory, 1x processor, 16x cpu-cores, 192G cpu RAM)
#SBATCH --partition=ampere_gpu
#SBATCH --nodes=1
#SBATCH --gpus-per-node=1
#SBATCH --cpus-per-gpu=16
#SBATCH --mem-per-cpu=12G
#SBATCH --reservation=structbio1

# Script:

echo "[ColabFold on Hydra from Singularity image] - $(date --rfc-3339=seconds) - Starting job on VUB-HPC's (Hydra) GPU clusters AMPERE_GPU (Nvidia A100)"

BASE_DIR={wrkdir}
mkdir -p $BASE_DIR
echo "[ColabFold on Hydra from Singularity image] - $(date --rfc-3339=seconds) - Base directory: $BASE_DIR"

mkdir -p $BASE_DIR/input
echo "[ColabFold on Hydra from Singularity image] - $(date --rfc-3339=seconds) - Directory for input files: $BASE_DIR/input"
echo "[ColabFold on Hydra from Singularity image] - $(date --rfc-3339=seconds) - Sequences to predict will be read from: $BASE_DIR/input/{input_name}"

mkdir -p $BASE_DIR/output/{output_path}
echo "[ColabFold on Hydra from Singularity image] - $(date --rfc-3339=seconds) - Results will be available inside directory: $BASE_DIR/output/{output_path}"

CACHE_DIR=$VSC_SCRATCH/.singularity_cache
mkdir -p $CACHE_DIR
export SINGULARITY_CACHEDIR=$CACHE_DIR
echo "[ColabFold on Hydra from Singularity image] - $(date --rfc-3339=seconds) - Singularity cache directory: $CACHE_DIR"

TEMP_DIR=$VSC_SCRATCH/.singularity_tmp
mkdir -p $TEMP_DIR
echo "[ColabFold on Hydra from Singularity image] - $(date --rfc-3339=seconds) - Singularity tmp directory: $TEMP_DIR"

mkdir -p $BASE_DIR/data
echo "[ColabFold on Hydra from Singularity image] - $(date --rfc-3339=seconds) - ColabFold will download parameters files inside directory (~6GB): $BASE_DIR/data"

echo "[ColabFold on Hydra from Singularity image] - $(date --rfc-3339=seconds) - All ready to start execution using Singularity image"

singularity run --contain --no-home --nv --bind $BASE_DIR:/data,$TEMP_DIR:/tmp library://agdiaz/bio2byte/colabfold:1.3.0 /bin/bash -c "source activate /colabfold_batch/colabfold-conda && colabfold_batch --save-pair-representations --save-single-representations --amber --templates --data /data --use-gpu-relax --num-recycle 3 --model-type AlphaFold2-multimer-v2 /data/input/{input_name} /data/output/{output_path}"

echo "[ColabFold on Hydra from Singularity image] - $(date --rfc-3339=seconds) - Job finished with success"
""".format(job_name=job_name, input_name=input_name, output_path=output_path, wrkdir=os.getcwd())

In [None]:
job_path = os.path.abspath(job_name + ".sh")

print("Saving job file in", job_path)

with open(job_path, "w") as file_handler:
    file_handler.write(prediction_job)

print(prediction_job)

## Job enqueueing

The job is submitted using the command `sbatch` followed by the name of our job file. You will receive a job identifier as output.

In [None]:
process = subprocess.Popen(['sbatch', job_path],
                     stdout=subprocess.PIPE, 
                     stderr=subprocess.PIPE)
stdout, stderr = process.communicate()
retcode        = process.poll()

print(f"stdout (exit code={retcode}):")
for line in stdout.decode().split("\n"):
    print(line)

print(f"stderr (exit code={retcode}):")
for line in stderr.decode().split("\n"):
    print(line)

### Query the queue status
The command to run is `mysqueue`. While the job is running you could visualize the SLURM logs inside the `slurm-JOBID.out` file:

- Cat command: View the full content of the file. `cat slurm-JOBID.out`
- Tail command: View the last lines of the file. `tail -f slurm-JOBID.out` (with `-f` the command will follow the output automatically).

In [None]:
process = subprocess.Popen(['mysqueue'],
                     stdout=subprocess.PIPE, 
                     stderr=subprocess.PIPE)
stdout, stderr = process.communicate()
retcode        = process.poll()

print(f"stdout (exit code={retcode}):")
for line in stdout.decode().split("\n"):
    print(line)

print(f"stderr (exit code={retcode}):")
for line in stderr.decode().split("\n"):
    print(line)

We are ready to continue working on the next Jupyter Notebook!