Structural biology - Practical day 01
=======================================

Part 01
-------

Document written by [Adrián Diaz](mailto:adrian.diaz@vub.be) & [David Bickel](mailto:david.bickel@vub.be)

**Vrije Universiteit Brussel**

## The scenario

We have these proteins involved: 

- Gyrase (dimer) `Gyr:Gyr`
- Toxin (dimer) `CcdB:CcdB`
- Anti-toxin `CcdA`

### Sequences

The following code block contains the residues of both Toxin and Anti-toxin proteins in FASTA format. You will use them to create the prediction job.

```fasta
>CcdA
MKQRITVTVDSDSYQLLKAYDVNISGLVSTTMQNEARRLRAERWKAENQEGMAEVARFIEMNGSFADENRDW

>CcdB
MQFKVYTYKRESRYRLFVDVQSDIIDTPGRRMVIPLASARLLSDKVSRELYPVVHIGDESWRMMTTDMASVPVSVIGEEVADLSHRENDIKNAINLMFWGI
```

### Task A
Predict the following complexes using ColabFold:

- Group A: `CcdB:CcdB:CcdA`
- Group B: `CcdB:CcdB`


## AlphaFold2 Job
AlphaFold is installed for the Nvidia Ampere GPUs in Hydra. You only need the sequence file (FASTA format) of the protein for your job. VUB-HPC provides the full datasets for AlphaFold in a central location accessible to all users. The software modules of AlphaFold are configured to use the datasets in this shared central location by default.

#### Group A

`CcdB:CcdB:CcdA`

In [None]:
prediction_input = """
>CcdB_1
MQFKVYTYKRESRYRLFVDVQSDIIDTPGRRMVIPLASARLLSDKVSRELYPVVHIGDESWRMMTTDMASVPVSVIGEEVADLSHRENDIKNAINLMFWGI

>CcdB_2
MQFKVYTYKRESRYRLFVDVQSDIIDTPGRRMVIPLASARLLSDKVSRELYPVVHIGDESWRMMTTDMASVPVSVIGEEVADLSHRENDIKNAINLMFWGI

>ccdA_1
MKQRITVTVDSDSYQLLKAYDVNISGLVSTTMQNEARRLRAERWKAENQEGMAEVARFIEMNGSFADENRDW
"""

#### Group B

`CcdB:CcdB`

In [None]:
prediction_input = """
>ccdA_1
MKQRITVTVDSDSYQLLKAYDVNISGLVSTTMQNEARRLRAERWKAENQEGMAEVARFIEMNGSFADENRDW

>_ccdA_2
MKQRITVTVDSDSYQLLKAYDVNISGLVSTTMQNEARRLRAERWKAENQEGMAEVARFIEMNGSFADENRDW
"""

#### Creating the required directories and files

First the input/output directories:

In [None]:
%%bash
# Create input directory
mkdir -p ./input

# Create output directory
mkdir -p ./output

Now the input FASTA file

In [None]:
import os

input_path = os.path.abspath(os.path.join('./input', "multimer.fasta"))

print("Saving input file in", input_path)

with open(input_path, "w") as file_handler:
    file_handler.write(prediction_input)

print("Input file saved in", input_path)

And finally, the job script:

In [None]:
prediction_job = """#!/bin/bash

# SCRIPT RESOURCE CONFIG:
#SBATCH --partition=ampere_gpu
#SBATCH --nodes=1
#SBATCH --gpus-per-node=1
#SBATCH --cpus-per-gpu=16
#SBATCH --job-name=structbio_multimer
#SBATCH --time=02:00:00
#SBATCH --reservation=structbio1

# LOADING CUDA (GPU) DEPENDENCIES:
export CUDA_MPS_PIPE_DIRECTORY=$TMPDIR/nvidia-mps-pipe
export CUDA_MPS_LOG_DIRECTORY=$TMPDIR/nvidia-mps-log
nvidia-cuda-mps-control -d

# LOADING ALPHAFOLD IN MEMORY:
module load AlphaFold/2.3.4-foss-2022a-CUDA-11.7.0-ColabFold

# SETUP INPUT/OUTPUT FILES
BASE_DIR={base_dir}
INPUT_FASTA=$BASE_DIR/input/multimer.fasta
OUTPUT_DIR=$BASE_DIR/output

# LAUNCH ALPHAFOLD
run_alphafold.py \\
    --model_preset=multimer \\
    --fasta_paths=$INPUT_FASTA \\
    --max_template_date=1900-01-01 \\
    --db_preset=reduced_dbs \\
    --models_to_relax=best \\
    --enable_gpu_relax=true \\
    --use_precomputed_msas=true \\
    --num_multimer_predictions_per_model=1 \\
    --output_dir=$OUTPUT_DIR
""".format(base_dir=os.getcwd())

About AlphaFold flags:

**Templates:**

The `--max_template_date` flag is mandatory when running AlphaFold. If you are predicting the structure of a protein that is already in PDB and you wish to avoid using it as a template, then max_template_date must be set to be before the release date of the structure.

**Model:**

- `monomer`: This is the original model used at CASP14 with no ensembling.
- `monomer_casp14`: This is the original model used at CASP14 with num_ensemble=8, matching our CASP14 configuration. This is largely provided for reproducibility as it is 8x more computationally expensive for limited accuracy gain (+0.1 average GDT gain on CASP14 domains).
- `monomer_ptm`: This is the original CASP14 model fine tuned with the pTM head, providing a pairwise confidence measure. It is slightly less accurate than the normal monomer model.
- `multimer`: This is the AlphaFold-Multimer model. To use this model, provide a multi-sequence FASTA file. In addition, the UniProt database should have been downloaded.

**DB:**

You can control MSA speed/quality tradeoff by adding `--db_preset=reduced_dbs` or `--db_preset=full_dbs` to the run command. We provide the following presets:

- `reduced_dbs`: This preset is optimized for speed and lower hardware requirements. It runs with a reduced version of the BFD database. It requires 8 CPU cores (vCPUs), 8 GB of RAM, and 600 GB of disk space.
- `full_dbs`: This runs with all genetic databases used at CASP14.

**Relaxation:**

After generating the predicted model, AlphaFold runs a relaxation step to improve local geometry. By default, only the best model (by pLDDT) is relaxed (`--models_to_relax=best`), but also all of the models (`--models_to_relax=all`) or none of the models (`--models_to_relax=none`) can be relaxed.

The relaxation step can be run on GPU (faster, but could be less stable) or CPU (slow, but stable). This can be controlled with --enable_gpu_relax=true (default) or --enable_gpu_relax=false.

**MSA:**

AlphaFold can re-use MSAs (multiple sequence alignments) for the same sequence via --use_precomputed_msas=true option; this can be useful for trying different AlphaFold parameters. This option assumes that the directory structure generated by the first AlphaFold run in the output directory exists and that the protein sequence is the same.

**Seed:**

By default the multimer system will run 5 seeds per model (25 total predictions) for a small drop in accuracy you may wish to run a single seed per model. This can be done via the `--num_multimer_predictions_per_model` flag, e.g. set it to `--num_multimer_predictions_per_model=1` to run a single seed per model.

Now we can save the script into the job file:

In [None]:
job_name = "structbio_multimer.sh"
job_path = os.path.join("./input", job_name)

print("Saving job file in", job_path)

with open(job_path, "w") as file_handler:
    file_handler.write(prediction_job)

print("Job file saved in", job_path)

### About Slurm jobs in Hydra

Slurm provides a complete toolbox to manage and control your jobs. Some of them carry out common tasks: 

- submitting job scripts to the queue (`sbatch`)
- printing information about the queue (`mysqueue`)

Jobs are descripted using Bash files with these header options:

- `--job-name=job_name` : Set job name to job_name
- `--time=DD-HH:MM:SS`: Define the time limit
- `--mail-type=BEGIN|END|FAIL|REQUEUE|ALL`: Conditions for sending alerts by email
- `--partition=cluster_type`: Request a cluster


More information: https://hpc.vub.be/docs/job-submission/

## Job enqueueing

The job is submitted using the command `sbatch` followed by the name of our job file. You will receive a job identifier as output.

In [None]:
import subprocess

print("Submitting job file saved in", job_path)

process = subprocess.Popen(['sbatch', job_path],
                     stdout=subprocess.PIPE,
                     stderr=subprocess.PIPE)

stdout, stderr = process.communicate()
return_code    = process.poll()

print(f"stdout (exit code={return_code}):")
for line in stdout.decode().split("\n"):
    print(line)

print(f"stderr (exit code={return_code}):")
for line in stderr.decode().split("\n"):
    print(line)

### Query the queue status
The command to run is `mysqueue`. While the job is running you could visualize the SLURM logs inside the `slurm-JOBID.out` file:

- Cat command: View the full content of the file. `cat slurm-JOBID.out`
- Tail command: View the last lines of the file. `tail -f slurm-JOBID.out` (with `-f` the command will follow the output automatically).

In [None]:
import subprocess

process = subprocess.Popen(['mysqueue'],
                     stdout=subprocess.PIPE,
                     stderr=subprocess.PIPE)

stdout, stderr = process.communicate()
return_code    = process.poll()

print(f"stdout (exit code={return_code}):")
for line in stdout.decode().split("\n"):
    print(line)

print(f"stderr (exit code={return_code}):")
for line in stderr.decode().split("\n"):
    print(line)

We are ready to continue working on the next Jupyter Notebook!