Skip to content

Usage on Various Compute Clusters

John Giorgi edited this page Aug 15, 2019 · 24 revisions

This document presents step-by-step instructions for installing and training Saber on various compute clusters.

Compute Canada

These instructions will be written for the Béluga cluster in particular, but usage across all Compute Canada (CC) clusters should be nearly identical.

Installation

Start by SSH'ing into a login node, e.g.

$ ssh <username>@beluga.computecanada.ca

Then clone the repo to your PROJECT folder

# "def-someuser" will be the group you belong to
$ PROJECT_DIR=~/projects/<def-someuser>/<username>
$ cd $PROJECT_DIR
$ git clone https://github.com/BaderLab/saber.git
$ cd saber

Next, we will create an environment and install the package and all its dependencies. Note, you only need to do this once.

# Path to where the environment will be created
$ ENVDIR=~/saber

# Create a virtual environment
$ module load python/3.7 cuda/10.0
$ virtualenv --no-download $ENVDIR
$ source $ENVDIR/bin/activate
(saber) $ pip install --upgrade pip

# Packages available in the CC wheelhouse
(saber) $ pip install scikit-learn torch pytorch_transformers Keras-Preprocessing spacy nltk neuralcoref --no-index

# Install Saber
(saber) $ git checkout development
(saber) $ pip install -e .

# Download and Install a SpaCy model (OPTIONAL ; not required for training)
(saber) $ python -m spacy download en_core_web_md

# Install seqeval fork (TEMPORARY)
(saber) $ pip install git+https://github.com/JohnGiorgi/seqeval.git

# Install Apex (OPTIONAL)
(saber) $ module load gcc/7.3.0
(saber) $ git clone https://github.com/NVIDIA/apex
(saber) $ cd apex
(saber) $ python setup.py install --cpp_ext --cuda_ext

# Keep track of all requirements in this env so it can be recreated (OPTIONAL)
(saber) $ pip freeze > cc_requirements.txt

Downloading the datasets

Make a directory to store your datasets, e.g.

(saber) $ mkdir $PROJECT_DIR/saber/datasets

Place any datasets you would like to train on in this folder.

Download a BERT model

Because the compute nodes are air-gapped, you will need to download a BERT model on the login node. Note that you only have to do this once.

  • If you want to use the default BERT model (BioBert V1.1. ; recommended), simply call a training session and cancel it as soon as training begins
(saber) $ python -m saber.cli.train --dataset_folder path/to/dataset
  • If you want to use one of the BERT models from pytorch-transformers (see here for list of pre-trained BERT models), first set saber.constants.PRETRAINED_BERT_MODEL to your model name and run a training session, cancelling it as soon as training begins (as above).

  • If you want to supply your own model, simply set saber.constants.PRETRAINED_BERT_MODEL to your model's path on disk. There is no need to run a training session.

Training

To train the model, you will need to create a train.sh script. For example:

#!/bin/bash
#SBATCH --account=def-someuser
# Requested resources
#SBATCH --nodes=1
#SBATCH --mem=0
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=10 
# Wall time and job details
#SBATCH --time=1:00:00
#SBATCH --job-name=example
#SBATCH --output=./output/%j.txt
# Emails me when job starts, ends or fails
#SBATCH --mail-user=example@gmail.com
#SBATCH --mail-type=ALL
# Use this command to run the same job interactively
# salloc --account=def-someuser --nodes=1 --mem=0 --gres=gpu:1 --cpus-per-task=10 --time=0:30:00

# Load required models and activate the enviornment
ENVDIR=~/saber
WORKDIR=/home/johnmg/projects/def-gbader/johnmg/saber

module load python/3.7 cuda/10.0
source $ENVDIR/bin/activate

cd $WORKDIR

# Train the model
python -m saber.cli.train --dataset_folder path/to/dataset

Submit this job to the queue with sbatch train.sh. To run the same job interactively, use

salloc --account=def-someuser --nodes=1 --mem=0 --gres=gpu:1 --cpus-per-task=10 --time=0:30:00

Note, on Beluga, you should use a maximum of 10 CPUs per GPU requested.