This tutorial is intended to run on Google Cloud, specifically by starting a Google Compute Engine Instance, shelling in, installing snakemake, and running a pipeline. Full instructions are included here.
Important read/write access is currently required for the bucket hn-snakemake/pig5, along with Google Life Sciences Project permissions added for your project to the project with this bucket to run this pipeline. Both the credentials that you generate along with your compute engine credential will need GET access to the bucket mentioned above. In testing, we had to use admin access for it to fully work.
You will want to first log in to your project and create an instance. If you don't have any instances, this coincides with clicking "Create":
Specifically, these are the choices that are recommended:
- Region you want to be as close to where your main operations are as possible. For California, this typically means us-west1, and then choose a,b, or c. The actual instances for a/b/c vary by project, so your choice of A isn't equal to another project's, so don't worry too much about your choice.
- Machine Family I typically choose General purpose, because we just need a basic linux base.
- You don't need to select that we are deploying a container to the instance.
- Machine Type It's best to choose a smaller (but not too small) size, typically I choose n1-standard-2 (2 vCPUs, 7.5 GB memory).
- Boot Disk Even for development, you always want to chose an image with long term support. E.g., for now I would choose Ubuntu 18.04 LTS (long term support). There is a minimal image that works well too. For the actual disk size, I tend to choose a larger one to be safe (100GB).
- API Access you generally want to limit to only those endpoints that are needed, but if you are developing it's easier to grant access to all APIs. For this example, we would want access to Google Compute Engine, Google Life Sciences, and Google Storage.
- I typically allow both http/https traffic in the case that I need it.
You don't need to enable delete protection, or to ask for a static (not ephemeral) ip address under networking. This is just a test instance and it will go away rather quickly. When you are happy with your setup, click on "Create" at the bottom.
A note that I like to share is that you can click on "equivalent rest or command line" directly below this button. If you might want to create this programatically in the future, click there.
After you create the instance, it will take a brief time to spin up, and when it's ready, a small green dot will appear on the left side.
You can then click on the "SSH" dropdown on the right side, and I like to copy paste the command for shelling into the instance from my command line (view gcloud commnd). If you haven't yet installed gcloud, see these instructions.
Now let's install dependencies! This means snakemake, and also downloading the pipeline files.
First, here is for snakemake.
sudo apt-get update
sudo apt-get install -y git gcc
Let's install anaconda so we can test pipelines with conda.
$ wget https://repo.anaconda.com/archive/Anaconda3-2019.10-Linux-x86_64.sh
$ chmod +x Anaconda3-2019.10-Linux-x86_64.sh
$ ./Anaconda3-2019.10-Linux-x86_64.sh -b -p $HOME/anaconda3
Add to our path (edit in your bash profile to make permanent, if desired).
export PATH=$HOME/anaconda3/bin:$PATH
which python
/home/vanessa/anaconda3/bin/python
python --version
Python 3.7.4
which pip
/home/vanessa/anaconda3/bin/pip
If you like, you can use a virtual environment, however since this is a one-off testing instance I'm going to install snakemake using the system python.
git clone -b add/google-cloud-pipelines https://github.com/vsoch/snakemake
cd snakemake
To ensure that we fetch tags (versions) from the upstream, add it as a remote, and fetch those tags:
git remote add upstream https://github.com/snakemake/snakemake.git
git fetch upstream
And then install with pip. You should see a version 5.10.0 or later.
pip install -e .
And ensure it installed successfully and snakemake is on your path.
$ which snakemake
/home/vanessa/anaconda3/bin/snakemake
Google always suggests that you upgrade your python clients, so let's do that. Also, these aren't provided by default with Snakemake (there are many users that want to use Snakemake in a context outside of Google). However, we need them.
pip install --upgrade google-api-python-client
pip install --upgrade google-cloud-storage
pip install oauth2client
If you haven't yet, create a Google Storage Bucket in the interface. You'll want to be sure to add pipelines service accounts to your Storage bucket users. This step is hairy and error prone and I never really get it right the first time.
Let's clone the repository for Google MAGS.
cd $HOME
git clone https://github.com/WatsonLab/GoogleMAGs
cd GoogleMAGs
Snakemake requires GOOGLE_APPLICATION_CREDENTIALS, and since you might want to run this is (non Google places) too, you should download your service account key and export it to the environment. From your host, you can copy to the instance as follows:
gcloud compute scp credentials.json [username]@snakemake-googlemags:/home/[username]/credentials.json
And note that your computer username might not correspond to your gcloud username (in the above I use the same variable). Then you can export your credentials to the environment.
export GOOGLE_APPLICATION_CREDENTIALS="/home/[username]/credentials.json"
Note that I've also tested the above instead installing python3-dev
and using get-pip.py
to install pip. If you install to system Python you'd need to use sudo, otherwise you can use virtualenv.
Now let's test running Snakemake! Here we are in the root folder of the GoogleMAGs repository.
The snakefile we are targeting is Snakefilev14
and we are going to choose the same region that our
instance is in.
Although this is more complicated, I tested it first since I anticipated the command line would lead to troubles. Note that this is being run after cd into the GoogleMAGs directory.
from snakemake import snakemake
import os
workdir = os.getcwd()
Add the present working directory to the Python path:
os.environ['PYTHONPATH'] = os.getcwd()
We want to provide the path to the snakefile:
snakefile = 'Snakefilev14'
And now create a temporary directory for working:
import tempfile
tmpdir = next(tempfile._get_candidate_names())
tmpdir = os.path.join(tempfile.gettempdir(), "snakemake-%s" % tmpdir)
os.mkdir(tmpdir)
Copy all our files there.
import shutil
def copy(src, dst):
if os.path.isdir(src):
shutil.copytree(src, os.path.join(dst, os.path.basename(src)))
else:
shutil.copy(src, dst)
path = os.getcwd()
for f in os.listdir(path):
print(f)
copy(os.path.join(path, f), tmpdir)
Snakefilev6
Dockerfile
Snakefile
Snakefilev11
cluster.json
runs
Snakefilev8
envs
environment.yaml
Snakefilev7
create_local_text.sh
Snakefilev10
fileofaccessions.txt
.git
Snakefilev13
scripts
Snakefilev14
Snakefilev9
.snakemake
LICENSE
README.md
Snakefilev12
Let's get ready to run! These are the same parameters used in testing.
cores = 3
config={}
verbose=True
success = snakemake(
snakefile,
cores=cores,
workdir=tmpdir,
stats="stats.txt",
config=config,
use_conda=True,
google_lifesciences_regions=["us-west1"],
default_remote_prefix="hn-snakemake/pig5",
google_lifesciences=True,
google_lifesciences_cache=True,
verbose=True)
And then the pipeline will run and print verbose output to the screen.
Now that we've seen a test running from within Python, let's review the more likely use case (running from the command line).
snakemake --google-lifesciences --verbose -s Snakefilev14 --default-remote-prefix hn-snakemake/pig5 --use-conda --google-lifesciences-keep-cache --google-lifesciences-region us-west1
And here is the same example with 400 jobs:
snakemake --google-lifesciences --verbose -s Snakefilev14 -j 400 --default-remote-prefix hn-snakemake/pig5 --use-conda --google-lifesciences-keep-cache --google-lifesciences-region us-west1 --container-image snakemake/snakemake:v5.10.0
under development