makeflow-examples/ssaha at master · cooperative-computing-lab/makeflow-examples

History

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
fastq_generate.pl		fastq_generate.pl
fastq_reduce		fastq_reduce
image.mf		image.mf
image.pdf		image.pdf
image.png		image.png
make_ssaha_workflow		make_ssaha_workflow
ssaha.json		ssaha.json
ssaha.jx		ssaha.jx
ssaha.mf		ssaha.mf

README.md

SSAHA Workflow Example

This example workflow demonstrates how to parallelize the SSAHA2 (Sequence Search and Alignment by Hasing Algorithm) tool published by the Sanger institute.

If you have not done so already, please clone this example repository like so:

git clone https://github.com/cooperative-computing-lab/makeflow-examples.git
cd ./makeflow-examples/ssaha

First, download and install a suitable binary for SSAHA2:

export SSAHA_BINARY=ssaha2_v2.5.5_x86_64
wget ftp://ftp.sanger.ac.uk/pub/resources/software/ssaha2/${SSAHA_BINARY}.tgz
tar xvzf ${SSAHA_BINARY}.tgz
cp ${SSAHA_BINARY}/ssaha2 .

If you do not have any data of your own, you can generate some random data for testing purposes in FASTQ format. The first argument to fastq generate is the number of sequences, and the second is the length of sequences.

./fastq_generate.pl 10000 2000 > db.fastq
./fastq_generate.pl 100000 100 db.fastq > query.fastq

Make sure that the sequential ssaha executable works. This should run in ~5 minutes, so cancel it once you are satisfied it is working.

./ssaha2 db.fastq query.fastq

Then, generate a workflow to parallelize the job into sub-jobs of 1000 sequences each:

./make_ssaha_workflow db.fastq query.fastq output.fastq 1000 > ssaha.mf

Finally, run the workflow using makeflow locally, or using a batch system:

makeflow ssaha.mf
makeflow -T condor ssaha.mf
makeflow -T sge ssaha.mf
makeflow -T wq.ssaha.mf

Alternatively, the makeflow can be run using the JX or JSON implementation

makeflow --jx ssaha.jx
makeflow --json ssaha.json

Workflow Size	Reference Size(Number x Length)	Query Size(Number x Length)	Number of seq per split	Approx Time with Machine
Small	10000x2000 (Fixed 20M)	100000x100 (237K)	100	~5 min : 1 machine
Medium	10000x2000 (Fixed 1000M)	1000000x100 (237M)	1000	~3 : 20 machines
Large	100000x2000 (Fixed 4.0G)	1000000x500 (386M)	1000	~15 min : 20 machines

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ssaha

ssaha

README.md

README.md

fastq_generate.pl

fastq_generate.pl

fastq_reduce

fastq_reduce

image.mf

image.mf

image.pdf

image.pdf

image.png

image.png

make_ssaha_workflow

make_ssaha_workflow

ssaha.json

ssaha.json

ssaha.jx

ssaha.jx

ssaha.mf

ssaha.mf

README.md

SSAHA Workflow Example

Files

ssaha

Directory actions

More options

Directory actions

More options

Latest commit

History

ssaha

Folders and files

parent directory

SSAHA Workflow Example