A Fast, Lightweight Smith-Waterman Sequence Aligner
bioseq
is packaged as a module for Node environments, but automatically
attaches itself to window
if included in a browser. To get it in Node, from
your project directory,
npm install bioseq --save
From there, you can require it wherever is convenient:
const bioseq = require("bioseq");
Alternately, if you just want it in the browser, download it and include it with a script tag, like so:
<script src="bioseq.js"></script>
var target = "ATAGCTAGCTAGCATAAGC";
var query = "AGCTAcCGCAT";
var rst = bioseq.align(target, query);
console.log("Score: ", rst.score);
console.log("Starting position: ", rst.position);
console.log("CIGAR: ", bioseq.cigar2str(rst.CIGAR));
var fmt = bioseq.cigar2gaps(target, query, rst.position, rst.CIGAR);
console.log(fmt[0]);
console.log(fmt[1]);
For a more comprehensive example, view the source of demo/index.html
bioseq is an object with the following functions: align, cigar2gaps, cigar2str, bsg_enc_seq, gen_query_profile, gen_score_matrix, makeAlphabetMap, and makeIntArray.
In general, the only functions of general interest are align
and cigar2*
,
so let's talk about those.
align
is the fundamental function of bioseq
. It accepts seven parameters:
- reference - the reference sequence
- query - the query sequence or profile
And, optionally:
- is_local - perform local alignment
- matrix - square score matrix or [match,mismatch] array
- gapsc - [gap_open,gap_ext] array; k-length gap costs gap_open+gap_ext
- w - bandwidth, disabled by default
- table - encoding table. It defaults to bioseq.nt5.
Returns an object containing a score
, position
, and CIGAR
. CIGAR is
encoded in the BAM way, where higher 28 bits keeps the length and lower 4 bits
the operation in order of 'MIDNSH'. See bioseq.cigar2str()
for converting
CIGAR to string.
Before I get in to what this function does, let's talk about what it needs. CIGAR is an encoding scheme for sequence alignment instructions. Basically, it's a list of instructions to make two sequences line up. These instructions are assembled from a very simple set:
- M for match
- I for insertion
- D for deletion
- N for gap
- S for substitution
- H for hard clipping
Each of these can be preceded by an integer, indicating the number of base pairs for which this instruction is correct between the two sequences. However, CIGARs are stored as simpler machine instructions, rather than strings. So, if you want to show the CIGAR (for some reason), you're going to need this.
Takes a pair of sequences and a CIGAR and returns an array of sequences which have been aligned according to the instructions in the CIGAR.
- reference - The same reference sequence you passed to
bioseq.align
- query - The same query you passed to
bioseq.align
- start - The
position
value of the object returned bybioseq.align
- cigar - The
CIGAR
value of the object retruned bybioseq.align
And optionally,
- fixed - A boolean indicating whether the reference is fixed or modifiable.
Basically, if you're aligning many sequences to the same reference, you should
set fixed to
true
. Otherwise you can just leave it out.
bioseq
implements local and global pairwise alignment with affine gap
penalties. There are two formulations: the Durbin formulation as is
described in his book and the Green formulation as is implemented in phrap.
The Durbin formulation is easier to understand, while the Green formulation
is simpler to code and probably faster in practice.
The Durbin formulation is:
M(i,j) = max{M(i-1,j-1)+S(i,j), E(i-1,j-1), F(i-1,j-1)}
E(i,j) = max{M(i-1,j)-q-r, F(i-1,j)-q-r, E(i-1,j)-r}
F(i,j) = max{M(i,j-1)-q-r, F(i,j-1)-r, E(i,j-1)-q-r}
where q is the gap open penalty, r the gap extension penalty and S(i,j) is the score between the i-th residue in the row sequence and the j-th residue in the column sequence. Note that the original Durbin formulation disallows transitions between between E and F states, but we allow them here.
In the Green formulation, we introduce:
H(i,j) = max{M(i,j), E(i,j), F(i,j)}
The recursion becomes:
H(i,j) = max{H(i-1,j-1)+S(i,j), E(i,j), F(i,j)}
E(i,j) = max{H(i-1,j)-q, E(i-1,j)} - r
F(i,j) = max{H(i,j-1)-q, F(i,j-1)} - r
It is in fact equivalent to the Durbin formulation. In implementation, we calculate the scores in a different order:
H(i,j) = max{H(i-1,j-1)+S(i,j), E(i,j), F(i,j)}
E(i+1,j) = max{H(i,j)-q, E(i,j)} - r
F(i,j+1) = max{H(i,j)-q, F(i,j)} - r
i.e. at cell (i,j), we compute E for the next row and F for the next column.
In an online conversation, Istvan Albert was complaining he could not find a good Smith-Waterman implementation in Javascript. Li Heng thought he could write one over night by porting ksw.c to javascript. It took longer than he planned because he found a couple of subtle bugs in ksw.c. And while he was porting the C code to javascript, he realized that it is not that difficult to merge local and banded global alignments in one function. Achieving that also took extra time.
The end product is a fast and lightweight javascript library for affine-gap local and banded global pairwise alignment. With a modern Javascript engine, it is not much slower than a non-SSE C implementation.
This repository adapted Li's original algorithm for release on NPM. Li did all the hard work, and we owe him a large debt of gratitude for it.
This repository constitutes a work of the United States Government and is not subject to domestic copyright protection under 17 USC § 105. This repository is in the public domain within the United States, and copyright and related rights in the work worldwide are waived through the CC0 1.0 Universal public domain dedication. All contributions to this repository will be released under the CC0 dedication. By submitting a pull request you are agreeing to comply with this waiver of copyright interest.
The repository utilizes code licensed under the terms of the Apache Software License and therefore is licensed under ASL v2 or later.
This source code in this repository is free: you can redistribute it and/or modify it under the terms of the Apache Software License version 2, or (at your option) any later version.
This source code in this repository is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the Apache Software License for more details.
You should have received a copy of the Apache Software License along with this program. If not, see http://www.apache.org/licenses/LICENSE-2.0.html
The source code forked from other open source projects will inherit its license.
This repository contains only non-sensitive, publicly available data and information. All material and community participation is covered by the Surveillance Platform Disclaimer and Code of Conduct. For more information about CDC's privacy policy, please visit http://www.cdc.gov/privacy.html.
Anyone is encouraged to contribute to the repository by forking and submitting a pull request. (If you are new to GitHub, you might start with a basic tutorial.) By contributing to this project, you grant a world-wide, royalty-free, perpetual, irrevocable, non-exclusive, transferable license to all users under the terms of the Apache Software License v2 or later.
All comments, messages, pull requests, and other submissions received through CDC including this GitHub page are subject to the Presidential Records Act and may be archived. Learn more at http://www.cdc.gov/other/privacy.html.
This repository is not a source of government records, but is a copy to increase collaboration and collaborative potential. All government records will be published through the CDC web site.
Please refer to CDC's Template Repository for more information about contributing to this repository, public domain notices and disclaimers, and code of conduct.