Skip to content
Permalink
Branch: master
Find file Copy path
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
305 lines (246 sloc) 14.6 KB
The \eslmod{sq} module provides \Easel's object for single biological
sequences: an \ccode{ESL\_SQ}.
Sequence objects invariably become complicated, even though their
designer intends them to be simple. There's many things we want to do
with a sequence, and useful features naturally accrete over time. If a
library isn't careful to balance creeping featuritis against having an
easy way to start using the object in simple applications, then the
sequence object - possibly the most fundamental object of a
biosequence library - can become a barrier to anyone else actually
using the library. All those useful features won't matter much if you
can't figure out how to turn your sequence data into an object, or get
it back out. \Easel\ expects you to have your own preferred way of
dealing with sequence data that's not necessarily \Easel's way, so it
provides simple ways to create sequence objects from elemental (C
string) data, and simple ways to get elemental C strings back out.
This lets you minimize your exposure to \Easel's more complicated
capabilities if you like.
The most basic use of an \ccode{ESL\_SQ} object is to hold one
complete sequence, simply as a plain C string. A sequence may also
have a name, an accession, and a description line. This is called a
\esldef{text mode} sequence. In text mode, \Easel\ doesn't know
whether the sequence is DNA, RNA, protein, or something else; it's
just an ASCII character string. This limits some of \Easel's more
powerful abilities, such as the ability to check the sequence for
errors, or to automatically deal with degenerate residue codes; but
it's a simple mode that's easy to start using.
Alternatively, a sequence may be in \esldef{digital mode}. In digital
mode, sequences are predigested and encoded into \Easel's internal
format, which makes many sequence routines more robust, efficient, and
powerful.
In addition to storing a complete sequence, an \ccode{ESL\_SQ} is
designed to be used in three other situations:
\begin{itemize}
\item to hold a \esldef{subsequence} of a larger source sequence. The
object maintains source and coordinate information necessary for
crossreferencing the subsequence's coordinate system to the original
source coordinate system.
\item to hold a \esldef{window} of a larger source sequence. This is
like a subsequence, but is more specifically intended for reading a
sequence from a file in overlapping windows. This avoids having to
suck an entire chromosome (for example) into memory at any one
time. The stored subsequence is composed of two segments, a
\esldef{previous context} that gets saved from the previous window,
and a \esldef{new window} of fresh residues. The size of both the
context and the window are configurable at the time each new window
is read.
\item to hold only \esldef{information} about a sequence, such as its
name, its length, and its position in a file, excluding the sequence
(and optional secondary structure annotation) itself. This is handy
for example when indexing a sequence file, when we'd rather not read
any (possibly prohibitively large) sequence into memory until after
we've mapped out how big it is.
\end{itemize}
To keep all this straight, the object contains a bunch of internal
bookkeeping data.
Sequence objects are growable and reusable, for efficiency in memory
allocation. If you're going to go through many different sequences
sequentially, you would typically just allocate a single
\ccode{ESL\_SQ} object and \ccode{esl\_sq\_Reuse()} it for each new
sequence, rather than creating and destroying a lot of objects.
A sequence object can also store an optional secondary structure
annotation line for the sequence, one character per residue.
An interface to the \ccode{ESL\_MSA} multiple alignment object
provides the ability to extract single unaligned sequences from a
multiple alignment.
You would often use the \eslmod{sq} module in conjunction with
\eslmod{sqio}, which provides the ability to read and write
\ccode{ESL\_SQ} objects from and to files.
Table~\ref{tbl:sq_api} lists the functions in the \eslmod{sq} API.
% Table generated by autodoc -t esl_sq.c (so don't edit here, edit esl_sq.c:)
\input{apitables/esl_sq_api}
\subsection{Example of getting data in and out of an \ccodeincmd{ESL\_SQ}}
The easiest way to create a new \ccode{ESL\_SQ} object is with the
\ccode{esl\_sq\_CreateFrom()} function, which just takes character
strings for a sequence and its name (and also, optionally, an
accession, description, and/or secondary structure annotation string).
You can also build up (and/or change and manipulate) the contents of
an \ccode{ESL\_SQ} object by accessing the name, accession,
description, sequence, and structure annotation line more directly.
This code shows examples of both approaches:
\input{cexcerpts/sq_example}
A few things to notice about that code:
\begin{itemize}
\item Every sequence has a name and a sequence. If we didn't want to
add the optional accession, description, or structure annotation
line, we'd pass \ccode{NULL} for those arguments to
\ccode{esl\_sq\_CreateFrom()}.
\item An RNA secondary structure annotation line is shown here as part
of the example, but it's really sort of a more advanced
feature. It's good to know it's there (see the \eslmod{wuss} module
for more information about how \Easel\ annotates RNA structure) but
you can ignore it if you're getting started.
\item The \ccode{esl\_sq\_Set*} functions use the same syntax as C's
\ccode{*printf()} family, which gives you a flexible way to create
new sequence names, accessions, and descriptions automatically.
\item The sequence in \ccode{sq->seq} is just a C string. (Here it's a
copy of the \ccode{testseq} string.) That has a couple of
implications. One is that it's a verbatim copy of what you provided;
\Easel\ doesn't know (or care) whether it's DNA or protein sequence,
upper case or lower case, or if it contains illegal non-sequence
characters. With a text mode sequence, that's \emph{your} problem!
For more robustness and defined biosequence alphabets, read on below
about digital mode sequences. The second implication is that, as a C
string, the \ccode{n} residues are indexed \ccode{0..sq->n-1}, not
\ccode{1..sq->n}.
\item If you're going to directly copy a sequence of length \ccode{n}
into a \ccode{sq->seq} field, note the \ccode{esl\_sq\_GrowTo()}
call, which makes sure the sequence object is allocated with enough
space for \ccode{n} residues; and don't forget to set \ccode{sq->n}.
\item The structure annotation \ccode{sq->ss} is also a C string,
indexed identically to \ccode{sq->seq}, but it's optional, and isn't
allocated by default; \ccode{esl\_sq\_GrowTo()} calls will only
reallocate for the structure annotation string after it's been
allocated at least once. Hence the \ccode{esl\_strdup} call in the
example, which duplicates (allocates and copies) the annotation into
\ccode{sq->ss}.
\end{itemize}
To get simple character strings back out of an \ccode{ESL\_SQ} object,
you're encouraged to peek inside the object. (Yeah, I know, object
oriented design says that there should be methods for this,
independent of the object's implementation; but I balance that against
simplicity, and here, simplicity wins.) The object is defined and
documented in \ccode{esl\_sq.h}. It contains various information; the
stuff you need to know is:
\input{cexcerpts/sq_sq}
Ignore the \ccode{dsq} field for now; we're about to get to it, when
we talk about digital mode sequences.
The \ccode{ESL\_SQ} object itself doesn't particularly care about the
contents of the annotation text fields, so long as they're C strings, and so
long as \ccode{n} is the length of the \ccode{seq} (and optional
\ccode{ss}, if it's non-\ccode{NULL}) strings. However, sequence file
formats do impose some expectations on the annotation strings, and it
would be a Good Idea to adhere to them:
\begin{sreitems} {\emcode{desc}}
\item [\emcode{name}] A sequence name is almost always expected to be
a single ``word'' (no whitespace), like \ccode{SNRPA\_HUMAN}.
\item [\emcode{acc}] An accession is also usually expected to be a
single ``word'' with no whitespace, like \ccode{P09012}. Database
accessions only make sense if you know what database they're for, so
when sequences might be from different databases, you'll sometimes
see accessions prefixed with a code indicating the source database,
as in something like \ccode{UniProt:P09012}. Again, \Easel\ itself
isn't enforcing the format of this string, so your application is
free to create its own accession/version/database format as needed.
\item [\emcode{desc}] A description line is something like \ccode{U1
small nuclear ribonucleoprotein A (U1 snRNP protein A) (U1A protein)
(U1-A).}; a one-line summary of what the sequence is. You can expect
the description line to show up in the tabular output of sequence
analysis applications, so ideally you want it to be short and sweet
(so it fits on one line with a name, accession, score, coords, and
other information from an analysis app). You also don't want the
description line to end in a newline (\verb+\n+) character, or the
description line will introduce unexpected line breaks in these
tabular output files.
\end{sreitems}
You can reach into a \ccode{ESL\_SQ} and copy or modify any of these
strings, but don't try to overwrite them with a larger string unless
You Know What You're Doing. Their memory allocations are managed by
the \ccode{ESL\_SQ} object. Instead, use the appropriate
\ccode{esl\_sq\_Set*} function to overwrite an annotation field.
The \eslmod{sq} module isn't much use by itself; it's a building block
for several other modules. For example, one of the main things you'll
want to do with sequences is to read them from a file. For examples
and documentation of sequence input, see the \eslmod{sqio} module.
\subsection{Example of using a digital \ccodeincmd{ESL\_SQ}}
What follows might make more sense if you've read about the
\eslmod{alphabet} module first. \eslmod{alphabet}'s documentation
explains how \Easel uses an internal digital biosequence ``alphabet'',
where residues are encoded as small integers, suitable for direct use
as array indices. But here's an example anyway, of creating and
accessing a digital mode sequence:
\input{cexcerpts/sq_example2}
Things to notice about this code:
\begin{itemize}
\item An \ccode{ESL\_SQ} object has a \ccode{sq->seq} if it's in text
mode, and \ccode{sq->dsq} if its in digital mode. These two fields are
mutually exclusive; one of them is \ccode{NULL}.
\item If you looked at the contents of \ccode{sq->dsq} in either of
the objects, you'd see that each residue is encoded as a value
\ccode{0..3}, representing (for an RNA alphabet) the residues
\ccode{ACGU}.
\item That representation is defined by the digital RNA alphabet
\ccode{abc}, which was the first thing we created.
\item In digital mode, both the sequence residues and the optional
secondary structure characters are indexed \ccode{1..n}.
\item To make the digital sequence in the first sequence object, we
created a digital sequence \ccode{dsq} by encoding the
\ccode{testseq} using \ccode{esl\_abc\_CreateDsq()}; this
function allocated new memory for \ccode{dsq}, so we have to
free it. An \ccode{ESL\_DSQ *} is just a special character array;
it's not a full-fledged \Easel\ object, and so there's no
conventional \ccode{Create()},\ccode{Destroy()} function pair.
\item In the second sequence object, we used
\ccode{esl\_abc\_Digitize()} to encode the \ccode{testseq} directly
into space that the \ccode{sq2} object already had allocated, saving
us the temporary allocation of another \ccode{dsq}, because we
created it in digital mode (\ccode{esl\_sq\_CreateDigital()}) and
made it big enough to hold \ccode{n} digital residues with
\ccode{esl\_sq\_GrowTo()}. Notice that \ccode{esl\_sq\_GrowTo()} is
smart enough to know whether to grow the digital or the text mode
sequence field.
\item By convention, when using digital sequences, we usually keep
track of (and pass as arguments) both a digital sequence \ccode{dsq}
and its length \ccode{n}, and we also need to have the digital
alphabet itself \ccode{abc} available to know what the \ccode{dsq}
means; with text mode sequences, we usually just pass the string
pointer. Thus the \ccode{esl\_sq\_CreateDigitalFrom()} function
takes \ccode{abc}, \ccode{dsq}, and \ccode{n} as arguments, whereas
the counterpart text mode \ccode{esl\_sq\_CreateDigitalFrom()} only
took a C string \ccode{seq}. This is solely a convention - digital
sequences begin and end with a special sentinel character, so we
could always count the length of a \ccode{dsq} if we had to (using
\ccode{esl\_abc\_dsqlen()}, for example), much as we can use ANSI
C's \ccode{strlen()} to count the number of chars in a C string up
to the sentinel \verb+\0+ \ccode{NUL} character at the end.
\item To get the structure annotation to be indexed \ccode{1..n} for
consistency with the \ccode{dsq}, even though the annotation string
is still just an ASCII string, it's offset by one, and the leading
character is set by convention to a \verb+\0+. Therefore to access
the whole structure string (for printing, for instance), you want to
access \ccode{sq->ss+1}. This is a hack, but it's a simple one, so
long as you don't forget about the convention.
\item Because the original sequence has been encoded, you may not get
the original sequence back out when you decode the digital values as
alphabet symbols. \ccode{abc->sym[sq2->dsq[3]]}, for example, takes
the third digital residue and looks it up in the alphabet's symbol
table, returning the canonical character it's
representing. Upper/lower case distinctions are lost, for example;
digital alphabet symbol tables are uniformly upper case. And this
example shows another example, where the input \ccode{testseq}
contains T's, but since the digital alphabet was declared as RNA,
the symbol table represents those residues as U's when you access
them.
\item In that respect, a more careful example should have checked the
return status of the \ccode{esl\_abc\_CreateDsq()} and
\ccode{esl\_abc\_Digitize()} calls. These have a normal failure
mode, when the input text sequence contains one or more ASCII
characters that are unrecognized and therefore invalid in the
digital alphabet. If this had happened, these functions would have
returned \ccode{eslEINVAL} instead of \ccode{eslOK}. We can get away
without checking, however, because the functions just replace any
invalid character with an ``any'' character (representing \ccode{N}
for DNA or RNA, \ccode{X} for protein).
\end{itemize}
For more information about how digital sequence alphabets work, see
the \eslmod{alphabet} module.
You can’t perform that action at this time.