Skip to content
Permalink
Branch: master
Find file Copy path
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
276 lines (218 sloc) 11.8 KB
The \eslmod{sqio} module handles input from unaligned sequence data
files, such as FASTA files.
\eslmod{sqio} can automatically recognize and parse several different
file formats, including FASTA, UniProt, Genbank, DDBJ, and EMBL.
Additionally, it can read individual unaligned sequences from multiple
alignment files in several different formats, including Stockholm,
Clustal, aligned FASTA, PSI-BLAST, and Phylip.
Sequences can be read from normal files, directly from the
\ccode{stdin} pipe, or from \ccode{gzip}-compressed files.
Sequence files can be automatically looked for in a list of one or
more database directories, specified by an environment variable (such
as \ccode{HMMERDB}).
Table~\ref{tbl:sqio_api} lists the functions in the \eslmod{sqio} API.
The module uses an \ccode{ESL\_SQFILE} object which works much like an
ANSI C \ccode{FILE}, maintaining information for an open sequence file
while it's being read.
% API table is auto generated by the Makefile,
% using autodoc -t esl_sqio.c
%
\input{apitables/esl_sqio_api}
\subsection{Example: reading sequences from a file}
Figure~\ref{fig:sqio_example_text} shows a program that opens a file, reads
sequences from it one at a time, then closes the file.
\begin{figure}
\input{cexcerpts/sqio_example_text}
\caption{Example of reading sequences from a file.}
\label{fig:sqio_example_text}
\end{figure}
A FASTA file named \ccode{seqfile} is opened for reading by calling
\ccode{esl\_sqfile\_Open(filename, format, env, \&sqfp)}, which
creates a new \ccode{ESL\_SQFILE} and returns it through the
\ccode{sqfp} pointer. If the \ccode{format} is passed as
\ccode{eslSQFILE\_UNKNOWN}, then the format of the file is
autodetected. Here, we bypass autodetection by asserting that the file
is in FASTA format by passing a \ccode{eslSQFILE\_FASTA} code. (See
below for a list of valid codes and formats.) The optional \ccode{env}
argument is described below too; here, we're passing \ccode{NULL} and
not using it.
Several things can go wrong in trying to open a sequence file that are
beyond the control of Easel or your application, so it's important
that you check the return code. \ccode{esl\_sqfile\_Open()} returns
\ccode{eslENOTFOUND} if the file can't be opened, and
\ccode{eslEFORMAT} if the file is empty, or if autodetection can't
determine its format.\footnote{Additionally, internal errors might be
thrown, which you should check for if you installed a nonfatal error
handler.}
The file is then read one sequence at a time by calling
\ccode{esl\_sq\_Read(sqfp, sq)}. This function returns \ccode{eslOK}
if it read a new sequence, and leaves that sequence in the \ccode{sq}
object that the caller provided. When there is no more data in the
file, \ccode{esl\_sq\_Read()} returns \ccode{eslEOF}.
If at any point the file does not appear to be in the proper format,
\ccode{esl\_sq\_Read()} returns \ccode{eslEFORMAT}. The application
must check for this. The API can provide a little information about
what went wrong and where. \ccode{sqfp->filename} is the name of the
file that we were parsing (not necessarily the same as
\ccode{seqfile}; \ccode{sqfp->filename} can be a full pathname if we
used an \ccode{env} argument to look for \ccode{seqfile} in installed
database directories). The function \ccode{esl\_sqfile\_GetErrorBuf()}
should be called to get a pointer to the generated error message. The
buffer is a brief explanatory message that gets filled in when a
\ccode{eslEFORMAT} error occurs.
\footnote{Unlike in the MSA module, you don't get access to the
current line text; some of sqio's parsers use fast block-based
(\ccode{fread()}) input instead of line-based input.}
We can reuse the same \ccode{ESL\_SQ} object for subsequent sequences
by calling \ccode{esl\_sq\_Reuse()} on it when we're done with the
previous sequence. If we wanted to load a set of sequences, we'd
\ccode{\_Create()} an array of \ccode{ESL\_SQ} objects.
Finally, to clean up properly, a \ccode{ESL\_SQ} that was created is
destroyed with \ccode{esl\_sq\_Destroy(sq)}, and a \ccode{ESL\_SQFILE}
is closed with \ccode{esl\_sqfile\_Close()}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Digital sequence input mode}
Most Easel-based programs manipulate sequences in Easel's digital
sequence format, using \eslmod{alphabet}, as opposed to manipulating
them as plaintext. The \eslmod{sqio} reader can be used either in text
mode or digital mode. In text mode, you get the \ccode{sq->seq} field
of the \ccode{ESL\_SQ}; in digital mode, you get \ccode{sq->dsq}.
To use digital mode, both the \ccode{ESL\_SQFILE} reader and the
\ccode{ESL\_SQ} sequence object must be set to digital mode. The
reader, because it has an input map that maps plaintext input
characters to internal residue codes (plaintext or digital), or
errors. The sequence object, because it needs to have either its
\ccode{seq} or \ccode{dsq} field allocated. Both also carry a copy of
the pointer to the alphabet.
Figure~\ref{fig:sqio_example_digital} shows an example of the standard
idiom for opening files in digital mode, autoguessing their format and
alphabet by default, and allowing format and alphabet to be specified
on the command line.
\begin{figure}
\input{cexcerpts/sqio_example_digital}
\caption{Standard idiom for reading sequences in digital mode.}
\label{fig:sqio_example_digital}
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Accepted formats}
Accepted unaligned sequence file formats (and their Easel format
codes) include:
\begin{tabular}{lll}
\textbf{name} & \textbf{code} & \textbf{description} \\
fasta & \ccode{eslSQFILE\_FASTA} & FASTA format \\
embl & \ccode{eslSQFILE\_EMBL} & EMBL DNA database format \\
genbank & \ccode{eslSQFILE\_GENBANK} & GenBank DNA database format \\
ddbj & \ccode{eslSQFILE\_DDBJ} & DDBJ DNA database format \\
uniprot & \ccode{eslSQFILE\_UNIPROT} & UniProt protein database format \\
\end{tabular}
The above names, case-insensitive, are what a user uses to specify a
format on a command line: i.e.\ \ccode{--informat fasta} or
\ccode{--informat FASTA}.
The codes are what you use as a developer to specify a format to an
Easel function call.
Additionally, there is a code \ccode{eslSQFILE\_UNKNOWN}. It
tells \ccode{esl\_sqfile\_Open()} to perform format
autodetection.\footnote{There are some other formats as well, which we
don't advertise because they're less well supported.}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Reading from stdin and compressed files}
There are two special cases for input files.
The module can read sequence input from a stdin pipe. If the
\ccode{seqfile} argument is ``-'', \ccode{esl\_sqfile\_Open()} ``opens''
standard input (really, it just associates \ccode{stdin}, which is
always open, with the \ccode{ESL\_SQFILE}).
The module can read compressed sequence files. If the \ccode{seqfile}
argument to \ccode{esl\_sqfile\_Open()} ends in \ccode{.gz}, the file
is assumed to be compressed with \ccode{gzip}; instead of opening it
normally, \ccode{esl\_sqfile\_Open()} opens it as a pipe from
\ccode{gzip -dc}. Your system must support pipes to use
this.\footnote{Specifically, it must support the \ccode{popen()}
system call (POSIX.2 compliant operating systems do). The
\ccode{configure} script automatically checks this at compile-time
and defines \ccode{HAVE\_POPEN} appropriately.} Obviously, the user
must also have \ccode{gzip} installed and in his PATH.
For both special cases, the catch is that you can't use format
autodetection; you must provide a valid known format code when you
read from stdin or from a compressed file. Pipes are not rewindable,
and format autodetection currently relies on a two-pass algorithm: it
reads partway into the file to determine the format, then rewinds to
start parsing for real.\footnote{The \eslmod{msafile} module is more
advanced. Its parsers are based on the newer \eslmod{buffer} module
which provides rewindable input buffers even for stdin and pipes.}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% SRE: commented NCBI section out.
%% I don't believe that it works, or that it's up to date.
%% Needs testing.
%% \subsection{NCBI BLAST database support}
%% If sqio is augmented with the ncbi module, then the sqio API gains the
%% ability to read NCBI BLAST database formats in addition to ASCII file
%% formats. The sqio API remains exactly the same (the caller doesn't
%% have to use any msa module functions).
%% To open a BLAST database, the format must be supplied. If a format
%% of \ccode{eslSQFILE\_UNKNOWN} is specified, only ASCII based format,
%% i.e. FASTA, GENBANK, etc. will be opened.
%% When opening a BLAST database, the file name should not include any
%% extensions. If the \ccode{ESL\_ALPHABET} is not specified, Easel
%% will first try to open a protein database, followed by a DNA database
%% and finally a multi-volume database. The handling of a multi-volume
%% database is done through the alias file. The only directive in
%% the alias file is the DBLIST line, listing all the database volumes.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Adding a sequence parser}
New parsers for new formats can be plugged into \eslmod{sqio} without
any API changes. Existing Easel-based programs don't need code changes
to use the new parser.
A new parser will need a format type, a structure for parser specific
data, API functions and a hook into the \ccode{sqfile\_open} function.
The list of formats are defined in \ccode{esl\_sqio.h}. A new
\ccode{\#define} will be added to the existing formats:
\input{cexcerpts/sq_sqio_format}
A data structure for parser specific data will need to be added
to \ccode{ESL\_SQDATA}. This structure is a union of the
different parser specific data structures.
\input{cexcerpts/sq_sqio_data}
Finally, a set of parser specific function pointers need to be
defined. The functions in \ccode{esl\_sqio.c} in turn call these
function pointers. The \ccode{esl\_sqfile\_Open} function initializes
the function pointers to NULL, so if they are not set, an exception
will occur when the function is called. At a minimum, the function
should be defined to return an \ccode{eslEUNIMPLEMENTED}. Below is a
map the the function pointers to their respective function.
\begin{tabular}{ll} \\
Function pointer & \eslmod{sqio} function \\ \hline
position & esl\_sqfile\_Position \\
close & esl\_sqfile\_Close \\
set\_digital & esl\_sqfile\_SetDigital \\
guess\_alphabet & esl\_sqfile\_GuessAlphabet \\
read & esl\_sqio\_Read \\
read\_info & esl\_sqio\_ReadInfo \\
read\_seq & esl\_sqio\_ReadSequence \\
read\_window & esl\_sqio\_ReadWindow \\
echo & esl\_sqio\_Echo \\
read\_block & esl\_sqio\_ReadBlock \\
open\_ssi & esl\_sqfile\_OpenSSI \\
pos\_by\_key & esl\_sqfile\_PositionByKey \\
pos\_by\_number & esl\_sqfile\_PositionByNumber \\
fetch & esl\_sqio\_Fetch \\
fetch\_info & esl\_sqio\_FetchInfo \\
fetch\_subseq & esl\_sqio\_FetchSubseq \\
is\_rewindable & esl\_sqfile\_IsRewindable \\
get\_error & esl\_sqfile\_GetErrorBuf \\
\end{tabular}
\bigskip
A hook needs to be added to the function \ccode{sqfile\_open}.
This hook will try to open the specified file. If successful,
the \ccode{ESL\_SQFILE} structure should be filled in with function
pointers and the parser specific data and the open hook
return \ccode{eslOK}. If the sequence files were not found for
the specific parser, an \ccode{eslENOTFOUND} is returned and
the next parser tries to open the file. Below is an example
of code that tries to open an NCBI BLAST database if not
successful, then the ASCII sequence parsers try to open the
file.
\begin{cchunk}
if (format == eslSQFILE\_NCBI && status == eslENOTFOUND)
status = esl\_sqncbi\_Open(sqfp->filename, sqfp->format, sqfp);
if (status == eslENOTFOUND)
status = esl\_sqascii\_Open(sqfp->filename, sqfp->format, sqfp);
\end{cchunk}
You can’t perform that action at this time.