-
Notifications
You must be signed in to change notification settings - Fork 1
Sequences
Sequences are the core data of the Bio++ libraries. They come as character chains, from text files or from a database. In order to be able to interpret a sequence, an alphabet is required. It will be used to encode the sequence en ensure the translation between computer representation and human representation. A sequence can in some cases be associated with several features, like gene annotations or quality scores.
Depending on the user's need, there are several ways to manipulate sequences in Bio++.
The simplest way to manipulate sequences is to store them as character strings (using std::string). The class StringSequenceTools offers several methods to process such sequences. while being the easiest way to process sequences, this option is rather limited as it comes to perform more complex data manipulation, particularly when states have more than one character (for instance: codon sequences).
Most methods in Bio++ will required a full object implementation of sequence data. In Bio++ 2.00, the class hierarchy has been rewritten in order to accommodate several implementations. It is however to a large extent backward compatible with previous versions of Bio++.
The most basic feature of a sequence is to store its constitutive series of elements, together with the associated alphabet required to decode it. Basic operations on the sequence include changing, inserting or deleting some elements. The CoreSymbolList interface therefore defines all the required operations. T can be either int for sequences of letters (got through their state in an alphabet), or vector for sequences of sets of values (typically probabilities or counts on the alphabet). There are currently two implementation of this interface:
- BasicIntSymbolList (T=int) or ProbabilisticSymbolList (T=vector), offering a minimal implementation.
- EdIntSymbolList standing for event-driven. This implementation defines a IntSymbolListListener (aka CoreSymbolListListener) and IntSymbolListEvent (aka CoreSymbolListEvent) classes. This event-driven implementation allows you to capture any modification of the sequence by appropriate events.
The Sequence interface inherits from the CoreSymbolList interface, and adds some simple features like sequence names and comments. It also contains some utilitary methods for automatically converting a sequence from/to a character string. Two implementations are available:
- BasicSequence, which is based on the BasicIntSymbolList implementations, and
- SequenceWithAnnotation, which offers an event-driven implementation based on the EdIntSymbolList class. In addition, a SequenceAnnotation interface is defined, extending the IntSymbolListListener interface. Sequence annotations can therefore be handled in a very general way by the SequenceWithAnnotation class, as a special case of listeners. Some utilitary methods dedicated to annotations are provided.
The SequenceWithQuality class is a special case of SequencewithAnnotation. It contains a mandatory annotation, an instance of a SequenceQuality class, containing sequence quality scores (as the one obtained from the phred format for instance). This class provides some methods to edit the scores together with the sequence, for convenience.
The EdIntSymbolList class fire events every time the sequence content is modified. Depending on the modification, several events can be generated:
- IntSymbolListEditionEvent when the full content was affected, for instance after using the setContent method,
- SymbolListDeletionEvent and SymbolListInsertionEvent in case of an indel, for instance after calling addElement, removeElement or resize,
- SymbolListSubstitutionEvent when the sequence content was changed, for instance with a call to setElement. Each of these event will be thrown twice: before attempting to perform the modification, and after the modification was performed. These events can be caughed by implementing the SymbolListListener interface, and adding an instance of the resulting class to the sequence object using the addSymbolListListener method.
See the bpp-seq-example repository!