Navigation Menu

Skip to content

Libraries

Michal J. Gajda edited this page Oct 28, 2020 · 3 revisions

An important goal of the BioHaskell effort is to organize libraries. Below is the list of the various libraries and their current state.

General

Shared definitions library: biocore

biocore on hackage

This contains (very) basic functionality and data types intended to be shared among other libraries. Using it ensures that libraries are compatible, and that the same types are used to represent the same things.

Sequences

Sequence format parsers

Reading and writing (although biocore can generate Fasta-formatted ByteStrings directly) Fasta-formatted sequences.

Reading and writing (but see above) FastQ-formatted sequences.

Functionality for dealing with SFF files, as produced by Roche 454 and ABI Ion Torrent sequences. Includes the flower executable, which can convert SFF files into a variety of formats.

Genbank libary contains tools, parser and datastructures for the NCBI (National Center for Biotechnology Information) Genbank format.

Alignment format parsers

Parsing the BLAST XML output format.

Parse Multiple EM for Motif Elicitation (MEME) XML output.

The ACE alignment output format.

The PHD sequence format, as output by Phred.

A small library enabling reading and writing of PSL files, as output by e.g. BLAT. It also contains some example programs for extracting and manipulating PSL data.

Package supports parsing and rendering of files in Stockholm 1.0 format. These formats are used by Pfam and Rfam for multiple sequence alignments. The library supports both an streaming interface that runs in constant memory and a convenient document interface that uses as much memory as the largest family in the Stockholm file. Both interfaces are accessed using the conduit but a lazy version for the document interface is provided for one-off scripts.

For reading BAM files (which are Binary SAM files), there is a samtools library, with separate libraries providing iteratee and enumerator interfaces. Available on Hackage.

Annotations

The library by Nick Ingolia provides facilities for working with sequence locations, for instance to describe and manipulate genome annotations.

Calculating alignments

Calculating alignments.

Searches for a provided nucleotide or protein sequence with the NCBI Blast REST service and returns a blast result in xml format as BlastResult datatype.

RNA secondary structure

Parsing parameter files

Work with RNA secondary structure parameter files, transform strings into highly efficient internal format and some functions for dealing with Infernal covariance models. The library will be extended with several "DataSource"s soon. This will allow users to import typical data easily.

Note: the other libraries Biobase are deprecated, their functionality is now included in Biobase.*

BiobaseDotP

  • import/export secondary structures based on some form of the Vienna dot-bracket notation (((...(((...)))..)))
  • import/export extended secondary structures as used by RNAwolf

BiobaseFR3D

  • FR3D contains already parsed PDB RNA structures
  • this library extracts basepairs and sequence from FR3D data
  • including complete directories full of entries

BiobaseInfernal

  • verbose hits
  • tabulated hits
  • stockholm files
  • covariance models
  • currently being converted to iteratee

BiobaseMAF

  • reading of MAF files
  • based on iteratee

BiobaseTrainingData

  • TrainingData to be used for training RNAwolf
  • imports from FR3D and DotP
  • exports trainingdata elements
  • imports trainingdata

BiobaseTurner

  • import Turner 2004 energy parameter files

BiobaseXNA

  • rna primary and secondary structure
  • tree-based representations
  • some datasources reading and writing dot-bracket and similar notations (e.g. rnastrand data)
  • named -xna instead to support both -dna and -rna. The internals are a bit rough, but since this is targeting high-performance stuff, it is ok

BiobaseVienna

  • Importer and Exporter for Vienna energy files. Allows converting Turner parameter files to Vienna parameter files.

BiobaseTypes

  • Provides an algebraic ring class and instances for Gibbs free energy, partition function probabilities, and scores. Conversion between different entities is provided by a convert function. All entities are ready for the vector library.

BiobaseFasta

  • Enumeratees for FASTA-handling and convenience functions. In a typical application, the user should write an enumeratee to extract information to allow for efficient low-memory handling of queries.

Folding

  • vienna rnafold v2.0
  • im- and exporting of turner and vienna tables
  • asymptotically fast reimplementation of mc-fold (parisien, major, 2008)
  • importing of mcfold-db
  • extended rna secondary structure folding
  • version 0.3 includes full stacking
  • folding is reasonably fast due to the use of additional arrays (expect to fold 300-500 nt in seconds)
  • 2-diagrams will be back soon, if no bugs show up
  • complete 2-diagrams for multibranched loops will follow later due to the large constant overhead

Tertiary structures (3D)

For reading and analyzing of Protein Databank format files.

Nuclear Magnetic resonance

Parsing STAR* format files from Biological Nuclear Magnetic Resonance Databank.

Parsing output of TALOS+ program for predicting protein backbone torsion angles from chemical shifts.

Diagnostics

Hemokit is a library and tool suite for the Emotiv Epoc EEG.

An electroencephalograph is a device to measures brain activity on the head skin. The library is the Haskell port of the Emokit project, but at the time of writing seems to be more stable and comes with versatile non-Haskell-programmer tools.

Deprecated: bio

This library contains data types for sequences and various kinds of alignments. Functionality for reading and writing many different file formats. Development is driven by the needs of applications, so while large parts of the library is solid and efficient, other parts are less mature or feature complete.

Planned new libraries

The following libraries are "kind of" new. Mostly, they have been part of some package but are sufficiently different that they can stand apart from bioinformatics in general.

passive-aggressive optimization

An optimization scheme that has been used successfully for NLP, RNA secondary structure and other tasks (I guess ;-).

convex optimization

Depending on time constraints, a variant of the Haskell GLPK library, but for convex optimizers.