Skip to content

Latest commit

 

History

History
118 lines (82 loc) · 8.95 KB

README.md

File metadata and controls

118 lines (82 loc) · 8.95 KB

Google Summer of Code 2017

Final Work Product Submission Report

Introduction

Genoverse is a genome browser written in javascript. My project was to add support to Genoverse for large binary file formats such as BigWig, BigBed, compressed VCF. In the process I also enabled support for Wiggle and BED as these were required for the binary versions (BigWig and BigBed respectively) to work. Genoverse supported uncompressed VCF and BAM prior to the commencement of this project.

What work was done ?

I first researched the existing C++ libraries used to parse these file formats (HTSLib, libBigWig) and ported them into javascript using emscripten. However I quickly realized that this approach was unreliable since emscripten doesn't convert C++ to javascript perfectly, and the resulting code is also almost human unreadable and hence not maintainable. I therefore reverted the original plan of writing my own parsers taking inspiration from already existing open source implementations : dalliance, libBigWig. This required both understanding of the structure of the binary file formats and writing the javascript parsers themselves.

Finally I added support for rendering of these in Genoverse, either by reusing components of the existing Genoverse code base or by writing code from scratch:

  • tabix indexed VCF: new parsing code, existing rendering code
  • BigWig and Wiggle: new parsing code and extension of the preexisting Bar.lineGraph drawing code component
  • (Big)Bed: new parsing code and new rendering code.

Current State of the Project

The goals for this project have been accomplished in that support for BigWig, BigBed, compressed / tabix VCF , Wiggle and BED formats has been added to Genoverse. Having said this there still are some known bugs and further work is needed.

How to use ?

Method 1

This is suitable for small, uncompressed files but does also work with binary files.

  1. Load this onto your browser : http://wtsi-web.github.io/Genoverse/
  2. Drag any of your genome data files onto the browser area : extensions .bw, .bb, .vcf.gz, .wig, .bed all are now supported through code written during this project.

Method 2 ( local deployment )

This is suitable for attachment of large files over http

  1. Clone this repository through git git clone https://github.com/wtsi-web/Genoverse.git
  2. Copy the contents of this folder to your server and load SERVER_IP://Genoverse/expanded.html onto your browser
  3. Edit expanded.html to add a source track. For example for BigWig file:
Genoverse.Track.File.BIGWIG.extend({
  name : 'bigwig-demo',
  url  : 'path/to/bigwig/file'
});

For other formats replace BIGWIG with the appropriate data type, for example for BED it would be :

Genoverse.Track.File.BED.extend({
  name : 'bigbed-demo',
  url  : 'path/to/bed/file'
})

note : for indexed vcf files i.e vcf.gz files you must enable the gz : true option as follows:

Genoverse.Track.File.VCF.extend({
  name : 'vcf-gz-demo',
  url  : 'path/to/vcf.gz/file'
  gz   : true
})

You can use all of the normal track parameters like

height : 100,

to set track height etc options that work with normal Genoverse tracks.

  1. Save expanded.html and reload to see a new track of the type you have chosen. You can change the url field in the added track source to try a different remote file.

Testing setup :

I initially setup a jasmine + karma test suite and automated the testing through travis-CI so that everytime I committed additional code to the repository, travis automatically tested it to comply with certain test cases. During the course of my project a mocha testing environment within the the main Genoverse repo was made public and I have therefore not committed my testing suite to the main repo.

Repository contributed to :

Genoverse.

What code got merged ?

The below pull requests were either automatically or manually merged into the main Genoverse repo by the author.

List of pull requests

Link Description
DECIPHER-genomics/Genoverse#37 The binary VCF parser code was merged in this PR
DECIPHER-genomics/Genoverse#38 Added parsing and rendering for BED data
DECIPHER-genomics/Genoverse#39 Support added for Wiggle data
DECIPHER-genomics/Genoverse#40 Respect thickStart and thickEnd fields while displaying BED data
DECIPHER-genomics/Genoverse#42 Support added for Bigwig and Bigbed data

What code didn't get merged ?

I wrote a webapp to compare the speed and verify the correctness of my Bigwig and Bigbed parsers for remote files by comparing the output against that from dalliance's parsers. This demonstrated that the contents were parsed correctly and showed that there was no apparent difference in performance of my parsers compared to those of dalliance. The variability in timings between different requests was however very large meaning that I could not give accurate measurements of performance. The code has been committed to the Ensembl GSOC repository.

The emscripten study work described above has not been committed.

Known bugs and further work:

  1. Performance of VCF parsing needs to be improved by making fewer network requests as was done for BIGWIG,for example if we need data from blocks with indices [1,3,4,7] then rather than sending 4 network requests we could send a single one from 1 to 7 + dataSize(7th block), this would enable remote vcf.gz files to be processed in real time.
  2. BED files without all 12 fields present are currently not being rendered.
  3. The present code fails to process some BIGBED files.

File Format Explanations and Parsing :

Explanation Parsing
BIGWIG.md BIGWIG_parsing.md
BED.md BED_parsing.md
VCF.md VCF_parsing.md
WIG.md WIG_parsing.md

Challenges and Learning :

The main challenge I faced was that I had absolutely no knowledge of bioinformatics when I started my GSoC project, but I got enormous amounts of help from my mentor and other members of my organization. I slowly learnt about the complexities of the project I had taken on, I learnt about NGS (Next Generation Sequencing) and how all these large binary file formats are important to genome researchers. Once I understood the importance it gave a new found sense of satisfaction to work on the project as I understood the impact my code would have.

In fact my main aha moments came when I fully understood how these file formats work, they are so cleverly designed for faster remote access, it is sheer genius. My favourite parts of the journey were my code reviews, I learnt a lot through them. It improved my coding style. I learnt how to write tests and documentation which are so important to further maintain the code. I really enjoyed my journey so far and I plan to keep contributing to Genoverse further as it helps me learn so many things I wouldn't have ever learnt otherwise.