- Author: Qianqian Fang <q.fang at neu.edu>
- License: Data files in the data folder are in the public domain; BSD License for codes in the src folder
- Version: 0.5
- URL: http://github.com/fangq/gbff2json
Table of content
- Overview
- How to parse genomic data files
- How to use the gbff2json converter
- Contribute to this project
We provide a GenBank .gbff file to JSON converter and various converted formats for the recently sequenced COVID-19 coronavirus genomic data, with a hope to facilitate automated processing in computer programs using Python, MATLAB, Javascript etc.
The source of the COVID-19 gemonic data is downloaded from NIH NCBI GenBank repository:
https://www.ncbi.nlm.nih.gov/nuccore/NC_045512
The follow data formats are provided:
- *.gbff: original GenBank gbff data format (https://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html)
- *.json: text-based JSON file (http://json.org)
- *.mat: MATLAB/GNU Octave .mat format (https://www.mathworks.com/help/pdf_doc/matlab/matfile_format.pdf)
- *.pickle: Python pickle format (https://docs.python.org/3/library/pickle.html)
- *.jdt: text-based JData (JSON compatible) format (http://openjdata.org)
- *.jdb: binary JData (UBJSON compatible) format (http://openjdata.org and http://ubjson.org)
- *.h5: HDF5 format (https://www.hdfgroup.org/solutions/hdf5/)
- *.msgpk: MessagePack format (https://msgpack.org)
Most of the above data formats are widely supported data exchange formats, and can be loaded and processed in a variety of programming environments. Among these data formats, we strongly recommend .json and .jdt formats because they are both human-readable and easy to parse/convert.
To load the data in python, one can use the below sample codes
import json
from collections import OrderedDict
with open('datafile.json') as f:
covid19= json.load(f,object_pairs_hook=OrderedDict);
if you do not attach object_pairs_hook=OrderedDict
in the above command, the generated
dict
object may have random orders in the subfields.
First, install the pyjdata package via
pip install jdata
pip install bjdata
then open python, and run
import jdata
from collections import OrderedDict
covid19=jdata.loadt('datafile.jdt',object_pairs_hook=OrderedDict);
to load the text-based JData file, or
import jdata
covid19=jdata.loadb('datafile.jdb');
to load the binary-version of the jdata file.
import pickle
covid19=pickle.load( open( "datafile.pickle", "rb" ) )
import h5py
covid19=h5py.File('datafile.h5','r')
First, install the msgpack package via
pip install msgpack
then open python, and run
import msgpack
with open('datafile.msgpack') as fd:
covid19 = msgpack.unpack(fd)
Once the data is loaded in Python, the full data structured is typically stored as a nested dict
object.
One can access the individual subfields via python's standard object indexing and reference methods. For
example,
covid19['Version'] # this prints the Version subfield in the top level
covid19['Features']['Location/Qualifiers']['gene'].keys() # print the gene positions
covid19['Features']['Location/Qualifiers']['gene']['266..21555'] # print the gene between positions 266 and 21555
len(covid19['Origin']) # print the genome sequence length
covid19['Origin'][0:10] # print the first 10 nucleotide bases
To load the data in MATLAB/Octave, one should first download and install JSONLab toolbox from
https://github.com/fangq/jsonlab
To read the compressed JData files in Octave (not needed in MATLAB), one shall also install the ZMat toolbox from
If you use Fedora Linux, these toolboxes can be installed via
sudo dnf install octave-jsonlab
covid19=loadjson('datafile.json');
If you do not have JSONLab, in MATLAB 2018a or newer, you can also use the below command to load the JSON data
covid19=jsondecode(fileread('datafile.json'));
The text-based JData file is directly supported by JSONLab
covid19=loadjson('datafile.jdt');
similarly, the binary JData file can be loaded as
covid19=loadubjson('datafile.jdb');
Loading the .mat file does not need any additional toolboxes as it is the native format for MATLAB. One can simply
load GCF_009858895.2_ASM985889v3_genomic.gbff_matlab.mat
This will produce a struct variable named covid19
in the 'base' workspace.
To load the hdf5 file in MATLAB using a single command, you need to download the EasyH5 toolbox from the below links
https://github.com/fangq/easyh5
then you can load the file using
covid19=loadh5('datafile.h5');
The MessagePack-format is also supported by JSONLab
covid19=loadmsgpack('datafile.msgpack');
Once the data is loaded in Python, the full data structured is typically stored as a nested dict
object.
One can access the individual subfields via python's standard object indexing and reference methods. For
example,
covid19.Version # this prints the Version subfield in the top level
fieldnames(covid19.Features.Location_Qualifiers.gene) # print the gene positions
covid19.Features.Location_Qualifiers.gene.From_266_to_21555 # print the gene between positions 266 and 21555
length(covid19.Origin) # print the genome sequence length
covid19.Origin(1:10) # print the first 10 nucleotide bases
The gbff2json
converter is a Perl script that converts GenBank gbff files to a JSON file.
To see the help information, simply type gbff2json.pl
without any parameter. The below
help info will be printed
gbff2json.pl - converting GenBank database file to a JSON/JData file
Format: gbff2json.pl <options> input.gbff > output.json
The supported options include (multiple parameters can be used, separated by spaces)
-m convert key names to MATLAB friendly forms, otherwise, keeps the original key names
-C use all-caps for top-level key names, otherwise, capitalize only the first letter
-c print JSON in compact form, otherwise, print in the indented form
Example commands using this script include
gbff2json.pl datafile.gbff # print the JSON output to the terminal
gbff2json.pl datafile.gbff > datafile.json # print the JSON output to a file
echo datafile.gbff | gbff2json.pl -m # read from pipe and convert key names to remove special letters
echo datafile.gbff | gbff2json.pl -c >out.json# read from pipe and output compact JSON format
For better readability and easy extension, gbff2json
makes the below minor changes
to the original gbff data entry names:
- the top-level keys (
LOCUS, VERSION, ORIGIN
...) are converted to keep only capital form for the first letter. One can revert to the all-caps version by adding the"-C"
flag - the _matlab.json file contains a JSON structure with converted key-names to facilitate
data manipulation in MATLAB/Octave. Because special characters can not be used as structure
field names, we converted range operators
xxxxx..xxxxx
toFrom_xxxxx_to_xxxxx
, and replaced "/", "(", ")" and "," to "_"
The gbff2json.pl
converter requires the JSON
and Tie::IxHash
perl modules, which can be
install via
sudo apt-get install libtie-ixhash-perl libjson-perl
on Ubuntu/Debian used Linux, or run perl -MCPAN -e shell
and type
install JSON
install Tie::IxHash
in the CPAN shell to install these modules. If the cpan command is installed on your OS, you may install those directly using the below commands.
cpan JSON
cpan Tie::IxHash
Please submit your bug reports, feature requests and questions to the Github Issues page at
https://github.com/fangq/covid19/issues
Please feel free to fork our software, making changes, and submit your revision back to us via "Pull Requests". gbff2json is open-source and welcome to your contributions!