Skip to content

Commit

Permalink
Added a pipeline to process GFF3 files with Tabix and also store an i…
Browse files Browse the repository at this point in the history
…ndex in an HDF5 file in that same manner as the be and wig files for quick identifiaction of which GFF files to request data from
  • Loading branch information
markmcdowall committed Mar 17, 2017
1 parent 9382e06 commit 9201c11
Show file tree
Hide file tree
Showing 6 changed files with 484 additions and 1 deletion.
2 changes: 2 additions & 0 deletions docs/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ Software

- Mongo DB 3.2
- Python 2.7.10+
- SamTools

Python Modules
^^^^^^^^^^^^^^
Expand All @@ -20,6 +21,7 @@ Python Modules
- numpy
- h5py
- pyBigWig
- pysam

Installation
------------
Expand Down
42 changes: 41 additions & 1 deletion docs/pipelines.rst
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ WIG File Indexing
Returns
-------
BigWig : file
BigWig File
BigWig file
HDF5 : file
HDF5 index file

Expand Down Expand Up @@ -111,6 +111,46 @@ WIG File Indexing
.. autoclass:: process_wig.process_wig
:members:

GFF3 File Indexing
------------------
.. automodule:: process_gff3

This pipeline can process GFF3 files into Tabix and HDF5 index files for web
use.

Running from the command line
=============================

Parameters
----------
assembly : str
Genome assembly ID (e.g. GCA_000001405.22)
gff3_file : str
Location of the source gff3 file
h5_file : str
Location of HDF5 index file

Returns
-------
Tabix : file
Tabix index file
HDF5 : file
HDF5 index file

Example
-------
When using a local verion of the [COMPS virtual machine](http://www.bsc.es/computer-sciences/grid-computing/comp-superscalar/downloads-and-documentation):

.. code-block:: none
:linenos:
runcompss --lang=python /home/compss/mg-process-files/process_gff3.py --assembly GCA_000001405.22 --gff3_file <data_dir>/expt.gff3 --h5_file <data_dir>/expt.hdf5
Methods
=======
.. autoclass:: process_wig.process_wig
:members:

3D JSON Indexing
----------------
.. automodule:: process_json_3d
Expand Down
5 changes: 5 additions & 0 deletions docs/tool.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,11 @@ Tools to index genomic files
.. autoclass:: tool.wig_indexer.wigIndexerTool
:members:

GFF3 Indexer
----------------
.. autoclass:: tool.gff3_indexer.gff3IndexerTool
:members:

3D JSON Indexer
----------------
.. autoclass:: tool.json_3d_indexer.json3dIndexerTool
Expand Down
133 changes: 133 additions & 0 deletions process_gff3.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
#(grep ^"#" in.gff3; grep -v ^"#" in.gff3 | sort -k1,1 -k4,4n) > out.sorted.gff3

#!/usr/bin/python

"""
.. Copyright 2017 EMBL-European Bioinformatics Institute
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
"""

import argparse, urllib2, gzip, shutil, shlex, subprocess, os.path, json
from functools import wraps


from basic_modules import Tool, Workflow, Metadata
from dmp import dmp

import tool
import os

try:
from pycompss.api.parameter import FILE_IN, FILE_OUT
from pycompss.api.task import task
from pycompss.api.constraint import constraint
except ImportError :
print "[Warning] Cannot import \"pycompss\" API packages."
print " Using mock decorators."

from dummy_pycompss import *


# ------------------------------------------------------------------------------

class process_gff3(Workflow):
"""
Workflow to index WIG formatted files within the Multiscale Genomics (MuG)
Virtural Research Environment (VRE)
"""

def __init__(self):
"""
Initialise the class
"""


def run(self, file_ids, metadata):
"""
Main run function to index the WIG files ready for use in the RESTful
API. WIG files are indexed in 2 different ways to allow for optimal data
retreival. The first is as a Tabix file, this allows the data to get
easily extracted as GFF3 documents and served to the user. The second is
as an HDF5 file that is used to identify which bed files have
information at a given location. This is to help the REST clients make
only the required calls to the relevant GFF3 files rather than needing
to pole all potential GFF3 files.
Parameters
----------
files_ids : list
List of file locations
metadata : list
Returns
-------
outputfiles : list
List of locations for the output wig and HDF5 files
"""

gff3_file = file_ids[0]
hdf5_file = file_ids[1]
assembly = metadata["assembly"]

# GFF3 Indexer
tbi = tool.gff3IndexerTool(self.configuration)
bw, h5_idx = w.run((wig_file, chrom_file, hdf5_file), {'assembly' : assembly})

return (bb, h5_idx)

# ------------------------------------------------------------------------------

if __name__ == "__main__":
import sys
import os

# Set up the command line parameters
parser = argparse.ArgumentParser(description="Index the gff3 file")
parser.add_argument("--assembly", help="Assembly")
parser.add_argument("--gff3_file", help="GFF3 file to get indexed")
parser.add_argument("--h5_file", help="HDF5 index file")

# Get the matching parameters from the command line
args = parser.parse_args()

assembly = args.assembly
chrom_size_file = args.chrom
gff3_file = args.gff3_file
h5_file = args.h5_file

#
# MuG Tool Steps
# --------------
#
# 1. Create data files
# This should have already been done by the VRE - Potentially. If these
# Are ones that are present in the ENA then I would need to download them


#2. Register the data with the DMP
da = dmp()

print da.get_files_by_user("test")

g3_file = da.set_file("test", gff3_file, "gff3", "Assembly", "", None)
h5_file = da.set_file("test", h5_file, "hdf5", "index", "", None)

print da.get_files_by_user("test")

# 3. Instantiate and launch the App
from basic_modules import WorkflowApp
app = WorkflowApp()
results = app.launch(process_bed, [g3_file, h5_file], {"assembly" : assembly})

print da.get_files_by_user("test")
3 changes: 3 additions & 0 deletions tool/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,9 @@
"""

import bed_indexer
import gff3_indexer
import json_3d_indexer
import wig_indexer

__author__ = 'Mark McDowall'
__version__ = '0.0'
Expand Down

0 comments on commit 9201c11

Please sign in to comment.