Skip to content

NCBI-Hackathons/EndoVir

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

84 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EndoVir

Discovery of Novel Endogenous Viruses

Abstract

This project strives to implement the BUD algorithm from ViruSpy in Python. In short, reads from a Short Read Archive (SRA) are screened for putative virus motifs/domains using known virus nucleotide an protein sequences. Sequences containing such motifs or domains serve as queries in subsequent searches using the same SRA to extend the initial sequence until non-virus sequences/domains are encountered. This indicates that either en exogenous virus has been identified or an endogenous virus within a the host genome.

Setup

Setup analysis enviroment:

  • git clone https://github.com/NCBI-Hackathons/EndoVir.git
  • cd Endovir
  • ./setup.sh

Install external tools:

All external tools have to be currently in $PATH. Please see the corresponding README files for installation instructions.

  • cd $ENDOVIR (should be in EndoVir/)
  • mkdir tools
  • cd tools MagicBLAST 1.3.0 [check for updates]
    • wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/magicblast/LATEST/ncbi-magicblast-1.3.0-x64-linux.tar.gz
    • tar -xvzf ncbi-magicblast-1.3.0-x64-linux.tar.gz
    • export PATH=$PATH:$ENDOVIR/tools/ncbi-magicblast-1.3.0/bin/
    • sra-toolkit [check for updates; might also want to use ubuntu]
      • wget https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.8.2-1/sratoolkit.2.8.2-1-centos_linux64.tar.gz
      • tar -xvzf sratoolkit.2.8.2-1-centos_linux64.tar.gz
      • export PATH=$PATH:$ENDOVIR/tools/sratoolkit.2.8.2-1-centos_linux64/bin/
    • MEGAHIT
    • [SOAPdenovo]
    • [SPADES]*
    • echo "PATH=$PATH:$ENDOVIRPATH" >> ~/.bashrc # only if you switch the console/login after install
    • source ~/.bashrc # just in case

Run

Change into your working directory work

  • cd $ENDOVIR/work
  • python3.6 ../src/endovir.py

Design

The underlying design of endovir will facilitate the use of external tools, e.g. assemblers or parser, without changing the BUD routine itself. Further, the results of the intermediate steps can be parsed and used to set the parameters for each subsequent step in the analysis pipeline.

The use of STDIN and STDOUT is used were possible to communicate between the external tools, thereby reducing the usage of intermediate files as much as possible. In addition, only the Python standard libraries should be used.

Workflow diagram

Endovir diagram

Pipeline approach

The pipline has three major steps (in src):

  • endovir.Endovir(): creates the analysis environment and prepares the screening.

  • screener.Screener(): Initiates the screen and identifies the initial, putative virus contigs.

  • viruscontig.VirusContig(): Each putative virus contig is expanded and analyzed independently.

external screening tools, e.g MagicBLAST, have wrappers and parser in their corresponding namespace in lib.

Dependencies

Only the Python standard libraries are used. However, the pipeline depends on several external tools which are called using the subprocess module:

References:

About

Discovery of Novel Endogenous Viruses

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published