Skip to content

CodeFeiX/iGDP-rc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

iGDP-rc v1.1.0

Good Active GPL

An integrated Genome Decontamination Pipeline (iGDP) for rumen ciliates

iGDP-rc can work as a "positive or negative filter" to obtain target rumen ciliate sequences from genomic sequencing data containing various contaminants by integrating homology search, telomere reads-assisted and clustering approaches.



Install

  • Depend tools (Please ignore if already available)

# mmseqs2 (>=v13.45111)
$ conda install -c bioconda mmseqs2
  
# bwa (>=v0.7.17)
$ conda install -c bioconda bwa
  
# samtools (>=v1.7)
$ conda install -c bioconda samtools
  
# metabat2 (>=v2.12.1)
$ conda install -c bioconda metabat2
  • iGDP-rc

$ git clone https://github.com/GWang2022/iGDP.git
# give executable permission to all scripts in iGDP scripts directory
$ chmod a+x iGDP/scripts/*pl
# add iGDP-rc scripts directory to your PATH environment variable
$ echo 'PATH=$(pwd)/iGDP/scripts/:$PATH' >> ~/.bashrc
$ source ~/.bashrc

Download NCBI NR protein database using mmseqs

# Usage: mmseqs databases <name> <o:sequenceDB> <tmpDir> [options]
# Downloading NR database named with prefix 'NRdb' in your working directory using the following command
$ mmseqs databases NR NRdb tmpDir

Tip: You can creat your own database for homology search using mmseqs createdb module. For more details, see mmseqs.

Usage

Workflow

Run iGDP-rc

  • Implement homology search program

$ iGDP_homology_search.pl -i <input.contigs.fa> -o <output_dir> -d <mmseqs_DB> [options]

options:
  -i <required>:  input assembled contigs [.gz or uncompressed]
  -o <required>:  output directory [e.g. homology_search]
  -d <required>:  database for mmseqs search
  -rank [optional]: target taxonomic space of homology search [format, rank:taxon; rank must be phylum/class/order/family/genus/species and taxon begins     
                  with a capital letter; default: phylum:Ciliophora]
  -b [optional]:  bin size [contig is cut to -b bp for homology search; default: 1000]
  -s [optional]:  mmseqs seach sensitivity [1.0 faster; 4.0 fast; 7.5 sensitive; default: 5.7]
  -t [optional]:  number of threads used for mmseqs [default: 72]
  -T [optional]:  translation table of the target genome [default: 6 for ciliates]
  • Implement telomere reads-assisted program

$ iGDP_telomere_reads.pl -i <input.contigs.fa> -o <output_dir> -r1 <reads1> -r2 <reads2> [options]

options:
  -i  <required>:  input assembled contigs [.gz or uncompressed]
  -o  <required>:  output directory [e.g. telomere_reads]
  -r1 <required>:  read1 input file name [.gz or uncompress]
  -r2 <required>:  read2 input file name [.gz or uncompress]
  -u  [optional]:  5' to 3' telomeric repeat unit of the target genome [default: CCCCAA for Tetrahymena species]
  -b  [optional]:  threads for bwa mem [default: 8]
  -s  [optional]:  threads for samtools view [default: 8]
  • Implement clustering program

$ iGDP_clustering.pl -i <input.contigs.fa> -o <output_dir> -r1 <reads1> -r2 <reads2> [options]

options:
  -i  <required>:  input assembled contigs [.gz or uncompressed]
  -o  <required>:  output directory [e.g. clustering]
  -r1 <required>:  read1 input file name [.gz or uncompress]
  -r2 <required>:  read2 input file name [.gz or uncompress]
  -b  [optional]:  threads for bwa mem [default: 8]
  -s  [optional]:  threads for samtools view [default: 8]

Tip: Running iGDP_clustering.pl must be after implementing iGDP_homology_search.pl and iGDP_telomere_reads.pl programs.

An example of running iGDP-rc

Positive filtering mode (default)

This mode directly selects ciliate sequences as the target genome.

Please enter the iGDP/ directory after downloading iGDP and NR protein database. You will see three files in the example/ directory:

  • The file assemly.fa.gz is a contaminated genome assembly.
  • The files read1.fq.gz and read2.fq.gz are paired-end short-read sequencing data for the above genome.

Enter the example/ directory and implement the following command lines:

$ iGDP_homology_search.pl -i assemly.fa.gz -o homology_search -d {path_to_NR}/NRdb
$ iGDP_telomere_reads.pl -i assemly.fa.gz -o telomere_reads -r1 read1.fq.gz -r2 read2.fq.gz
$ iGDP_clustering.pl -i assemly.fa.gz -o clustering -r1 read1.fq.gz -r2 read2.fq.gz

Then the follwong data files will be created and deposited in the example/ directory:

  • The files homology_search.homology.recall.contigs, telomere_reads.telo_reads.recall.contigs and clustering.contigs contain contig IDs obtained by iGDP_homology_search.pl, iGDP_telomere_reads.pl and iGDP_clustering.pl programs, respectively;

  • The folders homology_search/, telomere_reads/ and clustering/ contain intermediate data files generate by the above commands.

  • The file final_genome.fa is the final genome after contamination removal.

Negative filtering mode

This mode first selects sequences from all non-Ciliophora contaminants and then keep the rest as the target genome. Compared with positive filtering, the obtained genome by this mode usually has higher completeness but lower precision.

After run iGDP_homology_search.pl and iGDP_telomere_reads.pl as above, implement the following command line:

$ iGDP_clustering_negative.pl -i assemly.fa.gz -o clustering_negative -r1 read1.fq.gz -r2 read2.fq.gz
  • The file final_genome.negative.fa is the final genome after contamination removal.

Update

  • 2022/10/14
    • intergate clustering program into iGDP
    • add -rank option allowing user to set the homology search space for the target species.
  • 2023/01/25
    • add negative filtering mode into iGDP. This mode is suitable to genomic data without contamination from other ciliates such as single-cell sequencing data.

About

An integrated Genome Decontamination Pipeline for rumen ciliates

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages