iGDP-rc v1.1.0

An integrated Genome Decontamination Pipeline (iGDP) for rumen ciliates

iGDP-rc can work as a "positive or negative filter" to obtain target rumen ciliate sequences from genomic sequencing data containing various contaminants by integrating homology search, telomere reads-assisted and clustering approaches.

Issues, bug reports and feature requests: GitHub issues
Contact: Fei Xie (xiefei_njau@163.com); Chuanqi Jiang (jiangchuanqi@ihb.ac.cn)
Citation: 1. Chuanqi Jiang, Guangying Wang, Jing Zhang, Siyu Gu, Xueyan Wang, Weiwei Qin, Kai Chen, Dongxia Yuan, Xiaocui Chai, Mingkun Yang, Fang Zhou, Jie Xiong, Wei Miao (2023). iGDP: An integrated Genome Decontamination Pipeline for wild ciliated microeukaryotes. Molecular Ecology Resources. 23, 1182–1193 (2023).

Install

Depend tools (Please ignore if already available)

# mmseqs2 (>=v13.45111)
$ conda install -c bioconda mmseqs2
  
# bwa (>=v0.7.17)
$ conda install -c bioconda bwa
  
# samtools (>=v1.7)
$ conda install -c bioconda samtools
  
# metabat2 (>=v2.12.1)
$ conda install -c bioconda metabat2

iGDP-rc

$ git clone https://github.com/GWang2022/iGDP.git
# give executable permission to all scripts in iGDP scripts directory
$ chmod a+x iGDP/scripts/*pl
# add iGDP-rc scripts directory to your PATH environment variable
$ echo 'PATH=$(pwd)/iGDP/scripts/:$PATH' >> ~/.bashrc
$ source ~/.bashrc

Download NCBI NR protein database using mmseqs

# Usage: mmseqs databases <name> <o:sequenceDB> <tmpDir> [options]
# Downloading NR database named with prefix 'NRdb' in your working directory using the following command
$ mmseqs databases NR NRdb tmpDir

Tip: You can creat your own database for homology search using mmseqs createdb module. For more details, see mmseqs.

Usage

Workflow

Run iGDP-rc

Implement homology search program

$ iGDP_homology_search.pl -i <input.contigs.fa> -o <output_dir> -d <mmseqs_DB> [options]

options:
  -i <required>:  input assembled contigs [.gz or uncompressed]
  -o <required>:  output directory [e.g. homology_search]
  -d <required>:  database for mmseqs search
  -rank [optional]: target taxonomic space of homology search [format, rank:taxon; rank must be phylum/class/order/family/genus/species and taxon begins     
                  with a capital letter; default: phylum:Ciliophora]
  -b [optional]:  bin size [contig is cut to -b bp for homology search; default: 1000]
  -s [optional]:  mmseqs seach sensitivity [1.0 faster; 4.0 fast; 7.5 sensitive; default: 5.7]
  -t [optional]:  number of threads used for mmseqs [default: 72]
  -T [optional]:  translation table of the target genome [default: 6 for ciliates]

Implement telomere reads-assisted program

$ iGDP_telomere_reads.pl -i <input.contigs.fa> -o <output_dir> -r1 <reads1> -r2 <reads2> [options]

options:
  -i  <required>:  input assembled contigs [.gz or uncompressed]
  -o  <required>:  output directory [e.g. telomere_reads]
  -r1 <required>:  read1 input file name [.gz or uncompress]
  -r2 <required>:  read2 input file name [.gz or uncompress]
  -u  [optional]:  5' to 3' telomeric repeat unit of the target genome [default: CCCCAA for Tetrahymena species]
  -b  [optional]:  threads for bwa mem [default: 8]
  -s  [optional]:  threads for samtools view [default: 8]

Implement clustering program

$ iGDP_clustering.pl -i <input.contigs.fa> -o <output_dir> -r1 <reads1> -r2 <reads2> [options]

options:
  -i  <required>:  input assembled contigs [.gz or uncompressed]
  -o  <required>:  output directory [e.g. clustering]
  -r1 <required>:  read1 input file name [.gz or uncompress]
  -r2 <required>:  read2 input file name [.gz or uncompress]
  -b  [optional]:  threads for bwa mem [default: 8]
  -s  [optional]:  threads for samtools view [default: 8]

Tip: Running iGDP_clustering.pl must be after implementing iGDP_homology_search.pl and iGDP_telomere_reads.pl programs.

An example of running iGDP-rc

Positive filtering mode (default)

This mode directly selects ciliate sequences as the target genome.

Please enter the iGDP/ directory after downloading iGDP and NR protein database. You will see three files in the example/ directory:

The file assemly.fa.gz is a contaminated genome assembly.
The files read1.fq.gz and read2.fq.gz are paired-end short-read sequencing data for the above genome.

Enter the example/ directory and implement the following command lines:

$ iGDP_homology_search.pl -i assemly.fa.gz -o homology_search -d {path_to_NR}/NRdb
$ iGDP_telomere_reads.pl -i assemly.fa.gz -o telomere_reads -r1 read1.fq.gz -r2 read2.fq.gz
$ iGDP_clustering.pl -i assemly.fa.gz -o clustering -r1 read1.fq.gz -r2 read2.fq.gz

Then the follwong data files will be created and deposited in the example/ directory:

The files homology_search.homology.recall.contigs, telomere_reads.telo_reads.recall.contigs and clustering.contigs contain contig IDs obtained by iGDP_homology_search.pl, iGDP_telomere_reads.pl and iGDP_clustering.pl programs, respectively;
The folders homology_search/, telomere_reads/ and clustering/ contain intermediate data files generate by the above commands.
The file final_genome.fa is the final genome after contamination removal.

Negative filtering mode

This mode first selects sequences from all non-Ciliophora contaminants and then keep the rest as the target genome. Compared with positive filtering, the obtained genome by this mode usually has higher completeness but lower precision.

After run iGDP_homology_search.pl and iGDP_telomere_reads.pl as above, implement the following command line:

$ iGDP_clustering_negative.pl -i assemly.fa.gz -o clustering_negative -r1 read1.fq.gz -r2 read2.fq.gz

The file final_genome.negative.fa is the final genome after contamination removal.

Update

2022/10/14
- intergate clustering program into iGDP
- add -rank option allowing user to set the homology search space for the target species.
2023/01/25
- add negative filtering mode into iGDP. This mode is suitable to genomic data without contamination from other ciliates such as single-cell sequencing data.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
example		example
scripts		scripts
INSTALL		INSTALL
LICENSE		LICENSE
README.md		README.md
run.example		run.example

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

iGDP-rc v1.1.0

An integrated Genome Decontamination Pipeline (iGDP) for rumen ciliates

Install

Depend tools (Please ignore if already available)

iGDP-rc

Download NCBI NR protein database using mmseqs

Usage

Workflow

Run iGDP-rc

Implement homology search program

Implement telomere reads-assisted program

Implement clustering program

An example of running iGDP-rc

Positive filtering mode (default)

Negative filtering mode

Update

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

iGDP-rc v1.1.0

An integrated Genome Decontamination Pipeline (iGDP) for rumen ciliates

Install

Depend tools (Please ignore if already available)

iGDP-rc

Download NCBI NR protein database using mmseqs

Usage

Workflow

Run iGDP-rc

Implement homology search program

Implement telomere reads-assisted program

Implement clustering program

An example of running iGDP-rc

Positive filtering mode (default)

Negative filtering mode

Update

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages