ViralCellDetector is an R-based pipeline designed to detect viral contamination in RNA-seq samples in any host sample. It accepts raw RNA-seq FASTQ files as input and outputs a list of putative viruses identified in the sample. The final results can be visualized using genome browsers such as IGV (Integrative Genomics Viewer).
git clone https://github.com/Bin-Chen-Lab/ViralCellDetector
cd ViralCellDetectorBefore running the detection script, you need to download the host genome and annotation files by executing the provided shell script.
- Edit the
Genome_file.txtto include the appropriate FTP links for your species' genome and annotation files. For example, for the human genome:
ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_29/GRCh38.primary_assembly.genome.fa.gz
ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_29/gencode.v29.annotation.gtf.gz
- Then run:
bash Genome_index.sh Genome_file.txtThis will:
- Download the host genome and annotation files
- Download the viral genomes from NCBI
- Build STAR and BWA index files for alignment
-
Create a folder named
fastq/inside theViralCellDetectordirectory and move your FASTQ files into this folder. -
Create a file named
sample_input.txtcontaining the base names (without_1.fastqor_2.fastqsuffixes) of your paired-end FASTQ files.
Example entry for a sample:input_file
If the paired-end files are input_file_1.fastq and input_file_2.fastq, only write input_file in the sample_input.txt.
After the setup, execute the main R script:
Rscript ViralCellDetector.R sample_input.txt- Raw RNA-seq FASTQ files (preferably rRNA-depleted, though poly-A data can also be used)
- A text file (
sample_input.txt) listing sample base names - The
fastq/directory should contain all input FASTQ files
- A summary file for each sample listing:
- Putative virus names
- Genome size
- Number of mapped reads
- Covered genomic regions
The viral mapping file can be loaded into IGV or other genome browsers for visualization.
- Ensure high-quality sequencing reads for optimal detection.
- Poly-A enriched data may result in reduced viral diversity compared to rRNA-depleted data.
Make sure the following tools and R packages are installed before running the script:
Install these via CRAN if not already available:
install.packages(c("dplyr", "data.table", "tidyr"))To run the example test case:
bash Run_test_ViralCellDetector.shViralCellDetector is developed and maintained by the Bin Chen Lab.
For questions or suggestions, please contact:
- Rama Shankar, PhD – ramashan@msu.edu
- Bin Chen, PhD (PI) – Chenbi12@msu.edu