Skip to content

khturner/metaRNA-seq

Repository files navigation

metaRNA-seq

Custom scripts for creating, annotating, and determining differential gene expression in multispecies bacterial metagenomes

Copyright (c) 2013 Peter A. Jorth, Keith H. Turner, Pinar Gumus, Nejat Nizam, Nurcan Buduneli, Marvin Whiteley

The scripts contained herein can be used to assemble .gff and .fna files representing a multispecies bacterial "metaorganism" for bioinformatic analyses. Brief descriptions of the scripts follow, but also see the scripts themselves and the usage instructions on the command line for information about how to use them. Please direct any questions on their use or construction to Keith H. Turner (khturner at utexas.edu).

GenomeMerge.pl

This program takes a tab-separated list of bacterial genome directory names (find at ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/) and the associated accession numbers and joins the .fna and the .gff files to make new .fna and .gff files specifying one "metaorganism" encompassing all of them.

usage: ./GenomeMerge.pl (genomes).txt (out_pfx)

Arguments:

(genomes).txt - A tab-separated list of directory names and accession numbers

(out_pfx) - The prefix you want used for the output files

Dependencies:

-Net::FTP (CPAN)

PullEC.pl

This program takes a GFF file from GenomeMerge.pl on STDIN, searches http://www.kegg.jp for the locus_tag and returns a new GFF file with EC and Kegg pathway information on STDOUT.

usage: ./PullEC.pl < (in).gff > (out.gff)

Dependencies:

-wget (Unix)

HOMDpull.sh

This script takes a .gbk file from a dynamically-annotated genome on the Human Oral Microbiome Database (http://www.homd.org/index.php?&name=seqDownload&type=G) with prefix (prefix), generates a .fna file of the DNA sequence and a .gff file with EC number (obtained from the associated _swisstopmatch.xls file) and KEGG pathway annotations (using the same prefix provided).

usage: ./HOMDpull.sh (prefix)

Requires: (prefix).gbk and (prefix)_swisstopmatch.xls from HOMD

Dependencies:

-bp_seqconvert.pl (BioPerl)

-HOMD_GenomeMerge.pl (this package)

HOMD_GenomeMerge.pl

This is a suite of utilities used in a pipeline to merge genomes from the Human Oral Microbiome Database (http://www.homd.org).

Usage: ./HOMD_GenomeMerge.pl (command) [arg]

(command) is a required argument that has one of the following values:

GFF_Cleaner - takes a GFF file generated from a GenBank file from the HOMD by bp_genbank2gff on STDIN and parses and relabels the 9th field to match the GFF file notation as in GenBank proper on STDOUT.

Get_EC - takes an (ID, SwissProt Accession No.) pair on STDIN, searches UniProt for the accession number and returns (ID, EC number) on STDOUT (provide genome prefix at [arg]).

EC_Pathways - takes an (ID, EC number) pair on STDIN, searches http://www.kegg.jp for the EC number and returns (ID, EC number, Kegg pathway information) on STDOUT (provide genome prefix at [arg]).

Tag_GFF - reads (ID, EC number, Kegg pathway information) from file [arg] and prints (EC number, Kegg pathway information) on the correct line of the GFF file read from STDIN on STDOUT.

GFF_Sort - takes a GFF file generated by our HOMD annotation pipeline on STDIN, sorts it by (contig)-(position) and returns it on STDOUT.

Dependencies:

-URI::Escape (CPAN)

-wget (Unix)

ECcounter.pl

This is a perl script that takes a tab-delimited table with EC numbers in the first column and read counts in the following 6 columns. The table needs to be sorted by EC number. The script then scans through each row in the table and sums the total read count for each unique EC number for each replicate.

Usage: ./ECcounter.pl (ECcount_file)

MapCount_RNASeq.sh

This UNIX shell script uses Bowtie 2.0 to map metaRNA-seq reads to a metagenome, and counts the number of reads mapping to each feature in the metagenome using HTSeq.

Usage: ./MapCount_RNASeq.sh (in_file) (out_pfx) (assembly) (threads(x))

in_file Path of the trimmed input fastq file.

out_pfx Desired prefix of output files.

assembly Prefix for assembly.

threads 1 = 1 processing thread; 2 = 2 processing threads, etc.

Requires: a fastq formatted in_file, a metagenome fasta file indexed by Bowtie 2.0, and a metagenome gff annotation file.

The following variable needs to be defined in the user .profile for proper execution:

METARNASEQDIR=/path/to/metagenome

where metagenome is the directory containing the gff file and the Bowtie indices.

Dependencies:

-Bowtie 2.0 (http://sourceforge.net/projects/bowtie-bio/files/bowtie2/2.1.0/)

-htseq-count (https://pypi.python.org/pypi/HTSeq)

calcRNASeqPaired.sh

This shell script joins read count files produced by MapCount_RNASeq.sh and then determines differentially expressed genes in the metagenome using a paired EdgeR analysis. The end results are a table with differentially expressed genes, a mutlidimensional scaling plot showing relatedness of samples, and a plot showing the goodness of fit for the variance estimation.

Usage: ./calcRNASeqPaired.sh [-o ] [-c ] [-x <#>] [-t ] [-y <#>] ...

Requires: Pairwise_edgeR.R must be present in the directory defined by the user .profile for proper execution:

METARNASEQDIR=/path/to/metagenome

Dependencies:

-EdgeR

-Pairwise_edgeR.R (this package)

PairwiseEdgeR.sh

This shell script reads in a tab-delimited table with genes or EC numbers in the first column, followed by raw read counts per gene/EC in the following columns. Then the script determines differentially expressed genes in the metagenome using a paired EdgeR analysis. The end results are a table with differentially expressed genes, a mutlidimensional scaling plot showing relatedness of samples, and a plot showing the goodness of fit for the variance estimation.

Usage: ./PairwiseEdgeR.sh [-o ] [-c ] [-x <#>] [-t ] [-y <#>] ...

Requires: Pairwise_edgeR.R must be present in the directory defined by the user .profile for proper execution:

METARNASEQDIR=/path/to/metagenome

Dependencies:

-EdgeR

-Pairwise_edgeR.R (this package)