Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
GSTAr: Generic Small RNA-Transcriptome Aligner
Fetching latest commit…
Cannot retrieve the latest commit at this time.
|Type||Name||Latest commit message||Commit time|
|Failed to load latest commit information.|
LICENSE GSTAr.pl Copyright (c) 2013 Michael J. Axtell This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>. SYNOPSIS GSTAr : Generic Small RNA-Transcriptome Aligner. Flexible RNAplex-based alignment of miRNAs and siRNAs (15-26 nts) to a transcriptome. AUTHOR Michael J. Axtell, Penn State University, firstname.lastname@example.org VERSION 1.0 : September 17, 2013 INSTALL Dependencies perl RNAplex (from Vienna RNA package) RNAplex must be executable from your PATH. GSTAr.pl was developed using RNAplex version RNAplex 2.1.3. It has not been tested on other versoins of RNAplex. Installation No "real" installation. If the script is in your working directory, you can call it with ./GSTAr.pl For convenience, you can add it to your PATH. e.g. sudo mv GSTAr.pl /usr/bin/ GSTAr.pl expects to find perl in /usr/bin/perl .. if not, edit line 1 (the hashbang) accordingly. USAGE GSTAr.pl [options] queries.fasta transcriptome.fasta Output alignments go to STDOUT, and can be redirected to a file with > or piped to another process with | Log and progress information goes to STDERR, and can be suppressed with option -q (quiet mode). Options Options: -h Print help message and quit -v Print version and quit -q Quiet mode .. no log/progress information to STDERR -t Tabular output format ... More suitable for parsing -a Sort by Allen et al. score instead of the default MFEratio -r [float >0..1] Minimum Free Energy Ratio cutoff. Default: 0.65 METHODS GSTAr.pl is essentially a wrapper and parser for RNAplex desgined for aligning short queries (15-26nts) against the rev-comp. strand of a eukaryotic transcriptome. For each query sequence, the minimum free energy of a perfectly complelementary sequence is calculated using RNAplex under all default parameters. Following this, the same query is then analyzed against the entire transcriptome. Hits where the MFEratio (i.e. MFE / MFE-perfect) is >= the cutoff established by option r are retained and parsed. The detailed RNAplex parameters (see RNAplex man page for details) for each query analysis are -f 2 : Fast mode .. structure based on approximated plex model. -e [minMFEratio * perfectMFE] : Minimum acceptable MFE value to keep a hit -z : [10 + query_length] : Acceptable alignments can span no more than length of query + 10nts. Slice Site is the transcript nt opposite nt 10 of the query. This is where one should look to find evidence of AGO-catalyzed slicing in the event that a) the transcript was really a target of the query at that site, and b) it really was sliced. GSTAr makes no judgements on the likelihood of either of those events, and the recording of the Slice Site position should NOT be taken as evidence that slicing exists or is even possible at that site. For a given query, all returned sites are non-redundant. Redunancy is based upon the putative slice site only. By default, the output is sorted in descending order (best to worst) according to the MFE ratio. In the alternative option -a mode, the results are instead sorted in ascending order (best to worst) according to the Allen et al. score. Allen et al. score: This is a score for plant miRNA/siRNA-target interactions based on the position-specific penalties described by Allen et al. (2005) Cell, 121:207-221 [PMID: 15851028]. Specifically, mismatched query bases or target-bulged bases, are penalized 1. G-U wobbles are penalized 0.5. These penalties are double within positions 2-13 of the query. WARNINGS NOT a target predictor GSTAr is very explicitly NOT a target predictor for miRNAs or siRNAs. It is only an aligner based on RNA-RNA hybridization thermodynamic predictions. Users should make no claims as to whether the identified alignments are actually targets of the query without independent data of some sort. Slice Sites are NOT predictions of slicing Although GSTAr reports a "Slicing Site" position for each alignment, this is merely for conveneince when using GSTAr alignments to guide subsequent experiments searching for AGO-catalyzed slicing evidence. No claim is made that any alignment is actually AGO-cleaved or even theoretically AGO-cleavable. Not for whole genomes GSTAr holds the entire contents of the transcripts.fasta file in memory to speed the isolation of sub-sequences. This will be impractical in terms of memory usage if a user attempts to load a whole genome. Similarly, GSTAr will only search for pairing between the top strand of the transcripts.fasta file, making it also impractical for a genome analysis, where sites might be on either strand. Temp files GSTAr writes temp files to the working directory. Their contents change dynamically during a run, and they will be deleted at the end of a run. So, don't mess with them during a run. In addition, it is a very bad idea to have two GSTAr runs operating concurrently from the same working directory because there will be clashes and overwrites for these temp files. Not too fast GSTAr uses RNAplex (Tafer and Hofacker, 2008. Bioinformatics 24:2657-63, PMID: 18434344, doi:10.1093/bioinformatics/btn193), which is exceptionally fast for an inter-molecular RNA-RNA hybridization calculator. However, when applied to entire eukaryotic transcriptomes the CPU time per query is still significant. Run time is only slightly affected (much less than 2-fold) by the setting of -r. Setting tabular mode (option -t) also increases speed just a tiny bit for runs with a low option -r. In tests with the Arabidopsis transcriptome (33,602 mRNAs, total nts=51,074,197), a single 21nt miRNA query typically takes about 90-110 seconds to complete. No ambiguity codes Query sequences with characters other than A, T, U, C, or G (case-insensitive) will not be analyzed, and a warning will be sent to the user. Transcript sub-sequences for potential alignments will be *silently* ignored if they contain any characters other than A, T, U, C,or G (case-insensitive). Small queries Query sequences must be small (between 15 and 26nts). Queries that don't meet these size requirements will not be analyzed and a warning sent to the user. Redundant output GSTAr guarantees that, FOR A GIVEN QUERY, the returned alignments will be unique in terms of their PUTATIVE SLICING SITE POSITION. However, the same query could generate multiple overlapping aligments that each have different putative slice sites. Furthermore, if different queries in a multi-query analysis have similar (or identical !) sequences, the same alignment position (based on putative slicing site position) could be returned multiple times, once for each of the similar/identical queries. Therefore, for multi-query result files, there is no guarantee of non-redundancy among the returned sites. OUTPUT Both output formats begin with a series of commented lines that provide details about the run. Pretty output The default output "pretty" mode is easily parsed by humans and should be self-explanatory Tabular output Tabular output (which occurs when option -t is specified) is a tab-delimited text file. The first non-commented line is the column headings, with meanings as follows: 1: Query: Name of query 2: Transcript: Name of transcript 3: TStart: One-based start position of the alignment within the transcript 4: TStop: One-based stop position of the alignment within the transcript 5: TSLice: One-based position of the alignment opposite position 10 of the query 6: MFEperfect: Minimum free energy of a perfectly matched site (approximate) 7: MFEsite: Minimum free energy of the alignment in question 8: MFEratio: MFEsite / MFEperfect 9: AllenScore: Penalty score calculated per Allen et al. (2005) Cell, 121:207-221 [PMID: 15851028]. 10: Paired: String representing paired positions in the query and transcript. The format is Query5'-Query3',Transcript3'-Transcript5'. Positions are one-based. Discrete blocks of pairing are separated by ; 11: Unpaired: String representing unpaired positions in the query and transcript. The format is Query5'-Query3',Transcript3'-Transcript5'[code]. Possible codes are "UP5" (Unpaired region at 5' end of query), "UP3" (Unpaired region at 3' end of query), "SIL" (symmetric internal loop), "AILt" (asymmetric internal loop with more unpaired nts on the transcript side), "AILq" (asymmetric internal loop with more unpaired nts on the query side), "BULt" (Bulged on the transcript side), or "BULq" (bulged on the query side). Positions are one-based. Discrete blocks of pairing are separated by ; 12: Structure: Aligned secondary structure. The region before the "&" represents the transcript, 5'-3', while the region after the "&" represents the query, 5'-3'. "(" represents a transcript base that is paired, ")" represents a query based that is paired, "." represents an unpaired base, and "-" represents a gap inserted to facilitate alignment. 13: Sequence: Aligned sequence. The region before the "&" represents the transcript, 5'-3', while the region after the "&" represents the query, 5'-3'.