Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
CleaveLand4: Analysis of degradome data to find sliced miRNA and siRNA targets
Latest commit 451cf29
Jul 11, 2018
|Type||Name||Latest commit message||Commit time|
|Failed to load latest commit information.|
LICENSE CleaveLand4.pl Copyright (c) 2013-2018 Michael J. Axtell This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>. SYNOPSIS CleaveLand4 : Finding evidence of sliced targets of small RNAs from degradome data AUTHOR Michael J. Axtell, Penn State University, firstname.lastname@example.org VERSION 4.5 : July 11, 2018 INSTALL Dependencies - Required Perl Modules Getopt::Std Math::CDF CleaveLand4.pl is a perl program, so it needs perl installed on your system. It was developed on perl version 5.10.0, and hasn't been tested on other versions (but there is no reason to suspect problems with other perl 5.x versions). CleaveLand4.pl will not compile unless the above two Perl modules are also installed in your perl's @INC. Getopt::Std is pre-loaded into most (all?) Perl distros. But you may need to install Math::CDF from CPAN. Only one Math::CDF function is used by CleaveLand4 ('pbinom'). Dependencies - PATH executables bowtie (version 0.12.x or 1.x) bowtie-build RNAplex (from Vienna RNA package) GSTAr.pl (Version 1.0 or higher -- distrubuted with CleaveLand4) R samtools All of the above must be executable from your PATH. Depending on the mode of the CleaveLand4 run (see below), only a subset of these programs may be required for a given run. Installation Except for the above dependencies, there is no "real" installation. If the script is in your working directory, you can call it with ./CleaveLand4.pl For convenience, you can add it to your PATH. e.g. sudo mv CleaveLand4.pl /usr/bin/ GSTAr.pl expects to find perl in /usr/bin/perl .. if not, edit line 1 (the hashbang) accordingly. USAGE CleaveLand4.pl [options] > [out.txt] Log and progress information goes to STDERR, and can be suppressed with option -q (quiet mode). Options -h Print help message and quit -v Print version and quit -q Quiet mode .. no log/progress information to STDERR -a Sort small RNA / transcript alignments by Allen et al. score* instead of default MFEratio -- for GSTAr -t Output in tabular format instead of the default verbose format -r [float >0..1] Minimum Free Energy Ratio cutoff. Default: 0.65 -- for GSTAr -o [string] : Produce T-plots in the directory indicated by the string. If the dir does not exist, it will be created -d [string] : Path to degradome density file. -e [string] : Path to FASTA-formatted degradome reads. -g [string] : Path to GSTAr-created tabular formatted query-transcript alignments. -u [string] : Path to FASTA-formatted small RNA queries -n [string] : Path to FASTA-formatted transcriptome -p [float >0..1] : p-value for reporting. Default is 1 (no p-value filtering). -c [integer 0..4] : Maximum category for reporting. Default is 4 (all categories reported). *Allen et al. score: This is a score based on the position-specific penalties described by Allen et al. (2005) Cell, 121:207-221 [PMID: 15851028]. Specifically, mismatched query bases or target-bulged bases, are penalized 1. G-U wobbles are penalized 0.5. These penalties are double within positions 2-13 of the query. Modes CleaveLand4 runs in one of four different modes. Each mode has a required set of options and a disallowed set of options, as described below: Mode 1: Align degradome data and create degradome density file, perform new small RNA query/transcriptome alignment with GSTAr, and analyze. Required options: -e, -u, -n. Disallowed options: -d, -g. Mode 2: Use existing degradome density file, perform new small RNA query/transcriptome alignment with GSTAr, and analyze. Required options: -d, -u, -n. Disallowed options: -e, -g. Mode 3: Align degradome data and create degradome density file, use existing GSTAr alignments, and analyze. Required options: -e, -n, -g. Disallowed options: -d, -u. Mode 4: Use existing degradome density file and existing GSTAr alignments, and analyze. Required options: -d, -g. Disallowed options: -e, -u. METHODS Degradome data --> transcriptome alignments --> degradome density file creation (modes 1 and 3) Degradome data is aligned to the reference transcriptome using bowtie. If needed, the bowtie indices for the transcript are built with bowtie-build using default parameters. This results in the creation of six files, each including "ebwt" in their suffix. Alignment parameters allow zero or one mismatch, and are only allowed to the forward strand of the transcriptome. In the case of multiple valid alignments only one is randomly selected and reported. (The specific bowtie command used is "bowtie -f -v 1 --best -k 1 --norc -S"). The alignment process uses samtools to generate a sorted BAM alignment file from the bowtie output stream. After creation of the sorted BAM alignment file, the alignments are parsed to quantify the density of observed 5' ends at each nt of the transcriptome. The results are written to a 'degradome density' file in the working directory. The BAM alignment file is deleted at the completion of this process. The degradome density files contain the position of the transcript, the number of 5' ends at that position, and the degradome peak 'category'. Categories are determined as follows: Category 4: Just one read at that position Category 3: >1 read, but below or equal to the average* depth of coverage on the transcript Category 2: >1 read, above the average* depth, but not the maximum on the transcript Cateogry 1: >1 read, equal to the maximum on the transcript, when there is >1 position at maximum value Cateogry 0: >1 read, equal to the maximum on the transcript, when there is just 1 position at the the maximum value * Note that the average does not include all of the 'zeroes' for non-occupied positions within a transcript. Instead, it is the average of all positions that have at least one read. Small RNA --> transcriptome alignments with GSTAr (modes 1 and 2) Potential target sites are generated with GSTAr.pl, which ships with CleaveLand4. Options -r and -a are passed to GSTAr. By default, potential target sites are sorted in descending order by MFE ratio. If the -a switch is present, this is changed to ascending order based on Allen et al. score. Note that GSTAr can be expected to take 90-120 seconds per query when analyzing a typically sized transcritpome. GSTAr.pl is always called to output in tabular format by CleaveLand4.pl. Only alignments to the reverse-comp strand of the transcriptome are considered. Upon completion, a GSTAr alignment file is written to the working directory. See the GSTAr documentation for more details on this program. Analysis (all modes) After loading valid degradome density and GSTAr alignment files, CleaveLand4 first checks to ensure that the transcriptome (as noted in the headers) is the same. If so, analysis progresses. For each alignment in the GSTAr alignment file, CleaveLand4 searches the degradome density file to see if there are any degradome reads at the predicted slicing site. If there are, a p-value is calculated. The p-value takes into account both the noise in the degradome density file and the quality of the small RNA-transcriptome alignment. First, the chances of observing a deagrdome 'peak' of the given category by random chance is calculated. The chance is the total number of peaks of the given category divided by the effective transcriptome size*. Then, the quality of the alignment is simply the rank of the alignment in the GSTAr alignment file (which is either sorted by MFE ratio [default] or by Allen et al. score). The p-value is calculated as the binomial probability of observing one or more 'hits' in 'x' trials given probability 'c', where 'x' is the rank of the alignment, and 'c' is the chance described above. * The effective transcriptome size is the total number of bases in the transcriptome - (n * mean_read_size), where 'n' is the number of transcripts. This adjustment accounts for the fact that the very ends of the transcripts could not possibly have any mapped 5' ends. Any hits with a p-value <= the cutoff specified by option -p AND a category <= the cutoff specified by option -c are output to STDOUT. By default, all hits are reported (option p default is 1, option c default is 4). INPUT FILE FORMAT REQUIREMENTS Newlines All files are assumed to have "\n" as newline characters. Files with MS-DOS text encoding, or others, that do not conform to this assumption will cause unexpected behavior and likely meaningless results. Transcriptome (option -n) This must be a multiline FASTA file. The headers should be short and simple and devoid of whitespace (e.g. ">AT1G12345" is good, ">AT1G12345 | this is my favorite gene | it is awesome" is not. The filename of the transcriptome file should also be devoid of whitespace. Degradome reads (option -e) This must be a multiline FASTA file. The reads are assumed to have already clipped to remove adapters. Furthermore, the reads must not have been collapsed in any way. In other words, each read off the sequencer should have an entry. Sequences that appear 50 times in the raw data from 50 different reads should each have a line. Finally, CleaveLand4 assumes that each degradome read represents the 5'-3' sequence of a transcript, and that the first nt of each read represents the 5' end of an RNA. Small RNA Queries (option -u) This must be a multiline FASTA file with the full sequence of a given small RNA on one line (e.g. each line is either a header beginning with ">" or the full-length sequence of the small RNA). The headers should be short and simple and devoid of whitespace (e.g. ">ath-miR169a" is good, ">ath-miR169a MIMAT0000200 Arabidopsis thaliana miR169a" is not. Sequences can have either T's or U's. Note: miRBase often has several mature miRNAs with exactly the same sequence, relfecting paralogous miRNA genes within a species. There is no use querying the same sequence multiple times, so it is a good idea to collapse the redundancy by query sequence when making a file of small RNA queries. Degradome density files (option -d) Most of the time, these will be files created by previous runs of CleaveLand4 that will have the suffix "_dd.txt". If you don't like the alignment parameters that CleaveLand4 uses, you could create your own degradome denstiy files. The format specification is: Header region: Lines begin with "#". Here is an example: [line1]# CleaveLand4 degradome density [line2]# Fri Sep 13 09:21:38 EDT 2013 [line3]# Degradome Reads:GSM278335.fasta [line4]# Transcriptome:TAIR10_cdna_20110103_rgmupdated_cleand.fasta [line5]# TranscriptomeCharacters:51074197 [line6]# Mean Degradome Read Size:20 [line7]# Estimated effective Transcriptome Size:50402157 [line8]# Category 0:16430 [line9]# Category 1:3456 [line10]# Category 2:95747 [line11]# Category 3:207062 [line12]# Category 4:78279 CleaveLand4 demands that the first line of a valid degradome density file is "# CleaveLand4 degradome density". It also requires all other header lines, except the date on line 2, to be present. All of this information is required for analysis. Data Region: Each transcript begins with two lines as follows: [line1]@ID:AT1G50920.1 [line2]@LN:2394 The @ID: gives the name of the transcript, while the @LN: gives the length of the transcript. After this, each line gives a one-based position, the number of 5' ends at that positions, and the degradome category. The data lines are tab-delimited. Note that positons with zero reads are NOT shown. GSTAr query-transcriptome alignments (option -g) These are files created by GSTAr. If they were created as part of a CleaveLand4 run, they will have the suffix "_GSTAr.txt". They must be in the 'tabular' format, and have a proper header as shown below: [line1]# GSTAr version 1.0 [line2]# Thu Sep 12 13:56:58 EDT 2013 [line3]# Queries: test_mir.fasta [line4]# Transcripts: TAIR10_cdna_20110103_rgmupdated_cleand.fasta [line5]# Hit seed length required to initiate RNAplex analysis (option -s): 7 [line6]# Minimum Free Energy Ratio cutoff (option -r): 0.65 [line7]# Sorted by: MFEratio [line8]# Output Format: Tabular Is is strongly recommended NOT to try to produce these files by means not involving CleaveLand4/GSTAr. OUTPUT Pretty format By default, CleaveLand prints hits that pass the p-value and category filters to STDOUT in a human-readble, verbose format that is self-explanatory. A header (lines beginning with "#") is printed giving basic information on the analysis. Tabular format If option -t is specified, any hits passing the p-value filter are printed in a tab-delimited format. First, a header (lines beginning with "#") is printed giving basic information on the analysis. After that, a line giving the names of the columns is printed. Each subsequent line gives information on a single hit. The format is similar to that of the GSTAr alignments. Column information is: 1: SiteID: A unique name (within the scope of the output of a particular run) for the putative slicing site. In the form "[transcript]:[slice_site]". The output is sorted by SiteID. 2: Query: Name of query 3: Transcript: Name of transcript 4: TStart: One-based start position of the alignment within the transcript 5: TStop: One-based stop position of the alignment within the transcript 6: TSLice: One-based position of the alignment opposite position 10 of the query 7: MFEperfect: Minimum free energy of a perfectly matched site (approximate) 8: MFEsite: Minimum free energy of the alignment in question 9: MFEratio: MFEsite / MFEperfect 10: AllenScore: Penalty score calculated per Allen et al. (2005) Cell, 121:207-221 [PMID: 15851028]. 11: Paired: String representing paired positions in the query and transcript. The format is Query5'-Query3',Transcript3'-Transcript5'. Positions are one-based. Discrete blocks of pairing are separated by ; 12: Unpaired: String representing unpaired positions in the query and transcript. The format is Query5'-Query3',Transcript3'-Transcript5'[code]. Possible codes are "UP5" (Unpaired region at 5' end of query), "UP3" (Unpaired region at 3' end of query), "SIL" (symmetric internal loop), "AILt" (asymmetric internal loop with more unpaired nts on the transcript side), "AILq" (asymmetric internal loop with more unpaired nts on the query side), "BULt" (Bulged on the transcript side), or "BULq" (bulged on the query side). Positions are one-based. Discrete blocks of pairing are separated by ; 13: Structure: Aligned secondary structure. The region before the "&" represents the transcript, 5'-3', while the region after the "&" represents the query, 5'-3'. "(" represents a transcript base that is paired, ")" represents a query based that is paired, "." represents an unpaired base, and "-" represents a gap inserted to facilitate alignment. 14: Sequence: Aligned sequence. The region before the "&" represents the transcript, 5'-3', while the region after the "&" represents the query, 5'-3'. 15: DegradomeCategory: Category 4: Just one read at that position Category 3: >1 read, but below or equal to the average* depth of coverage on the transcript Category 2: >1 read, above the average* depth, but not the maximum on the transcript Cateogry 1: >1 read, equal to the maximum on the transcript, when there is >1 position at maximum value Cateogry 0: >1 read, equal to the maximum on the transcript, when there is just 1 position at the the maximum value 16: DegradomePval: p-value for this degradome hit. 17: Tplot_file_path: File path for the T-plot of this hit. * Note that the average does not include all of the 'zeroes' for non-occupied positions within a transcript. Instead, it is the average of all positions that have at least one read. =head2 T-plots If the user requests them by including the -o option, T-plots for each hit that passes the p-value cutoff are created and written to the directory specificied by option -o. The black line on the plot shows all of the degradome data, and the red dot shows the putative slicing site. The title of each T-plot indicates the transcript ("T="), the query ("Q="), and the putative slicing site ("S="), as well as the category and p-value. Existing T-plot files in the -o directory with the same name will be over-written without warning. WARNINGS Don't believe the hype - part 1 Under default settings, CleaveLand4 reports ALL putative slicing sites with ANY degradome reads at all, regardless of the liklihood of a given hit of being due to random chance. Without any filtering, most of your hits probably ARE due to random chance. This means that there will be many many hits of Category 4 (just one read) and/or at very high p-values. Exercise skepticism when interpreting these results. More confidence in the reality of a given slicing event can come from restricting analysis to hits with low p-values and/or high categories, and, even better, observing the slicing event in multiple degradome libraries. Don't believe the hype - part 2 The p-value calculation is built around the ASSUMPTION that the rank order of alignments for a given query reflects their liklihood of being functional. Under default settings, GSTAr will sort the alignments for each query based on descending MFE ratio. Alternatively, when option a is specified for the GSTAr run, they will be sorted in ascending order according the Allen et al. score. The extent to which the p-values are trustworthy is dictated directly by the extent to which you believe that MFEratio or Allen et al. scores are predictive of function. If you don't believe that, you should treat the p-values with due skepticism, and focus instead on high category hits and reproducibility between libraries. Not for whole genomes Degradome alignment by CleaveLand4 only searches the top strand of the transcriptome. Also, GSTAr holds the entire contents of the transcripts.fasta file in memory to speed the isolation of sub-sequences, and CleaveLand4 holds the entire contents of the degradome density file in memory. This will be impractical in terms of memory usage if a user attempts to load a whole genome. Similarly, GSTAr will only search for pairing between the top strand of the transcripts.fasta file, making it also impractical for a genome analysis, where sites might be on either strand. Temp files CleaveLand4 writes several temp files during the course of a run. So, don't mess with them during a run. In addition, it is a very bad idea to have two CleaveLand4 runs operating concurrently from the same working directory. CleaveLand4 will clean up all temp files at the conclusion of a run. Not too fast in modes 1 and 2 (and maybe 3). GSTAr is a very fast intermolecular RNA-RNA hybridization calculator. But when applied to whole transcriptomes, it is still very time-consuming. When running in modes 1 or 2, plan on about 90-120 seconds per query during the GSTAr phase. Additionally, bowtie alignments and index bulding (modes 1 or 3) can also be time-consuming. Finally, requesting T-plots can also slow things down, especially if a great number of hits are being returned. No ambiguity codes Query sequences with characters other than A, T, U, C, or G (case-insensitive) will not be analyzed, and a warning will be sent to the user. Transcript sub-sequences for potential query alignments will be *silently* ignored if they contain any characters other than A, T, U, C,or G (case-insensitive). Small queries GSTAR demands that query sequences must be small (15-26 nts). Queries that don't meet these size requirements will not be analyzed and a warning sent to the user. No redundancy For a GIVEN QUERY, GSTAr alignments are non-redundant in terms of the slicing site of the alignment. However, a single query can have multiple overlapping alignment patterns that have differing predicted slicing sites. In addition, if multiple queries are similar in sequence, the same alignment (in terms of putative slicing site) can be present multiple times for different queries. If CleaveLand4 identifies more than one alignment at the same putative slicing site, it will only report the one with the best (lowest) p-value (subject to the maximum allowed p-value, option -p). Therefore there should be no redundancy in the putative slice sites returned by a given CleaveLand4 run. No reverse-compatibility Degradome density files created by versions of CleaveLand prior to 4.0 are NOT compatible with CleaveLand 4. Sorry. Change in category definitions The categories used by CleaveLand4 differ slightly from those used in CleaveLand3 and earlier. Specifically, categories 3 and 2 now rely upon calculating the mean, not the median, level of coverage in the transcript. In addition, transcript positions with zero coverage are no longer included in the calculation. The effect of this is to make category 2 hits much more rare, and category 3 hits much more common. Slicing at position 10 CleaveLand4 only looks for evidence of slicing at position 10 relative to the aligned small RNA. There is no ambiguity -- data at position 11 or 9 is not relevant to CleaveLand4. This is because, as far as I know, there is no direct evidence showing Argonaute proteins cut anywhere besides position 10. However, there IS clear evidence for isomirs: lower-abundance variants of miRNAs with alternative 5' or 3' ends. Isomirs with alternative 5' ends could certainly cause offset slicing. If you wish to search for slicing at slightly 'off' locations with CleaveLand4, you will need to explictly query with the isomirs of interest.