FASTAptamer: A Bioinformatic Toolkit for High-Throughput Sequence Analysis of Combinatorial Selections
License
FASTAptamer/FASTAptamer
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
-------------------------------------------------------------------------------- FASTAptamer v1.0.14 If you use, adapt, or modify FASTAptamer please cite: Khalid K. Alam, Jonathan L. Chang & Donald H. Burke "FASTAptamer: A Bioinformatic Toolkit for High-Throughput Sequence Analysis of Combinatorial Selections" Molecular Therapy - Nucleic Acids. 2015. DOI: 10.1038/mtna.2015.4 (C) 2014. Web: http://burkelab.missouri.edu/ Email: burkelab@missouri.edu Twitter: @BurkeLabRNA -------------------------------------------------------------------------------- I. Introduction II. Installation III. FASTAptamer-Count IV. FASTAptamer-Compare V. FASTAptamer-Cluster VI. FASTAptamer-Enrich VII. FASTAptamer-Search VIII. Version history IX. License X. Miscellaneous -------------------------------------------------------------------------------- I. Introduction FASTAptamer was developed to address the analysis needs of high throughput sequ- encing data from combinatorial selections. The FASTAptamer toolkit rapidly conv- erts large FASTQ files into manageable FASTA files, ranks, sorts, and normalizes read counts for each unique sequence, compares two populations for sequence dis- tribution, calculates enrichment fold across two or three populations, clusters sequences according to a user-defined Levenshtein edit distance and searches for degenerate sequence motifs. While originally developed for aptamer and ribozyme discovery, FASTAptamer provides insight into phage display, in vivo mutagenesis selection and other DNA-encoded libraries. The FASTAptamer toolkit currently in- cludes five tools, FAST-Aptamer-Count, FASTAptamer-Compare, FASTAptamer-Cluster, FASTAptamer-Enrich, and FASTAptamer-Search. ** For detailed instructions on installation and use of the FASTAptamer toolkit, including sample sequence data and screenshots, please see the PDF user's guide included with the download. ** -------------------------------------------------------------------------------- II. Installation The FASTAptamer toolkit is written in Perl 5 and has no external dependencies. It should operate on any modern Unix-like system (including Linux and Mac OS X) and has been tested using CentOS Linux 5.4, Mac OS X 10.5+, and Debian GNU/Linu- x 7.0. FASTAptamer should also be able to run on a Microsoft Windows platform p- rovided that a Perl interpreter has been installed, such as ActiveState Perl or Strawberry Perl. FASTAptamer can be used by installing or by saving the files to an accessible d- irectory and executing the Perl interpreter on each script as needed. Installation and use of the FASTAptamer toolkit assumes a basic working knowled- ge of command line operation. FASTAptamer is currently provided as a collection of scripts, after downloading the scripts there is one main task to complete the "installation" process -- saving the scripts in an executable directory. The scripts should be saved in a directory where executable programs are found (known as the "path variable" and typically /bin, /usr/bin, or /usr/local/bin). To find these directories, enter the following on the command line on your shell prompt. echo $PATH This should list the directories (separated by colons) your system uses to sear- ch for executable programs. You may not have permission to modify some of these folders, but you should be able to move the FASTAptamer files from their current location into one of these directories using the move (mv) or copy (cp) command. Once the scripts are in an executable directory and have the appropriate permis- sion to execute, the FASTAptamer toolkit should be ready for use. Alternately, if you're having trouble accessing your PATH directories, you can use FASTAptamer without installation. For details on this process, refer to the PDF user's guide included with the download. -------------------------------------------------------------------------------- III. FASTAptamer-Count FASTAptamer-Count serves as the gateway to the FASTAptamer suite of bioinformat- ics tools. FASTAptamer-Count will accept a FASTQ input file (the de facto NGS/ HTS file format). Ideally, the data should be trimmed to remove any constant r- egions and filtered to include only the high-quality reads. FASTAptamer-Count n- on-destructively outputs a sorted and non-redundant FASTA formatted file in whi- ch each unique sequence generates a FASTA entry with the following information in the description line of each sequence entry: >RANK-READS-RPM Where RANK is the relative abundance of the sequence within the population. In cases where two or more sequences are sampled with equal abundance, FASTAptamer- Count follows standard competition ranking (e.g., "1-2-2-4" where two sequences are tied for second). READS is the raw number of times a sequence was counted. RPM is "Reads per million," which is a normalized value that allows for compari- son across populations of varying read depth. RPM is calculated as: RPM = (READS/(population size)) x 10^6. In addition to generating a FASTA output file, FASTAptamer-Count will display a summary report on the screen (STDOUT) that includes the number of total reads f- ound in the input file, the number of unique sequences, the file input/output n- ames and the program execution time. The summary report can be suppressed by i- ncluding the optional flag [-q] on the command line. Usage: fastaptamer_count [-h] [-q] [-v] [-i INFILE] [-o OUTFILE] [-h] = Help screen. [-q] = Suppress STDOUT of run report. [-v] = Display version. [-i INFILE] = FASTQ input file. REQUIRED. [-o OUTFILE] = FASTA output file. REQUIRED. [-u ] = Create unique sequence ID's If the -u flag is used, the FASTA file will have the following format (starting at the second sequence that has the same rank): >RANK(UNIQ)-READS-RPM SEQUENCE Where UNIQ is a numeric label that distinguishes sequences with identical RANK and READS (and therefore identical RPM too). This label starts at '2' and increments by one for each new sequence that has the same RANK. This label is r- eset each time RANK increases. -------------------------------------------------------------------------------- IV. FASTAptamer-Compare FASTAptamer-Compare facilitates statistical analysis of two populations by rapi- dly generating a tab-delimited output file that lists each unique sequence along with RPM (reads per million) in each population file (if available) and log(2) of the ratio of their RPM values in each population. RPM data for both populations can be utilized to generate an XY-scatter plot of sequence distribution across two populations. FASTAptamer-Compare also facilit- ates the generation of a histogram of the sequence distribution by creating 102 bins for the log(2) values. This histogram can provide a quick visual comparis- on of the two populations: distributions centered around 0 indicate similar pop- ulations, while distributions shifted to the left or right indicate overall enr- ichment or depletion. Input for FASTAptamer-Compare MUST come from FASTAptamer-Count output files. Usage: fastaptamer_compare [-h] [-x INFILE] [-y INFILE] [-o OUTFILE] [-q] [-a] [-v] [-h] = Help screen. [-x INFILE] = Input file (from FASTAptamer-Count). REQUIRED. [-y INFILE] = Input file (from FASTAptamer-Count). REQUIRED. [-o OUTFILE] = Plain text output file with tab separated values. REQUIRED [-q] = Quiet mode. Suppresses standard output of file I/O and execution time. [-a] = Output all sequences, including those present in only one input file. Default behavior suppresses output of sequences without a match. [-v] = Display version. -------------------------------------------------------------------------------- V. FASTAptamer-Cluster FASTAptamer-Cluster uses the Levenshtein algorithm to cluster together sequences based on a user-defined edit distance. The most abundant and unclustered seque- nce is used as the "seed sequence" for which edit distance is calculated from. Output is FASTA with the following information on the identifier line of each sequence entry: >Rank-Reads-RPM-Cluster#-RankWithinCluster-EditDistanceFromSeed SEQUENCE To prevent clustering of sequences not highly sampled (and improve execution ti- me), invoke the read filter [-f] and enter a number. Only sequences with total reads greater than the number entered will be clustered. To limit the number of clusters to the most you would be interested in (and imp- rove execution time), use the maximum clusters option [-c]. Input for FASTAptamer-Cluster MUST come from FASTAptamer-Count output files. PLEASE NOTE: This is a computationally intense program that can take multiple h- ours to finish, depending on the size and complexity of your population. Usage: fastaptamer_cluster [-h] [-i INFILE] [-o OUTFILE] [-d] [-f] [-q] [-v] [-h] = Help screen. [-i INFILE] = Input file from fastaptamer_count. REQUIRED. [-o OUTFILE] = Output file, FASTA format. REQUIRED. [-d] = Edit distance for clustering sequences. REQUIRED. [-f] = Read filter. Only sequences with total reads greater than the value supplied will be clustered. [-c] = Maximum number of clusters to find. [-q] = Quiet mode. Suppresses standard output of file I/O, numb- er of clusters, cluster size and execution time. [-v] = Display version. (For a speed boost, you can instead use fastaptamer_cluster_xs if you have the Text::LevenshteinXS module installed.) -------------------------------------------------------------------------------- VI. FASTAptamer-Enrich FASTAptamer-Enrich rapidly calculates "fold-enrichment" values for each sequence across two or three input files. Output is provided as a tab-delimited plain t- ext file and is formatted to include sequence composition, length, rank, reads, reads per million (RPM), and enrichment values for each sequence. If any files from FASTAptamer-Cluster are provided, output will include cluster information for that population. A threshold filter can be applied to exclude sequences with total reads per million (across all input populations) less than the number ent- ered after the [-f] option. Default behavior is to include all sequences. Enri- chment is calculated by dividing reads per million of y/x (and z/y and z/x, if a third input file is specified). Input for FASTAptamer-Enrich MUST come from FASTAptamer-Count or FASTAptamer- Cluster output files. Usage: fastaptamer_enrich [-h] [-x INFILE] [-y INFILE] [-z INFILE] [-o OUTFILE] [-f #] [-q] [-v] [-h] = Help screen. [-x INFILE] = First input file from FASTAptamer-Count or FASTAptamer-Cluster. REQUIRED. [-y INFILE] = Second input file from FASTAptamer-Count or FASTAptamer-Cluster. REQUIRED. *** For two populations only, use -x and -y. *** [-z INFILE] = Optional third input file from FASTAptamer-Count or FASTAptamer-Cluster. [-o OUTFILE] = Plain text output file with tab separated values. REQUIRED [-f] = Optional reads per million threshold filter. [-q] = Quiet mode. Suppresses standard output of file I/O, number of matched sequences and execution time. [-v] = Display version. -------------------------------------------------------------------------------- VII. FASTAptamer-Search FASTAptamer-Search allows users to search for specific patterns within one or m- ore sequence files. To search through more than one input file, simply use the [-i] flag multiple t- imes. All input files must use FASTA format. Similarly, to search for multiple patterns simultaneously, use the [-p] flag as many times as needed. When searching for multiple patterns, note that partial m- atches are not returned. For example, entering the following command: fastaptamer_search -i FILE1 -i FILE2 -p ATTGCC -p TGGCAT would search FILE1 and FILE2 for sequences containing both ATTGCC and TGGCAT. Patterns and input sequence data are case insensitive, and T/U are interchangea- ble. In addition to single bases, patterns can include any of the degenerate ba- se symbols from IUPAC-IUBMB nucleic acid notation: A/T/G/C/U single bases R puRines (A/G) Y pYrimidines (C/T) W Weak (A/T) S Strong (G/C) M aMino (A/C) K Keto (G/T) B not A D not C H not G V not T N aNy base (not a gap) For greater visibility, pattern matches can be highlighted by parentheses in the output by calling the [-highlight] flag. A summary report is generated after each file's search results and after search completion. To suppress these reports, enable quiet mode using the [-quiet] flag Usage: fastaptamer_search [-i INFILE] [-o OUTFILE] [-p PATTERN] [-v] [-help] = Show help screen and exit. [-i FILENAME] = Input filename; can be used multiple times. [-p PATTERN] = Sequence pattern to search for; can be used multiple times. [-o FILENAME] = Output file for search results. If none given, output goes to STDOUT. [-highlight] = "Highlight" matched portion in parentheses. [-quiet] = Suppress summary report. [-v] = Display version. -------------------------------------------------------------------------------- VIII. Version History Version 1.0.14 - Released July 25, 2018 Restored error message resulting from empty input file name (Fixed bug introduced in 1.0.13) Removed tests' dependencies on non-core Perl modules. Version 1.0.13 - Released July 25, 2018 fastaptamer_count will automatically decompress input file ending in ".gz". Version 1.0.12 - Released October 20th, 2017 Implemented fastaptamer_cluster as fastaptamer_cluster_xs using the module Text::LevenshteinXS. Thanks vladimirkhramkov! Version 1.0.11 - Released July 20th, 2016 fastaptamer_count may autodetect a FASTA file and switch to FASTA mode. This is just a safety net. This behavior is not defined in the usage, and should not be depended upon. Version 1.0.10 - Released July 20th, 2016 Now fastaptamer_count can accept FASTA files as input, by using the -f flag. Version 1.0.9 - Released July 7th, 2015 If selected, optional uniqueness labels only start appearing on the second seq- uence of the same rank. Tests were updated to reflect this change. However, co- unt files created with version 1.0.8 should still process correctly with this version. Version 1.0.8 - Released June 17th, 2015 Optional uniqueness labels now in format like ">RANK(UNIQ)-READS-RPM" instead of ">RANK-READS-RPM-UNIQ". Version 1.0.7 - Released June 8th, 2015 Modified fastaptamer_compare to handle optional uniqueness labels in sequence I- Ds. Added test for fastaptamer_compare. Version 1.0.6 - Released June 8th, 2015. Added the option to fastaptamer_count for generating unique IDs. Modified fastaptamer_cluster to be able to read count files that contain unique IDs Version 1.0.5 - Released May 29th, 2015. Fixed bug in fastapatmer_count that was sensitive to an at-sign (@) in quality header or as the last (or only) character of the sequence header. Version 1.0.4 - Released March 10th, 2015. fastaptamer_cluster should now be much more memory efficient. Version 1.0.3 - Released March 3rd, 2015. Added maximum number of clusters option [-c] to fastaptamer_cluster. Version 1.0.2 - Released January 21st, 2015. Added FASTAptamer citation and license information to all scripts. Improvements to readability and consistency of code and STDOUT summary reports. Version 1.0.1 - Released December 24th, 2014. Added version option [-v] to all scripts. Updated help screens. FASTAptamer-Clu- ster's RPM filter [-f] is now a more intuitive read filter. Version 1.0 - Released August 31st, 2014. Added FASTAptamer-Search. Updated FASTAptamer-Cluster and FASTAptamer-Enrich to include RPM filters. Built in cluster auto-detect to FASTAptamer-Enrich and sub- sequently removed FASTAptamer-Cluster_enrich from toolkit. Included user's guide as a PDF bundled with the download. Sample data made available. Version 0.2.2 - Released May 5th, 2014. Fixed a formatting error in FASTAptamer -Cluster_enrich in which sequences present only in population x had their infor- mation split across two lines. Version 0.2.1 - Released April 29th, 2014. Fixed formatting issue with FASTAptamer-Cluster_enrich where enrichment value of (y/x) appeared under column for cluster (z) when three populations are used. Version 0.2 - Released April 27th, 2014. Added FASTAptamer-Cluster and FASTAptamer-Cluster_enrich. README document upda- ted. License included. Version 0.1 - initial "beta." Released April 11th, 2014. -------------------------------------------------------------------------------- IX. License FASTAptamer is distributed under the GNU General Public License version 3. See file "LICENSE.txt" for more information and complete license. -------------------------------------------------------------------------------- X. Miscellaneous For help, comments, criticism, error reporting, etc. please contact burkelab@missouri.edu. If you're on Twitter, reach out to us @BurkeLabRNA or c- ontact the primary author of the toolkit, Khalid K. Alam @BiochemPhD. "Fork" development of FASTAptamer on GitHub. http://github.com/FASTAptamer/FASTAptamer Funding for this work was provided by NSF [CHE-1057506] (Chemistry of Life Proc- esses) & NASA [NAG5-12360] (NASA Exobiology Program).
About
FASTAptamer: A Bioinformatic Toolkit for High-Throughput Sequence Analysis of Combinatorial Selections
Resources
License
Stars
Watchers
Forks
Packages 0
No packages published