Skip to content

Eli-Meyer/2bRAD_utilities

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A collection of scripts for analysis of 2bRAD sequence data. 

For instructions, see the User's Guide at http://eli-meyer.github.io/2bRAD_utilities

-------------------------
AlleleFilter.pl
-------------------------

------------------------------------------------------------
AlleleFilter.pl
Excludes loci containing too many alleles.
Usage: AlleleFilter.pl -i input -n max_alleles <options>
Required arguments:
	-i input	name of the input file, a matrix of genotypes.
			Input format: rows=loci and columns=samples.
                        row 1 = column label, column 1 = tag, column 2 = position
                        subsequent columns contain genotypes for each sample
	-n max_alleles	maximum number of alleles allowed. Loci with more than this 
			number of allels will be excluded. 
Options:
	-p option	y=print filtered loci and summary; n=only print summary
			(default=n)
	-o output	a name for the output file (loci passing this filter) (required if -p y)

------------------------------------------------------------

-------------------------
BuildRef.pl
-------------------------

------------------------------------------------------------
BuildRef.pl
Builds a reference for de novo analysis of 2bRAD sequences from samples
without a sequenced genome. The script filters, clusters, and compares 
similar sequences to infer the set of loci present in the species of
interest, using a subset of reads from the samples themselves.
Usage: BuildRef.pl -i input -o output <OPTIONS>
Required arguments:
	-i input	The set of processed (truncated, HQ) reads (FASTQ) to be used for reference
			development. Ideally this should include 10-20 million reads spanning the range
			of known diversity (e.g. from all populations in your study). Prepare this 
			ahead of time by concatenating together a subset of reads from your samples. 
	-o output	a name for the output file, to be used as a reference in mapping and genotyping
Options:
	-v overwrite	0=do not overwrite existing files; use them for analysis. (default) 
			1=do not use existing files; overwrite them with new files. 
	-n mincov	Minimum depth to qualify as a valid allele. (default=2)
	-q threshold	Quality scores below this threshold are low quality (default=30)
	-x number	Maximum number of low quality bases allowed for reference construction (default=0)
	-m mismatches	Maximum number of mismatches allowed in clustering of related alleles (default=2)
	-d distance	Minimum number of bases required to resolve sub-clusters (default=1)
	-a haplotypes	For very large clusters containing more than this number of unique sequences, 
			do not attempt to resolve sub clusters. These indicate repetitive sequences 
			which are not useful for genotyping anyway, and resolving these large clusters
			is computationally intensive. (default=32) 
------------------------------------------------------------

-------------------------
CallGenotypes.pl
-------------------------

------------------------------------------------------------
CallGenotypes.pl

Determines SNP genotypes from nucleotide frequencies. Input file contains nucleotide frequencies 
from multiple samples. Output file lists the genotypes called from those frequencies. 
Usage: CallGenotypes.pl -i input -o output <OPTIONS>
Required arguments:
	-i input	Input file, tab delimited text file of nucleotide frequencies (output from NtFrequences.pl)
			column = tag, column 2 = locus, column 3 = reference genotype
			subsequent columns = nucleotide frequences in each sample, as A/C/G/T (e.g. 0/0/10/12)
	-o output	A name for the output file (tab delimited text)
Options:
	-c coverage	Minimum coverage required to determine genotypes. Lower coverage loci wil be discarded.
			Default: 10
	-e ends		y: exclude terminal positions in alignments where errors may arise during ligation. (default)
			n: do not exclude terminal positions. 
	-m method	"nf": nucleotide frequencies (classic method; the default). 
			  This method determines genotypes directly from nucleotide frequencies, using thresholds
			  defined by the user. If minor allele frequency (MAF) <= x at a locus, the genotype is called 
			  homozygous for the major allele at that locus. If MAF >= n, genotype is called heterozygous.
			  Genotypes are not called at intermediate MAF (if x > MAF > n) where errors are likely.  
			"pgf": NF informed by population genotype frequencies. (an update on the classic method)
			  This method first identifies valid alleles at each locus based on their frequency in the 
			  population (the two most common alleles with frequencies >=y times in >= q individuals),
			  then applies relaxed nucleotide frequency thresholds for those alleles (using y instead
			  of n for valid alleles). 
			"bgc" = Bayesian Genotype Caller
			  This method calls the BGC software, which implements a maximum-likelihood (ML) method for 
			  calling genotypes that incorporates prior population-level information on genotype 
			  frequencies and error rates from a genotype-frequency estimator. For more details see 
			  (Maruki & Lynch, [doi: 10.1534/g3.117.039008], and cite that paper if using this option.	
	Options for method "nf" or "pgf":
	-x max_MAF	Maximum frequency of the minor allele you're willing to ignore and call the position 
			homozygous for the major allele (0-1). Default: 0.01
	-n min_MAF	Minimum frequency of the minor allele you're willing to accept as evidence of 
			heterozygosity, and call the locus heterozygous (0-1). Default: 0.25
	-r min_reads	Because low frequencies translate into 1 or fewer reads at low coverage, the script
			also imposes a minimum read number for detection of heterozygotes. (default: 2) 
	Options for method "pgf":
	-y frequency	Minimum frequency a second allele must be detected to be considered valid.
			(default: 0.05)
	-q samples	Each allele must present in at least q samples to be considered valid.
			(default: 2)	
	Options for method "bgc":
	-p p-value	Critical p-value for the chi-square polymorphism test (BGC)
			(default: 0.05)
	-v maxcov	Coverage at which the pipeline switches from BGC (for low coverage data) to
			HGC (for high coverage data). Default=80 (i.e. HGC above 80). 
Examples:
  CallGenotypes.pl -i allele_counts.tab -o genotypes.tab		# basic usage
  CallGenotypes.pl -i allele_counts.tab -o genotypes.tab -c 20 		# increase coverage threshold
  CallGenotypes.pl -i allele_counts.tab -o genotypes.tab -m bgc		# use Bayesian Genotype Caller
  CallGenotypes.pl -i allele_counts.tab -o genotypes.tab -m pgf -y 0.05	# use population method
------------------------------------------------------------

-------------------------
CombineAlleleCounts.pl
-------------------------

------------------------------------------------------------
CombineAlleleCounts.pl
Counts observations of alleles at each locus in a collection of base counts from 2bRAD 
(the output from SAMBaseCounts.pl). 

This script identifies the major and minor allele at each locus, and combines all samples
into a single file containing the number of times each of these alleles was observed
in each sample (two columns per sample, for major and minor allele respectively).

Output format: columns 1=tag, 2=position, 3=major allele, 4=minor allele,
5=major allele counts for sample A, 6=minor allele counts for sample A, etc.

Missing data are shown as "NA" for both alleles, and the minor allele is reported as 
"NA" for monomorphic loci.

Usage: CombineAlleleCounts.pl <options> file_1 file_2 ... file_n > output_file
Required arguments:
	files 1-n:	nucleotide frequencies (output from SAMBaseCounts.pl) for each sample
	output_file:	a name for the output; tab-delimited text
Options:
	-a max_alleles	maximum number of alleles allowed at each locus. Loci with more than this
			number of alleles will be excluded. (default=2)
        -v min_cov      minimum coverage required to consider an allele present. (default=2)
        -s min_samp     minimum number of samples in which an allele must be present
                        (default=1)

------------------------------------------------------------

-------------------------
CombineBaseCounts.pl
-------------------------

------------------------------------------------------------
CombineBaseCounts.pl
Counts the number of times each allele was observed, for each locus, in a collection of 
2bRAD data describing nucleotide frequencies for each locus and sample (the output from
SAMBaseCounts.pl). 

Output format: columns 1=tag, 2=position, 3=reference allele,
5=allele counts for sample 1 (A/C/G/T), 6=for sample 2, etc..
Missing data are shown as "NA" for all alleles.

Usage: CombineBaseCounts.pl file_1 file_2 ... file_n > output_file
Where:
	files 1-n:	nucleotide frequencies (output from SAMBasecaller.pl) for each sample
	output_file:	a name for the output; tab-delimited text
------------------------------------------------------------

-------------------------
CompareSNPMatrices.pl
-------------------------

------------------------------------------------------------
CompareSNPMatrices.pl

Compare two matrices of SNP genotypes (e.g. produced from RAD data) from the same set of 
samples to evaluate overlap in genotyped loci, and the level of agreement and disagreement in genotypes.
This is useful for comparing different genotyping algorithms. 
Input files are formatted as the output from NFGenotyper or BGCGenotyper: tab-delimited text, 
rows are loci, columns 1-2 are tag and locus and subsequent columns are samples, homozygotes 
shown as e.g. "A", heterozygotes as e.g. "A C", and missing data as "0". 

Usage: CompareSNPMatrices.pl -f file1 -s file2 <OPTIONS>
Required arguments:
	-f file1	name of the first SNP matrix (tab delimited text)
	-s file2	name of the second SNP matrix (tab delimited text)
Options:
	-o option	0: (default) don't print any detailed info on disagreements
			1: show detailed info on loci called different homozygous genotypes in each file
			2: show detailed info on loci called different heterozygous genotypes in each file
			3: show detailed info on loci called homozygous in file1 and heterozygous in file2 
			4: show detailed info on loci called homozygous in file2 and heterozygous in file1 
			5: show detailed info on loci called in file 1 but not in file 2
	-b counts	(required if -o > 0) the file of allele counts from which genotypes were called.
------------------------------------------------------------

-------------------------
EvalFrags.pl
-------------------------

------------------------------------------------------------
EvalFrags.pl
Evaluates the uniqueness of type IIb restriction fragments in a FASTA file.
e.g. a collection of 36-bp AlfI fragments extracted from a genome sequence 
using AlfI_Extract.pl
Usage: EvalFrags.pl input.fasta 
Where:
	input.fasta:	a collection of 36-bp sequences (FASTA)
------------------------------------------------------------

-------------------------
ExtractSites.pl
-------------------------

------------------------------------------------------------
ExtractSites.pl
Counts and extracts type IIb restriction fragments from a set of DNA sequences.
Output:  a fasta file of those sites, named by position.
Usage:   ExtractSites.pl -i input -o output
Required arguments:
         -i input	a fasta file containing the sequences to be searched
	 -e enzyme	choice of enzyme (AlfI, BsaXI, BcgI)
         -o output	name for the output file, a fasta file of those sites
------------------------------------------------------------

-------------------------
FastaStats.pl
-------------------------

------------------------------------------------------------
FastaStats.pl
Summarizes length statistics for a set of DNA sequences.
Usage: FastaStats.pl -i input -o output
Required arguments:
	-i input	name of the input file (FASTA)
	-o output	a name for the output file (TXT)
------------------------------------------------------------

-------------------------
gt2bayes.pl
-------------------------

------------------------------------------------------------
gt2bayes.pl
Converts a 2bRAD genotype matrix into the input format required by BayeScan.
Usage: gt2bayes.pl -i input -p pop.file -o output
Required arguments:
        -i input        tab-delimited genotype matrix, with rows=loci and columns=samples.
                        First two columns indicate tag and position respectively.
                        This format is the output from CallGenotypes.pl.
	-p pop.file	a tab-delimited text file showing which population each
			sample was drawn from. Formatted as: SampleName "\t" PopName "\n"
			Note -- make sure that sample names in this file are exactly identical
			to those shown in the first row of the genotype matrix.
	-o output 	a name for the BayeScan formatted output file.
------------------------------------------------------------

-------------------------
gt2colony.pl
-------------------------

------------------------------------------------------------
gt2colony.pl
Converts a genotype matrix (loci x samples) into the appropriate input format for COLONY.
Usage: gt2colony.pl -i input -o output
Required arguments:
	-i input	tab-delimited genotype matrix, with rows=loci and columns=samples.
                	First two columns indicate tag and position respectively.
                	This format is the output from CallGenotypes.pl.
	-o output	a name for the output file. COLONY input format.
------------------------------------------------------------

-------------------------
gt2dadi.pl
-------------------------

------------------------------------------------------------
gt2dadi.pl

Converts a SNP matrix (produced from CallGenotypes.pl) into the format required by
the software DADI, described at: https://bitbucket.org/gutenkunstlab/dadi/wiki/DataFormats
Usage: gt2dadi.pl -i input -k key -r reference -o output
Where:
	-i input	tab-delimited genotype matrix, with rows=loci and columns=samples.
                	First two columns indicate tag and position respectively.
                	This format is the output from CallGenotypes.pl.
	-k key		a tab-delimited text file associating each sample in the input
			with a population label. Alleles will be counted and reported
			by the population labels assigned in this file. Formated as:
			Sample_name	Population_name
	-r reference	Name of the reference file from which these SNPs were called (FASTA format)
	-o output	a name for the output file. Tab delimited text in DADI format.

------------------------------------------------------------

-------------------------
gt2fasta.pl
-------------------------

------------------------------------------------------------
gt2fasta.pl
Converts a genotype matrix (loci x samples) to a FASTA-formatted alignment.
Usage: gt2fasta.pl -i input -o output
Required arguments:
	-i input	tab-delimited genotype matrix, with rows=loci and columns=samples.
                	First two columns indicate tag and position respectively.
                	This format is the output from CallGenotypes.pl.
	-o output	a name for the output file. FASTA alignment format.
------------------------------------------------------------

-------------------------
gt2fstat.pl
-------------------------

------------------------------------------------------------
gt2fstat.pl
Converts a genotype matrix (loci x samples) to FSTAT format.
Usage: gt2fstat.pl -i input -o output
Required arguments:
	-i input	tab-delimited genotype matrix, with rows=loci and columns=samples.
                	First two columns indicate tag and position respectively.
                	This format is the output from CallGenotypes.pl.
	-o output	a name for the output file. FSTAT format. corresponding locus_key 
			and sample_key files are also produced.
------------------------------------------------------------

-------------------------
gt2phy.pl
-------------------------

------------------------------------------------------------
gt2phy.pl
Converts a genotype matrix (loci x samples) to a PHYLIP-formatted alignment.
Usage: gt2phy.pl -i input -o output
Required arguments:
	-i input	tab-delimited genotype matrix, with rows=loci and columns=samples.
                	First two columns indicate tag and position respectively.
                	This format is the output from CallGenotypes.pl.
	-o output	a name for the output file. PHYLIP alignment format.
------------------------------------------------------------

-------------------------
gt2related.pl
-------------------------

------------------------------------------------------------
gt2related.pl
Converts a genotype matrix (loci x samples) into the appropriate input format for the R package related
Usage: gt2related.pl -i input -o output
Required arguments:
	-i input	tab-delimited genotype matrix, with rows=loci and columns=samples.
                	First two columns indicate tag and position respectively.
                	This format is the output from CallGenotypes.pl.
	-o output	a name for the output file. RELATED format.
------------------------------------------------------------

-------------------------
gt2remlf90.pl
-------------------------

------------------------------------------------------------
gt2remlf90.pl
Converts a SNP genotype matrix (loci x samples) produced from 2bRAD genotyping
into the format required for the BLUPF90 family of programs for mixed models
and quantitative genetic analysis. See BLUPF90 manual for details of that format.
Usage: gt2remlf90.pl -i input -o output
Required arguments:
	-i input	Name of the input file, from CallGenotypes.pl.
			(rows=loci, columns=samples, columns 1 & 2 show tag name and position in tag)
	-o output	A name for the output file, in the format expected by BLUPF90 programs, e.g.
				sample0   02221022511020101020
				sample100 12221222221222200010
------------------------------------------------------------

-------------------------
gt2Rqtl.pl
-------------------------

------------------------------------------------------------
gt2Rqtl.pl
Converts a 2bRAD genotype matrix into the csv input format required by R/qtl.
Usage: gt2Rqtl.pl -i input -t traits -o output
Where:
	-i input	tab-delimited genotype matrix, with rows=loci and columns=samples.
                	First two columns indicate tag and position respectively.
                	This format is the output from CallGenotypes.pl.
	-t traits	tab-delimited file of data on  traits, as
				"sample1	trait1	...	traitN"
			(note that sample names must match column headers in snps file)
	-m map		tab-delimited file of map positions as
				"marker	  LG	position"
	-o output	a name for the csv formatted output file.
------------------------------------------------------------

-------------------------
gt2snpmatrix.pl
-------------------------

------------------------------------------------------------
gt2snpmatrix.pl
Converts a genotype matrix (loci x samples) to a snp matrix, as described in the manual
for the R package diveRsity. This snp matrix can be converted to genepop format
using diveRsity's snp2gen function.
Usage: gt2snpmatrix.pl -i input -o output
Required arguments:
	-i input	tab-delimited genotype matrix, with rows=loci and columns=samples.
                	First two columns indicate tag and position respectively.
                	This format is the output from CallGenotypes.pl.
	-o output	a name for the output file. SNP matrix format, input for snp2gen.
------------------------------------------------------------

-------------------------
gt2structure.pl
-------------------------

------------------------------------------------------------
gt2structure.pl
Converts a genotype matrix (loci x samples) into the appropriate input format for STRUCTURE.
Usage: gt2structure.pl -i input -o output
Required arguments:
	-i input	tab-delimited genotype matrix, with rows=loci and columns=samples.
                	First two columns indicate tag and position respectively.
                	This format is the output from CallGenotypes.pl.
	-o output	a name for the output file. STRUCTURE input format.
------------------------------------------------------------

-------------------------
gt2vcf.pl
-------------------------

------------------------------------------------------------
gt2vcf.pl

Converts a genotype matrix (loci x samples) to a VCF file.
Usage: gt2vcf.pl -i input -r reference -o output <options>
Required arguments:
	-i input	tab-delimited genotype matrix, with rows=loci and columns=samples.
                	First two columns indicate tag and position respectively.
                	This format is the output from CallGenotypes.pl.
	-o output	a name for the output file. FASTA alignment format.
	-r reference	Complete path to the reference file used to generate these genotypes (FASTA). 
Options:
	-f filters	a text file described filters applied to the genotypes. this information
			will be included in the VCF file. e.g.	
			  "MD	removed loci genotyped in <20 samples"

------------------------------------------------------------

-------------------------
LowcovSampleFilter.pl
-------------------------

------------------------------------------------------------
LowcovSampleFilter.pl

Excludes samples with too much missing data (genotypes called at too few loci)
Usage: LowcovSampleFilter.pl -i input -n min_data <OPTIONS>
Required arguments:
	-i input	name of the input file, a matrix of genotypes or allele counts
			(see -m for format)
	-n min_data	samples in which fewer than this number of loci were genotyped will be excluded
Options:
	-m mode		g=genotypes (default). 
			  Input file contains genotypes from individuals.
			  Input format: rows=loci and columns=samples.
                          row 1 = column label, column 1 = tag, column 2 = position
                          subsequent columns contain genotypes for each sample
			a=allele counts.
			  Input file contains allele counts from pooled samples.
			  Input format: rows=loci and columns=samples.
                          row 1 = column label, column 1 = tag, column 2 = position
                          column 3 = major allele, column 4 = minor allele
                          subsequent pairs of columns contain allele counts 
			  (major then minor) for each sample
	-p option	y=print filtered loci and summary; n=only print summary
			(default=n)
	-o output	a name for the output file (loci passing this filter) (required if -p y)

------------------------------------------------------------

-------------------------
MDFilter.pl
-------------------------

------------------------------------------------------------
MDFilter.pl

Excludes loci containing too many missing data (genotyped in too few samples)
Usage: MDFilter.pl -i input -n min_data <OPTIONS>
Required arguments:
	-i input	name of the input file, a matrix of genotypes or allele counts
			(see -m for format)
	-n min_data	loci that were genotyped in fewer samples than this will be excluded
Options:
	-m mode		g=genotypes (default). 
			  Input file contains genotypes from individuals.
			  Input format: rows=loci and columns=samples.
                          row 1 = column label, column 1 = tag, column 2 = position
                          subsequent columns contain genotypes for each sample
			a=allele counts.
			  Input file contains allele counts from pooled samples.
			  Input format: rows=loci and columns=samples.
                          row 1 = column label, column 1 = tag, column 2 = position
                          column 3 = major allele, column 4 = minor allele
                          subsequent pairs of columns contain allele counts 
			  (major then minor) for each sample
	-p option	y=print filtered loci and summary; n=only print summary
			(default=n)
	-o output	a name for the output file (loci passing this filter) (required if -p y)

------------------------------------------------------------

-------------------------
OneSNPPerTag.pl
-------------------------

------------------------------------------------------------
OneSNPPerTag.pl

Selects a single SNP from each tag in a matrix or genotypes or allele counts. 
Chooses the locus with the least missing data. 
Usage: OneSNPPerTag.pl -i input <OPTIONS>
Required arguments:
	-i input	name of the input file, a matrix of genotypes or allele counts
			(see -m for format)
Options:
	-m mode		g=genotypes (default). 
			  Input file contains genotypes from individuals.
			  Input format: rows=loci and columns=samples.
                          row 1 = column label, column 1 = tag, column 2 = position
                          subsequent columns contain genotypes for each sample
			a=allele counts.
			  Input file contains allele counts from pooled samples.
			  Input format: rows=loci and columns=samples.
                          row 1 = column label, column 1 = tag, column 2 = position
                          column 3 = major allele, column 4 = minor allele
                          subsequent pairs of columns contain allele counts 
			  (major then minor) for each sample
	-p option	y=print filtered loci and summary; n=only print summary
			(default=n)
	-o output	a name for the output file (loci passing this filter) (required if -p y)

------------------------------------------------------------

-------------------------
PolyFilter.pl
-------------------------

------------------------------------------------------------
PolyFilter.pl
Excludes loci containing too few classes of genotypes or numbers of alleles (keeps polymorphic loci).
Usage: PolyFilter.pl -i input <OPTIONS>
Required arguments:
	-i input	name of the input file, a matrix of genotypes or allele counts
			(see -m for format)
Options:
	-g genotypes	minimum number of unique genotypes (for -m g) or alleles (for -m a) 
			required to consider a locus polymorphic (default=2)
	-m mode		g=genotypes (default). 
			  Input file contains genotypes from individuals.
			  Input format: rows=loci and columns=samples.
                          row 1 = column label, column 1 = tag, column 2 = position
                          subsequent columns contain genotypes for each sample
			a=allele counts.
			  Input file contains allele counts from pooled samples.
			  Input format: rows=loci and columns=samples.
                          row 1 = column label, column 1 = tag, column 2 = position
                          column 3 = major allele, column 4 = minor allele
                          subsequent pairs of columns contain allele counts 
			  (major then minor) for each sample
	-v min_cov	(for -m a) minimum coverage required to consider an allele present
			(default=2)
	-s min_samp	minimum number of samples in which an allele must be present
			(default=1)
	-p option	y=print filtered loci and summary; n=only print summary
			(default=n)
	-o output	a name for the output file (loci passing this filter) (required if -p y)

------------------------------------------------------------

-------------------------
QualFilterFastq.pl
-------------------------

------------------------------------------------------------
QualFilterFastq.pl

Removes reads containing too many low quality basecalls from a set of short sequences
Output:  high-quality reads in FASTQ format
Usage:   QualFilterFastq.pl -i input -m min_score -x max_LQ -o output
Required arguments:
	-i input	raw input reads in FASTQ format
	-m min_score	quality scores below this are considered low quality (LQ)
	-x max_LQ	reads with more than this many LQ bases are excluded
	-o output	name for ourput file of HQ reads in FASTQ format
------------------------------------------------------------

-------------------------
RandomFastq.pl
-------------------------

------------------------------------------------------------
RandomFastq.pl
Draws the specified number of sequences randomly from a FASTQ sequence file.
Usage: RandomFastq.pl -i input -n num_seq -o output
Required arguments:
	-i input	name of the input file from which sequences will be randomly drawn.
	-n num_seq	number of sequences to draw
	-o output	a name for the output file (FASTQ)
------------------------------------------------------------

-------------------------
RepTagFilter.pl
-------------------------

------------------------------------------------------------
RepTagFilter.pl

Excludes tags containing too many SNPs, suggesting repetive regions of the genome
Usage: RepTagFilter.pl -i input -n max_snps <OPTIONS>
Required arguments:
	-i input	name of the SNP input file, a matrix of genotypes or allele counts
			(see -m for format)
			note: the script assumes the input only includes polymorphic loci
	-n max_snps	all SNPs from tags containing more than this number of SNPs will be excluded
Options:
	-m mode		g=genotypes (default). 
			  Input file contains genotypes from individuals.
			  Input format: rows=loci and columns=samples.
                          row 1 = column label, column 1 = tag, column 2 = position
                          subsequent columns contain genotypes for each sample
			a=allele counts.
			  Input file contains allele counts from pooled samples.
			  Input format: rows=loci and columns=samples.
                          row 1 = column label, column 1 = tag, column 2 = position
                          column 3 = major allele, column 4 = minor allele
                          subsequent pairs of columns contain allele counts 
			  (major then minor) for each sample
	-p option	y=print filtered loci and summary; n=only print summary
			(default=n)
	-o output	a name for the output file (loci passing this filter) (required if -p y)

------------------------------------------------------------

-------------------------
SAMBaseCounts.pl
-------------------------

------------------------------------------------------------
SAMBaseCounts.pl
Counts nucleotide frequencies at each locus in a 2bRAD sequence data set.
Usage: SAMBaseCounts.pl -i input -r reference -o <OPTIONS>
Required arguments:
	-i input		input alignments, SAM format
	-r reference		reference used to generate the input alignments, FASTA format
	-o output		a name for the output file (tab delimited text)
Options:
	-c coverage		loci with lower coverage are discarded (default: 3)
------------------------------------------------------------

-------------------------
SAMFilter.pl
-------------------------

------------------------------------------------------------
SAMFilter.pl
Filters the alignments produced by mapping short reads against a reference,
excluding ambiguous, short, and weak matches.
NOTE: make sure that when a read matches multiple reference sequences (ambigous)
your mapper reports at least two alignments in the output. This is NOT the default 
behavior for some mappers, but is required to exclude ambiguous matches before genotyping.

Usage:  SAMFilter.pl -i input -m matches -o output <options>
Required arguments:
	-i input	Output from any short read mapper, in SAM format.
	-m matches	Minimum number of matching bases required to consider an alignment valid. 
	-o output	A name for the filtered output (SAM format). 
Options:
	-c option	1: Report the number of reads matching each reference sequence
			in a separate output files "counts.tab". 0: Don't produce this file (default).
	-l length	Minimum length of aligned region (match, mismatch, + gaps) required to consider 
			an alignment valid. Only relevant if your mapper uses local alignment. For global
			alignments, this is set equal to -m. 
------------------------------------------------------------

-------------------------
TruncateFastq.pl
-------------------------

------------------------------------------------------------
TruncateFastq.pl
Truncates a set of short reads in FASTQ format to keep the region specified
Usage:   TruncateFastq.pl -i input -s start -e end -o output
Required arguments:
	-i input	file of short reads to be filtered, fastq format
	-s start	beginning of the region to keep, nucleotide position
	-e end		end of the region to keep, nucleotide position
	-o output	a name for the output file (fastq format)
------------------------------------------------------------

About

A collection of scripts for analyzing 2bRAD genotyping data.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published