Sequence Manipulation Wiki
Last Updated: 07/16/2018
- Biopython (used by many scripts in this subrepo)
BCBio (used only by
How to Use this Repo
To use this repo (just the
sequence/ scripts, at least), simply add the directory to your PYTHONPATH and PATH. No compilation is required.
export PYTHONPATH=$PYTHONPATH:<path_to_Cupcake>/sequence export PATH=$PATH:<path_to_Cupcake>/sequence
List of Scripts
get_seq_stats.py: Summarize length distribution of a FASTA/FASTQ file.
fq2fa.py: Convert between FASTA and FASTQ format.
rev_comp.py: Reverse complement a sequence from command line.
sort_fasta_by_len.py: sort fasta file by length (increasing or decreasing).
get_seqs_from_list.py: extract list of sequences given a fasta file and a list of IDs.
err_correct_w_genome.py: generate fasta sequences given genome and SAM file.
summarize_gmap_sam.py: Summarize SAM file.
calc_expected_accuracy_from_fastq.py: Calculate expected accuracy from FASTQ file.
filter_lq_isoforms.py: Filter LQ isoforms using customized FL count and expected accuracy cutoff.
sam_to_bam.py: Convert SAM to BAM using samtools.
sam_to_gff3.py: Convert SAM to GFF3 format using BioPython and BCBio GFF library.
Summarize sequence lengths in fasta/fastq files
$ get_seq_stats.py test.fa file type is: fasta 50 sequences min: 387 max: 7394 avg: 1906.34 Length Breakdown by kb range: 0-1 kb: 17 1-2 kb: 10 2-3 kb: 18 3-4 kb: 3 4-5 kb: 0 5-6 kb: 0 6-7 kb: 0 7-8 kb: 2
Convert between fasta and fastq format
fa2fq.py <fasta_filename> fq2fa.py <fastq_filename>
fa2fq.py test.fasta fq2fa.py test.fastq
Reverse complement sequence(s)
revcomp.py <seq1> [seq2] [seq3] [seq4]
$ revcomp.py AAACCT AGGTTT $ revcomp.py AAACCT GGCCGAA AGGTTT TTCGGCC
Sort fasta file by length (increasing or decreasing)
sort_fasta_by_len.py [-r] <fasta_filename>
$ sort_fasta_by_len.py test.fa (sort by increasing length) $ sort_fasta_by_len.py -r test.fa (sort by decreasing length)
Extract list of sequences given a fasta file and a list of IDs
get_seqs_from_list.py <fasta_filename> <list_filename>
list_filename should be a text file where each line is a sequence ID to be found in the fasta file.
get_seqs_from_list.py test.fasta list.fasta > extracted.fasta
Generate fasta sequences given genome and SAM file
NOTE: for this script to properly work (since it depends on other scripts in the same directory), you must add to your PATH variable the
sequence/ subdirectory of the repo. For example:
err_correct_w_genome.py <genome_fasta> <SAM_file> <output_fasta_filename>
genome_fasta should be the genomic fasta file (ex:
hg38.fa), but note that the program reads the entire genome so it may take a while and you must have enough memory!
SAM_file is the SAM alignment file from which to reconstitute your genomic sequences. A common example would be the GMAP SAM output of a PacBio transcriptome dataset. The SAM file does not need to be sorted.
Example (including an example GMAP command to first generate the SAM file):
gmap -D /home/gmap_db -d hg38 -f samse -n 0 -t 12 input.fasta > input.fasta.sam 2> input.fasta.sam.log err_correct_w_genome.py hg38.fa input.fasta.sam input.hg38_corrected.fasta
Summarize (GMAP) SAM file
summarize_gmap_sam.py produces a simple tab-delimited report file that shows you a human-readable format for each aligned read in a (GMAP) SAM file.
summarize_gmap_sam.py <input_fasta_or_fastq> <gmap_sam_file>
The output will look something like this:
id qLength qCoverage identity unique i2_HQ_sample1|c36312/f3p0/1169 1076 1.0000 1.0000 Y i1_HQ_sample1|c14559/f3p0/1177 1074 1.0000 0.9991 Y
qLength is the length of the sequence,
qCoverage is how much of the sequence was aligned,
identity is the alignment identity, and
unique is if the read is uniquely aligned. If there are multiple alignments for that read, each alignment will be reported on a separate line (but not necessarily sorted, unless the SAM file itself is sorted).
Calculate expected accuracy from FASTQ file.
This is for calculating expected accuracy based on FASTQ QV. Useful for looking at expected accuracy in low-quality (LQ) isoforms.
usage: calc_expected_accuracy_from_fastq.py [-h] [--qv_trim_5 QV_TRIM_5] [--qv_trim_3 QV_TRIM_3] fastq_filename output_filename
An example usage would be:
python calc_expected_accuracy_from_fastq.py lq_isoforms.fastq lq_isoforms.with_exp_acc.fastq
This would add an extra field at the end of the sequence header, for example:
i0_LQ_sample32f89e|c21/f13p0/373 isoform=c21; \ full_length_coverage=13; \ non_full_length_coverage=0; \ isoform_length=373; \ expected_accuracy=0.365
Filter LQ isoforms using customized FL count and expected accuracy cutoff
IMPORTANT: The FASTQ sequence IDs must already have
expected_accuracy=. If not, please run
calc_expected_accuracy_from_fastq.py first to modify the sequence IDs.
usage: filter_lq_isoforms.py [-h] [--min_fl_count MIN_FL_COUNT] [--min_exp_acc MIN_EXP_ACC] fastq_filename output_filename
An example usage is:
python filter_lq_isoforms.py lq_isoforms.with_exp_acc.fastq lq_filtered.fastq --min_fl_count=1 --min_exp_acc=0.95
While running, you will see messages like:
Including i10_LQ_sample32f89e|c526/f1p0/10181 into output file (FL: 1, acc: 0.968). Including i10_LQ_sample32f89e|c545/f1p35/10190 into output file (FL: 1, acc: 0.993). Including i10_LQ_sample32f89e|c550/f1p0/10108 into output file (FL: 1, acc: 0.97). Including i11_LQ_sample32f89e|c204/f1p0/11961 into output file (FL: 1, acc: 0.994). Including i11_LQ_sample32f89e|c313/f1p0/11313 into output file (FL: 1, acc: 0.989).
Convert SAM to BAM
usage: sam_to_bam.py sam_filename
Uses samtools to convert SAM to BAM file.
Convert SAM to GFF3
usage: sam_to_gff3.py [-i INPUT_FASTA] sam_filename