Skip to content

Sequence Manipulation Wiki

Elizabeth Tseng edited this page Jul 16, 2018 · 19 revisions

Last Updated: 07/16/2018

Python Requirement

  • Biopython (used by many scripts in this subrepo)
  • BCBio (used only by sam_to_gff3.py)

How to Use this Repo

To use this repo (just the sequence/ scripts, at least), simply add the directory to your PYTHONPATH and PATH. No compilation is required.

export PYTHONPATH=$PYTHONPATH:<path_to_Cupcake>/sequence
export PATH=$PATH:<path_to_Cupcake>/sequence

List of Scripts

Detailed Documentation

Summarize sequence lengths in fasta/fastq files

Usage:

get_seq_stats.py <fasta_or_fastq_filename>

Example:

$ get_seq_stats.py test.fa
file type is: fasta
50 sequences
min: 387
max: 7394
avg: 1906.34
Length Breakdown by kb range:
0-1 kb: 17
1-2 kb: 10
2-3 kb: 18
3-4 kb: 3
4-5 kb: 0
5-6 kb: 0
6-7 kb: 0
7-8 kb: 2

Convert between fasta and fastq format

Usage:

fa2fq.py <fasta_filename>
fq2fa.py <fastq_filename>

Example:

fa2fq.py test.fasta
fq2fa.py test.fastq

Reverse complement sequence(s)

Usage:

revcomp.py <seq1> [seq2] [seq3] [seq4]

Example:

$ revcomp.py AAACCT
AGGTTT
$ revcomp.py AAACCT GGCCGAA
AGGTTT
TTCGGCC

Sort fasta file by length (increasing or decreasing)

Usage:

sort_fasta_by_len.py [-r] <fasta_filename>

Example:

$ sort_fasta_by_len.py test.fa   (sort by increasing length)
$ sort_fasta_by_len.py -r test.fa  (sort by decreasing length)

Extract list of sequences given a fasta file and a list of IDs

Usage:

get_seqs_from_list.py <fasta_filename> <list_filename>

list_filename should be a text file where each line is a sequence ID to be found in the fasta file.

Example:

get_seqs_from_list.py test.fasta list.fasta > extracted.fasta

Generate fasta sequences given genome and SAM file

NOTE: for this script to properly work (since it depends on other scripts in the same directory), you must add to your PATH variable the sequence/ subdirectory of the repo. For example:

export PATH=$PATH:<path_to_Cupcake>/sequence/

Usage:

err_correct_w_genome.py <genome_fasta> <SAM_file> <output_fasta_filename>

genome_fasta should be the genomic fasta file (ex: hg38.fa), but note that the program reads the entire genome so it may take a while and you must have enough memory!

SAM_file is the SAM alignment file from which to reconstitute your genomic sequences. A common example would be the GMAP SAM output of a PacBio transcriptome dataset. The SAM file does not need to be sorted.

Example (including an example GMAP command to first generate the SAM file):

gmap -D /home/gmap_db -d hg38 -f samse -n 0 -t 12 input.fasta > input.fasta.sam 2> input.fasta.sam.log

err_correct_w_genome.py hg38.fa input.fasta.sam input.hg38_corrected.fasta

Summarize (GMAP) SAM file

The script summarize_gmap_sam.py produces a simple tab-delimited report file that shows you a human-readable format for each aligned read in a (GMAP) SAM file.

Usage:

summarize_gmap_sam.py <input_fasta_or_fastq> <gmap_sam_file>

The output will look something like this:

id  qLength qCoverage   identity    unique
i2_HQ_sample1|c36312/f3p0/1169   1076    1.0000  1.0000  Y
i1_HQ_sample1|c14559/f3p0/1177   1074    1.0000  0.9991  Y

where qLength is the length of the sequence, qCoverage is how much of the sequence was aligned, identity is the alignment identity, and unique is if the read is uniquely aligned. If there are multiple alignments for that read, each alignment will be reported on a separate line (but not necessarily sorted, unless the SAM file itself is sorted).

Calculate expected accuracy from FASTQ file.

This is for calculating expected accuracy based on FASTQ QV. Useful for looking at expected accuracy in low-quality (LQ) isoforms.

usage: calc_expected_accuracy_from_fastq.py [-h] [--qv_trim_5 QV_TRIM_5]
                                            [--qv_trim_3 QV_TRIM_3]
                                            fastq_filename output_filename

An example usage would be:

python calc_expected_accuracy_from_fastq.py lq_isoforms.fastq lq_isoforms.with_exp_acc.fastq

This would add an extra field at the end of the sequence header, for example:

i0_LQ_sample32f89e|c21/f13p0/373 isoform=c21; \
full_length_coverage=13; \
non_full_length_coverage=0; \
isoform_length=373; \
expected_accuracy=0.365

Filter LQ isoforms using customized FL count and expected accuracy cutoff

IMPORTANT: The FASTQ sequence IDs must already have expected_accuracy=. If not, please run calc_expected_accuracy_from_fastq.py first to modify the sequence IDs.

Usage:

usage: filter_lq_isoforms.py [-h] [--min_fl_count MIN_FL_COUNT]
                             [--min_exp_acc MIN_EXP_ACC]
                             fastq_filename output_filename

An example usage is:

python filter_lq_isoforms.py lq_isoforms.with_exp_acc.fastq lq_filtered.fastq --min_fl_count=1 --min_exp_acc=0.95

While running, you will see messages like:

Including i10_LQ_sample32f89e|c526/f1p0/10181 into output file (FL: 1, acc: 0.968).
Including i10_LQ_sample32f89e|c545/f1p35/10190 into output file (FL: 1, acc: 0.993).
Including i10_LQ_sample32f89e|c550/f1p0/10108 into output file (FL: 1, acc: 0.97).
Including i11_LQ_sample32f89e|c204/f1p0/11961 into output file (FL: 1, acc: 0.994).
Including i11_LQ_sample32f89e|c313/f1p0/11313 into output file (FL: 1, acc: 0.989).

Convert SAM to BAM

usage: sam_to_bam.py sam_filename

Uses samtools to convert SAM to BAM file.

Convert SAM to GFF3

Convert SAM file to GFF3 format using BCBio GFF library. Requires installing BCBio first.

usage: sam_to_gff3.py  [-i INPUT_FASTA]  sam_filename
You can’t perform that action at this time.