-
Notifications
You must be signed in to change notification settings - Fork 17
Tutorial: Using Cogent to collapse redundant transcripts in absence of genome
Last Updated: 09/20/2019
This tutorial describes the use case for when:
- There is no reference genome
- There is Iso-Seq data (HQ isoform sequences) that have been processed through Cogent (follow tutorial here).
And you would like to:
- Obtain a final set of unique (non-redundant) transcript isoforms
You will need the following files:
- Iso-Seq output (HQ isoform sequences). This file is usually called
hq_transcripts.fasta
. - Iso-Seq output of
cluster_report.csv
. This is optional. If you have this file, you will be able to get the abundance information for the collapsed isoforms at the end of this tutorial. - Cogent family finding output file. This would be called
test_human.partition.txt
in the tutorial. - Cogent reconstruction output. If you followed the tutorial, and say the directory is
test_human/
, then the output files aretest_human/*/cogent2.renamed.fasta
.
We will create a fake genome that consists of all the Cogent reconstructed contigs and the "unassigned" HQ isoforms. The "unassigned" HQ isoforms are isoforms that did not have a similarity hit to any other isoforms during the family finding part of Cogent.
Use the commands below to extract just the unassigned sequences. You will need get_seqs_from_list.py
from the Cupcake repository (no installation required).
$ export PATH=$PATH:<path_to_Cupcake>/sequence
$ tail -n 1 test_human.partition.txt | tr ',' '\n' > unassigned.list
$ get_seqs_from_list.py hq_transcripts.fasta unassigned.list > unassigned.fasta
Next we will create a "fake genome" by concatenating all Cogent output with the unassigned:
$ mkdir collected
$ cd collected
$ cat ../test_human/*/cogent2.renamed.fasta ../unassigned.fasta > cogent.fake_genome.fasta
We need to first create a SAM alignment. This can be done either with Minimap2 (faster, may be more sensitive but slightly less precise) or GMAP (slower, may be more precise). For more info on parameter choices, see this tutorial on minimap2, deSALT, GMAP, STAR, BLAT.
Here is an example using minimap2:
$ minimap2 -ax splice -t 30 -uf --secondary=no \
cogent.fake_genome.fasta hq_transcripts.fasta > \
hq_transcripts.fasta.sam
Or you can use GMAP instead of minimap2:
$ gmap_build -D . -d fake_genome cogent.fake_genome.fasta
$ gmap -D . -d fake_genome -f samse -n 0 -t 30 \
hq_transcripts.fasta > hq_transcripts.fasta.sam 2> \
hq_transcripts.fasta.sam.log
We will then follow the collapse tutorial from Cupcake. You will need to install the Cupcake repo at this point!
$ sort -k 3,3 -k 4,4n hq_transcripts.fasta.sam > hq_transcripts.fasta.sorted.sam
$ collapse_isoforms_by_sam.py --input hq_transcripts.fasta -s hq_transcripts.fasta.sorted.sam \
-c 0.95 -i 0.85 --dun-merge-5-shorter \
-o hq_transcripts.fasta.no5merge
$ get_abundance_post_collapse.py hq_transcripts.fasta.no5merge.collapsed cluster_report.csv
$ filter_away_subset.py hq_transcripts.fasta.no5merge.collapsed
The output will be hq_transcripts.fasta.no5merge.collapsed.filtered.*
which have been collapsed into unique isoforms.
You can then follow up with rarefaction analysis using the collapsed output as well.