Skip to content

Tutorial: Using Cogent to collapse redundant transcripts in absence of genome

Elizabeth Tseng edited this page Sep 20, 2019 · 10 revisions

Last Updated: 09/20/2019

This tutorial describes the use case for when:

  1. There is no reference genome
  2. There is Iso-Seq data (HQ isoform sequences) that have been processed through Cogent (follow tutorial here).

And you would like to:

  • Obtain a final set of unique (non-redundant) transcript isoforms

Prerequisite

You will need the following files:

  1. Iso-Seq output (HQ isoform sequences). This file is usually called hq_transcripts.fasta.
  2. Iso-Seq output of cluster_report.csv. This is optional. If you have this file, you will be able to get the abundance information for the collapsed isoforms at the end of this tutorial.
  3. Cogent family finding output file. This would be called test_human.partition.txt in the tutorial.
  4. Cogent reconstruction output. If you followed the tutorial, and say the directory is test_human/, then the output files are test_human/*/cogent2.renamed.fasta.

Creating the "fake genome"

We will create a fake genome that consists of all the Cogent reconstructed contigs and the "unassigned" HQ isoforms. The "unassigned" HQ isoforms are isoforms that did not have a similarity hit to any other isoforms during the family finding part of Cogent.

Getting the unassigned sequences

Use the commands below to extract just the unassigned sequences. You will need get_seqs_from_list.py from the Cupcake repository (no installation required).

$ export PATH=$PATH:<path_to_Cupcake>/sequence
$ tail -n 1 test_human.partition.txt | tr ',' '\n'  > unassigned.list
$ get_seqs_from_list.py hq_transcripts.fasta unassigned.list > unassigned.fasta

Concatenate unassigned with Cogent contigs

Next we will create a "fake genome" by concatenating all Cogent output with the unassigned:

$ mkdir collected
$ cd collected
$ cat ../test_human/*/cogent2.renamed.fasta ../unassigned.fasta > cogent.fake_genome.fasta

Collapsing redundant isoforms

We need to first create a SAM alignment. This can be done either with Minimap2 (faster, may be more sensitive but slightly less precise) or GMAP (slower, may be more precise). For more info on parameter choices, see this tutorial on minimap2, deSALT, GMAP, STAR, BLAT.

Here is an example using minimap2:

$ minimap2 -ax splice -t 30 -uf --secondary=no \
  cogent.fake_genome.fasta hq_transcripts.fasta > \
   hq_transcripts.fasta.sam

Or you can use GMAP instead of minimap2:

$ gmap_build -D . -d fake_genome cogent.fake_genome.fasta
$ gmap -D . -d fake_genome -f samse -n 0 -t 30 \
   hq_transcripts.fasta > hq_transcripts.fasta.sam 2> \
   hq_transcripts.fasta.sam.log

We will then follow the collapse tutorial from Cupcake. You will need to install the Cupcake repo at this point!

$ sort -k 3,3 -k 4,4n hq_transcripts.fasta.sam > hq_transcripts.fasta.sorted.sam
$ collapse_isoforms_by_sam.py --input hq_transcripts.fasta -s hq_transcripts.fasta.sorted.sam \
           -c 0.95 -i 0.85 --dun-merge-5-shorter \
           -o hq_transcripts.fasta.no5merge

$ get_abundance_post_collapse.py hq_transcripts.fasta.no5merge.collapsed cluster_report.csv
$ filter_away_subset.py hq_transcripts.fasta.no5merge.collapsed

The output will be hq_transcripts.fasta.no5merge.collapsed.filtered.* which have been collapsed into unique isoforms.

You can then follow up with rarefaction analysis using the collapsed output as well.