Circularizing and trimming

rhallPB edited this page Sep 25, 2015 · 13 revisions
Clone this wiki locally

When an assembly results in a circular molecule, confirmed via a dot plot, the beginning and end of the contig contain the same sequence. This results in reads mapping to both locations during the HGAP polishing step, as such the reads have a low mapping score, and are not used to call consensus. In this case a polished assemby will have low quality consensus sequence in these regions. As read lengths increase the size of this region increases, and in small circular molecules such as BACs or plasmid this region can make up a significant portion of the sequence. The following workflow addresses this problem and will result in a blunt ended circular sequence that has high quality consensus throughout. Other than SMRT Analysis the AMOS package is required. Note part of the AMOS package is included in SMRT Analysis, but a full install is required for this workflow. Note as of 2.3 the AMOS executables in SMRT Analysis work for this procedure, simply source /etc/setup.sh.

  1. Open up the polished_assembly.fasta file in a text editor

  2. In the middle of the circular contig introduce a break, i.e. a new line '>Break'. It does not matter where in the sequence you introduce the break, but the sequence immediately after the break will be the start of your circularized sequence.

  3. toAmos -s polished_assembly.fasta -o circularized.afg

  4. minimus2 circularized

  5. Minimus will overlap and join the ends of the two contigs, the resulting circularized.fasta file should contain one contig for the sequence in which you introduced a break.

  6. Any sequence that was not circularized, or extra contigs in which you did not introduce a break will be in circularized.singletons.seq, and can be added back to the circularized.fasta file cat circularized.singletons.seq >> circularized.fasta

  7. The sequence that was overlapped now needs to be quiver corrected, to do this simply import the circularized.fasta file into SMRT Portal as a reference and run a resequencing job with the raw data. Problems with the coverage in the region which was overlapped could indicate an issue with the circularization and the possibility that the molecule was not circular, or was split at a repeat that has not been accounted for.

Advanced usage:

If a contig terminates in a repeat structure and there is a possibility that circularization will collapse a repeat minimus2 can be limited to finding an exact overlap using the min identity parameter:
minimus2 circularized -D MINID=100
Other parameters affecting the finding of overlaps / possiblity of false positives, are the minimum length of the sequences that must overlap:
-D OVERLAP=<n>
and the amount of sequence that can be trimmed from the ends of the contigs in order to form this overlap:
-D MAXTRIM=<n>

##Further reading Blog post discussing circularization and where to break the circle - Omics! Omics! Assembly Could Benefit From More Circular Reasoning

Circlator also can circularize genomes.