Generating Contigs using UG and Reads #9

ardakdemir · 2019-06-16T00:34:52Z

Generating Contigs using UG and Reads

We would like to generate contigs using the unitigs on the dbg and the reads given as the initial input for the dbg construction. This step of the project is more challenging compared to the previous two steps : DBG construction and DBG-to-UG.

We should decide on the core functionalities required during the contig generation and start implementing them. Right now, the dbg constructor takes as input a set of kmers in their canonical form.

First, we need to generalize the dbg constructor by having a preprocessing step for the raw reads to be represented as a set of canonical kmers. The preprocessing step should enumerate all unique canonical kmers, from a set of reads.

TransGirlCodes · 2019-07-03T01:20:18Z

For contigging or solving the graph beyond untagging we need to be able to map reads to the graph.

For Paired-End Illumina reads (or any sequencing tech where per bp quality is very high), a simple mapper that is good enough in many situations is to create an index of all the unique kmers and where they are in the graph.

A unique kmer being a kmer that appears once, in one place in the graph, and never again anywhere else in the graph. The index is simply a Dict{Kmer (canonical), (nodeid, position)}.

Mapping a read to the graph then is a somewhat straightforward process: Go through the read, one kmer at a time, and check if it is in the unique kmer index of the graph. If it is in the index, then we note the node and position, by going through each kmer in the read and collecting a series of node+positions, we can then check the graph to see if the sequence of nodes we got is a sensible path through the graph (i.e. it exists). If it is, then the read has been successfully mapped to a path in the graph.

ardakdemir · 2019-07-03T04:53:57Z

Thanks a lot for the description.

What exactly do we mean by the position of a node?
Until now, we represented the dbg using a dictionary of nodes and a dictionary of links.
I understand that the position information is important to quickly check if the path is a consecutive list of nodes each have an outgoing edge to the successor in the list. Maybe only using the nodeid's are sufficient, we can use some kind of a distance measure between each consecutive node in the collection of nodes to check the sensibility of the path.

TransGirlCodes · 2019-07-03T10:52:59Z

Sorry, when I say position, I don't mean a node's position, I mean the position of the kmer, inside of the node. So the index would look like this: `(canonicalkmer => (ID of the node the kmer is in, the position of the kmer in the node's sequence). So If a canonical kmer mapped to (5, 10) It would mean it was in node 5, and started at the 10th base in node 5.

ardakdemir · 2019-07-03T11:24:47Z

Thanks for the clarification!
I will try to implement the mapping phase in this way.

TransGirlCodes · 2019-07-03T12:56:49Z

We should do any future work relating to contigging on a gsoc/contigging branch, to make it a bit easier to review.

ardakdemir · 2019-07-04T03:13:44Z

Ok I will make the updates on a separate gsoc/contigging branch.

ardakdemir · 2019-07-05T07:55:38Z

@benjward I am working on mapping reads using the method you mentioned.

Assuming there are errors in the reads, as you mentioned previously, finding the positions of each unique kmer inside a read on the dbg may not generate a clean path. In this case ignoring some mapped kmers (assuming that they correspond to the error-prone region of the read) may solve the problem. I am thinking something of a majority voting to select between candidate paths for a read.

ardakdemir added this to the GSoC 2019 - Arda Akdemir - DeBruijn Graphs milestone Jun 16, 2019

TransGirlCodes closed this as completed Jan 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generating Contigs using UG and Reads #9

Generating Contigs using UG and Reads #9

ardakdemir commented Jun 16, 2019

TransGirlCodes commented Jul 3, 2019

ardakdemir commented Jul 3, 2019

TransGirlCodes commented Jul 3, 2019

ardakdemir commented Jul 3, 2019

TransGirlCodes commented Jul 3, 2019 •

edited

Loading

ardakdemir commented Jul 4, 2019

ardakdemir commented Jul 5, 2019

Generating Contigs using UG and Reads #9

Generating Contigs using UG and Reads #9

Comments

ardakdemir commented Jun 16, 2019

TransGirlCodes commented Jul 3, 2019

ardakdemir commented Jul 3, 2019

TransGirlCodes commented Jul 3, 2019

ardakdemir commented Jul 3, 2019

TransGirlCodes commented Jul 3, 2019 • edited Loading

ardakdemir commented Jul 4, 2019

ardakdemir commented Jul 5, 2019

TransGirlCodes commented Jul 3, 2019 •

edited

Loading