Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generating Contigs using UG and Reads #9

Closed
ardakdemir opened this issue Jun 16, 2019 · 7 comments
Closed

Generating Contigs using UG and Reads #9

ardakdemir opened this issue Jun 16, 2019 · 7 comments

Comments

@ardakdemir
Copy link
Member

Generating Contigs using UG and Reads

We would like to generate contigs using the unitigs on the dbg and the reads given as the initial input for the dbg construction. This step of the project is more challenging compared to the previous two steps : DBG construction and DBG-to-UG.

We should decide on the core functionalities required during the contig generation and start implementing them. Right now, the dbg constructor takes as input a set of kmers in their canonical form.

First, we need to generalize the dbg constructor by having a preprocessing step for the raw reads to be represented as a set of canonical kmers. The preprocessing step should enumerate all unique canonical kmers, from a set of reads.

@TransGirlCodes
Copy link
Member

For contigging or solving the graph beyond untagging we need to be able to map reads to the graph.

For Paired-End Illumina reads (or any sequencing tech where per bp quality is very high), a simple mapper that is good enough in many situations is to create an index of all the unique kmers and where they are in the graph.

A unique kmer being a kmer that appears once, in one place in the graph, and never again anywhere else in the graph. The index is simply a Dict{Kmer (canonical), (nodeid, position)}.

Mapping a read to the graph then is a somewhat straightforward process: Go through the read, one kmer at a time, and check if it is in the unique kmer index of the graph. If it is in the index, then we note the node and position, by going through each kmer in the read and collecting a series of node+positions, we can then check the graph to see if the sequence of nodes we got is a sensible path through the graph (i.e. it exists). If it is, then the read has been successfully mapped to a path in the graph.

@ardakdemir
Copy link
Member Author

Thanks a lot for the description.

What exactly do we mean by the position of a node?
Until now, we represented the dbg using a dictionary of nodes and a dictionary of links.
I understand that the position information is important to quickly check if the path is a consecutive list of nodes each have an outgoing edge to the successor in the list. Maybe only using the nodeid's are sufficient, we can use some kind of a distance measure between each consecutive node in the collection of nodes to check the sensibility of the path.

@TransGirlCodes
Copy link
Member

Sorry, when I say position, I don't mean a node's position, I mean the position of the kmer, inside of the node. So the index would look like this: `(canonicalkmer => (ID of the node the kmer is in, the position of the kmer in the node's sequence). So If a canonical kmer mapped to (5, 10) It would mean it was in node 5, and started at the 10th base in node 5.

@ardakdemir
Copy link
Member Author

Thanks for the clarification!
I will try to implement the mapping phase in this way.

@TransGirlCodes
Copy link
Member

TransGirlCodes commented Jul 3, 2019

We should do any future work relating to contigging on a gsoc/contigging branch, to make it a bit easier to review.

@ardakdemir
Copy link
Member Author

Ok I will make the updates on a separate gsoc/contigging branch.

@ardakdemir
Copy link
Member Author

@benjward I am working on mapping reads using the method you mentioned.

Assuming there are errors in the reads, as you mentioned previously, finding the positions of each unique kmer inside a read on the dbg may not generate a clean path. In this case ignoring some mapped kmers (assuming that they correspond to the error-prone region of the read) may solve the problem. I am thinking something of a majority voting to select between candidate paths for a read.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants