-
-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generating Contigs using UG and Reads #9
Comments
For contigging or solving the graph beyond untagging we need to be able to map reads to the graph. For Paired-End Illumina reads (or any sequencing tech where per bp quality is very high), a simple mapper that is good enough in many situations is to create an index of all the unique kmers and where they are in the graph. A unique kmer being a kmer that appears once, in one place in the graph, and never again anywhere else in the graph. The index is simply a Dict{Kmer (canonical), (nodeid, position)}. Mapping a read to the graph then is a somewhat straightforward process: Go through the read, one kmer at a time, and check if it is in the unique kmer index of the graph. If it is in the index, then we note the node and position, by going through each kmer in the read and collecting a series of node+positions, we can then check the graph to see if the sequence of nodes we got is a sensible path through the graph (i.e. it exists). If it is, then the read has been successfully mapped to a path in the graph. |
Thanks a lot for the description. What exactly do we mean by the position of a node? |
Sorry, when I say position, I don't mean a node's position, I mean the position of the kmer, inside of the node. So the index would look like this: `(canonicalkmer => (ID of the node the kmer is in, the position of the kmer in the node's sequence). So If a canonical kmer mapped to (5, 10) It would mean it was in node 5, and started at the 10th base in node 5. |
Thanks for the clarification! |
We should do any future work relating to contigging on a gsoc/contigging branch, to make it a bit easier to review. |
Ok I will make the updates on a separate gsoc/contigging branch. |
@benjward I am working on mapping reads using the method you mentioned. Assuming there are errors in the reads, as you mentioned previously, finding the positions of each unique kmer inside a read on the dbg may not generate a clean path. In this case ignoring some mapped kmers (assuming that they correspond to the error-prone region of the read) may solve the problem. I am thinking something of a majority voting to select between candidate paths for a read. |
Generating Contigs using UG and Reads
We would like to generate contigs using the unitigs on the dbg and the reads given as the initial input for the dbg construction. This step of the project is more challenging compared to the previous two steps : DBG construction and DBG-to-UG.
We should decide on the core functionalities required during the contig generation and start implementing them. Right now, the dbg constructor takes as input a set of kmers in their canonical form.
First, we need to generalize the dbg constructor by having a preprocessing step for the raw reads to be represented as a set of canonical kmers. The preprocessing step should enumerate all unique canonical kmers, from a set of reads.
The text was updated successfully, but these errors were encountered: