Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create basic SequenceDistanceGraph and DBG building based on Arda's code #10

Merged
merged 14 commits into from
Jul 9, 2019

Conversation

TransGirlCodes
Copy link
Member

@ardakdemir This is the SequenceDistanceGraph, based on your DeBruijn Graph code with some tweaks. Have a look at the methods it has and the dbg constructor. Let me know if you're happy with my edits to that algorithm.

@ardakdemir
Copy link
Member

Everything looks fine for me!
I think there may arise some issues during edge addition with your dbg constructor if the reverse complement of a prefix/suffix of a kmer is equal to itself e.g ATA. Then adding it only to bw_nodes or fw_nodes will skip some of the edges. I tried to explain this issue in my blog as well : https://ardakdemir.github.io/pages/gsoc.html

Let me know if I am not clear or you think there is a mistake in my formulation.

@TransGirlCodes
Copy link
Member Author

TransGirlCodes commented Jul 6, 2019

Hey @ardakdemir I've added an improvement to the DBG builder. This new function called make_graph_from_kmerlist has several improvements:
It can handle larger genomes, by using a sorted list of canonical kmers rather than a set which uses more memory, than the previous DBG stuff we've been working on.
It also builds unitigs straight away and adds them to the graph.

The ATAT ATA -> TAT problem you described also affects this algorithm too. The problem you were describing in your blog post is a small example of a more general problem called the circle problem, I'm glad you found it as I forgot about it. In the new function, this edge case takes the form of kmers in circles being excluded from the graph. To fix this we will need to take all the kmers that have not been used once unitig creation is done, and add those "circles" to the graph as units by choosing a random breakpoint.

But for now, until we squash this bug if the K used to build Kmers/Graph is large enough this edge case should be very rare, and we could add a test/warning message if the case does pop up!

@ardakdemir
Copy link
Member

@benjward thanks for the explanation.
I agree that the sorted list is a better implementation for representing the input kmers.
I think I am happy about the content :)

@TransGirlCodes TransGirlCodes force-pushed the benjward/sdg branch 3 times, most recently from 306784e to 8002fff Compare July 9, 2019 20:43
Genome Assembly with Julia automation moved this from In progress to Done Jul 9, 2019
@TransGirlCodes TransGirlCodes reopened this Jul 9, 2019
Genome Assembly with Julia automation moved this from Done to In progress Jul 9, 2019
@TransGirlCodes
Copy link
Member Author

Oops, hit the close button by accident.

@TransGirlCodes TransGirlCodes merged commit 3277fec into gsoc/dbg Jul 9, 2019
Genome Assembly with Julia automation moved this from In progress to Done Jul 9, 2019
@TransGirlCodes TransGirlCodes deleted the benjward/sdg branch July 9, 2019 21:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

None yet

2 participants