Enabling additional equalities for generic sequences #152

mobinasri · 2020-05-25T19:14:44Z

This PR contains the implementation of a disjoint-set data structure and using it to activate additional equalities for generic sequences.
The main change is based on the transitive relation, which means that if A is equivalent to B (A ~ B) and B ~ C then A ~ C. If we have such an assumption we can use a disjoint-set structure to map all the elements in the same equivalency set to just one element.
Here is the list of main changes:

Implementation of disjoint-set in a separate header file and including it in the edlib source file.
The elements in each set are replaced by their representative element in both query and target sequences. For example if we had A ~ B and B ~ C so all A,B and C would be in the same set and one of them is being used (Actually the root of disjoint set) as the representative. If we choose A as the representative, then we replace B and C by A in our sequences.This process is done in the function, transformSequences().
The class of EqualityDefintion is removed and wherever we were calling areEqual it is now replaced by equality operator ==. It is going to be faster than before.

Martinsos

Great work @masri2019, but after almost completing the review of your PR, I just realized something that you will not be happy with I am afraid -> I was looking that test that you adjusted, because due to transitivity it basically had just one big equality test, and realized that that we actually can't assume transitivity :(. I am so sorry about this, I can't believe I haven't realized this before, I completely forgot about that use case. The use case is the one from the test, when we want "N" to match any other character -> A, C, T or G, but we of course don't want A to match G and so on, which we can't achieve if we assume transitivity! So to generalize, if we assume transitivity, we can't use additional equalities to describe the "joker" character that matches multiple other characters.
Please take my apologies again, I am really sorry I didn't figure this out before.
I am afraid we will have to drop this PR and go with non-transitive approach. I don't think we have much choice there, we most likely need to construct unordered_map, or unordered_set, and use it inside of "isEqual()". That will not be so fast as the solution we had for transitivity, but I don't think we can go around that.

So if we have (A, N), (C, N), (G, H), then we would have unordered set of [(A, N), (N, A), (C, N), (N, C), (G, H), (H, G)] -> it probably makes sense to have these duplicates that don't care about order of elements.
Or we could go with unordered_map of unordered sets: { A -> [N], C -> [N], G -> [H], N -> [A, C] }.
I think unordered set is a better match, in theory, than unordered_map of unordered sets, but I am not sure how complicated it is to ensure that EqualityPair can be the key for unordered_set, so if that is a problem, it might be easier to go with map of sets.
Let me know if you have some better ideas how to approach this!

Martinsos · 2020-06-01T22:27:18Z

Oh oh, one idea -> what if we let them define the equality function themselves?
Then, they could define it as they want, and in certain use cases it could be much faster then the solution with unordered sets/maps.

For example, let's say they have big alphabet, and only one equality: A == B.
They could provide us with function { if (x == 'A' && y == 'B' || x == 'B' || y == 'A') then true else false }.
While this still means we have to run this function for every equality checks while building Peq, at least it does not necessarily require calculating hashes of each element and looking them up in the map/set, which I believe is more expensive, regardless of time being constant in average.
This also allows them to specify some fancy equalities hm. It also allows them to mess things up if the relation the provide is not a proper equality hm.

I am not sure how we would let them provide such function. Would it be implementation of interface that we specified? They could overload operator == on Element, but I am afraid that might be problematic in some other places if we don't expect it (or maybe not?).
Hm, also, how would be implement this in Python? They could provide us with lambda, but can we transfer that to C++? I don't think so.

mobinasri · 2020-06-16T07:19:30Z

@Martinsos
My apologies for the late response. Last week was the final week of the Spring quarter and there were a lot of assignments to do.

I thought about the two approaches you said (I mean using a set of pairs or using a hash table of sets). I wrote a script to implement both and compare their performance. My test scenario was that I made an array of EqualityPair similar to what edlib has. Then I generated a lot of random numbers in pairs and check if they are equal using both approaches. The execution time when I used a set of pairs was at least 4 times faster than using a hash table of sets. By the way, I could implement sets of pairs in fewer lines of code. So, I agree with you on using sets of pairs.

About allowing the user to define the equality function, I don't have any idea on how to use lambda in python and define a function based on that in C++. If you know any technique please tell me so maybe we can go that way. I guess this step is critical because this function is going to be called many times in edlib and if the user cannot provide a well-defined function it will harm the performance. I think using std::unordered_set is fine.

mobinasri · 2020-06-16T07:32:42Z

And one more thing is that I had to use boost/functional/hash.hpp for hashing std::pair and inserting pairs into a std::unordered_set. That's because there is no standard way of computing a hash on a pair.

The declaration of the set is something like this:
unordered_set< pair<int, int>, boost::hash< pair<int, int> > > equalitySet;

We have to replace int by Element when we use it for edlib.
I got this idea from here:
https://stackoverflow.com/questions/15160889/how-can-i-make-an-unordered-set-of-pairs-of-integers-in-c

Martinsos · 2020-06-24T19:47:42Z

@Martinsos
My apologies for the late response. Last week was the final week of the Spring quarter and there were a lot of assignments to do.

I thought about the two approaches you said (I mean using a set of pairs or using a hash table of sets). I wrote a script to implement both and compare their performance. My test scenario was that I made an array of EqualityPair similar to what edlib has. Then I generated a lot of random numbers in pairs and check if they are equal using both approaches. The execution time when I used a set of pairs was at least 4 times faster than using a hash table of sets. By the way, I could implement sets of pairs in fewer lines of code. So, I agree with you on using sets of pairs.

About allowing the user to define the equality function, I don't have any idea on how to use lambda in python and define a function based on that in C++. If you know any technique please tell me so maybe we can go that way. I guess this step is critical because this function is going to be called many times in edlib and if the user cannot provide a well-defined function it will harm the performance. I think using std::unordered_set is fine.

Great job on doing the tests @masri2019!
I am slightly surprised that set of pairs is faster hm, I would have thought that there would be less hash collisions if we use map of sets, meaning it is all faster. Maybe it is slower because of memory allocation for sets hm?
Anyway, if tests show that set of pairs is faster, great, let's go with that!

Regarding the function in Python - I also don't know an easy way, so I agree, let's drop that for now, that can be an optimization for the future at some moment if we decide it is worth bothering with it.

Regarding hashing of pair -> yes, I thought that might be a problem hm! Using boost normally makes sense, but, in Edlib we have no external dependencies, and adding Boost for something small like this feels like an overkill. I found this link https://stackoverflow.com/a/9729747/1509394 where they describe how to easily implement the same hashing as boost has -> so how about we do it on our own? I checked Boost source and implementation in Boost is still the same as in this SO answer.
Although, they also mention that this hashing is not very good, but also that there is no better method out there, so I guess that is ok then.

Btw, as a final thought -> we could have specialization of AdditionalEqualities for char, where we build the table, like we did before. Or also some other kinds of optimizations. But I would also leave that for the future, I just wanted to mention it.

TLDR; -> Great, let's do set of pairs, however let's not use Boost, instead let's copy paste their hashing logic (it should be function hash_combine and then defining the hash struct for pair -> this struct uses hash_combine and then we give it to the unordered_set as second template argument I believe).

mobinasri · 2020-06-27T08:39:32Z

@Martinsos
Great! I used the hash function you sent in my test code. Let me share my code and the result.
Here is the test code:
https://github.com/masri2019/temporaryCodes/blob/master/edlib/compareSetAndMap/compareSetAndMap.cpp
Here is the result which shows that using a set of pairs is nearly 4 times faster: (for 20% expected hit rate)
#std::unordered_map -> Finished in 0.051417 seconds
#std::unordered_map -> Hit Count / all: 19853 / 100000
#std::unordered_set -> Finished in 0.012612 seconds
#std::unordered_set -> Hit Count / all: 19919 / 100000
I'm going to do the similar implementation for edlib. Please tell me if you have any fundamental comments especially on how I implemented the set of pairs. If you think it is OK you can close this PR and I'll come back with a new one.

Martinsos · 2020-06-27T11:16:51Z

I took a look and looks good! Ok, let's go for it! I think this is good first effort, and if needed we can try optimizing this further in the future. Btw. one thing I was thinking about for some time is setting up some really nice performance tests, that we could use in the future to optimize parts of Edlib as needed, without having to manually check the speed -> it would be great to have a suite of perpformance tests that target very specific things and that measure how much is specific part of edlib taking, and then making optimization would just be trying couple of approaches and checking out what are the numbers. Ok, certainly not something we should be doing now, but just wanted to share the idea, it would be really cool to have that one day. I did write some perf tests in the past, but that was to compare with other tools, so not really the same thing. I will open an issue for it.

…

On Sat, Jun 27, 2020 at 10:39 AM Mobin Asri ***@***.***> wrote: @Martinsos <https://github.com/Martinsos> Great! I used the hash function you sent in my test code. Let me share my code and the result. Here is the test code: https://github.com/masri2019/temporaryCodes/blob/master/edlib/compareSetAndMap/compareSetAndMap.cpp Here is the result which shows that using a set of pairs is nearly 4 times faster: (for 20% expected hit rate) #std::unordered_map -> Finished in 0.051417 seconds #std::unordered_map -> Hit Count / all: 19853 / 100000 #std::unordered_set -> Finished in 0.012612 seconds #std::unordered_set -> Hit Count / all: 19919 / 100000 I'm going to do the similar implementation for edlib. Please tell me if you have any fundamental comments especially on how I implemented the set of pairs. If you think it is OK you can close this PR and I'll come back with a new one. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#152 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AALXFB7IGJQVBHPRATYRQ5LRYWV47ANCNFSM4NJXI77Q> .

Martinsos · 2020-06-27T11:21:06Z

Ok, created an issue here: #153 .

…

On Sat, Jun 27, 2020 at 1:16 PM Martin Šošić ***@***.***> wrote: I took a look and looks good! Ok, let's go for it! I think this is good first effort, and if needed we can try optimizing this further in the future. Btw. one thing I was thinking about for some time is setting up some really nice performance tests, that we could use in the future to optimize parts of Edlib as needed, without having to manually check the speed -> it would be great to have a suite of perpformance tests that target very specific things and that measure how much is specific part of edlib taking, and then making optimization would just be trying couple of approaches and checking out what are the numbers. Ok, certainly not something we should be doing now, but just wanted to share the idea, it would be really cool to have that one day. I did write some perf tests in the past, but that was to compare with other tools, so not really the same thing. I will open an issue for it. On Sat, Jun 27, 2020 at 10:39 AM Mobin Asri ***@***.***> wrote: > @Martinsos <https://github.com/Martinsos> > Great! I used the hash function you sent in my test code. Let me share my > code and the result. > Here is the test code: > > https://github.com/masri2019/temporaryCodes/blob/master/edlib/compareSetAndMap/compareSetAndMap.cpp > Here is the result which shows that using a set of pairs is nearly 4 > times faster: (for 20% expected hit rate) > #std::unordered_map -> Finished in 0.051417 seconds > #std::unordered_map -> Hit Count / all: 19853 / 100000 > #std::unordered_set -> Finished in 0.012612 seconds > #std::unordered_set -> Hit Count / all: 19919 / 100000 > I'm going to do the similar implementation for edlib. Please tell me if > you have any fundamental comments especially on how I implemented the set > of pairs. If you think it is OK you can close this PR and I'll come back > with a new one. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#152 (comment)>, or > unsubscribe > <https://github.com/notifications/unsubscribe-auth/AALXFB7IGJQVBHPRATYRQ5LRYWV47ANCNFSM4NJXI77Q> > . >

Martinsos requested changes Jun 1, 2020

View reviewed changes

mobinasri closed this Jul 3, 2020

mobinasri force-pushed the gen-seqs branch from 84513ba to d0c6201 Compare July 3, 2020 18:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enabling additional equalities for generic sequences #152

Enabling additional equalities for generic sequences #152

mobinasri commented May 25, 2020

Martinsos left a comment •

edited

Loading

Martinsos commented Jun 1, 2020

mobinasri commented Jun 16, 2020

mobinasri commented Jun 16, 2020 •

edited

Loading

Martinsos commented Jun 24, 2020

mobinasri commented Jun 27, 2020

Martinsos commented Jun 27, 2020 via email

Martinsos commented Jun 27, 2020 via email

Enabling additional equalities for generic sequences #152

Enabling additional equalities for generic sequences #152

Conversation

mobinasri commented May 25, 2020

Martinsos left a comment • edited Loading

Choose a reason for hiding this comment

Martinsos commented Jun 1, 2020

mobinasri commented Jun 16, 2020

mobinasri commented Jun 16, 2020 • edited Loading

Martinsos commented Jun 24, 2020

mobinasri commented Jun 27, 2020

Martinsos commented Jun 27, 2020 via email

Martinsos commented Jun 27, 2020 via email

Martinsos left a comment •

edited

Loading

mobinasri commented Jun 16, 2020 •

edited

Loading