Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabling additional equalities for generic sequences #152

Closed
wants to merge 0 commits into from

Conversation

mobinasri
Copy link

This PR contains the implementation of a disjoint-set data structure and using it to activate additional equalities for generic sequences.
The main change is based on the transitive relation, which means that if A is equivalent to B (A ~ B) and B ~ C then A ~ C. If we have such an assumption we can use a disjoint-set structure to map all the elements in the same equivalency set to just one element.
Here is the list of main changes:

  1. Implementation of disjoint-set in a separate header file and including it in the edlib source file.
  2. The elements in each set are replaced by their representative element in both query and target sequences. For example if we had A ~ B and B ~ C so all A,B and C would be in the same set and one of them is being used (Actually the root of disjoint set) as the representative. If we choose A as the representative, then we replace B and C by A in our sequences.This process is done in the function, transformSequences().
  3. The class of EqualityDefintion is removed and wherever we were calling areEqual it is now replaced by equality operator ==. It is going to be faster than before.

Copy link
Owner

@Martinsos Martinsos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @masri2019, but after almost completing the review of your PR, I just realized something that you will not be happy with I am afraid -> I was looking that test that you adjusted, because due to transitivity it basically had just one big equality test, and realized that that we actually can't assume transitivity :(. I am so sorry about this, I can't believe I haven't realized this before, I completely forgot about that use case. The use case is the one from the test, when we want "N" to match any other character -> A, C, T or G, but we of course don't want A to match G and so on, which we can't achieve if we assume transitivity! So to generalize, if we assume transitivity, we can't use additional equalities to describe the "joker" character that matches multiple other characters.
Please take my apologies again, I am really sorry I didn't figure this out before.
I am afraid we will have to drop this PR and go with non-transitive approach. I don't think we have much choice there, we most likely need to construct unordered_map, or unordered_set, and use it inside of "isEqual()". That will not be so fast as the solution we had for transitivity, but I don't think we can go around that.

So if we have (A, N), (C, N), (G, H), then we would have unordered set of [(A, N), (N, A), (C, N), (N, C), (G, H), (H, G)] -> it probably makes sense to have these duplicates that don't care about order of elements.
Or we could go with unordered_map of unordered sets: { A -> [N], C -> [N], G -> [H], N -> [A, C] }.
I think unordered set is a better match, in theory, than unordered_map of unordered sets, but I am not sure how complicated it is to ensure that EqualityPair can be the key for unordered_set, so if that is a problem, it might be easier to go with map of sets.
Let me know if you have some better ideas how to approach this!

@Martinsos
Copy link
Owner

Oh oh, one idea -> what if we let them define the equality function themselves?
Then, they could define it as they want, and in certain use cases it could be much faster then the solution with unordered sets/maps.

For example, let's say they have big alphabet, and only one equality: A == B.
They could provide us with function { if (x == 'A' && y == 'B' || x == 'B' || y == 'A') then true else false }.
While this still means we have to run this function for every equality checks while building Peq, at least it does not necessarily require calculating hashes of each element and looking them up in the map/set, which I believe is more expensive, regardless of time being constant in average.
This also allows them to specify some fancy equalities hm. It also allows them to mess things up if the relation the provide is not a proper equality hm.

I am not sure how we would let them provide such function. Would it be implementation of interface that we specified? They could overload operator == on Element, but I am afraid that might be problematic in some other places if we don't expect it (or maybe not?).
Hm, also, how would be implement this in Python? They could provide us with lambda, but can we transfer that to C++? I don't think so.

@mobinasri
Copy link
Author

@Martinsos
My apologies for the late response. Last week was the final week of the Spring quarter and there were a lot of assignments to do.

I thought about the two approaches you said (I mean using a set of pairs or using a hash table of sets). I wrote a script to implement both and compare their performance. My test scenario was that I made an array of EqualityPair similar to what edlib has. Then I generated a lot of random numbers in pairs and check if they are equal using both approaches. The execution time when I used a set of pairs was at least 4 times faster than using a hash table of sets. By the way, I could implement sets of pairs in fewer lines of code. So, I agree with you on using sets of pairs.

About allowing the user to define the equality function, I don't have any idea on how to use lambda in python and define a function based on that in C++. If you know any technique please tell me so maybe we can go that way. I guess this step is critical because this function is going to be called many times in edlib and if the user cannot provide a well-defined function it will harm the performance. I think using std::unordered_set is fine.

@mobinasri
Copy link
Author

mobinasri commented Jun 16, 2020

And one more thing is that I had to use boost/functional/hash.hpp for hashing std::pair and inserting pairs into a std::unordered_set. That's because there is no standard way of computing a hash on a pair.

The declaration of the set is something like this:
unordered_set< pair<int, int>, boost::hash< pair<int, int> > > equalitySet;

We have to replace int by Element when we use it for edlib.
I got this idea from here:
https://stackoverflow.com/questions/15160889/how-can-i-make-an-unordered-set-of-pairs-of-integers-in-c

@Martinsos
Copy link
Owner

@Martinsos
My apologies for the late response. Last week was the final week of the Spring quarter and there were a lot of assignments to do.

I thought about the two approaches you said (I mean using a set of pairs or using a hash table of sets). I wrote a script to implement both and compare their performance. My test scenario was that I made an array of EqualityPair similar to what edlib has. Then I generated a lot of random numbers in pairs and check if they are equal using both approaches. The execution time when I used a set of pairs was at least 4 times faster than using a hash table of sets. By the way, I could implement sets of pairs in fewer lines of code. So, I agree with you on using sets of pairs.

About allowing the user to define the equality function, I don't have any idea on how to use lambda in python and define a function based on that in C++. If you know any technique please tell me so maybe we can go that way. I guess this step is critical because this function is going to be called many times in edlib and if the user cannot provide a well-defined function it will harm the performance. I think using std::unordered_set is fine.

Great job on doing the tests @masri2019!
I am slightly surprised that set of pairs is faster hm, I would have thought that there would be less hash collisions if we use map of sets, meaning it is all faster. Maybe it is slower because of memory allocation for sets hm?
Anyway, if tests show that set of pairs is faster, great, let's go with that!

Regarding the function in Python - I also don't know an easy way, so I agree, let's drop that for now, that can be an optimization for the future at some moment if we decide it is worth bothering with it.

Regarding hashing of pair -> yes, I thought that might be a problem hm! Using boost normally makes sense, but, in Edlib we have no external dependencies, and adding Boost for something small like this feels like an overkill. I found this link https://stackoverflow.com/a/9729747/1509394 where they describe how to easily implement the same hashing as boost has -> so how about we do it on our own? I checked Boost source and implementation in Boost is still the same as in this SO answer.
Although, they also mention that this hashing is not very good, but also that there is no better method out there, so I guess that is ok then.

Btw, as a final thought -> we could have specialization of AdditionalEqualities for char, where we build the table, like we did before. Or also some other kinds of optimizations. But I would also leave that for the future, I just wanted to mention it.

TLDR; -> Great, let's do set of pairs, however let's not use Boost, instead let's copy paste their hashing logic (it should be function hash_combine and then defining the hash struct for pair -> this struct uses hash_combine and then we give it to the unordered_set as second template argument I believe).

@mobinasri
Copy link
Author

@Martinsos
Great! I used the hash function you sent in my test code. Let me share my code and the result.
Here is the test code:
https://github.com/masri2019/temporaryCodes/blob/master/edlib/compareSetAndMap/compareSetAndMap.cpp
Here is the result which shows that using a set of pairs is nearly 4 times faster: (for 20% expected hit rate)
#std::unordered_map -> Finished in 0.051417 seconds
#std::unordered_map -> Hit Count / all: 19853 / 100000
#std::unordered_set -> Finished in 0.012612 seconds
#std::unordered_set -> Hit Count / all: 19919 / 100000
I'm going to do the similar implementation for edlib. Please tell me if you have any fundamental comments especially on how I implemented the set of pairs. If you think it is OK you can close this PR and I'll come back with a new one.

@Martinsos
Copy link
Owner

Martinsos commented Jun 27, 2020 via email

@Martinsos
Copy link
Owner

Martinsos commented Jun 27, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants