Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabling additional equalities for generic sequences (using sets) #154

Merged
merged 1 commit into from
Sep 26, 2020

Conversation

mobinasri
Copy link

@mobinasri mobinasri commented Jul 4, 2020

This PR contains the implementation of an unordered_set of pairs and using it to activate additional equalities for generic sequences.

Here is the list of main changes:

  1. Added pairhash.hpp which contains a hash function for std::pair
  2. If we are given an array of additional equality pairs, an unordered_set of these pairs is created. For example if we are given these pairs; (a,b) , (b,c), (a,d) -> we make this set { (a,b), (b,a), (b,c), (c,b), (a,d), (d,a) } which contains both permutations for each pair.
  3. This unordered_set is used for checking the equality of any two elements. To speed up the process, searching the set is avoided as much as possible. We check the below conditions in advance, if any of them is true we ignore searching the set;
    1. If two given elements are the same
    1. If at least one of the given elements is not a member of any additional equality pair (using a prepossessed boolean array)

It can speed up the process because we expect that a large portion of comparisons fall into at least one of the above scenarios.

Copy link
Owner

@Martinsos Martinsos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job @masri2019 ! I think we are there, we just need a little bit of reorganization of the code, the main comment on it is here #154 (comment) and there are also other supporting comments.

edlib/include/pairHash.hpp Show resolved Hide resolved
edlib/include/pairHash.hpp Show resolved Hide resolved
edlib/src/edlib.tpp Outdated Show resolved Hide resolved
edlib/src/edlib.tpp Outdated Show resolved Hide resolved
edlib/src/edlib.tpp Outdated Show resolved Hide resolved
edlib/src/edlib.tpp Outdated Show resolved Hide resolved
edlib/src/edlib.tpp Outdated Show resolved Hide resolved
makeEqualitySet<Element, AlphabetIdx>(elementToAlphabetIdx,
config.additionalEqualities,
config.additionalEqualitiesLength);
EqualityDefinition<AlphabetIdx> equalityDefinition(equalitySet, hasAdditionalEquality);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at all this, I think the approach / algorithm is good, however it feels (as mentioned in some comments above) that we could encapsulate some things better, so let's see how to do that.

We have config.additionalEqualities array, we have its size, and we have elementToAlphabetIdx map. If I am correct, that is all that we need to construct the EqualityDefinition object, that will then allow us to call .areEqual(a, b) to check if two alphabet indexes are equal, right?
If that is so, then in edlibAlign function, we would ideally have only the following logic: passing inputs (config.additionalEqualities + their size, elementToAlphabetIdx, maybe smth else I missed) to a function, and receiving as a result EqualityDefinition object! Nothing more, no internals exposed. Then, we use that object where needed.

Now, the function that generates EqualityDefinition -> it could either be constructor of EqualityDefinition, in which case most of the logic will go into constructor, or it could be standalone function and constructor of EqualityDefinition will be as simple as it is now. I personally prefer doing stuff in constructor, but I also heard opposite opinions so I am not 100% sure at the moment what is better approach, probably it is not very important.

Also, in this approach, important thing is that EqualityDefinition would worry about allocating and deallocating memory it needs, reducing chance of memory leaks.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Martinsos Let me explain why I put the function that generates EqualityDefinition out of its constructor. As you mentioned the EqualityDefinition has just one method and that is areEqual. When we call .areEqual we have to pass two variables of type AlphabetIdx to be compared so knowing the custom type of AlphabetIdx is enough and there is no need to carry Element. areEqual has nothing to do with Element.
On the other hand, the pairs in additionalEqualities are of type Element and generating EqualityDefintion needs mapping those pairs from the type Element to AlphabetIdx. If we put this mapping procedure in the constructor we have to add a template argument to EqualityDefinition for Element, which is needed just for generating the object.
I decided to avoid adding this template argument. Maybe you have better idea? (or you think we can have it?)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on what you said I put all the procedure for generating EqualityDefintion in a single function.
This is the new commit -> 4e6c4dd

Copy link
Owner

@Martinsos Martinsos Jul 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, good point with the template argument, I missed that!
Makes sense then, I agree with the approach of external constructor. Although it maybe somewhat complicates the situation with allocating the memory hm, since I think now we might have some copying happening with vectors/sets, although I am not 100% sure. While if it was happening in constructor, it would be easier, because vector/set would get allocated as object is constructed and you would just fill them and that is it.

Hm here is a thought: if we first transformed additionalEqualities from EdlibEqualityPair<Element>* into vector<EdlibEqualityPair<AlphabetIdx>>, which is very easy to do, we could just pass that one vector to constructor of EqualityDefinition and that is it! It would take that vector by reference, so no copying, and then it would just fill the set and vector it already has as private members.

Hm this sounds great to me, what do you think? So to summarize:

  1. Very simple function that takes EdlibEqualityPair<Element>* and its length and elementToAlphabetIdx and returns vector<EdlibEqualityPair<AlphabetIdx>>. So basically just transforms array of elements to vector of alphabet indexes. Or maybe we can even inline this code, whatever you like better.
  2. Constructor in EqualityDefinition that takes only vector<EdlibEqualityPair<AlphabetIdx>> by reference and constructs equalitySet and hasAdditionalEqualities. It doesn't even have to call that template argument AlphabetIdx, it can call it anything at this point, just T, since it is really in no way connected to AlphabetIdx per se any more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All agreed!

edlib/include/edlib.hpp Outdated Show resolved Hide resolved
edlib/include/edlib.hpp Outdated Show resolved Hide resolved
bool areEqual(AlphabetIdx a, AlphabetIdx b) const {
return a == b;
return (a == b) || (hasAdditionalEqualities[a] &&
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, and nice formatting!
One thing only: I prefer, and I believe I am doing it so in the rest of the codebase (at least I hope so), to put newline before the binary operator when I have to break the line into multiple lines. So let's do that? (Put && at start of the line, not the end.)

There is also good discussion here that I believe nicely demonstrates pros and cons of two approaches: https://softwareengineering.stackexchange.com/questions/160522/should-i-put-newlines-before-or-after-binary-operators .

SetOfPairs<AlphabetIdx> equalitySet;
std::vector<bool> hasAdditionalEqualities;
public:
EqualityDefinition(const SetOfPairs<AlphabetIdx>& equalitySet_, const std::vector<bool>& hasAdditionalEqualities_){
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I am wondering here is, if you take these sets/vector by reference, what happens exactly? I believe if it was not reference, they could get completely copied, together with all the contents, and new memory would be allocated on heap, that would live as long as this object lives. However, in this case with reference hm, I am not sure what happens. Maybe they also get copied in that fashion when you do assignment below? Yes, that is probably so I think.

This means they will survive as long as they should, but we do have one copying that maybe could have been avoided. I am not sure what is the impact though, probably not very big.

}
};

/**
* Takes additional equalities for alphabet elements, transforms them into index equalities
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Takes additional equalities for alphabet elements, transforms them into index equalities
* Builds equality definition from additional equalities.

I think this is enough, who wants to know more can check additional equalities, no need for us to explain here what they do.

Btw, we could even completely remove this description, it would still be clear from the name.
Although I might vote then to rename function to buildEqualityDefinition, because "build" indicates that something resource intensive might be happening, at least I perceive it so.

}
};

/**
* Takes additional equalities for alphabet elements, transforms them into index equalities
* @param [in] additionalEqualitiesLength
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also add other params?

makeEqualitySet<Element, AlphabetIdx>(elementToAlphabetIdx,
config.additionalEqualities,
config.additionalEqualitiesLength);
EqualityDefinition<AlphabetIdx> equalityDefinition(equalitySet, hasAdditionalEquality);
Copy link
Owner

@Martinsos Martinsos Jul 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, good point with the template argument, I missed that!
Makes sense then, I agree with the approach of external constructor. Although it maybe somewhat complicates the situation with allocating the memory hm, since I think now we might have some copying happening with vectors/sets, although I am not 100% sure. While if it was happening in constructor, it would be easier, because vector/set would get allocated as object is constructed and you would just fill them and that is it.

Hm here is a thought: if we first transformed additionalEqualities from EdlibEqualityPair<Element>* into vector<EdlibEqualityPair<AlphabetIdx>>, which is very easy to do, we could just pass that one vector to constructor of EqualityDefinition and that is it! It would take that vector by reference, so no copying, and then it would just fill the set and vector it already has as private members.

Hm this sounds great to me, what do you think? So to summarize:

  1. Very simple function that takes EdlibEqualityPair<Element>* and its length and elementToAlphabetIdx and returns vector<EdlibEqualityPair<AlphabetIdx>>. So basically just transforms array of elements to vector of alphabet indexes. Or maybe we can even inline this code, whatever you like better.
  2. Constructor in EqualityDefinition that takes only vector<EdlibEqualityPair<AlphabetIdx>> by reference and constructs equalitySet and hasAdditionalEqualities. It doesn't even have to call that template argument AlphabetIdx, it can call it anything at this point, just T, since it is really in no way connected to AlphabetIdx per se any more.

Copy link
Owner

@Martinsos Martinsos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@masri2019 Ok we are really close, but I just realized something based on your explanation of why you used external constructor, please check #154 (comment), I think this is the way to go at the end. Other comments are minor.

* @return equalitySet as a set of all equality pairs where each pair contains the indexes of an additional equality
*/
template<class Element, class AlphabetIdx>
EqualityDefinition<AlphabetIdx> makeEqualityDefinition(const EdlibAlignConfig<Element> config,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am somewhat confused with the implementation here, you are taking both config and then additionalEqualities and their length! I would say you need either config or additionalEqualities and their length, not both at the same time, right?

ALso, below you have two for loops, why? Why not just have one for loop, merge them together?

Btw check this comments, I think this is the way to go: #154 (comment) .

Copy link
Owner

@Martinsos Martinsos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great I think this is it, just couple of small comments, one more important (one with assert).

elementToAlphabetIdx,
config.additionalEqualities,
config.additionalEqualitiesLength);
EqualityDefinition<AlphabetIdx>(additionalEqualities, elementToAlphabetIdx.size());
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you can just do inline: EqualityDefinition<AlphabetIdx>(additionalEqualities, elementToAlphabetIdx.size());, no need for assignment, you are just doing redundant copying possibly.

for(int i = 0; i < additionalEqualitiesLength; i++){
Element a = additionalEqualities[i].first;
Element b = additionalEqualities[i].second;
if (elementToAlphabetIdx.count(a) && elementToAlphabetIdx.count(b)) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If correct arguments are given, this check should not fail, right?

What I am worried about is if this check fails, we will never know -> it will be a silent fail!

If it fails, it means we made a mistake in our program, it is not users fault, in this case.
So the whole thing should crash. How about we put assert here, instead of this if-else?

Copy link
Author

@mobinasri mobinasri Jul 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I did this was that when at least one of the elements in an equality is not present in the alphabet, we can remove that equality. I thought that maybe the purpose of test12 in runTests.cpp is to ignore such equalities. Could you check that test? Maybe I have to change that test too?

edlib/test/runTests.cpp

Lines 427 to 442 in ba784e5

bool test12() {
EdlibEqualityPair additionalEqualities[24] = {{'R','A'},{'R','G'},{'M','A'},{'M','C'},{'W','A'},{'W','T'},
{'S','C'},{'S','G'},{'Y','C'},{'Y','T'},{'K','G'},{'K','T'},
{'V','A'},{'V','C'},{'V','G'},{'H','A'},{'H','C'},{'H','T'},
{'D','A'},{'D','G'},{'D','T'},{'B','C'},{'B','G'},{'B','T'}};
const char* query = "GCATATCAATAAGCGGAGGA";
const char* target = "TAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTATCGAATAAACTTGATGGGTTGTCGCTGGCTTCTAGGAGCATGTGCACATCCGTCATTTTTATCCATCCACCTGTGCACCTTTTGTAGTCTTTGGAGGTAATAAGCGTGAATCTATCGAGGTCCTCTGGTCCTCGGAAAGAGGTGTTTGCCATATGGCTCGCCTTTGATACTCGCGAGTTACTCTAAGACTATGTCCTTTCATATACTACGAATGTAATAGAATGTATTCATTGGGCCTCAGTGCCTATAAAACATATACAACTTTCAGCAACGGATCTCTTGGCTCTCGCATCGATGAAGAACGCAGCGAAATGCGATAAGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACCTTGCGCTCCTTGGTATTCCGAGGAGCATGCCTGTTTGAGTGTCATTAAATTCTCAACCCCTTCCGGTTTTTTGACTGGCTTTGGGGCTTGGATGTGGGGGATTCATTTGCGGGCCTCTGTAGAGGTCGGCTCCCCTGAAATGCATTAGTGGAACCGTTTGCGGTTACCGTCGCTGGTGTGATAACTATCTATGCCAAAGACAAACTGCTCTCTGATAGTTCTGCTTCTAACCGTCCATTTATTGGACAACATTATTATGAACACTTGACCTCAAATCAGGTAGGACTACCCGCTGAACTTAAGCATATCAATAAGCGGAGGAAAAGAAACTAACAAGGATTCCCCTAGTAACTGCGAGTGAAGCGGGAAAAGCTCAAATTTAAAATCTGGCGGTCTTTGGCCGTCCGAGTTGTAATCTAGAGAAGCGACACCCGCGCTGGACCGTGTACAAGTCTCCTGGAATGGAGCGTCATAGAGGGTGAGAATCCCGTCTCTGACACGGACTACCAGGGCTTTGTGGTGCGCTCTCAAAGAGTCGAGTTGTTTGGGAATGCAGCTCTAAATGGGTGGTAAATTCCATCTAAAGCTAAATATTGGCGAGAGACCGATAGCGAACAAGTACCGTGAGGGAAAGATGAAAAGAACTTTGGAAAGAGAGTTAAACAGTACGTGAAATTGCTGAAAGGGAAACGCTTGAAGTCAGTCGCGTTGGCCGGGGATCAGCCTCGCTTTTGCGTGGTGTATTTCCTGGTTGACGGGTCAGCATCAATTTTGACCGCTGGAAAAGGACTTGGGGAATGTGGCATCTTCGGATGTGTTATAGCCCTTTGTCGCATACGGCGGTTGGGATTGAGGAACTCAGCACGCCGCAAGGCCGGGTTTCGACCACGTTCGTGCTTAGGATGCTGGCATAATGGCTTTAATCGACCCGTCTTGAAACACGGACCAAGGAGTCTAACATGCCTGCGAGTGTTTGGGTGGAAAACCCGAGCGCGTAATGAAAGTGAAAGTTGAGATCCCTGTCGTGGGGAGCATCGACGCCCGGACCAGAACTTTTGGGACGGATCTGCGGTAGAGCATGTATGTTGGGACCCGAAAGATGGTGAACTATGCCTGAATAGGGTGAAGCCAGAGGAAACTCTGGTGGAGGCTCGTAGCGATTCTGACGTGCAAATCGATCGTCAAATTTGGGTATAGGGGCGAAAGACTAATCGAACCATCTAGTAGCTGGTTCCTGCCGAAGTTTCCCTCAGGATAGCAGAAACTCATATCAGATTTATGTGGTAAAGCGAATGATTAGAGGCCTTGGGGTTGAAACAACCTTAACCTATTCTCAAACTTTAAATATGTAAGAACGAGCCGTTTCTTGATTGAACCGCTCGGCGATTGAGAGTTTCTAGTGGGCCATTTTTGGTAAGCAGAACTGGCGATGCGGGATGAACCGAACGCGAGGTTAAGGTGCCGGAATTCACGCTCATCAGACACCACAAAAGGTGTTAGTTCATCTAGACAGCAGGACGGTGGCCATGGAAGTCGGAATCCGCTAAGGAGTGTGTAACAACTCACCTGCCGAATGAACTAGCCCTGAAAATGGATGGCGCTTAAGCGTGATACCCATACCTCGCCGTCAGCGTTGAAGTGACGCGCTGACGAGTAGGCAGGCGTGGAGGTCAGTGAAGAAGCCTTGGCAGTGATGCTGGGTGAAACGGCCTCC";
EdlibAlignResult result = edlibAlign(query, static_cast<int>(std::strlen(query)),
target, static_cast<int>(std::strlen(target)),
edlibNewAlignConfig(-1, EDLIB_MODE_HW,EDLIB_TASK_LOC, additionalEqualities, 24));
bool pass = result.status == EDLIB_STATUS_OK && result.editDistance == 0;
printf(pass ? "\x1B[32m""OK""\x1B[0m\n" : "\x1B[31m""FAIL""\x1B[0m\n");
edlibFreeAlignResult(result);
return pass;
}

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ufff you are right! I am being silly, yes let's keep it this way!

@Martinsos
Copy link
Owner

I somehow missed the changes you made following my comments!

So, all is good I believe and we can merge, but I see that continuous integration on AppVeyor is failing, with error C:\projects\edlib\edlib\include\edlib.hpp(140): error C4519: default template arguments are only allowed on a class template [C:\projects\edlib\runTests.vcxproj]. I found this SO Q/A about it: https://stackoverflow.com/questions/2447458/default-template-arguments-for-function-templates . We should fix this before proceeding. It seems that having default templates on functions is problem, and while modern compilers support it, some older ones still don't, especially on windows. I wonder what to do about it? Should we not have defaults? Or should we say: you need to use this new compiler and so on? Because allegedly, it is just a legacy thing, which Bjourne said is just preventing good programming practices today.

@Martinsos
Copy link
Owner

I somehow missed the changes you made following my comments!

So, all is good I believe and we can merge, but I see that continuous integration on AppVeyor is failing, with error C:\projects\edlib\edlib\include\edlib.hpp(140): error C4519: default template arguments are only allowed on a class template [C:\projects\edlib\runTests.vcxproj]. I found this SO Q/A about it: https://stackoverflow.com/questions/2447458/default-template-arguments-for-function-templates . We should fix this before proceeding. It seems that having default templates on functions is problem, and while modern compilers support it, some older ones still don't, especially on windows. I wonder what to do about it? Should we not have defaults? Or should we say: you need to use this new compiler and so on? Because allegedly, it is just a legacy thing, which Bjourne said is just preventing good programming practices today.

Or maybe we just shouldn't care about VS 10 and drop it from AppVeyor check hm! I like that better, if people complain we can see what to do, but I don't like not using modern features just to support very old systems.

@Martinsos
Copy link
Owner

@masri2019 ok, I figured it out! I updated AppVeyor (our CI service for Windows) to use newer versions of Visual Studio, so hopefully that will help. I did that update on master, so you could cherry pick it here if you want or we will rebase later.

This PR is ready to merge, but could you first squash the commits? Let's squash them into optimaly one (or multiple if there are distinct sets of changes) commit, and then I can merge it.

@Martinsos
Copy link
Owner

Also, this change just happened on master: #158 . Since we did big changes, I wouldn't want us to miss that and override once we merge, so maybe we should do it now here also? Or we write it down and remember later when merging to keep an eye out for it.

Defined hash functions for std::pair to be able to create set of pairs, implemented EqualityDefinition to support generic types and updated tests and aligner app accordingly
@mobinasri
Copy link
Author

Hi @Martinsos, My apologies for being so late. Because of fire evacuation order in California and its conflict with some of my university deadlines I completely missed to finish this PR. I just added a commit based on #158 and re-based this branch.

@Martinsos
Copy link
Owner

@masri2019 no problem, I didn't forget about it but assumed you are on vacation and did not want to bother you! Sorry to hear about you being affected by the situation in California, I hope all is well!
Great work, I merged this. I also dedicated special issue to make it easier for us to track progress, check out this comment: #90 (comment) , so we can now use it to see what needs to be done next and how is the progress going.

Next step would be updating main README.md according to the changes you have made -> would you like to take that on? Let me know how you would like to continue, I would love it of course if you take on as much as possible :D, but I understand if you have time limits and similar.

@mobinasri
Copy link
Author

@Martinsos Thanks for your understanding. Great idea for tracking the progress. I'd love to take the next steps with your guidance. I start updating the README.md this week and send you the PR as soon as it is done.

@Martinsos
Copy link
Owner

@Martinsos Thanks for your understanding. Great idea for tracking the progress. I'd love to take the next steps with your guidance. I start updating the README.md this week and send you the PR as soon as it is done.

Awesome let's continue then :)!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants