New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changed distance() to be more memory efficient based on email from mailing list. #2195
Conversation
tt.popFront(); | ||
auto cIns = matrix(i,j - 1) + _insertionIncrement; | ||
auto cDel = matrix(i - 1,j) + _deletionIncrement; | ||
CostType distance(Range s, Range t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trailing space.
Factor those into levenshteinDistance() ? |
auto cDel = matrix(i - 1,j) + _deletionIncrement; | ||
switch (min_index(cSub, cIns, cDel)) { | ||
olddiag = matrix(0,y); | ||
auto cSub = lastdiag + (equals(s[y-1], t[x-1]) ? 0 : _substitutionIncrement); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You cannot use opIndex on Range
as it isn't necessarily random access. You will need to iterate s
/t
using front
/popFront
as was done previously.
The unittests for levenshteinDistance are disappointing (not your fault). They only test one range type ( Some good unittests should highlight the issues I have raised: assert(levenshteinDistance("parks".filter!"true", "spark".filter!"true") == 2);
assert(levenshteinDistance("ID", "I♥D") == 1); The first ensures that only forward range operations are used, and the second tests that strings are treated by code point instead of code unit. |
Both filtering and UTF decoding are potentially bidirectional operations. If they currently aren't, that's because it isn't yet implemented (I believe this is the case for decoding). Therefore, in order to test that only forward range operations are used, something else needs to be used. |
Thanks for the changes @JamesDavidCarr! I've added a few more comments, but I think that should be the last of it. Unfortunately the original algorithm had a couple of incorrect usages of ranges. Not your fault, but we might as well fix it while we're here if it's okay with you! |
@schuetzm |
@Poita Thank you so much for all your help, this was my first commit to open source and I really appreciate all your help. |
s.popFront(); | ||
auto tt = t; | ||
foreach (j; 1 .. cols) | ||
matrix(y,0) = y; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm... so basically a matrix is a plain array. (or slan+1 rows of 1-item columns, or simply slen+1 columns)
Then using raws instead of columns simply adds confusion, just use it as an array it actually is (_matrix).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry Dmirty, I don't quite understand what you're trying to say.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently in this case :
matrix(y,0)
is the same as matrix(0, y)
and I'm not sure if this distinction on this line vs few lines down below makes anything better.
Then I'd argue that since it's exactly the same as _matrix[y]
just use it directly (only in this function) making the life of the optimizer that much simpler.
It's been a month since this has been touched last. Any news? |
|
P.S. this is the only concern why I am not merging it straight away, changes look good. |
When I was writing this I wasn't sure how to get around the old distance() implementation modifying the _matrix field as a side-effect. I don't think it's possible to have the low memory version and also figure out the path at the same time so I split those into two separate use cases and didn't consider that people would declare their own versions of the struct and call distance() from that. I also have one other question. How do you personally make changes and test them when working on phobos? |
It sounds very unlikely to me but there is some risk. What about keeping low
For my Linux box I have this layout:
With this setup testing changes in Phobos module are as simple as |
Hey, I'm sorry for causing this merge conflict. This is the first project I've worked on where I'm dealing with a fork. I pulled in the upstream changes and then pushed everything on accident. How should I deal with this? |
Recommended approach is to do this:
instead of this:
Now that you already have a merge commit in your local branch you can remove it manually:
Assuming no other conflicts will appear that should get you to the state where your commits are based on top of most recent upstream master. If everything is ok at this point, just do |
What should I do with the line with 4e5acf1? |
delete it completely, sorry for the missing word ;) |
Modified the implementation of Levenshtein distance so that it now uses O(n) memory but retains the same time complexity. However, does not work with path().
Modified the implementation of Levenshtein distance so that it now uses O(n) memory but retains the same time complexity. However, does not work with path().
Moved all braces so that they are on their own lines and removed the trailing whitespace after function declaration;
Moved to be a private method and only expose new more memory efficient distance(). This is done because we need the old implementation for path(). Modified levenshteinDistanceAndPath() to use the old distance() to reflect this.
Moved checks for which Range is longer to levenshteinDistance() method. Reverted to implementation found in original distance() with regards to using front/popFront instead of opIndex. Fixed formatting of return statement. Added two unit tests for levenshteinDistance(). Credit to @Poita.
Refactored distacen() so that it now accepts the lengths of the ranges because we already calculate these in levenshteinDistance(). Altered ss in distance() and tt in distanceWithPath(). Moved popFront()’s to after their final usage to handle transitive ranges. Corrected path() so that it calls distanceWithPath() instead of distance().
Since _matrix is now a single dimension matrix, calls to matrix() have been removed and replaced with direct access equivalents.
…nefficient version to prevent breakage in the API. Efficient version has been moved to distanceLowMem() in the struct and is called internally by levenshteinDistance()
Auto-merge toggled on |
Changed distance() to be more memory efficient based on email from mailing list.
Changed the implementation so that it uses a smaller array and overwrites it on loop iterations.
http://forum.dlang.org/thread/rwrjgydeiytufcuiqaqk@forum.dlang.org