Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changed distance() to be more memory efficient based on email from mailing list. #2195

Merged
merged 9 commits into from Jul 16, 2014

Conversation

JamesDavidCarr
Copy link
Contributor

Changed the implementation so that it uses a smaller array and overwrites it on loop iterations.

http://forum.dlang.org/thread/rwrjgydeiytufcuiqaqk@forum.dlang.org

tt.popFront();
auto cIns = matrix(i,j - 1) + _insertionIncrement;
auto cDel = matrix(i - 1,j) + _deletionIncrement;
CostType distance(Range s, Range t)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trailing space.

@JamesDavidCarr
Copy link
Contributor Author

Factor those into levenshteinDistance() ?

auto cDel = matrix(i - 1,j) + _deletionIncrement;
switch (min_index(cSub, cIns, cDel)) {
olddiag = matrix(0,y);
auto cSub = lastdiag + (equals(s[y-1], t[x-1]) ? 0 : _substitutionIncrement);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You cannot use opIndex on Range as it isn't necessarily random access. You will need to iterate s/t using front/popFront as was done previously.

@Poita
Copy link
Contributor

Poita commented May 25, 2014

The unittests for levenshteinDistance are disappointing (not your fault). They only test one range type (string) and even then don't test that it works with non-ASCII characters.

Some good unittests should highlight the issues I have raised:

assert(levenshteinDistance("parks".filter!"true", "spark".filter!"true") == 2);
assert(levenshteinDistance("ID", "I♥D") == 1);

The first ensures that only forward range operations are used, and the second tests that strings are treated by code point instead of code unit.

@schuetzm
Copy link
Contributor

Both filtering and UTF decoding are potentially bidirectional operations. If they currently aren't, that's because it isn't yet implemented (I believe this is the case for decoding). Therefore, in order to test that only forward range operations are used, something else needs to be used.

@Poita
Copy link
Contributor

Poita commented May 26, 2014

Thanks for the changes @JamesDavidCarr! I've added a few more comments, but I think that should be the last of it. Unfortunately the original algorithm had a couple of incorrect usages of ranges. Not your fault, but we might as well fix it while we're here if it's okay with you!

@Poita
Copy link
Contributor

Poita commented May 26, 2014

@schuetzm filter is only a forward range. We introduced filterBidirectional for the bidirectional case because it incurs extra overhead of eagerly finding the last element (filter is not lazy per element).

@JamesDavidCarr
Copy link
Contributor Author

@Poita Thank you so much for all your help, this was my first commit to open source and I really appreciate all your help.

s.popFront();
auto tt = t;
foreach (j; 1 .. cols)
matrix(y,0) = y;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm... so basically a matrix is a plain array. (or slan+1 rows of 1-item columns, or simply slen+1 columns)
Then using raws instead of columns simply adds confusion, just use it as an array it actually is (_matrix).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry Dmirty, I don't quite understand what you're trying to say.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently in this case :
matrix(y,0) is the same as matrix(0, y) and I'm not sure if this distinction on this line vs few lines down below makes anything better.

Then I'd argue that since it's exactly the same as _matrix[y] just use it directly (only in this function) making the life of the optimizer that much simpler.

@Hackerpilot
Copy link
Member

It's been a month since this has been touched last. Any news?

@mihails-strasuns
Copy link

Levenstein struct seems to be public though undocumented - and this is a breaking change in its API. Was it supposed to be private?

@mihails-strasuns
Copy link

P.S. this is the only concern why I am not merging it straight away, changes look good.

@JamesDavidCarr
Copy link
Contributor Author

When I was writing this I wasn't sure how to get around the old distance() implementation modifying the _matrix field as a side-effect.

I don't think it's possible to have the low memory version and also figure out the path at the same time so I split those into two separate use cases and didn't consider that people would declare their own versions of the struct and call distance() from that.

I also have one other question.

How do you personally make changes and test them when working on phobos?

@mihails-strasuns
Copy link

and didn't consider that people would declare their own versions of the struct and call distance() from that

It sounds very unlikely to me but there is some risk. What about keeping low distance() as is and providing new low-memory method which will get used internally by levenshteinDistance?

How do you personally make changes and test them when working on phobos?

For my Linux box I have this layout:

  1. 3 git clones : ~/devel/dlang/dmd, ~/devel/dlang/druntime, ~/devel/dlang/phobos, all updated to most recent master
  2. dmd-git symlink alias for ~/dlang/dmd/src/dmd
  3. ~/dmd.conf (one in the home dir takes priority over /etc/dmd.conf)
[Environment]
DFLAGS=-I/home/dicebot/devel/dlang/druntime/src -I/home/dicebot/devel/dlang/phobos/ -L--export-dynamic -L-L/home/dicebot/devel/dlang/phobos/generated/linux/release/64

With this setup testing changes in Phobos module are as simple as rdmd --compiler=dmd-git -unittest -main std/algorithm.d and when I don't need it anymore I can simply do mv ~/dmd.conf ~/dmd.conf.bak to enable system-wide DMD install back.

@JamesDavidCarr
Copy link
Contributor Author

Hey, I'm sorry for causing this merge conflict.

This is the first project I've worked on where I'm dealing with a fork. I pulled in the upstream changes and then pushed everything on accident.

How should I deal with this?

@mihails-strasuns
Copy link

Recommended approach is to do this:

git fetch upstream
git rebase upstream/master

instead of this:

git pull upstream

Now that you already have a merge commit in your local branch you can remove it manually:

git fetch upstream
git rebase -i upstream/master # -i stands for "interactive"
# at this point text editor will open with list of commits.
# you can simply the line with 4e5acf1 , save and close

Assuming no other conflicts will appear that should get you to the state where your commits are based on top of most recent upstream master. If everything is ok at this point, just do git push -f origin master to overwrite your pull request.

@JamesDavidCarr
Copy link
Contributor Author

What should I do with the line with 4e5acf1?

@mihails-strasuns
Copy link

delete it completely, sorry for the missing word ;)

Modified the implementation of Levenshtein distance so that it now uses O(n) memory but retains the same time complexity. However, does not work with path().
Modified the implementation of Levenshtein distance so that it now uses O(n) memory but retains the same time complexity. However, does not work with path().
Moved all braces so that they are on their own lines and removed the trailing whitespace after function declaration;
Moved to be a private method and only expose new more memory efficient
distance(). This is done because we need the old implementation for
path(). Modified levenshteinDistanceAndPath() to use the old distance()
to reflect this.
Moved checks for which Range is longer to levenshteinDistance() method.

Reverted to implementation found in original distance() with regards to
using front/popFront instead of opIndex.

Fixed formatting of return statement.

Added two unit tests for levenshteinDistance().

Credit to @Poita.
Refactored distacen() so that it now accepts the lengths of the ranges
because we already calculate these in levenshteinDistance().

Altered ss in distance() and tt in distanceWithPath().

Moved popFront()’s to after their final usage to handle transitive
ranges.

Corrected path() so that it calls distanceWithPath() instead of
distance().
Since _matrix is now a single dimension matrix, calls to matrix() have
been removed and replaced with direct access equivalents.
…nefficient version to prevent breakage in the API. Efficient version has been moved to distanceLowMem() in the struct and is called internally by levenshteinDistance()
@DmitryOlshansky
Copy link
Member

Auto-merge toggled on

DmitryOlshansky added a commit that referenced this pull request Jul 16, 2014
Changed distance() to be more memory efficient based on email from mailing list.
@DmitryOlshansky DmitryOlshansky merged commit 1f4c14a into dlang:master Jul 16, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants