Changed distance() to be more memory efficient based on email from mailing list. #2195

JamesDavidCarr · 2014-05-22T19:36:29Z

Changed the implementation so that it uses a smaller array and overwrites it on loop iterations.

http://forum.dlang.org/thread/rwrjgydeiytufcuiqaqk@forum.dlang.org

monarchdodra · 2014-05-22T21:11:45Z

std/algorithm.d

-                tt.popFront();
-                auto cIns = matrix(i,j - 1) + _insertionIncrement;
-                auto cDel = matrix(i - 1,j) + _deletionIncrement;
+    CostType distance(Range s, Range t) 


Trailing space.

JamesDavidCarr · 2014-05-25T22:53:01Z

Factor those into levenshteinDistance() ?

Poita · 2014-05-25T22:53:23Z

std/algorithm.d

-                auto cDel = matrix(i - 1,j) + _deletionIncrement;
-                switch (min_index(cSub, cIns, cDel)) {
+                olddiag = matrix(0,y);
+                auto cSub = lastdiag + (equals(s[y-1], t[x-1]) ? 0 : _substitutionIncrement);


You cannot use opIndex on Range as it isn't necessarily random access. You will need to iterate s/t using front/popFront as was done previously.

Poita · 2014-05-25T23:03:41Z

The unittests for levenshteinDistance are disappointing (not your fault). They only test one range type (string) and even then don't test that it works with non-ASCII characters.

Some good unittests should highlight the issues I have raised:

assert(levenshteinDistance("parks".filter!"true", "spark".filter!"true") == 2);
assert(levenshteinDistance("ID", "I♥D") == 1);

The first ensures that only forward range operations are used, and the second tests that strings are treated by code point instead of code unit.

schuetzm · 2014-05-26T09:53:34Z

Both filtering and UTF decoding are potentially bidirectional operations. If they currently aren't, that's because it isn't yet implemented (I believe this is the case for decoding). Therefore, in order to test that only forward range operations are used, something else needs to be used.

Poita · 2014-05-26T09:53:44Z

Thanks for the changes @JamesDavidCarr! I've added a few more comments, but I think that should be the last of it. Unfortunately the original algorithm had a couple of incorrect usages of ranges. Not your fault, but we might as well fix it while we're here if it's okay with you!

Poita · 2014-05-26T09:56:09Z

@schuetzm filter is only a forward range. We introduced filterBidirectional for the bidirectional case because it incurs extra overhead of eagerly finding the last element (filter is not lazy per element).

JamesDavidCarr · 2014-05-26T11:01:50Z

@Poita Thank you so much for all your help, this was my first commit to open source and I really appreciate all your help.

DmitryOlshansky · 2014-06-01T05:48:12Z

std/algorithm.d

-            s.popFront();
-            auto tt = t;
-            foreach (j; 1 .. cols)
+            matrix(y,0) = y;


Hm... so basically a matrix is a plain array. (or slan+1 rows of 1-item columns, or simply slen+1 columns)
Then using raws instead of columns simply adds confusion, just use it as an array it actually is (_matrix).

Sorry Dmirty, I don't quite understand what you're trying to say.

Apparently in this case :
matrix(y,0) is the same as matrix(0, y) and I'm not sure if this distinction on this line vs few lines down below makes anything better.

Then I'd argue that since it's exactly the same as _matrix[y] just use it directly (only in this function) making the life of the optimizer that much simpler.

Hackerpilot · 2014-07-12T00:54:11Z

It's been a month since this has been touched last. Any news?

mihails-strasuns · 2014-07-12T10:57:54Z

Levenstein struct seems to be public though undocumented - and this is a breaking change in its API. Was it supposed to be private?

mihails-strasuns · 2014-07-12T11:01:46Z

P.S. this is the only concern why I am not merging it straight away, changes look good.

JamesDavidCarr · 2014-07-12T21:53:27Z

When I was writing this I wasn't sure how to get around the old distance() implementation modifying the _matrix field as a side-effect.

I don't think it's possible to have the low memory version and also figure out the path at the same time so I split those into two separate use cases and didn't consider that people would declare their own versions of the struct and call distance() from that.

I also have one other question.

How do you personally make changes and test them when working on phobos?

mihails-strasuns · 2014-07-12T22:24:59Z

and didn't consider that people would declare their own versions of the struct and call distance() from that

It sounds very unlikely to me but there is some risk. What about keeping low distance() as is and providing new low-memory method which will get used internally by levenshteinDistance?

How do you personally make changes and test them when working on phobos?

For my Linux box I have this layout:

3 git clones : ~/devel/dlang/dmd, ~/devel/dlang/druntime, ~/devel/dlang/phobos, all updated to most recent master
dmd-git symlink alias for ~/dlang/dmd/src/dmd
~/dmd.conf (one in the home dir takes priority over /etc/dmd.conf)

[Environment]
DFLAGS=-I/home/dicebot/devel/dlang/druntime/src -I/home/dicebot/devel/dlang/phobos/ -L--export-dynamic -L-L/home/dicebot/devel/dlang/phobos/generated/linux/release/64

With this setup testing changes in Phobos module are as simple as rdmd --compiler=dmd-git -unittest -main std/algorithm.d and when I don't need it anymore I can simply do mv ~/dmd.conf ~/dmd.conf.bak to enable system-wide DMD install back.

JamesDavidCarr · 2014-07-13T11:57:12Z

Hey, I'm sorry for causing this merge conflict.

This is the first project I've worked on where I'm dealing with a fork. I pulled in the upstream changes and then pushed everything on accident.

How should I deal with this?

mihails-strasuns · 2014-07-13T12:03:02Z

Recommended approach is to do this:

git fetch upstream
git rebase upstream/master

instead of this:

git pull upstream

Now that you already have a merge commit in your local branch you can remove it manually:

git fetch upstream
git rebase -i upstream/master # -i stands for "interactive"
# at this point text editor will open with list of commits.
# you can simply the line with 4e5acf1 , save and close

Assuming no other conflicts will appear that should get you to the state where your commits are based on top of most recent upstream master. If everything is ok at this point, just do git push -f origin master to overwrite your pull request.

JamesDavidCarr · 2014-07-13T12:20:48Z

What should I do with the line with 4e5acf1?

mihails-strasuns · 2014-07-13T12:23:05Z

delete it completely, sorry for the missing word ;)

Modified the implementation of Levenshtein distance so that it now uses O(n) memory but retains the same time complexity. However, does not work with path().

Moved all braces so that they are on their own lines and removed the trailing whitespace after function declaration;

Moved to be a private method and only expose new more memory efficient distance(). This is done because we need the old implementation for path(). Modified levenshteinDistanceAndPath() to use the old distance() to reflect this.

@Poita

Moved checks for which Range is longer to levenshteinDistance() method. Reverted to implementation found in original distance() with regards to using front/popFront instead of opIndex. Fixed formatting of return statement. Added two unit tests for levenshteinDistance(). Credit to @Poita.

Refactored distacen() so that it now accepts the lengths of the ranges because we already calculate these in levenshteinDistance(). Altered ss in distance() and tt in distanceWithPath(). Moved popFront()’s to after their final usage to handle transitive ranges. Corrected path() so that it calls distanceWithPath() instead of distance().

Since _matrix is now a single dimension matrix, calls to matrix() have been removed and replaced with direct access equivalents.

…nefficient version to prevent breakage in the API. Efficient version has been moved to distanceLowMem() in the struct and is called internally by levenshteinDistance()

DmitryOlshansky · 2014-07-16T19:56:53Z

Auto-merge toggled on

Changed distance() to be more memory efficient based on email from mailing list.

monarchdodra reviewed May 22, 2014
View reviewed changes

Poita reviewed May 25, 2014
View reviewed changes

DmitryOlshansky reviewed Jun 1, 2014
View reviewed changes

JamesDavidCarr added 8 commits July 13, 2014 13:26

Modified Levenshtein distance()

4c9c85c

Modified the implementation of Levenshtein distance so that it now uses O(n) memory but retains the same time complexity. However, does not work with path().

Modified Levenshtein distance()

3a87433

Modified the implementation of Levenshtein distance so that it now uses O(n) memory but retains the same time complexity. However, does not work with path().

Fixed formatting

7f5aff5

Moved all braces so that they are on their own lines and removed the trailing whitespace after function declaration;

Moved previous distance() implementation

a9ae102

Moved to be a private method and only expose new more memory efficient distance(). This is done because we need the old implementation for path(). Modified levenshteinDistanceAndPath() to use the old distance() to reflect this.

Removed calls to matrix() in distance()

5102e9e

Since _matrix is now a single dimension matrix, calls to matrix() have been removed and replaced with direct access equivalents.

Changed the implementation of distance() so that it uses the memory i…

4beda7d

…nefficient version to prevent breakage in the API. Efficient version has been moved to distanceLowMem() in the struct and is called internally by levenshteinDistance()

Removed the accidental semicolon

0a7d2c9

DmitryOlshansky added a commit that referenced this pull request Jul 16, 2014

Merge pull request #2195 from JamesDavidCarr/master

1f4c14a

Changed distance() to be more memory efficient based on email from mailing list.

DmitryOlshansky merged commit 1f4c14a into dlang:master Jul 16, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changed distance() to be more memory efficient based on email from mailing list. #2195

Changed distance() to be more memory efficient based on email from mailing list. #2195

JamesDavidCarr commented May 22, 2014

monarchdodra May 22, 2014

JamesDavidCarr commented May 25, 2014

Poita May 25, 2014

Poita commented May 25, 2014

schuetzm commented May 26, 2014

Poita commented May 26, 2014

Poita commented May 26, 2014

JamesDavidCarr commented May 26, 2014

DmitryOlshansky Jun 1, 2014

JamesDavidCarr Jun 6, 2014

DmitryOlshansky Jun 6, 2014

Hackerpilot commented Jul 12, 2014

mihails-strasuns commented Jul 12, 2014

mihails-strasuns commented Jul 12, 2014

JamesDavidCarr commented Jul 12, 2014

mihails-strasuns commented Jul 12, 2014

JamesDavidCarr commented Jul 13, 2014

mihails-strasuns commented Jul 13, 2014

JamesDavidCarr commented Jul 13, 2014

mihails-strasuns commented Jul 13, 2014

DmitryOlshansky commented Jul 16, 2014

Changed distance() to be more memory efficient based on email from mailing list. #2195

Changed distance() to be more memory efficient based on email from mailing list. #2195

Conversation

JamesDavidCarr commented May 22, 2014

monarchdodra May 22, 2014

Choose a reason for hiding this comment

JamesDavidCarr commented May 25, 2014

Poita May 25, 2014

Choose a reason for hiding this comment

Poita commented May 25, 2014

schuetzm commented May 26, 2014

Poita commented May 26, 2014

Poita commented May 26, 2014

JamesDavidCarr commented May 26, 2014

DmitryOlshansky Jun 1, 2014

Choose a reason for hiding this comment

JamesDavidCarr Jun 6, 2014

Choose a reason for hiding this comment

DmitryOlshansky Jun 6, 2014

Choose a reason for hiding this comment

Hackerpilot commented Jul 12, 2014

mihails-strasuns commented Jul 12, 2014

mihails-strasuns commented Jul 12, 2014

JamesDavidCarr commented Jul 12, 2014

mihails-strasuns commented Jul 12, 2014

JamesDavidCarr commented Jul 13, 2014

mihails-strasuns commented Jul 13, 2014

JamesDavidCarr commented Jul 13, 2014

mihails-strasuns commented Jul 13, 2014

DmitryOlshansky commented Jul 16, 2014