Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use a open addressing map to store folding paths #36

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

arnaudroger
Copy link
Contributor

  • change simplefold logic to use an open addressing map with phi mix hashing.
    • quick path check of value if < ORBIT_MIN_VALUE
    • if there is a folding array just iterate over avoiding continuous lookup
    • if not just got through toUpper toLower path
  • reduce binary search to exclude the element searched through the linear searched

This optimisation came from profiling my app with real payload, the simpleFold was taking around 15% cpu and is now negligible.

I added a few test to verify that the logic is still working as expected.

int[] folds = Unicode.optimizedFoldOrbit(r0);

if (folds == null) {
return (r == Unicode.optimizedFoldNonOrbit(r0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you benchmarked this against an implementation based on Character.toLowerCase(r) == r? Ideally with a few changes like that, we could throw away all the Unicode tables and use the ones built into java.lang.

Be sure to use the int (code point) not char (UTF-16 code) variants of those functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you are talking about the optimizedFoldNonOrbit? no I have not, was trying to be the least impactful as possible :) will have a look. From what I understand the Orbit exception will still need to be there ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the orbit case will still need special handling. I suggest you do this experiment as a separate change.

static int[] optimizedFoldOrbit(int r) {
if (r >= ORBIT_MIN_VALUE) {
return FOLD_MAP.get(r);
} else return null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use {...} around every block, or outdent the else block:

if (r < ORBIT_MIN_VALUE) {
    return null;
}
return FOLD_MAP.get(r);

private static final int ORBIT_MIN_VALUE = UnicodeTables.CASE_ORBIT[0][0];

static {
for(int i = 0; i < UnicodeTables.CASE_ORBIT.length; i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for (int[][] orbit : UnicodeTables.CASE_ORBIT)

int[] folds = new int[0];
int r1 = r0;
while((r1 = simpleFold(r1)) != r0) {
folds = Arrays.copyOf(folds, folds.length + 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't an orbit necessarily of length at least 3? I realize they are small and few, but it seems perverse to copy the entire array each time we add an element. Use a temporary large enough for the longest orbit (MAX_DISTANCE?) and copy it once before each call to FOLD_MAP.put.

}

/*
associate a rune to it's unfolded case equivalent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"A FoldMap maps a rune to an array of runes that are equivalent to it in a case-insensitive pattern."

return toUpper(r);
}

private static final FoldMap FOLD_MAP = new FoldMap(UnicodeTables.CASE_ORBIT.length * 4);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whence 4? Is it a load factor based on MAX_DISTANCE?

I don't think this should be a parameter. FoldMap is instantiated exactly once; specialize it.

@@ -13,6 +13,8 @@
*/
class Inst {

private static final int RUNES_LINER_SEARCH_SIZE = 10;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LINEAR?

Move this constant into the sole function that uses it.

// Otherwise binary search.
for (int lo = 0, hi = runes.length / 2; lo < hi; ) {
// Otherwise binary search on rest of the array
for (int lo = 0, hi = (runes.length - RUNES_LINER_SEARCH_SIZE) / 2; lo < hi; ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it's worth the fuss to subtract the portion initially scanned. It's a logarithmic algorithm and this is not the common case. I'm sure it's fast enough to search the complete array. Did you find evidence to the contrary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, will remove it

@@ -84,12 +98,12 @@ boolean matchRune(int r) {
}
}

// Otherwise binary search.
for (int lo = 0, hi = runes.length / 2; lo < hi; ) {
// Otherwise binary search on rest of the array
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While you're here, please document:

// Invariant: lo, hi, m are even.

associate a rune to it's unfolded case equivalent
*/
private static class FoldMap {
private static final int MAX_DISTANCE = 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document:

// number of closed-hashing probes before giving up

private final int mask;
private final int maxDistance;

FoldMap(int size) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FoldMap serves exactly one purpose. It doesn't need a parameter, and it doesn't need to be separated from the logic that instantiates and populates it.

@arnaudroger
Copy link
Contributor Author

I did check the Character.toLower/toUpper it does not affect the results, though I change the isUpper/isLower to use the Character one and it broke some test so I rolled it back.

@martin-a martin-a mentioned this pull request Jan 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants