Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re: utf8 implementation: shared keys #541

p5pRT opened this issue Sep 20, 1999 · 2 comments

Re: utf8 implementation: shared keys #541

p5pRT opened this issue Sep 20, 1999 · 2 comments


Copy link

p5pRT commented Sep 20, 1999

Migrated from (status was 'resolved')

Searchable as RT1388$

Copy link

p5pRT commented Sep 20, 1999

From The RT System itself

That's fair.

But ... why do we need to dis-associate utf8 and byte in this way?
So long as we can tell they are not "equal" we may as well share the bytes.
So we just have a flag which sets the utf8-ness of this shared value.
We hash-em the same, store them the same, but two entries only match
of the utf8-ness matches as well as hash, length, and bytes.
Now I can have one clump-of-bytes
Because I don't believe that these strings will coincide _exactly_.
So why have multiple sets of overhead and time consuming decision process?

To creater tighter hashes, and hopefully those "time consuming decisions"
would only be made once for every string either offered as input from
some source that would be considered tainted (i.e. external data), or
once at compile time for strings specified at the source level.

In fact, as I mentioned above, I believe that the datatab string table
I mention above is likely not worth storing at all.
I know I need it - it allows the app. to run without swapping...

This is fair. Although, it might be worth allowing app's which use
binary data for some other purpose to disable shared keys for that
specific hash.

What we need is a better algorithym that will be able to eliminate
hash keys with only one reference after some unspecified period.
Shared keys are only useful if the key is actually shared.
Tighter hashes have the potential to be faster for the simple reason that
they take up less pages of memory.

Given an analysis of the string contents,
Um, that sounds slow.
Not if we keep Larry Wall's suggested 7-bit clean optimization in place.
But that means finding another flag bit. Which may mean making TYPEMASK
0x7F rather than 0xFF and converting a byte read in to a read +

Well, you could always OR all the bytes in sequence and check the 8th
bit when you are done.

(Crossing my fingers... :-) ) The majority of cases should be able to be
determined at compile time.
That is far from clear.

The only part unclear is how much effort would be required in adding
this optimization into perl. It is for sure possible.

And assuming that all operations that combine
strings, are operatore on them, propagate the bits whenever it is accurate
to do so, the slow down should be fairly minimal. (And hopefully, absolutely
minimal for "normal" perl...)
For instance, if there are ZERO uses of utf8 code anywhere so far, there
is no need at all to perform the extra analysis to determine if the
string is 7-bit clean. Only on the first occurence of "use utf8;" or
"\x{...}" or some other set of specifications, does this analysis need
to begin to take place.
Having a huge hash with many keys that are only referenced once is slow.

I really think you have to go all the way, or else you'll regret it later.

Of course, perhaps smaller steps are possible.


markm@​​​ __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | CUE Development (4Y21)
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | Nortel Networks
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

  One ring to rule them all, one ring to find them, one ring to bring them all
  and in the darkness bind them...


Copy link

p5pRT commented Apr 22, 2003

@iabyn - Status changed from 'stalled' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet

No branches or pull requests

1 participant