-
Notifications
You must be signed in to change notification settings - Fork 567
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hash keys are limited to 2 GB #15237
Comments
From @arcIt is not currently possible to use a string of 2**31 bytes (or larger) as a hash key, even on a 64-bit system. Furthermore, trying to do so seems to cause invalid length calculations, resulting in an overlarge allocation and a panic in malloc: $ ./perl -le 'print $^V' The attached patch tries to remedy at least the second of those problems, by detecting the situation and emitting a diagnostic that explicitly says what's happening. However, it is incomplete; I got as far as working out that, to make OP_MULTIDEREF work, we'd need to change the argument types of newSVpvn_share() (part of the public API), and decided that this wasn't something I wanted to tackle during code freeze. But I think it would probably be worth looking into this during the 5.25.x cycle. -- |
From @arc0001-Better-handling-for-huge-hash-keys.patchFrom ff8fb2afb9808d99fc7f2a3704df927958492ab8 Mon Sep 17 00:00:00 2001
From: Aaron Crane <arc@cpan.org>
Date: Fri, 18 Mar 2016 16:56:43 +0000
Subject: [PATCH] Better handling for huge hash keys
We currently require hash keys to be less than 2**31 bytes long. But (a)
nothing actually tries to enforce that, and (b) if a Perl program tries to
create a hash with such a key (using a 64-bit system), we miscalculate the
size of a memory block, yielding a panic:
$ ./perl -e '+{ "x" x 2**31, undef }'
panic: malloc, size=18446744071562068026 at -e line 1.
Instead, check for this situation, and croak with an appropriate (new)
diagnostic in the unlikely event that it occurs.
This also involves changing the type of an argument to a public API function:
Perl_share_hek() previously took the key's length as an I32, but that makes
it impossible to detect over-long keys, so it must be SSize_t instead.
---
embed.fnc | 4 ++--
hv.c | 10 +++++++---
pod/perldiag.pod | 6 ++++++
proto.h | 4 ++--
t/bigmem/hash.t | 33 +++++++++++++++++++++++++++++++++
5 files changed, 50 insertions(+), 7 deletions(-)
create mode 100644 t/bigmem/hash.t
diff --git a/embed.fnc b/embed.fnc
index 049f6c1..87c2855 100644
--- a/embed.fnc
+++ b/embed.fnc
@@ -1331,7 +1331,7 @@ AMpd |OP* |op_scope |NULLOK OP* o
: Only used by perl.c/miniperl.c, but defined in caretx.c
px |void |set_caret_X
Apd |void |setdefout |NN GV* gv
-Ap |HEK* |share_hek |NN const char* str|I32 len|U32 hash
+Ap |HEK* |share_hek |NN const char* str|SSize_t len|U32 hash
#if defined(HAS_SIGACTION) && defined(SA_SIGINFO)
: Used in perl.c
np |Signal_t |sighandler |int sig|NULLOK siginfo_t *info|NULLOK void *uap
@@ -1926,7 +1926,7 @@ sa |HE* |new_he
sanR |HEK* |save_hek_flags |NN const char *str|I32 len|U32 hash|int flags
sn |void |hv_magic_check |NN HV *hv|NN bool *needs_copy|NN bool *needs_store
s |void |unshare_hek_or_pvn|NULLOK const HEK* hek|NULLOK const char* str|I32 len|U32 hash
-sR |HEK* |share_hek_flags|NN const char *str|I32 len|U32 hash|int flags
+sR |HEK* |share_hek_flags|NN const char *str|STRLEN len|U32 hash|int flags
rs |void |hv_notallowed |int flags|NN const char *key|I32 klen|NN const char *msg
in |U32|ptr_hash|PTRV u
s |struct xpvhv_aux*|hv_auxinit|NN HV *hv
diff --git a/hv.c b/hv.c
index 7b5ad95..9df211e 100644
--- a/hv.c
+++ b/hv.c
@@ -2922,7 +2922,7 @@ S_unshare_hek_or_pvn(pTHX_ const HEK *hek, const char *str, I32 len, U32 hash)
* len and hash must both be valid for str.
*/
HEK *
-Perl_share_hek(pTHX_ const char *str, I32 len, U32 hash)
+Perl_share_hek(pTHX_ const char *str, SSize_t len, U32 hash)
{
bool is_utf8 = FALSE;
int flags = 0;
@@ -2954,7 +2954,7 @@ Perl_share_hek(pTHX_ const char *str, I32 len, U32 hash)
}
STATIC HEK *
-S_share_hek_flags(pTHX_ const char *str, I32 len, U32 hash, int flags)
+S_share_hek_flags(pTHX_ const char *str, STRLEN len, U32 hash, int flags)
{
HE *entry;
const int flags_masked = flags & HVhek_MASK;
@@ -2963,6 +2963,10 @@ S_share_hek_flags(pTHX_ const char *str, I32 len, U32 hash, int flags)
PERL_ARGS_ASSERT_SHARE_HEK_FLAGS;
+ if (UNLIKELY(len > (STRLEN) I32_MAX)) {
+ Perl_croak_nocontext("Sorry, hash keys must be smaller than 2**31 bytes");
+ }
+
/* what follows is the moral equivalent of:
if (!(Svp = hv_fetch(PL_strtab, str, len, FALSE)))
@@ -2977,7 +2981,7 @@ S_share_hek_flags(pTHX_ const char *str, I32 len, U32 hash, int flags)
for (;entry; entry = HeNEXT(entry)) {
if (HeHASH(entry) != hash) /* strings can't be equal */
continue;
- if (HeKLEN(entry) != len)
+ if (HeKLEN(entry) != (SSize_t) len)
continue;
if (HeKEY(entry) != str && memNE(HeKEY(entry),str,len)) /* is this it? */
continue;
diff --git a/pod/perldiag.pod b/pod/perldiag.pod
index b0106f0..e2d1f80 100644
--- a/pod/perldiag.pod
+++ b/pod/perldiag.pod
@@ -5525,6 +5525,12 @@ overhauled.
(F) An ancient error message that almost nobody ever runs into anymore.
But before sort was a keyword, people sometimes used it as a filehandle.
+=item Sorry, hash keys must be smaller than 2**31 bytes
+
+(F) You tried to create a hash containing a very large key, where "very
+large" means that it needs at least 2 gigabytes to store. Unfortunately,
+Perl doesn't yet handle such large hash keys.
+
=item Source filters apply only to byte streams
(F) You tried to activate a source filter (usually by loading a
diff --git a/proto.h b/proto.h
index 8807867..b0e17b1 100644
--- a/proto.h
+++ b/proto.h
@@ -2792,7 +2792,7 @@ PERL_CALLCONV void Perl_set_numeric_standard(pTHX);
PERL_CALLCONV void Perl_setdefout(pTHX_ GV* gv);
#define PERL_ARGS_ASSERT_SETDEFOUT \
assert(gv)
-PERL_CALLCONV HEK* Perl_share_hek(pTHX_ const char* str, I32 len, U32 hash);
+PERL_CALLCONV HEK* Perl_share_hek(pTHX_ const char* str, SSize_t len, U32 hash);
#define PERL_ARGS_ASSERT_SHARE_HEK \
assert(str)
PERL_CALLCONV void Perl_sortsv(pTHX_ SV** array, size_t num_elts, SVCOMPARE_t cmp);
@@ -4269,7 +4269,7 @@ STATIC HEK* S_save_hek_flags(const char *str, I32 len, U32 hash, int flags)
#define PERL_ARGS_ASSERT_SAVE_HEK_FLAGS \
assert(str)
-STATIC HEK* S_share_hek_flags(pTHX_ const char *str, I32 len, U32 hash, int flags)
+STATIC HEK* S_share_hek_flags(pTHX_ const char *str, STRLEN len, U32 hash, int flags)
__attribute__warn_unused_result__;
#define PERL_ARGS_ASSERT_SHARE_HEK_FLAGS \
assert(str)
diff --git a/t/bigmem/hash.t b/t/bigmem/hash.t
new file mode 100644
index 0000000..e3d2980
--- /dev/null
+++ b/t/bigmem/hash.t
@@ -0,0 +1,33 @@
+#!perl
+BEGIN {
+ chdir 't' if -d 't';
+ @INC = "../lib";
+ require './test.pl';
+}
+
+use Config qw(%Config);
+
+$ENV{PERL_TEST_MEMORY} >= 4
+ or skip_all("Need ~4Gb for this test");
+$Config{ptrsize} >= 8
+ or skip_all("Need 64-bit pointers for this test");
+
+plan(2);
+
+sub exn {
+ my ($code_string) = @_;
+ local $@;
+ return undef if eval "do { $code_string }; 1";
+ return $@;
+}
+
+like(exn('my $h = { "x" x 2**31, undef }'),
+ qr/^\QSorry, hash keys must be smaller than 2**31 bytes\E\b/,
+ "hash constructed with huge key");
+
+TODO: {
+ local $TODO = "Doesn't yet work with OP_MULTIDEREF";
+ like(exn('my %h; %h{ "x" x 2**31 } = undef'),
+ qr/^\QSorry, hash keys must be smaller than 2**31 bytes\E\b/,
+ "assign to huge hash key");
+}
--
2.7.2
|
From @bulk88On Fri Mar 18 11:10:21 2016, arc wrote:
I think its a waste of memory to allow hash keys >2GB or >4GB. Keep the length 32 bit. Trying to dedup >2GB strings with a perl hash sounds like a programmer error unless the machine has dozens of TBs of RAM and 100s of TBs of SSDs. -- |
The RT System itself - Status changed from 'new' to 'open' |
From @jkeenanOn Sun, 20 Mar 2016 15:51:29 GMT, bulk88 wrote:
I concur with bulk88. Is there any *reasonable* case -- not merely a curiosity -- for hash keys > 2GB? Thank you very much. -- |
From zefram@fysh.orgJames E Keenan via RT wrote:
If there's any case for strings >2GB, then there's a case for hash keys -zefram |
From @demerphqOn 16 May 2017 at 03:06, Zefram <zefram@fysh.org> wrote:
Agreed; especially when you consider that the length data is stored Also some of the "short string" hacks might be applicable here. There are all kinds of trade offs here to consider, but IMO wasting Yves -- |
From @dur-randir2017-05-16 9:49 GMT+03:00 demerphq <demerphq@gmail.com>:
I'd disagree with this. Wasting some bytes seems small, but wasting I think that the reasonable behavior would be just to croak() on Best regards, |
From @toddr
I'm glad you replied. This was well said and exactly what I couldn't figure out how to put into words. Anybody using heks that big should be using some sort of library or XS not Perl. |
From @toddr
I can imagine a scenario where I might want to mmap a 5GB file into a PV. I can not imagine a scenario where I'd want the file contents in a HEK. Todd |
From zefram@fysh.orgTodd Rinaldo wrote:
Using a hash to represent a set of strings. Specifically, construct a -zefram |
From @demerphqOn 17 May 2017 18:09, "Todd Rinaldo" <toddr@cpanel.net> wrote:
I can imagine a scenario where I might want to mmap a 5GB file into a PV. I The fact strings and keys aren't completely normalised shouldn't be an I don't have a problem with a build option supporting this if you want to Yves |
From @cpansproutOn Wed, 17 May 2017 09:22:59 -0700, zefram@fysh.org wrote:
Is it possible to reach a compromise between the two positions? First, since it is quicker to do, have perl detect the situation it cannot handle, and croak with a better message (or a butter massage if you prefer). Then, when someone who really wants 2GB-hek support gets around to writing a patch, we apply the patch *as long as* it does not significantly increase memory usage by making all heks larger. (One way to do this might be to put large hash key support in a separate module that gets loaded automatically when a large hash key is created. The hash could get uvar or some other magic attached automatically; it would not have to use heks for storing those long strings.) -- Father Chrysostomos |
From @toddr
Right so we're talking a 4-6GB program just to store the 2-3 file contents in HEKs. Sure you could do it. But should you? AFAIK you can't / shouldn't be mem mapping those files into a HEK so that's right out. I'd question the time required to even generate the relevant hash for such a string. Not to mention the fact you're going to double your memory just to refer to those hash keys by putting their contents in a SVPV. Honestly the whole thing seems very contrived to me. Todd |
From @demerphqOn 18 May 2017 at 22:21, Todd Rinaldo <toddr@cpanel.net> wrote:
I misread the original bug. I thought you wanted to reduce it from 64 Until and unless we actually make it support 64 bit lengths, which I The thing I am missing is why Aaron put the lenght check where he did. I think something like this will do once my hash cleanup patches are applied: -#define _PERL_HASH_WITH_STATE(state,str,len) Yves -- |
From @demerphqOn 19 May 2017 at 00:05, demerphq <demerphq@gmail.com> wrote:
I have merged Aarons patch with mine and pushed it as b332a97d8b83526c8664d70288943f830d351ae9 Thanks Aaron! Yves -- |
Migrated from rt.perl.org#127742 (status was 'open')
Searchable as RT127742$
The text was updated successfully, but these errors were encountered: