-
Notifications
You must be signed in to change notification settings - Fork 550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unpack fails on utf-8 strings #7742
Comments
From pcg@goof.comCreated by pcg@goof.comThe following program should output "65535" twice, but doesn't: use Convert::Scalar; The program creates the string "\xff\xff" and runs it through unpack, once The result must be the same in both cases (same string content), but the As the internal encoding (wether latin1 or utf8) does NOT change the (I found this bug because for some reason perl upgraded my string to The solution is to downgrade the string to latin1 before converting it Perl Info
|
From @nwc10On Sun, Jan 09, 2005 at 10:19:43PM -0000, Marc Lehmann wrote:
I agree. Well, I thought I did. Then..
However, I'm confused. There is this code in pp_pack.c: #ifdef PACKED_IS_OCTETS and the default is the #else clause. If I recompile with -DPACKED_IS_OCTETS Failed 40 test scripts out of 903, 95.57% okay. which doesn't look great. It looks like some cases in unpack expect to find Nicholas Clark |
The RT System itself - Status changed from 'new' to 'open' |
From nick@ing-simmons.netNicholas Clark <nick@ccl4.org> writes:
I know I agree ;-)
I am reasonably sure that was my code, and the #ifdef backs it out
Then snag is that with 5.6 using pack/unpack and messing with SvUTF8 flag |
From pcg@goof.comOn Mon, Jan 10, 2005 at 02:35:18PM -0000, Nick Ing-Simmons via RT <perlbug-followup@perl.org> wrote:
Reminds me of mozilla, "MSIE has the same bug, so we don't follow the RFC
Hmm..
Which shouldn't be a problem, as perl-5.6 unicode code will not work on The problem with this bug is that, on the perl level, there is no easy way Not fixing this bug means that a programmer must ALWAYS downgrade the Bug-compatibility with earlier versions of perl doesn't seem like a reason to (I know you agree, but if you are still overruled, it might be useful to At the very least this breakage should be documented, and a workaround -- |
From pcg@goof.comOn Mon, Jan 10, 2005 at 01:42:14PM -0000, Nicholas Clark via RT <perlbug-followup@perl.org> wrote:
Yupp - pack makes octets out of data, unpack makes data out of octets.
Uh.
Do you know which cases? I'd say (Without knowing them :) that they are I cannot find any conversion operator that would make sense when feed with
That would still break "b" and would have questionable semantics on "a" I frankly cannot see any reason why >255 characters can make any sense as -- |
From @nwc10On Tue, Jan 11, 2005 at 06:08:26PM +0100, Marc A. Lehmann wrote:
I didn't know, but looking at the pack implementation, it's 'U', and only 'U': $ ./perl -Ilib -we 'use Devel::Peek; Dump pack "U", 256' It expects UTF8 on the way back in, *marked* with the UTF8 flag. $ ./perl -Ilib -MCarp -we 'use Devel::Peek; $a = pack "U", 256; utf8::encode $a; Dump $a; Dump unpack "U", $a' I'm about to test a hack that might make most things work. Nicholas Clark |
From @nwc10On Thu, Jan 13, 2005 at 01:56:33PM +0000, Nicholas Clark wrote:
Seems to be 'C' and 'U'
After changing t/op/join.t to avoid using H* to probe the innards of UTF8 Nicholas Clark ==== //depot/perl/pp_pack.c#52 - /Users/nick/p4perl/perl/pp_pack.c ==== Inline Patch--- /tmp/tmp.5559.0 Thu Jan 13 21:21:50 2005
+++ /Users/nick/p4perl/perl/pp_pack.c Thu Jan 13 16:26:17 2005
@@ -1869,7 +1869,13 @@ PP(pp_unpack)
*/
register char *s = SvPVbyte(right, rlen);
#else
- register char *s = SvPV(right, rlen);
+ /* This is a hack. Only the "U" pattern requires Unicode input, so
+ downgrade everything else. We're assuing that no-one is mad enough
+ to mix U patterns and regular packed data. This will, of course, be
+ wrong.
+ */
+ register char *s = (strchr (pat, 'U') || strchr (pat, 'C'))
+ ? SvPV(right, rlen) : SvPVbyte(right, rlen);
#endif
char *strend = s + rlen;
register char *patend = pat + llen; |
From pcg@goof.comOn Thu, Jan 13, 2005 at 09:24:16PM -0000, Nicholas Clark via RT <perlbug-followup@perl.org> wrote:
But "C" is documented as: An unsigned char value. Only does bytes. See U for I assume that "byte" == "octet", which is not generally true in perl, but While "U" is documented as: A Unicode character number. Encodes to UTF-8 If that indeed encodes to a unicode string, not to UTF-8 (as the confusing
Well, then how do you propose to fix the situaiton? The current behaviour The question, if this is not the right fix, is what the semantics of As of now, I don't see how I can control this on the perl level, as perl In any case, the current state of undocumented random breakage is a Try this: $x = "\xff\xff"; die unpack "n", $x; Can you guess the output of this program without running it? For every $x = "\xff\xff"; die unpack "n", "$x$y"; So concatenation with an empty string suddenly changes how the firts two It gets worse: use utf8; Now the output suddenly depends on thenormalization form the text editor Does I/O to a utf-8-encoded file change the unpack outcome? Does Why have all of the above code snippets defined behaviour regardless of Upgrading behaviour can change with every perl version, and often enough If that is supposed behaviour, then I would be happy if it's semantics Forcing dveelopers to paste a call to Encode between every scalar and In any case, if the old behaviour is to stay it simply needs to be defined This is just a time bomb (and it exploded in my code, and might do so in -- |
From perl5-porters@ton.iguana.beSome points I forgot to make in the previous mail: - If the first character in the unpackstring is U it now behaves like A few bugfixes and improvements relative to the previous patch: Inline Patch--- pp_pack.c.1 Sat Jan 22 14:43:22 2005
+++ pp_pack.c Sat Jan 22 16:06:04 2005
@@ -668,21 +668,32 @@
STATIC bool
need_utf8(const char *pat, const char *patend)
{
- if (pat >= patend) return FALSE;
- if (*pat == 'U') return TRUE;
+ bool first = TRUE;
while (pat < patend) {
if (pat[0] == '#') {
pat++;
pat = memchr(pat, '\n', patend-pat);
if (!pat) return FALSE;
} else if (pat[0] == 'U') {
- if (pat[1] == '0') return TRUE;
- }
+ if (first || pat[1] == '0') return TRUE;
+ } else first = FALSE;
pat++;
}
return FALSE;
}
+STATIC char
+first_symbol(const char *pat, const char *patend) {
+ while (pat < patend) {
+ if (pat[0] != '#') return pat[0];
+ pat++;
+ pat = memchr(pat, '\n', patend-pat);
+ if (!pat) return 0;
+ pat++;
+ }
+ return 0;
+}
+
/*
=for apidoc unpack_str
@@ -707,7 +718,7 @@
flags |= FLAG_UNPACK_DO_UTF8;
}
- if (pat < patend && *pat == 'U') {
+ if (first_symbol(pat, patend) == 'U') {
/*
if (!(flags & FLAG_UNPACK_DO_UTF8))
Perl_croak(aTHX_ "U0 mode on a byte string");
@@ -747,7 +758,7 @@
flags |= FLAG_UNPACK_DO_UTF8;
}
- if (pat < patend && *pat == 'U') {
+ if (first_symbol(pat, patend) == 'U') {
/*
if (!(flags & FLAG_UNPACK_DO_UTF8))
Perl_croak(aTHX_ "U0 mode on a byte string");
@@ -774,7 +785,7 @@
warnings), but these two mean we make no progress in the string and
might enter an infinite loop */
if (retlen == (STRLEN) -1 || retlen == 0)
- Perl_croak(aTHX_ "malformed UTF-8 string in unpack");
+ Perl_croak(aTHX_ "Malformed UTF-8 string in unpack");
if (val >= 0x100) Perl_croak(aTHX_ "'%c' applied to character value %"UVf,
(int) datumtype, val);
*s += retlen;
@@ -782,37 +793,58 @@
}
STATIC bool
-next_uni_uu(pTHX_ char **s, const char *end, I32 *out)
+next_uni_bytes(pTHX_ char **s, const char *end, char *buf, int buf_len)
{
UV val;
STRLEN retlen;
char *from = *s;
- val = UNI_TO_NATIVE(utf8n_to_uvuni(*s, end-*s, &retlen, UTF8_CHECK_ONLY));
- if (val >= 0x100 || !ISUUCHAR(val) ||
- retlen == (STRLEN) -1 || retlen == 0) {
- *out = 0;
- return FALSE;
+ int bad = 0;
+ U32 flags = ckWARN(WARN_UTF8) ?
+ UTF8_CHECK_ONLY : (UTF8_CHECK_ONLY | UTF8_ALLOW_ANY);
+ for (;buf_len > 0; buf_len--) {
+ if (from >= end) return FALSE;
+ val = UNI_TO_NATIVE(utf8n_to_uvuni(from, end-from, &retlen, flags));
+ if (retlen == (STRLEN) -1 || retlen == 0) {
+ from += UTF8SKIP(from);
+ bad |= 1;
+ } else from += retlen;
+ if (val >= 0x100) {
+ bad |= 2;
+ val &= 0xff;
+ }
+ *(unsigned char *)buf++ = val;
+ }
+ /* We have enough characters for the buffer. Did we have problems ? */
+ if (bad) {
+ if (bad & 1) {
+ /* Rewalk the string fragment while warning */
+ char *ptr;
+ flags = ckWARN(WARN_UTF8) ? 0 : UTF8_ALLOW_ANY;
+ for (ptr = *s; ptr < from; ptr += UTF8SKIP(ptr))
+ utf8n_to_uvuni(ptr, end-ptr, &retlen, flags);
+ if (from > end) from = end;
+ }
+ if ((bad & 2) && ckWARN(WARN_UNPACK))
+ Perl_warner(aTHX_ packWARN(WARN_UNPACK),
+ "Character(s) wrapped in unpack");
}
- *out = PL_uudmap[val] & 077;
*s = from;
return TRUE;
}
STATIC bool
-next_uni_bytes(pTHX_ char **s, const char *end, char *buf, int buf_len)
+next_uni_uu(pTHX_ char **s, const char *end, I32 *out)
{
UV val;
STRLEN retlen;
char *from = *s;
- while (buf_len > 0) {
- if (from >= end) return 1;
- val = UNI_TO_NATIVE(utf8n_to_uvuni(from, end-from, &retlen,
- UTF8_CHECK_ONLY));
- if (val >= 0x100 || retlen == (STRLEN) -1 || retlen == 0) return FALSE;
- from += retlen;
- *(unsigned char *)buf++ = val;
- buf_len--;
+ val = UNI_TO_NATIVE(utf8n_to_uvuni(*s, end-*s, &retlen, UTF8_CHECK_ONLY));
+ if (val >= 0x100 || !ISUUCHAR(val) ||
+ retlen == (STRLEN) -1 || retlen == 0) {
+ *out = 0;
+ return FALSE;
}
+ *out = PL_uudmap[val] & 077;
*s = from;
return TRUE;
}
@@ -884,6 +916,8 @@
symptr->level++;
PUTBACK;
while (len--) {
+ if (utf8) symptr->flags |= FLAG_UNPACK_PARSE_UTF8;
+ else symptr->flags &= ~FLAG_UNPACK_PARSE_UTF8;
symptr->patptr = savsym.grpbeg;
unpack_rec(symptr, s, strbeg, strend, &s);
if (s == strend && savsym.howlen == e_star)
@@ -952,7 +986,7 @@
I32 l = 0;
for (hop = strbeg; hop < s; hop += UTF8SKIP(hop)) l++;
if (s != hop)
- Perl_croak(aTHX_ "malformed UTF-8 string in unpack");
+ Perl_croak(aTHX_ "Malformed UTF-8 string in unpack");
ai32 = l % len;
} else ai32 = (s - strbeg) % len;
if (ai32 == 0) break;
@@ -985,12 +1019,12 @@
for (l=len, hop=s; l>0; l--, hop += UTF8SKIP(hop)) {
if (hop >= strend) {
if (hop > strend)
- Perl_croak(aTHX_ "malformed UTF-8 string in unpack");
+ Perl_croak(aTHX_ "Malformed UTF-8 string in unpack");
break;
}
}
if (hop > strend)
- Perl_croak(aTHX_ "malformed UTF-8 string in unpack");
+ Perl_croak(aTHX_ "Malformed UTF-8 string in unpack");
len = hop - s;
} else if (len > strend - s)
len = strend - s;
@@ -1168,16 +1202,9 @@
break;
case 'C':
if (len == 0) {
- if (literal) {
+ if (literal)
/* Switch to "natural" mode */
- if (symptr->flags & FLAG_UNPACK_DO_UTF8) {
- symptr->flags |= FLAG_UNPACK_PARSE_UTF8;
- utf8 = 1;
- } else {
- symptr->flags &= ~FLAG_UNPACK_PARSE_UTF8;
- utf8 = 0;
- }
- }
+ utf8 = (symptr->flags & FLAG_UNPACK_DO_UTF8) ? 1 : 0;
break;
}
uchar_checksum:
@@ -1215,14 +1242,11 @@
if (len == 0) {
if (literal) {
/* Switch to "bytes in utf-8" mode */
- if (symptr->flags & FLAG_UNPACK_DO_UTF8) {
- utf8 = 0;
- symptr->flags &= ~FLAG_UNPACK_PARSE_UTF8;
- } else {
+ if (symptr->flags & FLAG_UNPACK_DO_UTF8) utf8 = 0;
+ else
/* Should be impossible due to the need_utf8() test */
Perl_croak(aTHX_ "U0 mode on a byte string");
}
- }
break;
}
if (len > strend - s) len = strend - s;
@@ -1235,7 +1259,7 @@
STRLEN retlen;
UV auv = utf8n_to_uvuni((U8*)s, strend - s, &retlen, ckWARN(WARN_UTF8) ? 0 : UTF8_ALLOW_ANYUV);
if (retlen == (STRLEN) -1 || retlen == 0)
- Perl_croak(aTHX_ "malformed UTF-8 string in unpack");
+ Perl_croak(aTHX_ "Malformed UTF-8 string in unpack");
s += retlen;
if (!checksum)
PUSHs(sv_2mortal(newSVuv((UV) auv))); |
From @nwc10On Tue, Jan 11, 2005 at 06:08:26PM +0100, Marc A. Lehmann wrote:
I agree, but I suspect that there are also programs out there which are
I'm never so confident about that. The testsuite may be buggy, but I tend to On Fri, Jan 14, 2005 at 05:18:32PM +0100, Marc A. Lehmann wrote:
I think Ton's semantics as incorporated into 5.9.x are right (or at least a
I agree. There is already this fleeting reference to the current behaviour in =item * I've merged Ton's code into maint, disabled all the new features, and kept I don't like this compromise, but it seems less likely to trigger new bugs Nicholas Clark |
From perl@nevcal.comOn approximately 4/27/2006 8:04 AM, came the following characters from
You produced some amazing magic with loadable Regexp work, to get Dave's But it makes me wonder if Ton's un/pack changes could be turned into a use feature newpack; (I forget the syntax you used for the new regexp stuff) --
|
From schmorp@schmorp.deOn Thu, Apr 27, 2006 at 10:05:19PM -0700, Glenn Linderman via RT <perlbug-followup@perl.org> wrote:
It's probably not worth it: I think its important to fix unpack. Wetehr However, wether it gets fixed with 5.9 or 5.8 is a matter of taste, and if Just my thoughts... -- |
From guest@guest.guest.xxxxxxxxCould this issue be revisited? I get tons of reports from people who Example: people transfer a binary string via JSON::XS. JSON::XS might These kind of bugs actively hurt perls reputation for being If you really want to keep unpack's broken semantics, then you *need* to can this bug be fixed? it is still in bleedperl, after so many years. it is trivial to fix: just downgrade the argument string, this is (the same problem is SvPV, which should be aliases to SvPVbyte, as this Users do rightly expect that *the same* string gives the *the same* |
From guest@guest.guest.xxxxxxxxThe actual exmaple, btw, is: unpack ('CCCCVCC', $$string); fatc is, programs use unpack to extratc structs regularly. They do not The othe rusage, using unpack to peek at internal implementation details Giving "C" any semantics other than "extract the next byte", compatible Adding another specifier that ignores bytes and peeks at the internal This needs to be fixed ASAP. The more people fall into that trap the |
From @cpansproutThis was fixed in 5.8.9 and 5.10.0. Didn’t you fix this? |
@cpansprout - Status changed from 'open' to 'resolved' |
From schmorp@schmorp.deOn Thu, Oct 06, 2011 at 02:24:35PM -0700, Father Chrysostomos via RT <perlbug-followup@perl.org> wrote:
My memory is fuzzy, and this is long ago, so, could well be... In any case, yes, its indeed fixed - thanks! -- |
Migrated from rt.perl.org#33734 (status was 'resolved')
Searchable as RT33734$
The text was updated successfully, but these errors were encountered: