Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CVE-2018-6798] Heap-buffer-overflow in Perl__byte_dump_string (utf8.c) #16143

Closed
p5pRT opened this issue Sep 11, 2017 · 58 comments

Comments

@p5pRT
Copy link
Collaborator

@p5pRT p5pRT commented Sep 11, 2017

Migrated from rt.perl.org#132063 (status was 'resolved')

Searchable as RT132063$

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Sep 11, 2017

From imdb95@gmail.com

Greetings,
With crafted regex match, I have found a heap-over-flow in function
Perl__byte_dump_string, which would lead to memory leak.

**********Build Date & Hardware**********
Version​: Version​: the dev version (https://perl5.git.perl.org/perl.git)
manh@​manh-VirtualBox​:~/Fuzzing/afl/perl$ ./perl/perl -v

This is perl 5, version 27, subversion 4 (v5.27.4 (v5.27.3-14-gd2dccc0))
built for x86_64-linux

Copyright 1987-2017, Larry Wall

Perl may be copied only under the terms of either the Artistic License or
the
GNU General Public License, which may be found in the Perl 5 source kit.

Complete documentation for Perl, including FAQ lists, should be found on
this system using "man perl" or "perldoc perl". If you have access to the
Internet, point your browser at http​://www.perl.org/, the Perl Home Page.


OS​: Ubuntu 16.04 Desktop
manh@​manh-VirtualBox​:~/Fuzzing/afl/perl$ uname -a
Linux manh-VirtualBox 4.4.0-92-generic #115-Ubuntu SMP Thu Aug 10 09​:04​:33
UTC 2017 x86_64 x86_64 x86_64 GNU/Linux


Compilation​:
AFL_USE_ASAN=1 ./Configure -des -Dusedevel -DDEBUGGING -Dcc=afl-clang-fast
-Doptimize=-O0\ -g && AFL_USE_ASAN=1 make

**********Reproduce**********
manh@​manh-VirtualBox​:~/Fuzzing/afl/perl$ ./perl/perl -e
'$x="(?il)\x{100}|\x{100}"; "\xff" =~ /$x/;'

==11464==ERROR​: AddressSanitizer​: heap-buffer-overflow on address
0x602000001b3a at pc 0x000000c66a61 bp 0x7ffd2e7fafb0 sp 0x7ffd2e7fafa8
READ of size 1 at 0x602000001b3a thread T0
  #0 0xc66a60 in Perl__byte_dump_string
/home/manh/Fuzzing/afl/perl/perl/utf8.c​:977​:39
  #1 0xc66a60 in S_unexpected_non_continuation_text
/home/manh/Fuzzing/afl/perl/perl/utf8.c​:1036
  #2 0xc66a60 in Perl_utf8n_to_uvchr_error
/home/manh/Fuzzing/afl/perl/perl/utf8.c​:1707
  #3 0xc5d678 in Perl__force_out_malformed_utf8_message
/home/manh/Fuzzing/afl/perl/perl/utf8.c​:90​:12
  #4 0xc726ae in Perl__to_utf8_fold_flags
/home/manh/Fuzzing/afl/perl/perl/utf8.c​:3583​:5
  #5 0xbf7b22 in S_find_byclass
/home/manh/Fuzzing/afl/perl/perl/regexec.c​:2626​:25
  #6 0xbde716 in Perl_regexec_flags
/home/manh/Fuzzing/afl/perl/perl/regexec.c​:3389​:13
  #7 0x91ee92 in Perl_pp_match
/home/manh/Fuzzing/afl/perl/perl/pp_hot.c​:2222​:10
  #8 0x8396bc in Perl_runops_debug
/home/manh/Fuzzing/afl/perl/perl/dump.c​:2486​:23
  #9 0x5e0342 in S_run_body /home/manh/Fuzzing/afl/perl/perl/perl.c
  #10 0x5e0342 in perl_run /home/manh/Fuzzing/afl/perl/perl/perl.c​:2484
  #11 0x5095cb in main /home/manh/Fuzzing/afl/perl/perl/perlmain.c​:154​:9
  #12 0x7fc01990082f in __libc_start_main
(/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
  #13 0x435928 in _start (/home/manh/Fuzzing/afl/perl/perl/perl+0x435928)

0x602000001b3a is located 0 bytes to the right of 10-byte region
[0x602000001b30,0x602000001b3a)
allocated by thread T0 here​:
  #0 0x4dc62c in malloc
/scratch/llvm/clang-4/xenial/final/llvm.src/projects/compiler-rt/lib/asan/asan_malloc_linux.cc​:66​:3
  #1 0x83ed2b in Perl_safesysmalloc
/home/manh/Fuzzing/afl/perl/perl/util.c​:153​:21

SUMMARY​: AddressSanitizer​: heap-buffer-overflow
/home/manh/Fuzzing/afl/perl/perl/utf8.c​:977​:39 in Perl__byte_dump_string
Shadow bytes around the buggy address​:
  0x0c047fff8310​: fa fa 00 02 fa fa 00 06 fa fa 00 02 fa fa 00 02
  0x0c047fff8320​: fa fa 00 fa fa fa 00 02 fa fa 00 fa fa fa 00 03
  0x0c047fff8330​: fa fa 00 05 fa fa 02 fa fa fa fd fd fa fa fd fd
  0x0c047fff8340​: fa fa fd fd fa fa fd fd fa fa fd fd fa fa fd fd
  0x0c047fff8350​: fa fa 00 04 fa fa fd fa fa fa fd fd fa fa fd fd
=>0x0c047fff8360​: fa fa fd fa fa fa 00[02]fa fa fd fa fa fa fd fd
  0x0c047fff8370​: fa fa fd fd fa fa fd fa fa fa fd fd fa fa 00 02
  0x0c047fff8380​: fa fa fd fd fa fa fd fd fa fa fd fd fa fa fd fd
  0x0c047fff8390​: fa fa fd fa fa fa fd fd fa fa fd fd fa fa fd fd
  0x0c047fff83a0​: fa fa fd fd fa fa fd fd fa fa fd fa fa fa 00 00
  0x0c047fff83b0​: fa fa fd fd fa fa fd fd fa fa 00 00 fa fa fd fd
Shadow byte legend (one shadow byte represents 8 application bytes)​:
  Addressable​: 00
  Partially addressable​: 01 02 03 04 05 06 07
  Heap left redzone​: fa
  Freed heap region​: fd
  Stack left redzone​: f1
  Stack mid redzone​: f2
  Stack right redzone​: f3
  Stack after return​: f5
  Stack use after scope​: f8
  Global redzone​: f9
  Global init order​: f6
  Poisoned by user​: f7
  Container overflow​: fc
  Array cookie​: ac
  Intra object redzone​: bb
  ASan internal​: fe
  Left alloca redzone​: ca
  Right alloca redzone​: cb
==11464==ABORTING
**********More info**********
If perl is compiled with gcc, in non-debug mode
  ./Configure -des -Dusedevel -Dcc=gcc -Doptimize=-O0\ -ggdb
then perl will leak some bytes after the crafted utf8 string​:
root@​manh-VirtualBox​:/home# ./perl/perl -MConfig -e 'print
$Config{optimize}, "\n"'
-O0 -ggdb
root@​manh-VirtualBox​:/home# ./perl/perl -MConfig -e 'print
$Config{ccflags}, "\n"'
-fwrapv -DDEBUGGING -fno-strict-aliasing -pipe -fstack-protector-strong
-I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64
-D_FORTIFY_SOURCE=2
root@​manh-VirtualBox​:/home# ./perl/perl -MConfig -e 'print $Config{cc},
"\n"'
gcc
root@​manh-VirtualBox​:/home# ./perl/perl -e '$x="(?il)\x{100}|\x{100}";
"\xff" =~ /$x/;'
Malformed UTF-8 character​:
*\xff\x00\xfa\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00* (unexpected
non-continuation byte 0x00, immediately after start byte 0xff; need 13
bytes, got 1) in pattern match (m//) at -e line 1.
Malformed UTF-8 character (fatal) at -e line 1.
root@​manh-VirtualBox​:/home# ./perl/perl -e '$x="(?il)\x{100}|\x{100}";
"\xff" =~ /$x/;'
Malformed UTF-8 character​:
*\xff\x00\xb0\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00* (unexpected
non-continuation byte 0x00, immediately after start byte 0xff; need 13
bytes, got 1) in pattern match (m//) at -e line 1.
Malformed UTF-8 character (fatal) at -e line 1.

Best,
Manh

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Sep 27, 2017

From @khwilliamson

I looked at this enough so I believe I understand the cause.
In several places in the trie handling code in regexec.c, it passes UTF8_MAXLEN to utf8n_to_uvchr() instead of the actual space available.

I grepped blead for other possible problems with this, or its synonym UTF8_MAXBYTES. Fortunately there aren't many. I haven't investigated the attached results to see whether the others are problematic or not. But these certainly are.

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Sep 27, 2017

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Sep 27, 2017

From @khwilliamson

I looked at this enough so I believe I understand the cause.
In several places in the trie handling code in regexec.c, it passes UTF8_MAXLEN to utf8n_to_uvchr() instead of the actual space available.

I grepped blead for other possible problems with this, or its synonym UTF8_MAXBYTES. Fortunately there aren't many. I haven't investigated the attached results to see whether the others are problematic or not. But these certainly are.

1 similar comment
@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Sep 27, 2017

From @khwilliamson

I looked at this enough so I believe I understand the cause.
In several places in the trie handling code in regexec.c, it passes UTF8_MAXLEN to utf8n_to_uvchr() instead of the actual space available.

I grepped blead for other possible problems with this, or its synonym UTF8_MAXBYTES. Fortunately there aren't many. I haven't investigated the attached results to see whether the others are problematic or not. But these certainly are.

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Sep 27, 2017

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Sep 27, 2017

@khwilliamson - Status changed from 'new' to 'open'

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Sep 27, 2017

From @khwilliamson

On 09/26/2017 10​:28 PM, Karl Williamson via RT wrote​:

I looked at this enough so I believe I understand the cause.
In several places in the trie handling code in regexec.c, it passes UTF8_MAXLEN to utf8n_to_uvchr() instead of the actual space available.

I grepped blead for other possible problems with this, or its synonym UTF8_MAXBYTES. Fortunately there aren't many. I haven't investigated the attached results to see whether the others are problematic or not. But these certainly are.

By "these", I meant the ones in regexec.c

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Jan 22, 2018

From @khwilliamson

On 09/26/2017 11​:03 PM, Karl Williamson wrote​:

On 09/26/2017 10​:28 PM, Karl Williamson via RT wrote​:

I looked at this enough so I believe I understand the cause.
In several places in the trie handling code in regexec.c, it passes
UTF8_MAXLEN to utf8n_to_uvchr() instead of the actual space available.

I grepped blead for other possible problems with this, or its synonym
UTF8_MAXBYTES.  Fortunately there aren't many.  I haven't investigated
the attached results to see whether the others are problematic or
not.  But these certainly are.

By "these", I meant the ones in regexec.c

This should be fixed by the attached patch.

The consequences of this bug are that if a perl program were to allow
user input of patterns, this one could be used to read up to 11 bytes of
memory beyond the end of the scalar being matched.

I don't know if this is a real security issue or not.

I tried the test without using /l as a pattern modifier, and it did not
fail, so the attacker would have to furnish that, or the pattern would
need to be executed within the scope of 'use locale'

Tell me how to proceed.

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Jan 22, 2018

From @khwilliamson

0002-PATCH-perl-132063-Heap-buffer-overflow.patch
From 4c9e25a6146d63470ac98a35766bbf4f4b6db459 Mon Sep 17 00:00:00 2001
From: Karl Williamson <khw@cpan.org>
Date: Sun, 21 Jan 2018 17:48:22 -0700
Subject: [PATCH 2/2] PATCH: [perl #132063]: Heap buffer overflow

This was caused by several instances in regexec.c of the code assuming
that the input was valid UTF-8, whereas the input was too short for what
the start byte claimed it would be.

This commit changes to use the actual length available.

I grepped through the core for any other similar uses, and did not find
any.
---
 regexec.c              | 47 ++++++++++++++++++++++++++++-------------------
 t/lib/warnings/regexec |  7 +++++++
 2 files changed, 35 insertions(+), 19 deletions(-)

diff --git a/regexec.c b/regexec.c
index 9ee7e6bab5..0cdaa49487 100644
--- a/regexec.c
+++ b/regexec.c
@@ -1626,7 +1626,13 @@ Perl_re_intuit_start(pTHX_
                                            ? trie_utf8_fold                         \
                                            :   trie_latin_utf8_fold)))
 
-#define REXEC_TRIE_READ_CHAR(trie_type, trie, widecharmap, uc, uscan, len, uvc, charid, foldlen, foldbuf, uniflags) \
+/* Convenience, because 'e' is easier to use than a size */
+#define utf8_to_uvchr_buf_flags(s, e, lenp, flags)                             \
+                     utf8n_to_uvchr(s, (U8*)(e) - (U8*)(s), lenp, flags)
+
+/* 'uscan' is set to foldbuf, and incremented, so below the end of uscan is
+ * 'foldbuf+sizeof(foldbuf)' */
+#define REXEC_TRIE_READ_CHAR(trie_type, trie, widecharmap, uc, uc_end, uscan, len, uvc, charid, foldlen, foldbuf, uniflags) \
 STMT_START {                                                                        \
     STRLEN skiplen;                                                                 \
     U8 flags = FOLD_FLAGS_FULL;                                                     \
@@ -1634,7 +1640,7 @@ STMT_START {
     case trie_flu8:                                                                 \
         _CHECK_AND_WARN_PROBLEMATIC_LOCALE;                                         \
         if (utf8_target && UTF8_IS_ABOVE_LATIN1(*uc)) {                             \
-            _CHECK_AND_OUTPUT_WIDE_LOCALE_UTF8_MSG(uc, uc + UTF8SKIP(uc));          \
+            _CHECK_AND_OUTPUT_WIDE_LOCALE_UTF8_MSG(uc, uc_end);                     \
         }                                                                           \
         goto do_trie_utf8_fold;                                                     \
     case trie_utf8_exactfa_fold:                                                    \
@@ -1643,14 +1649,14 @@ STMT_START {
     case trie_utf8_fold:                                                            \
       do_trie_utf8_fold:                                                            \
         if ( foldlen>0 ) {                                                          \
-            uvc = utf8n_to_uvchr( (const U8*) uscan, UTF8_MAXLEN, &len, uniflags ); \
+            uvc = utf8_to_uvchr_buf_flags( (const U8*) uscan, foldbuf + sizeof(foldbuf), &len, uniflags ); \
             foldlen -= len;                                                         \
             uscan += len;                                                           \
             len=0;                                                                  \
         } else {                                                                    \
-            len = UTF8SKIP(uc);                                                     \
-            uvc = _toFOLD_utf8_flags( (const U8*) uc, uc + len, foldbuf, &foldlen,  \
+            uvc = _toFOLD_utf8_flags( (const U8*) uc, uc_end, foldbuf, &foldlen,    \
                                                                             flags); \
+            len = UTF8SKIP(uc);                                                     \
             skiplen = UVCHR_SKIP( uvc );                                            \
             foldlen -= skiplen;                                                     \
             uscan = foldbuf + skiplen;                                              \
@@ -1661,7 +1667,7 @@ STMT_START {
         /* FALLTHROUGH */                                                           \
     case trie_latin_utf8_fold:                                                      \
         if ( foldlen>0 ) {                                                          \
-            uvc = utf8n_to_uvchr( (const U8*) uscan, UTF8_MAXLEN, &len, uniflags ); \
+            uvc = utf8_to_uvchr_buf_flags( (const U8*) uscan, foldbuf + sizeof(foldbuf), &len, uniflags ); \
             foldlen -= len;                                                         \
             uscan += len;                                                           \
             len=0;                                                                  \
@@ -1676,11 +1682,11 @@ STMT_START {
     case trie_utf8l:                                                                \
         _CHECK_AND_WARN_PROBLEMATIC_LOCALE;                                         \
         if (utf8_target && UTF8_IS_ABOVE_LATIN1(*uc)) {                             \
-            _CHECK_AND_OUTPUT_WIDE_LOCALE_UTF8_MSG(uc, uc + UTF8SKIP(uc));          \
+            _CHECK_AND_OUTPUT_WIDE_LOCALE_UTF8_MSG(uc, uc_end);          \
         }                                                                           \
         /* FALLTHROUGH */                                                           \
     case trie_utf8:                                                                 \
-        uvc = utf8n_to_uvchr( (const U8*) uc, UTF8_MAXLEN, &len, uniflags );        \
+        uvc = utf8_to_uvchr_buf_flags( (const U8*) uc, uc_end, &len, uniflags );        \
         break;                                                                      \
     case trie_plain:                                                                \
         uvc = (UV)*uc;                                                              \
@@ -1821,7 +1827,7 @@ REXEC_FBC_SCAN( /* Loops while (s < strend) */                 \
     }                                                                          \
     else { /* Back-up to the start of the previous character */                \
         U8 * const r = reghop3((U8*)s, -1, (U8*)reginfo->strbeg);              \
-        tmp = utf8n_to_uvchr(r, (U8*) reginfo->strend - r,                     \
+        tmp = utf8_to_uvchr_buf_flags(r, (U8*) reginfo->strend,                     \
                                                        0, UTF8_ALLOW_DEFAULT); \
     }                                                                          \
     tmp = TEST_UV(tmp);                                                        \
@@ -2785,10 +2791,10 @@ S_find_byclass(pTHX_ regexp * prog, const regnode *c, char *s,
                     }
                     points[pointpos++ % maxlen]= uc;
                     if (foldlen || uc < (U8*)strend) {
-                        REXEC_TRIE_READ_CHAR(trie_type, trie,
-                                         widecharmap, uc,
-                                         uscan, len, uvc, charid, foldlen,
-                                         foldbuf, uniflags);
+                        REXEC_TRIE_READ_CHAR(trie_type, trie, widecharmap, uc,
+                                             (U8 *) strend, uscan, len, uvc,
+                                             charid, foldlen, foldbuf,
+                                             uniflags);
                         DEBUG_TRIE_EXECUTE_r({
                             dump_exec_pos( (char *)uc, c, strend,
                                         real_start, s, utf8_target, 0);
@@ -5847,9 +5853,10 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog)
 		    if ( base && (foldlen || uc < (U8*)(reginfo->strend))) {
 			I32 offset;
 			REXEC_TRIE_READ_CHAR(trie_type, trie, widecharmap, uc,
-					     uscan, len, uvc, charid, foldlen,
-					     foldbuf, uniflags);
-			charcount++;
+                                             (U8 *) reginfo->strend, uscan,
+                                             len, uvc, charid, foldlen,
+                                             foldbuf, uniflags);
+                        charcount++;
 			if (foldlen>0)
 			    ST.longfold = TRUE;
 			if (charid &&
@@ -5983,8 +5990,10 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog)
 			while (foldlen) {
 			    if (!--chars)
 				break;
-			    uvc = utf8n_to_uvchr(uscan, UTF8_MAXLEN, &len,
-					    uniflags);
+			    uvc = utf8_to_uvchr_buf_flags(uscan,
+                                                    foldbuf + sizeof(foldbuf),
+                                                    &len,
+                                                    uniflags);
 			    uscan += len;
 			    foldlen -= len;
 			}
@@ -9559,7 +9568,7 @@ S_reginclass(pTHX_ regexp * const prog, const regnode * const n, const U8* const
     if (! UTF8_IS_INVARIANT(c) && utf8_target) {
         STRLEN c_len = 0;
         const U32 utf8n_flags = UTF8_ALLOW_DEFAULT;
-	c = utf8n_to_uvchr(p, p_end - p, &c_len, utf8n_flags | UTF8_CHECK_ONLY);
+	c = utf8_to_uvchr_buf_flags(p, p_end, &c_len, utf8n_flags | UTF8_CHECK_ONLY);
 	if (c_len == (STRLEN)-1) {
             _force_out_malformed_utf8_message(p, p_end,
                                               utf8n_flags,
diff --git a/t/lib/warnings/regexec b/t/lib/warnings/regexec
index 900dd6ee7f..6635142dea 100644
--- a/t/lib/warnings/regexec
+++ b/t/lib/warnings/regexec
@@ -260,3 +260,10 @@ setlocale(&POSIX::LC_CTYPE, $utf8_locale);
 "k" =~ /(?[ \N{KELVIN SIGN} ])/i;
 ":" =~ /(?[ \: ])/;
 EXPECT
+########
+# NAME perl #132063, read beyond buffer end
+# OPTION fatal
+"\xff" =~ /(?il)\x{100}|\x{100}/;
+EXPECT
+Malformed UTF-8 character: \xff (too short; 1 byte available, need 13) in pattern match (m//) at - line 2.
+Malformed UTF-8 character (fatal) at - line 2.
-- 
2.11.0

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Jan 25, 2018

From @tonycoz

On Tue, 26 Sep 2017 21​:28​:51 -0700, khw wrote​:

I looked at this enough so I believe I understand the cause.
In several places in the trie handling code in regexec.c, it passes
UTF8_MAXLEN to utf8n_to_uvchr() instead of the actual space available.

I grepped blead for other possible problems with this, or its synonym
UTF8_MAXBYTES. Fortunately there aren't many. I haven't investigated
the attached results to see whether the others are problematic or not.
But these certainly are.

./pp_pack.c​:2585​: if (utf8) end -= UTF8_MAXLEN-1;
./pp_pack.c​:2598​: end = start+SvLEN(cat)-UTF8_MAXLEN;
./pp_pack.c​:2612​: end = start+SvLEN(cat)-UTF8_MAXLEN;
./pp_pack.c​:2644​: if (!utf8) end -= UTF8_MAXLEN;
./pp_pack.c​:2665​: end = start+SvLEN(cat)-UTF8_MAXLEN;

These can both lead to end pointing before the beginning of the allocated PV for the W and U pack formats.

This is classified as undefined behaviour by the C standard.

The end pointer is only used for comparisons to ensure that enough space is allocated when adding new characters to the buffer, with the small offset (UTF8_MAXLEN) the pointer is unlikely to wrap on modern systems.

There's no memory accesses through the end pointers, so while I think the pack() issues are bugs, I don't think they're security issues.

The consequences of this bug are that if a perl program were to allow
user input of patterns, this one could be used to read up to 11 bytes of
memory beyond the end of the scalar being matched.

I don't know if this is a real security issue or not.

It sounds like a security issue to me.

Tony

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 1, 2018

From @tonycoz

On Wed, 24 Jan 2018 20​:26​:09 -0800, tonyc wrote​:

It sounds like a security issue to me.

Karl asked me in IRC what needs to be done with security patches.

I believe we need to do the following (not necessarily in this order).

1) produce a fix (Karl's done this), including fixes for currently supported maint branches. These should not be pushed to the public repository unless the issue is public.

Hopefully you'll get votes from the people monitoring the security list for the change to be included in maint.

2) Allocate a CVE number, this is done with the form at​:

https://cveform.mitre.org/

If the issue is not public yet then the description should be vague and the rest of form should not contain the details of the problem. You might want to think about those details for when we update the CVE.

I can do this if you can get me to understand the issue.

3) Notify downstreams, with the fixes. Last time Sawyer did this.

4) Deal with queries from downstreams, it's more than likely they'll want to know​:

a) whether the issue applies to the unsupported maint releases they ship, and if it does

b) whether the fixes work for those older maints

and possibly platform issues.

5) Pick a publication date in consultation with downstreams.

On publication date​:

6) push the fix to blead, maint

7) Update the CVE entry with the details of the issue (if the issue wasn't public.)

8) Once downstreams are ready, make the ticket public.

Tony

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 1, 2018

From @khwilliamson

On 01/31/2018 08​:30 PM, Tony Cook via RT wrote​:

On Wed, 24 Jan 2018 20​:26​:09 -0800, tonyc wrote​:

It sounds like a security issue to me.

Karl asked me in IRC what needs to be done with security patches.

I believe we need to do the following (not necessarily in this order).

1) produce a fix (Karl's done this), including fixes for currently supported maint branches. These should not be pushed to the public repository unless the issue is public.

Hopefully you'll get votes from the people monitoring the security list for the change to be included in maint.

2) Allocate a CVE number, this is done with the form at​:

https://cveform.mitre.org/

If the issue is not public yet then the description should be vague and the rest of form should not contain the details of the problem. You might want to think about those details for when we update the CVE.

I can do this if you can get me to understand the issue.

So is this a real security problem?

The original code is

'$x="(?il)\x{100}|\x{100}"; "\xff" =~ /$x/;'

The trick here is that the pattern has to compile as a trie. It doesn't
have to be this particular pattern. In UTF-8, the length of a sequence
to form a character is given by the first byte in the sequence. The
maximum length of a character is 13 bytes. If only a partial character
is furnished, and the code doesn't guard against it, it can lead to
reading beyond the buffer end. The maximum it could read is 12 bytes.

It turns out that tries are vulnerable, because they just assume there
is enough space, so they go ahead and try to access it. It is almost
certain that the bytes that are off the end won't be legal UTF-8
continuation bytes, so the translator will detect that this is
malformed, and will output a message showing what those bytes are
(unless it segfaults). Thus someone knowing that there is a trie, can
feed it input that will show them up to 12 bytes beyond the end of the
end of the buffer that their string is in.

3) Notify downstreams, with the fixes. Last time Sawyer did this.

4) Deal with queries from downstreams, it's more than likely they'll want to know​:

a) whether the issue applies to the unsupported maint releases they ship, and if it does

b) whether the fixes work for those older maints

and possibly platform issues.

5) Pick a publication date in consultation with downstreams.

On publication date​:

6) push the fix to blead, maint

7) Update the CVE entry with the details of the issue (if the issue wasn't public.)

8) Once downstreams are ready, make the ticket public.

Tony

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 1, 2018

From @khwilliamson

On 01/24/2018 09​:26 PM, Tony Cook via RT wrote​:

On Tue, 26 Sep 2017 21​:28​:51 -0700, khw wrote​:

I looked at this enough so I believe I understand the cause.
In several places in the trie handling code in regexec.c, it passes
UTF8_MAXLEN to utf8n_to_uvchr() instead of the actual space available.

I grepped blead for other possible problems with this, or its synonym
UTF8_MAXBYTES. Fortunately there aren't many. I haven't investigated
the attached results to see whether the others are problematic or not.
But these certainly are.

./pp_pack.c​:2585​: if (utf8) end -= UTF8_MAXLEN-1;
./pp_pack.c​:2598​: end = start+SvLEN(cat)-UTF8_MAXLEN;
./pp_pack.c​:2612​: end = start+SvLEN(cat)-UTF8_MAXLEN;
./pp_pack.c​:2644​: if (!utf8) end -= UTF8_MAXLEN;
./pp_pack.c​:2665​: end = start+SvLEN(cat)-UTF8_MAXLEN;

These can both lead to end pointing before the beginning of the allocated PV for the W and U pack formats.

This is classified as undefined behaviour by the C standard.

The end pointer is only used for comparisons to ensure that enough space is allocated when adding new characters to the buffer, with the small offset (UTF8_MAXLEN) the pointer is unlikely to wrap on modern systems.

There's no memory accesses through the end pointers, so while I think the pack() issues are bugs, I don't think they're security issues.

The consequences of this bug are that if a perl program were to allow
user input of patterns, this one could be used to read up to 11 bytes of
memory beyond the end of the scalar being matched.

I don't know if this is a real security issue or not.

It sounds like a security issue to me.

Tony

Hmm. I missed this email until just now. My analysis is slightly
different in the email I just sent. I said 12 bytes; 11 is more
realistic given the liklihood of a trailing NUL. And the attacker
doesn't have to specify the pattern. It could happen for various tries,
given malicious input to match against.

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 1, 2018

From @tonycoz

On Wed, Jan 31, 2018 at 09​:00​:09PM -0700, Karl Williamson wrote​:

2) Allocate a CVE number, this is done with the form at​:

https://cveform.mitre.org/

If the issue is not public yet then the description should be vague and the rest of form should not contain the details of the problem. You might want to think about those details for when we update the CVE.

I can do this if you can get me to understand the issue.

So is this a real security problem?

The original code is

'$x="(?il)\x{100}|\x{100}"; "\xff" =~ /$x/;'

The trick here is that the pattern has to compile as a trie. It doesn't
have to be this particular pattern. In UTF-8, the length of a sequence to
form a character is given by the first byte in the sequence. The maximum
length of a character is 13 bytes. If only a partial character is
furnished, and the code doesn't guard against it, it can lead to reading
beyond the buffer end. The maximum it could read is 12 bytes.

It turns out that tries are vulnerable, because they just assume there is
enough space, so they go ahead and try to access it. It is almost certain
that the bytes that are off the end won't be legal UTF-8 continuation bytes,
so the translator will detect that this is malformed, and will output a
message showing what those bytes are (unless it segfaults). Thus someone
knowing that there is a trie, can feed it input that will show them up to 12
bytes beyond the end of the end of the buffer that their string is in.

There's two potential attacks that I can see with this bug​:

1) the regexp engine can access beyond the end of a malloced block.
This can crash perl, leading to a denial of service if the perl
process is running as some sort of long-term process.

2) the error message may reveal sensitive information, eg. if the
memory used by the SV had been previously used for a password, it
could reveal part of the password.

This all assumes that the regexp engine is intended to work with
untrusted regexps - if it isn't, we should probably remove C< use re 'eval';>

Tony

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 1, 2018

From @xsawyerx

On 1 February 2018 at 04​:30, Tony Cook via RT
<perl5-security-report-followup@​perl.org> wrote​:

On Wed, 24 Jan 2018 20​:26​:09 -0800, tonyc wrote​:

It sounds like a security issue to me.

Karl asked me in IRC what needs to be done with security patches.

I believe we need to do the following (not necessarily in this order).

1) produce a fix (Karl's done this), including fixes for currently supported maint branches. These should not be pushed to the public repository unless the issue is public.

Hopefully you'll get votes from the people monitoring the security list for the change to be included in maint.

2) Allocate a CVE number, this is done with the form at​:

https://cveform.mitre.org/

If the issue is not public yet then the description should be vague and the rest of form should not contain the details of the problem. You might want to think about those details for when we update the CVE.

I can do this if you can get me to understand the issue.

Thanks, Tony.

3) Notify downstreams, with the fixes. Last time Sawyer did this.

Yes. I handle that.

4) Deal with queries from downstreams, it's more than likely they'll want to know​:

a) whether the issue applies to the unsupported maint releases they ship, and if it does

b) whether the fixes work for those older maints

and possibly platform issues.

This happens sometimes and when it does, I normally contact the person
who fixed it for more details, if I cannot answer it myself.

5) Pick a publication date in consultation with downstreams.

This is usually about two weeks, a month, or far longer. Vendors
usually need at least two weeks, but sometimes prefer a month or more.
It depends on several factors, including the urgency, the likliness of
it being publicly known or discovered soon, whether we also released a
maint with a fix, etc.

On publication date​:

6) push the fix to blead, maint

7) Update the CVE entry with the details of the issue (if the issue wasn't public.)

8) Once downstreams are ready, make the ticket public.

These can be done prior to a release, too, to avoid the fix being
public until a release is made.

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 2, 2018

From @demerphq

On 27 September 2017 at 06​:28, Karl Williamson via RT
<rt-comment@​perl.org> wrote​:

I looked at this enough so I believe I understand the cause.
In several places in the trie handling code in regexec.c, it passes UTF8_MAXLEN to utf8n_to_uvchr() instead of the actual space available.

I grepped blead for other possible problems with this, or its synonym UTF8_MAXBYTES. Fortunately there aren't many. I haven't investigated the attached results to see whether the others are problematic or not. But these certainly are.

I have reviewed this ticket and I disagree with Karls analysis.

  U8 foldbuf[ UTF8_MAXBYTES_CASE + 1 ];

Anyway, when we use UTF8_MAXBYTES in

uvc = utf8n_to_uvchr( (const U8*) uscan, UTF8_MAXLEN, &len, uniflags );

if you check you will find that uscan points into foldbuf[] mentioned
above. The code in

  uvc = _toFOLD_utf8_flags( (const U8*) uc, uc + len,
foldbuf, &foldlen, \

  flags); \

would have to construct a malformed utf8 sequence inside of foldbuf.

What is actually is happening is that the logic to choose what
strategy we use for reading data in the trie code does not account for
latin strings when under case folding locale dependent logic. The code
falls into the utf8 reading branch of the logic, which it should not
do.

What is needed is the patch I am attaching here for blead.

This affects v5.22 and later, so we need patches for 22, 24, and 26,
and one for blead.

Me personally, if i had found this without the security report I
would have just pushed the blead fix, so I dont know if these patches
need to sequestered orwhatever.

I will let you guys figure this out. All i know is that m//li is not
very common, or we would definitely have seen and heard about this bug
before.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 2, 2018

From @demerphq

blead_fix_132063.patch
commit d94aa431d493f1e6a8ad4ad335bad04f7ceccc82
Author: Yves Orton <demerphq@gmail.com>
Date:   Fri Feb 2 21:29:21 2018 +0100

    fix TRIE_READ_CHAR and DECL_TRIE_TYPE to account for non-utf8 target

diff --git a/regexec.c b/regexec.c
index 08bf713..2b54a7d 100644
--- a/regexec.c
+++ b/regexec.c
@@ -1805,7 +1805,7 @@ Perl_re_intuit_start(pTHX_
 #define DECL_TRIE_TYPE(scan) \
     const enum { trie_plain, trie_utf8, trie_utf8_fold, trie_latin_utf8_fold,       \
                  trie_utf8_exactfa_fold, trie_latin_utf8_exactfa_fold,              \
-                 trie_utf8l, trie_flu8 }                                            \
+                 trie_utf8l, trie_flu8, trie_flu8_latin }                           \
                     trie_type = ((scan->flags == EXACT)                             \
                                  ? (utf8_target ? trie_utf8 : trie_plain)           \
                                  : (scan->flags == EXACTL)                          \
@@ -1815,10 +1815,12 @@ Perl_re_intuit_start(pTHX_
                                          ? trie_utf8_exactfa_fold                   \
                                          : trie_latin_utf8_exactfa_fold)            \
                                       : (scan->flags == EXACTFLU8                   \
-                                         ? trie_flu8                                \
+                                         ? (utf8_target                             \
+                                           ? trie_flu8                              \
+                                           : trie_flu8_latin)                       \
                                          : (utf8_target                             \
                                            ? trie_utf8_fold                         \
-                                           :   trie_latin_utf8_fold)))
+                                           : trie_latin_utf8_fold)))
 
 #define REXEC_TRIE_READ_CHAR(trie_type, trie, widecharmap, uc, uscan, len, uvc, charid, foldlen, foldbuf, uniflags) \
 STMT_START {                                                                        \
@@ -1827,7 +1829,7 @@ STMT_START {
     switch (trie_type) {                                                            \
     case trie_flu8:                                                                 \
         _CHECK_AND_WARN_PROBLEMATIC_LOCALE;                                         \
-        if (utf8_target && UTF8_IS_ABOVE_LATIN1(*uc)) {                             \
+        if (UTF8_IS_ABOVE_LATIN1(*uc)) {                                            \
             _CHECK_AND_OUTPUT_WIDE_LOCALE_UTF8_MSG(uc, uc + UTF8SKIP(uc));          \
         }                                                                           \
         goto do_trie_utf8_fold;                                                     \
@@ -1850,10 +1852,14 @@ STMT_START {
             uscan = foldbuf + skiplen;                                              \
         }                                                                           \
         break;                                                                      \
+    case trie_flu8_latin:                                                           \
+        _CHECK_AND_WARN_PROBLEMATIC_LOCALE;                                         \
+        goto do_trie_latin_utf8_fold;                                               \
     case trie_latin_utf8_exactfa_fold:                                              \
         flags |= FOLD_FLAGS_NOMIX_ASCII;                                            \
         /* FALLTHROUGH */                                                           \
     case trie_latin_utf8_fold:                                                      \
+      do_trie_latin_utf8_fold:                                                      \
         if ( foldlen>0 ) {                                                          \
             uvc = utf8n_to_uvchr( (const U8*) uscan, UTF8_MAXLEN, &len, uniflags ); \
             foldlen -= len;                                                         \
diff --git a/t/re/re_tests b/t/re/re_tests
index 61b8c87..3db034c 100644
--- a/t/re/re_tests
+++ b/t/re/re_tests
@@ -1989,7 +1989,7 @@ AB\s+\x{100}	AB \x{100}X	y	-	-
 ([[:ascii:]]+)\x81	a\x80b\x81	y	$&	b\x81
 [[:^ascii:]]+b	\x80a\x81b	y	$&	\x81b
 [[:^ascii:]]+b	\x80a\x81\x{100}b	y	$&	\x81\x{100}b
-
+(?il)\x{100}|\x{100}|\x{FF}	\xFF	y	$&	\xFF
 
 # Keep these lines at the end of the file
 # vim: softtabstop=0 noexpandtab
@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 2, 2018

From @demerphq

v5224_fix_132063.patch
commit 02920531c2f12d3dcbbf885a4203b6dd4662dcb6
Author: Yves Orton <demerphq@gmail.com>
Date:   Fri Feb 2 21:29:21 2018 +0100

    v5.22.4: fix TRIE_READ_CHAR and DECL_TRIE_TYPE to account for non-utf8 target

diff --git a/regexec.c b/regexec.c
index b7335ae..25d4aaa 100644
--- a/regexec.c
+++ b/regexec.c
@@ -1451,7 +1451,7 @@ Perl_re_intuit_start(pTHX_
 #define DECL_TRIE_TYPE(scan) \
     const enum { trie_plain, trie_utf8, trie_utf8_fold, trie_latin_utf8_fold,       \
                  trie_utf8_exactfa_fold, trie_latin_utf8_exactfa_fold,              \
-                 trie_utf8l, trie_flu8 }                                            \
+                 trie_utf8l, trie_flu8, trie_flu8_latin }                           \
                     trie_type = ((scan->flags == EXACT)                             \
                                  ? (utf8_target ? trie_utf8 : trie_plain)           \
                                  : (scan->flags == EXACTL)                          \
@@ -1461,10 +1461,12 @@ Perl_re_intuit_start(pTHX_
                                          ? trie_utf8_exactfa_fold                   \
                                          : trie_latin_utf8_exactfa_fold)            \
                                       : (scan->flags == EXACTFLU8                   \
-                                         ? trie_flu8                                \
+                                         ? (utf8_target                             \
+                                           ? trie_flu8                              \
+                                           : trie_flu8_latin)                       \
                                          : (utf8_target                             \
                                            ? trie_utf8_fold                         \
-                                           :   trie_latin_utf8_fold)))
+                                           : trie_latin_utf8_fold)))
 
 #define REXEC_TRIE_READ_CHAR(trie_type, trie, widecharmap, uc, uscan, len, uvc, charid, foldlen, foldbuf, uniflags) \
 STMT_START {                                                                        \
@@ -1473,7 +1475,7 @@ STMT_START {
     switch (trie_type) {                                                            \
     case trie_flu8:                                                                 \
         _CHECK_AND_WARN_PROBLEMATIC_LOCALE;                                         \
-        if (utf8_target && UTF8_IS_ABOVE_LATIN1(*uc)) {                             \
+        if (UTF8_IS_ABOVE_LATIN1(*uc)) {                                            \
             _CHECK_AND_OUTPUT_WIDE_LOCALE_UTF8_MSG(uc, uc + UTF8SKIP(uc));          \
         }                                                                           \
         goto do_trie_utf8_fold;                                                     \
@@ -1495,10 +1497,14 @@ STMT_START {
             uscan = foldbuf + skiplen;                                              \
         }                                                                           \
         break;                                                                      \
+    case trie_flu8_latin:                                                           \
+        _CHECK_AND_WARN_PROBLEMATIC_LOCALE;                                         \
+        goto do_trie_latin_utf8_fold;                                               \
     case trie_latin_utf8_exactfa_fold:                                              \
         flags |= FOLD_FLAGS_NOMIX_ASCII;                                            \
         /* FALLTHROUGH */                                                           \
     case trie_latin_utf8_fold:                                                      \
+      do_trie_latin_utf8_fold:                                                      \
         if ( foldlen>0 ) {                                                          \
             uvc = utf8n_to_uvchr( (const U8*) uscan, UTF8_MAXLEN, &len, uniflags ); \
             foldlen -= len;                                                         \
diff --git a/t/re/re_tests b/t/re/re_tests
index 85ce7f4..bac789e 100644
--- a/t/re/re_tests
+++ b/t/re/re_tests
@@ -1941,6 +1941,7 @@ A+(*PRUNE)BC(?{})	AAABC	y	$&	AAABC
 .{1}??	-	c	-	Nested quantifiers
 .{1}?+	-	c	-	Nested quantifiers
 (?:.||)(?|)000000000@	000000000@	y	$&	000000000@		#  [perl #126405]
+(?il)\x{100}|\x{100}|\x{FF}	\xFF	y	$&	\xFF
 
 # Keep these lines at the end of the file
 # vim: softtabstop=0 noexpandtab
@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 2, 2018

From @demerphq

v5243_fix_132063.patch
commit 6e4801235e12ad2e5f742955f7270f034789b34b
Author: Yves Orton <demerphq@gmail.com>
Date:   Fri Feb 2 21:29:21 2018 +0100

    v5.24.3: fix TRIE_READ_CHAR and DECL_TRIE_TYPE to account for non-utf8 target

diff --git a/regexec.c b/regexec.c
index 5735b99..2eef483 100644
--- a/regexec.c
+++ b/regexec.c
@@ -1451,7 +1451,7 @@ Perl_re_intuit_start(pTHX_
 #define DECL_TRIE_TYPE(scan) \
     const enum { trie_plain, trie_utf8, trie_utf8_fold, trie_latin_utf8_fold,       \
                  trie_utf8_exactfa_fold, trie_latin_utf8_exactfa_fold,              \
-                 trie_utf8l, trie_flu8 }                                            \
+                 trie_utf8l, trie_flu8, trie_flu8_latin }                           \
                     trie_type = ((scan->flags == EXACT)                             \
                                  ? (utf8_target ? trie_utf8 : trie_plain)           \
                                  : (scan->flags == EXACTL)                          \
@@ -1461,10 +1461,12 @@ Perl_re_intuit_start(pTHX_
                                          ? trie_utf8_exactfa_fold                   \
                                          : trie_latin_utf8_exactfa_fold)            \
                                       : (scan->flags == EXACTFLU8                   \
-                                         ? trie_flu8                                \
+                                         ? (utf8_target                             \
+                                           ? trie_flu8                              \
+                                           : trie_flu8_latin)                       \
                                          : (utf8_target                             \
                                            ? trie_utf8_fold                         \
-                                           :   trie_latin_utf8_fold)))
+                                           : trie_latin_utf8_fold)))
 
 #define REXEC_TRIE_READ_CHAR(trie_type, trie, widecharmap, uc, uscan, len, uvc, charid, foldlen, foldbuf, uniflags) \
 STMT_START {                                                                        \
@@ -1473,7 +1475,7 @@ STMT_START {
     switch (trie_type) {                                                            \
     case trie_flu8:                                                                 \
         _CHECK_AND_WARN_PROBLEMATIC_LOCALE;                                         \
-        if (utf8_target && UTF8_IS_ABOVE_LATIN1(*uc)) {                             \
+        if (UTF8_IS_ABOVE_LATIN1(*uc)) {                                            \
             _CHECK_AND_OUTPUT_WIDE_LOCALE_UTF8_MSG(uc, uc + UTF8SKIP(uc));          \
         }                                                                           \
         goto do_trie_utf8_fold;                                                     \
@@ -1495,10 +1497,14 @@ STMT_START {
             uscan = foldbuf + skiplen;                                              \
         }                                                                           \
         break;                                                                      \
+    case trie_flu8_latin:                                                           \
+        _CHECK_AND_WARN_PROBLEMATIC_LOCALE;                                         \
+        goto do_trie_latin_utf8_fold;                                               \
     case trie_latin_utf8_exactfa_fold:                                              \
         flags |= FOLD_FLAGS_NOMIX_ASCII;                                            \
         /* FALLTHROUGH */                                                           \
     case trie_latin_utf8_fold:                                                      \
+      do_trie_latin_utf8_fold:                                                      \
         if ( foldlen>0 ) {                                                          \
             uvc = utf8n_to_uvchr( (const U8*) uscan, UTF8_MAXLEN, &len, uniflags ); \
             foldlen -= len;                                                         \
diff --git a/t/re/re_tests b/t/re/re_tests
index 7e8522d..ab7ddbb 100644
--- a/t/re/re_tests
+++ b/t/re/re_tests
@@ -1968,6 +1968,7 @@ ab(?#Comment){2}c	abbc	y	$&	abbc
 (?:.||)(?|)000000000@	000000000@	y	$&	000000000@		#  [perl #126405]
 aa$|a(?R)a|a	aaa	y	$&	aaa		# [perl 128420] recursive matches
 (?:\1|a)([bcd])\1(?:(?R)|e)\1	abbaccaddedcb	y	$&	abbaccaddedcb		# [perl 128420] recursive match with backreferences
+(?il)\x{100}|\x{100}|\x{FF}	\xFF	y	$&	\xFF
 
 # Keep these lines at the end of the file
 # vim: softtabstop=0 noexpandtab
@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 2, 2018

From @demerphq

v5261_fix_132063.patch
commit 80853ef9d835d1f24f3913a0b78eb3b5ca71e34f
Author: Yves Orton <demerphq@gmail.com>
Date:   Fri Feb 2 21:29:21 2018 +0100

    5.26.1: fix TRIE_READ_CHAR and DECL_TRIE_TYPE to account for non-utf8 target

diff --git a/regexec.c b/regexec.c
index 134b196..9b8581f 100644
--- a/regexec.c
+++ b/regexec.c
@@ -1472,7 +1472,7 @@ Perl_re_intuit_start(pTHX_
 #define DECL_TRIE_TYPE(scan) \
     const enum { trie_plain, trie_utf8, trie_utf8_fold, trie_latin_utf8_fold,       \
                  trie_utf8_exactfa_fold, trie_latin_utf8_exactfa_fold,              \
-                 trie_utf8l, trie_flu8 }                                            \
+                 trie_utf8l, trie_flu8, trie_flu8_latin }                           \
                     trie_type = ((scan->flags == EXACT)                             \
                                  ? (utf8_target ? trie_utf8 : trie_plain)           \
                                  : (scan->flags == EXACTL)                          \
@@ -1482,10 +1482,12 @@ Perl_re_intuit_start(pTHX_
                                          ? trie_utf8_exactfa_fold                   \
                                          : trie_latin_utf8_exactfa_fold)            \
                                       : (scan->flags == EXACTFLU8                   \
-                                         ? trie_flu8                                \
+                                         ? (utf8_target                             \
+                                           ? trie_flu8                              \
+                                           : trie_flu8_latin)                       \
                                          : (utf8_target                             \
                                            ? trie_utf8_fold                         \
-                                           :   trie_latin_utf8_fold)))
+                                           : trie_latin_utf8_fold)))
 
 #define REXEC_TRIE_READ_CHAR(trie_type, trie, widecharmap, uc, uscan, len, uvc, charid, foldlen, foldbuf, uniflags) \
 STMT_START {                                                                        \
@@ -1494,7 +1496,7 @@ STMT_START {
     switch (trie_type) {                                                            \
     case trie_flu8:                                                                 \
         _CHECK_AND_WARN_PROBLEMATIC_LOCALE;                                         \
-        if (utf8_target && UTF8_IS_ABOVE_LATIN1(*uc)) {                             \
+        if (UTF8_IS_ABOVE_LATIN1(*uc)) {                                            \
             _CHECK_AND_OUTPUT_WIDE_LOCALE_UTF8_MSG(uc, uc + UTF8SKIP(uc));          \
         }                                                                           \
         goto do_trie_utf8_fold;                                                     \
@@ -1517,10 +1519,14 @@ STMT_START {
             uscan = foldbuf + skiplen;                                              \
         }                                                                           \
         break;                                                                      \
+    case trie_flu8_latin:                                                           \
+        _CHECK_AND_WARN_PROBLEMATIC_LOCALE;                                         \
+        goto do_trie_latin_utf8_fold;                                               \
     case trie_latin_utf8_exactfa_fold:                                              \
         flags |= FOLD_FLAGS_NOMIX_ASCII;                                            \
         /* FALLTHROUGH */                                                           \
     case trie_latin_utf8_fold:                                                      \
+      do_trie_latin_utf8_fold:                                                      \
         if ( foldlen>0 ) {                                                          \
             uvc = utf8n_to_uvchr( (const U8*) uscan, UTF8_MAXLEN, &len, uniflags ); \
             foldlen -= len;                                                         \
diff --git a/t/re/re_tests b/t/re/re_tests
index 410fcea..78baed6 100644
--- a/t/re/re_tests
+++ b/t/re/re_tests
@@ -1985,6 +1985,7 @@ AB\s+\x{100}	AB \x{100}X	y	-	-
 /(?x)[a b]/xx	\N{SPACE}	yS	$&	 	# Note a space char here
 /(?xx)[a b]/x	\N{SPACE}	n	-	-
 /(?-x:[a b])/xx	\N{SPACE}	yS	$&	 	# Note a space char here
+(?il)\x{100}|\x{100}|\x{FF}	\xFF	y	$&	\xFF
 
 # Keep these lines at the end of the file
 # vim: softtabstop=0 noexpandtab
@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 2, 2018

From @khwilliamson

I agree with Yves' analysis. Mine was flawed about the cause of this particular issue. I think I still think that when dealing with UTF-8, it's best to use the real space available size, but I'd have to examine it more closely in light of the new analysis.

What Yves didn't mention is that this bug requires the following things in order to manifest​:

1) the pattern is compiled under /il
2) the pattern does not contain any characters below 256
3) the target string is not UTF-8.

Items 1) and 2) together would be extremely rare in the field, unless the attacker gets to choose the pattern as well.

No non-UTF-8 string may match this pattern, so if we could just short circuit the rest of trying the match, it would work.
--
Karl Williamson

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 5, 2018

From @tonycoz

On Fri, 02 Feb 2018 12​:41​:41 -0800, demerphq wrote​:

On 27 September 2017 at 06​:28, Karl Williamson via RT
<rt-comment@​perl.org> wrote​:

I looked at this enough so I believe I understand the cause.
In several places in the trie handling code in regexec.c, it passes
UTF8_MAXLEN to utf8n_to_uvchr() instead of the actual space
available.

I grepped blead for other possible problems with this, or its synonym
UTF8_MAXBYTES. Fortunately there aren't many. I haven't
investigated the attached results to see whether the others are
problematic or not. But these certainly are.

I have reviewed this ticket and I disagree with Karls analysis.

U8 foldbuf[ UTF8_MAXBYTES_CASE + 1 ];

Anyway, when we use UTF8_MAXBYTES in

uvc = utf8n_to_uvchr( (const U8*) uscan, UTF8_MAXLEN, &len, uniflags
);

if you check you will find that uscan points into foldbuf[] mentioned
above. The code in

uvc = _toFOLD_utf8_flags( (const U8*) uc, uc + len,
foldbuf, &foldlen, \

flags); \

would have to construct a malformed utf8 sequence inside of foldbuf.

No, it points into the string we're matching​:

tony@​mars​:.../git/perl$ valgrind --vgdb-error=1 --vgdb=full -q ./perl -Ilib -e '$x="(?il)\x{100}|\x{100}"; "\xff" =~ /$x/;'
==25312==
==25312== TO DEBUG THIS PROCESS USING GDB​: start GDB like this
==25312== /path/to/gdb ./perl
==25312== and then give GDB the following command
==25312== target remote | /usr/lib/valgrind/../../bin/vgdb --pid=25312
==25312== --pid is optional if only one valgrind process is running
==25312==
==25312== Conditional jump or move depends on uninitialised value(s)
==25312== at 0x2E9522​: Perl__byte_dump_string (utf8.c​:991)
==25312== by 0x2E9629​: S_unexpected_non_continuation_text (utf8.c​:1036)
==25312== by 0x2EA52B​: Perl_utf8n_to_uvchr_msgs (utf8.c​:1935)
...

tony@​mars​:.../git/perl$ gdb ./perl
...
Reading symbols from lib/auto/re/re.so...done.
0x00000000002e9522 in Perl__byte_dump_string (start=0x5f82be0 "\377", len=13,
  format=false) at utf8.c​:991
991 if (high_nibble < 10) {
(gdb) p start
$1 = (const U8 * const) 0x5f82be0 "\377"
(gdb) bt
#0 0x00000000002e9522 in Perl__byte_dump_string (start=0x5f82be0 "\377",
  len=13, format=false) at utf8.c​:991
#1 0x00000000002e962a in S_unexpected_non_continuation_text (
  s=0x5f82be0 "\377", print_len=13, non_cont_byte_pos=1, expect_len=13)
  at utf8.c​:1036
#2 0x00000000002ea52c in Perl_utf8n_to_uvchr_msgs (s=0x5f82be1 "", curlen=1,
  retlen=0x0, flags=0, errors=0xffeffffb4, msgs=0x0) at utf8.c​:1935
#3 0x00000000002e88ee in Perl__force_out_malformed_utf8_message (
  p=0x5f82be0 "\377", e=0x5f82bed "", flags=0, die_here=true) at utf8.c​:90
#4 0x00000000002eeb44 in Perl__to_utf8_fold_flags (p=0x5f82be0 "\377",
  e=0x5f82bed "", ustrp=0xfff00023a "", lenp=0xfff000120, flags=2 '\002',
  file=0x37041d "", line=0) at utf8.c​:3856
#5 0x00000000002d2bdc in S_find_byclass (prog=0x61254c0, c=0x7059540,
  s=0x5f82be0 "\377", strend=0x5f82be1 "", reginfo=0xfff000390)
  at regexec.c​:3003
#6 0x00000000002d5117 in Perl_regexec_flags (rx=0x5f835e0,
  stringarg=0x5f82be0 "\377", strend=0x5f82be1 "", strbeg=0x5f82be0 "\377",
  minend=0, sv=0x5f735d8, data=0x0, flags=97) at regexec.c​:3766
...
(gdb) up 6
#6 0x00000000002d5117 in Perl_regexec_flags (rx=0x5f835e0,
  stringarg=0x5f82be0 "\377", strend=0x5f82be1 "", strbeg=0x5f82be0 "\377",
  minend=0, sv=0x5f735d8, data=0x0, flags=97) at regexec.c​:3766
3766 if (find_byclass(prog, c, s, strend, reginfo))
(gdb) p foldbuf
No symbol "foldbuf" in current context.
(gdb) down
#5 0x00000000002d2bdc in S_find_byclass (prog=0x61254c0, c=0x7059540,
  s=0x5f82be0 "\377", strend=0x5f82be1 "", reginfo=0xfff000390)
  at regexec.c​:3003
3003 REXEC_TRIE_READ_CHAR(trie_type, trie,
(gdb) p foldbuf
$2 = "\000\000\000\000\000\000x\272\372\005\000\000\000"
(gdb) p &foldbuf
$3 = (U8 (*)[14]) 0xfff00023a
(gdb) down
#4 0x00000000002eeb44 in Perl__to_utf8_fold_flags (p=0x5f82be0 "\377",
  e=0x5f82bed "", ustrp=0xfff00023a "", lenp=0xfff000120, flags=2 '\002',
  file=0x37041d "", line=0) at utf8.c​:3856

so the bytes being dumped come from the SV, not from foldbuf[]

If I continue until we get the invalid access​:

Program received signal SIGTRAP, Trace/breakpoint trap.
0x00000000002e94c3 in Perl__byte_dump_string (start=0x5f82be0 "\377", len=13,
  format=false) at utf8.c​:978
978 const unsigned high_nibble = (*s & 0xF0) >> 4;
(gdb) p s
$4 = (const U8 *) 0x5f82bea ""

Still in the string we're matching.

What is actually is happening is that the logic to choose what
strategy we use for reading data in the trie code does not account for
latin strings when under case folding locale dependent logic. The code
falls into the utf8 reading branch of the logic, which it should not
do.

I suspect code that uses the :utf8 layer to read from a file could hit this bug with this confusion[1], but I tend to think that such code is buggy in itself.

I don't know if the regexp engine is intended to handle badly encoded SVf_UTF8 SVs.

What is needed is the patch I am attaching here for blead.

This affects v5.22 and later, so we need patches for 22, 24, and 26,
and one for blead.

Thanks for the version range, saves me having to work it out for the CVE.

Tony

[1] since :utf8 could return invalidly encoded SVf_UTF8 SVs.

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 5, 2018

From @tonycoz

On Fri, 02 Feb 2018 13​:15​:17 -0800, khw wrote​:

I agree with Yves' analysis. Mine was flawed about the cause of this
particular issue. I think I still think that when dealing with UTF-8,
it's best to use the real space available size, but I'd have to
examine it more closely in light of the new analysis.

What Yves didn't mention is that this bug requires the following
things in order to manifest​:

1) the pattern is compiled under /il
2) the pattern does not contain any characters below 256
3) the target string is not UTF-8.

Items 1) and 2) together would be extremely rare in the field, unless
the attacker gets to choose the pattern as well.

No non-UTF-8 string may match this pattern, so if we could just short
circuit the rest of trying the match, it would work.

I can imagine an older application containing/managing non-Unicode data allowing a regular expression search that could trigger this issue.

CVE form entry for once this is public​:

Vulnerability type​:
  Buffer Overflow

Vendor of the product​:
  Perl5 Porters

Product​:
  perl

Version​:
  5.22 through 5.26

Has vendor confirmed or acknowledged the vulnerability?
  Yes

Attack vector​:
  Context dependent

Impact​:
  Denial of service (it can crash perl)
  Information disclosure (it exposes the bytes beyond the end of the NUL)

- the other choices for Impact are​:
  Code Execution - I don't think is this possible
  Escalation of Privilege - nope
  Other - not that I can think of

Affected Components​:
  regexec.c Perl_regexec_flags()

Attack vector

Matching a locale dependent regular expression against a non-UTF-8 encoded scalar could result in a heap buffer read overflow while reporting an error message. That error message includes bytes beyond the end of the string, and possibly beyond the end of the buffer, providing a potential information disclosure if the memory had contained any sensitive information.

Suggested description of the vulnerability for use in the CVE

  Matching a crafted locale dependent regular expression can cause a heap buffer read overflow and potentially information disclosure.

Discoverer(s)/Credits

  Nguyen Duc Manh <imdb95@​gmail.com>

Reference(s)

  https://rt.perl.org/Public/Bug/Display.html?id=132063

Additional Information

(blank)

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 5, 2018

From @demerphq

On 5 February 2018 at 05​:32, Tony Cook via RT
<perl5-security-report-followup@​perl.org> wrote​:

On Fri, 02 Feb 2018 12​:41​:41 -0800, demerphq wrote​:

On 27 September 2017 at 06​:28, Karl Williamson via RT
<rt-comment@​perl.org> wrote​:

I looked at this enough so I believe I understand the cause.
In several places in the trie handling code in regexec.c, it passes
UTF8_MAXLEN to utf8n_to_uvchr() instead of the actual space
available.

I grepped blead for other possible problems with this, or its synonym
UTF8_MAXBYTES. Fortunately there aren't many. I haven't
investigated the attached results to see whether the others are
problematic or not. But these certainly are.

I have reviewed this ticket and I disagree with Karls analysis.

U8 foldbuf[ UTF8_MAXBYTES_CASE + 1 ];

Anyway, when we use UTF8_MAXBYTES in

uvc = utf8n_to_uvchr( (const U8*) uscan, UTF8_MAXLEN, &len, uniflags
);

if you check you will find that uscan points into foldbuf[] mentioned
above. The code in

uvc = _toFOLD_utf8_flags( (const U8*) uc, uc + len,
foldbuf, &foldlen, \

flags); \

would have to construct a malformed utf8 sequence inside of foldbuf.

No, it points into the string we're matching​:

Yes, because it is not going through the correct code path. It is
trying to read a latin-1 string as utf8, and blowing up.

If it went through the right code path it would not blow up. IOW, Karl
analysed this being a problem in code that is not involved.

This ticket is extremely misleading FWIW.

The bug says its about Perl__byte_dump_string. It isnt. That is a
red-herring. That comes from the utf8 error throwing code which is
triggered by other logic failures further up the stack.

If you look at the stack the problem is caused here

#4 0x00000000002eeb44 in Perl__to_utf8_fold_flags (p=0x5f82be0 "\377",
e=0x5f82bed "", ustrp=0xfff00023a "", lenp=0xfff000120, flags=2 '\002',
file=0x37041d "", line=0) at utf8.c​:3856

while we are in

#5 0x00000000002d2bdc in S_find_byclass (prog=0x61254c0, c=0x7059540,
s=0x5f82be0 "\377", strend=0x5f82be1 "", reginfo=0xfff000390)
at regexec.c​:3003

That Perl__to_utf8_fold_flags() is the same thing as

uvc = _toFOLD_utf8_flags( (const U8*) uc, uc + len, foldbuf, &foldlen, flags);

And as you can see the *target* buffer is foldbuf. The *source* is the
string, which we have misidentified as utf8, even though we know it is
not as the utf8 flag is NOT set.

We should *never* end up in the code to read utf8, however a bug in
the logic for handling the EXACTLU8 variants of the TRIE code does not
account for the fact that target string might not be unicode (in which
case it cannot match), and mistakenly tries to read a _raw_ \xFF as
though it was utf8.

If you look at the patch I posted you can see how this was fixed.

tony@​mars​:.../git/perl$ valgrind --vgdb-error=1 --vgdb=full -q ./perl -Ilib -e '$x="(?il)\x{100}|\x{100}"; "\xff" =~ /$x/;'
==25312==
==25312== TO DEBUG THIS PROCESS USING GDB​: start GDB like this
==25312== /path/to/gdb ./perl
==25312== and then give GDB the following command
==25312== target remote | /usr/lib/valgrind/../../bin/vgdb --pid=25312
==25312== --pid is optional if only one valgrind process is running
==25312==
==25312== Conditional jump or move depends on uninitialised value(s)
==25312== at 0x2E9522​: Perl__byte_dump_string (utf8.c​:991)
==25312== by 0x2E9629​: S_unexpected_non_continuation_text (utf8.c​:1036)
==25312== by 0x2EA52B​: Perl_utf8n_to_uvchr_msgs (utf8.c​:1935)
...

tony@​mars​:.../git/perl$ gdb ./perl
...
Reading symbols from lib/auto/re/re.so...done.
0x00000000002e9522 in Perl__byte_dump_string (start=0x5f82be0 "\377", len=13,
format=false) at utf8.c​:991
991 if (high_nibble < 10) {
(gdb) p start
$1 = (const U8 * const) 0x5f82be0 "\377"
(gdb) bt
#0 0x00000000002e9522 in Perl__byte_dump_string (start=0x5f82be0 "\377",
len=13, format=false) at utf8.c​:991
#1 0x00000000002e962a in S_unexpected_non_continuation_text (
s=0x5f82be0 "\377", print_len=13, non_cont_byte_pos=1, expect_len=13)
at utf8.c​:1036
#2 0x00000000002ea52c in Perl_utf8n_to_uvchr_msgs (s=0x5f82be1 "", curlen=1,
retlen=0x0, flags=0, errors=0xffeffffb4, msgs=0x0) at utf8.c​:1935
#3 0x00000000002e88ee in Perl__force_out_malformed_utf8_message (
p=0x5f82be0 "\377", e=0x5f82bed "", flags=0, die_here=true) at utf8.c​:90
#4 0x00000000002eeb44 in Perl__to_utf8_fold_flags (p=0x5f82be0 "\377",
e=0x5f82bed "", ustrp=0xfff00023a "", lenp=0xfff000120, flags=2 '\002',
file=0x37041d "", line=0) at utf8.c​:3856
#5 0x00000000002d2bdc in S_find_byclass (prog=0x61254c0, c=0x7059540,
s=0x5f82be0 "\377", strend=0x5f82be1 "", reginfo=0xfff000390)
at regexec.c​:3003
#6 0x00000000002d5117 in Perl_regexec_flags (rx=0x5f835e0,
stringarg=0x5f82be0 "\377", strend=0x5f82be1 "", strbeg=0x5f82be0 "\377",
minend=0, sv=0x5f735d8, data=0x0, flags=97) at regexec.c​:3766
...
(gdb) up 6
#6 0x00000000002d5117 in Perl_regexec_flags (rx=0x5f835e0,
stringarg=0x5f82be0 "\377", strend=0x5f82be1 "", strbeg=0x5f82be0 "\377",
minend=0, sv=0x5f735d8, data=0x0, flags=97) at regexec.c​:3766
3766 if (find_byclass(prog, c, s, strend, reginfo))
(gdb) p foldbuf
No symbol "foldbuf" in current context.
(gdb) down
#5 0x00000000002d2bdc in S_find_byclass (prog=0x61254c0, c=0x7059540,
s=0x5f82be0 "\377", strend=0x5f82be1 "", reginfo=0xfff000390)
at regexec.c​:3003
3003 REXEC_TRIE_READ_CHAR(trie_type, trie,
(gdb) p foldbuf
$2 = "\000\000\000\000\000\000x\272\372\005\000\000\000"
(gdb) p &foldbuf
$3 = (U8 (*)[14]) 0xfff00023a
(gdb) down
#4 0x00000000002eeb44 in Perl__to_utf8_fold_flags (p=0x5f82be0 "\377",
e=0x5f82bed "", ustrp=0xfff00023a "", lenp=0xfff000120, flags=2 '\002',
file=0x37041d "", line=0) at utf8.c​:3856

so the bytes being dumped come from the SV, not from foldbuf[]

That wasnt the point. That function reads a character of utf8, and
then writes the folded equivalent to a buffer, in this case foldbuf[].
The place were UTF8_MAXLEN is relevant is when we fold, we have to
assume the input string is wellformed, or could end up validating it
inside of a quadratic loop, which is unnacceptable in terms of
performance.

However in this case it is intentionally (albeit incorrectly) being fed latin1.

If I continue until we get the invalid access​:

Program received signal SIGTRAP, Trace/breakpoint trap.
0x00000000002e94c3 in Perl__byte_dump_string (start=0x5f82be0 "\377", len=13,
format=false) at utf8.c​:978
978 const unsigned high_nibble = (*s & 0xF0) >> 4;
(gdb) p s
$4 = (const U8 *) 0x5f82bea ""

Still in the string we're matching.

And still not relevant. If you construct a sting with a single 0xFF
byte in it, and then feed it to ANY of our utf8 handling code it will
blow up.

The thing is you shouldnt be able to construct a string with a single
0xFF byte in it and have it be utf8-on.

What is actually is happening is that the logic to choose what
strategy we use for reading data in the trie code does not account for
latin strings when under case folding locale dependent logic. The code
falls into the utf8 reading branch of the logic, which it should not
do.

I suspect code that uses the :utf8 layer to read from a file could hit this bug with this confusion[1], but I tend to think that such code is buggy in itself.

Well, I would need to a see a proper case. This ticket isn't such.

I don't know if the regexp engine is intended to handle badly encoded SVf_UTF8 SVs.

No, it is not, and it shouldn't be. If you load malformed utf8 without
validating it you get to keep the results.

The regex engine might read the same utf8 character millions of times
during matching, we cannot defend against deliberate malformed input
without paying a disproportionate costs. I adamantly refuse to accept
that the regex engine should guard against malformed input in any way
other than doing a prescan over the target string. No such validation
should occur internally or it could end up being repeated over and
over and over and over and over and over.

What is needed is the patch I am attaching here for blead.

This affects v5.22 and later, so we need patches for 22, 24, and 26,
and one for blead.

Thanks for the version range, saves me having to work it out for the CVE.

Tony

[1] since :utf8 could return invalidly encoded SVf_UTF8 SVs.

Then that is a bug in the :utf8 layer, not in the regex engine, and
should be addressed there.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 5, 2018

From @khwilliamson

On 02/05/2018 01​:09 AM, demerphq wrote​:

On 5 February 2018 at 05​:32, Tony Cook via RT
<perl5-security-report-followup@​perl.org> wrote​:

On Fri, 02 Feb 2018 12​:41​:41 -0800, demerphq wrote​:

On 27 September 2017 at 06​:28, Karl Williamson via RT
<rt-comment@​perl.org> wrote​:

I looked at this enough so I believe I understand the cause.
In several places in the trie handling code in regexec.c, it passes
UTF8_MAXLEN to utf8n_to_uvchr() instead of the actual space
available.

I grepped blead for other possible problems with this, or its synonym
UTF8_MAXBYTES. Fortunately there aren't many. I haven't
investigated the attached results to see whether the others are
problematic or not. But these certainly are.

I have reviewed this ticket and I disagree with Karls analysis.

U8 foldbuf[ UTF8_MAXBYTES_CASE + 1 ];

Anyway, when we use UTF8_MAXBYTES in

uvc = utf8n_to_uvchr( (const U8*) uscan, UTF8_MAXLEN, &len, uniflags
);

if you check you will find that uscan points into foldbuf[] mentioned
above. The code in

uvc = _toFOLD_utf8_flags( (const U8*) uc, uc + len,
foldbuf, &foldlen, \

flags); \

would have to construct a malformed utf8 sequence inside of foldbuf.

No, it points into the string we're matching​:

Yes, because it is not going through the correct code path. It is
trying to read a latin-1 string as utf8, and blowing up.

If it went through the right code path it would not blow up. IOW, Karl
analysed this being a problem in code that is not involved.

This ticket is extremely misleading FWIW.

The bug says its about Perl__byte_dump_string. It isnt. That is a
red-herring. That comes from the utf8 error throwing code which is
triggered by other logic failures further up the stack.

If you look at the stack the problem is caused here

#4 0x00000000002eeb44 in Perl__to_utf8_fold_flags (p=0x5f82be0 "\377",
e=0x5f82bed "", ustrp=0xfff00023a "", lenp=0xfff000120, flags=2 '\002',
file=0x37041d "", line=0) at utf8.c​:3856

while we are in

#5 0x00000000002d2bdc in S_find_byclass (prog=0x61254c0, c=0x7059540,
s=0x5f82be0 "\377", strend=0x5f82be1 "", reginfo=0xfff000390)
at regexec.c​:3003

That Perl__to_utf8_fold_flags() is the same thing as

uvc = _toFOLD_utf8_flags( (const U8*) uc, uc + len, foldbuf, &foldlen, flags);

And as you can see the *target* buffer is foldbuf. The *source* is the
string, which we have misidentified as utf8, even though we know it is
not as the utf8 flag is NOT set.

We should *never* end up in the code to read utf8, however a bug in
the logic for handling the EXACTLU8 variants of the TRIE code does not
account for the fact that target string might not be unicode (in which
case it cannot match), and mistakenly tries to read a _raw_ \xFF as
though it was utf8.

If you look at the patch I posted you can see how this was fixed.

tony@​mars​:.../git/perl$ valgrind --vgdb-error=1 --vgdb=full -q ./perl -Ilib -e '$x="(?il)\x{100}|\x{100}"; "\xff" =~ /$x/;'
==25312==
==25312== TO DEBUG THIS PROCESS USING GDB​: start GDB like this
==25312== /path/to/gdb ./perl
==25312== and then give GDB the following command
==25312== target remote | /usr/lib/valgrind/../../bin/vgdb --pid=25312
==25312== --pid is optional if only one valgrind process is running
==25312==
==25312== Conditional jump or move depends on uninitialised value(s)
==25312== at 0x2E9522​: Perl__byte_dump_string (utf8.c​:991)
==25312== by 0x2E9629​: S_unexpected_non_continuation_text (utf8.c​:1036)
==25312== by 0x2EA52B​: Perl_utf8n_to_uvchr_msgs (utf8.c​:1935)
...

tony@​mars​:.../git/perl$ gdb ./perl
...
Reading symbols from lib/auto/re/re.so...done.
0x00000000002e9522 in Perl__byte_dump_string (start=0x5f82be0 "\377", len=13,
format=false) at utf8.c​:991
991 if (high_nibble < 10) {
(gdb) p start
$1 = (const U8 * const) 0x5f82be0 "\377"
(gdb) bt
#0 0x00000000002e9522 in Perl__byte_dump_string (start=0x5f82be0 "\377",
len=13, format=false) at utf8.c​:991
#1 0x00000000002e962a in S_unexpected_non_continuation_text (
s=0x5f82be0 "\377", print_len=13, non_cont_byte_pos=1, expect_len=13)
at utf8.c​:1036
#2 0x00000000002ea52c in Perl_utf8n_to_uvchr_msgs (s=0x5f82be1 "", curlen=1,
retlen=0x0, flags=0, errors=0xffeffffb4, msgs=0x0) at utf8.c​:1935
#3 0x00000000002e88ee in Perl__force_out_malformed_utf8_message (
p=0x5f82be0 "\377", e=0x5f82bed "", flags=0, die_here=true) at utf8.c​:90
#4 0x00000000002eeb44 in Perl__to_utf8_fold_flags (p=0x5f82be0 "\377",
e=0x5f82bed "", ustrp=0xfff00023a "", lenp=0xfff000120, flags=2 '\002',
file=0x37041d "", line=0) at utf8.c​:3856
#5 0x00000000002d2bdc in S_find_byclass (prog=0x61254c0, c=0x7059540,
s=0x5f82be0 "\377", strend=0x5f82be1 "", reginfo=0xfff000390)
at regexec.c​:3003
#6 0x00000000002d5117 in Perl_regexec_flags (rx=0x5f835e0,
stringarg=0x5f82be0 "\377", strend=0x5f82be1 "", strbeg=0x5f82be0 "\377",
minend=0, sv=0x5f735d8, data=0x0, flags=97) at regexec.c​:3766
...
(gdb) up 6
#6 0x00000000002d5117 in Perl_regexec_flags (rx=0x5f835e0,
stringarg=0x5f82be0 "\377", strend=0x5f82be1 "", strbeg=0x5f82be0 "\377",
minend=0, sv=0x5f735d8, data=0x0, flags=97) at regexec.c​:3766
3766 if (find_byclass(prog, c, s, strend, reginfo))
(gdb) p foldbuf
No symbol "foldbuf" in current context.
(gdb) down
#5 0x00000000002d2bdc in S_find_byclass (prog=0x61254c0, c=0x7059540,
s=0x5f82be0 "\377", strend=0x5f82be1 "", reginfo=0xfff000390)
at regexec.c​:3003
3003 REXEC_TRIE_READ_CHAR(trie_type, trie,
(gdb) p foldbuf
$2 = "\000\000\000\000\000\000x\272\372\005\000\000\000"
(gdb) p &foldbuf
$3 = (U8 (*)[14]) 0xfff00023a
(gdb) down
#4 0x00000000002eeb44 in Perl__to_utf8_fold_flags (p=0x5f82be0 "\377",
e=0x5f82bed "", ustrp=0xfff00023a "", lenp=0xfff000120, flags=2 '\002',
file=0x37041d "", line=0) at utf8.c​:3856

so the bytes being dumped come from the SV, not from foldbuf[]

That wasnt the point. That function reads a character of utf8, and
then writes the folded equivalent to a buffer, in this case foldbuf[].
The place were UTF8_MAXLEN is relevant is when we fold, we have to
assume the input string is wellformed, or could end up validating it
inside of a quadratic loop, which is unnacceptable in terms of
performance.

However in this case it is intentionally (albeit incorrectly) being fed latin1.

If I continue until we get the invalid access​:

Program received signal SIGTRAP, Trace/breakpoint trap.
0x00000000002e94c3 in Perl__byte_dump_string (start=0x5f82be0 "\377", len=13,
format=false) at utf8.c​:978
978 const unsigned high_nibble = (*s & 0xF0) >> 4;
(gdb) p s
$4 = (const U8 *) 0x5f82bea ""

Still in the string we're matching.

And still not relevant. If you construct a sting with a single 0xFF
byte in it, and then feed it to ANY of our utf8 handling code it will
blow up.

The thing is you shouldnt be able to construct a string with a single
0xFF byte in it and have it be utf8-on.

What is actually is happening is that the logic to choose what
strategy we use for reading data in the trie code does not account for
latin strings when under case folding locale dependent logic. The code
falls into the utf8 reading branch of the logic, which it should not
do.

I suspect code that uses the :utf8 layer to read from a file could hit this bug with this confusion[1], but I tend to think that such code is buggy in itself.

Well, I would need to a see a proper case. This ticket isn't such.

I don't know if the regexp engine is intended to handle badly encoded SVf_UTF8 SVs.

No, it is not, and it shouldn't be. If you load malformed utf8 without
validating it you get to keep the results.

The regex engine might read the same utf8 character millions of times
during matching, we cannot defend against deliberate malformed input
without paying a disproportionate costs. I adamantly refuse to accept
that the regex engine should guard against malformed input in any way
other than doing a prescan over the target string. No such validation
should occur internally or it could end up being repeated over and
over and over and over and over and over.

Yves is correct that the trie code is wrong (and I made it wrong) and if
it were correct, this bug would not have happened. The problem is that
it is incorrectly branching to a portion of the code that expects it is
handling UTF-8. And the input isn't UTF-8, and the code that assumes it
is UTF-8 should not have been branched to.

But I also think we should look that any UTF-8 we handle isn't too long
for the space available. What I diagnosed was code that passed in a
made-up number as to the available length. We shouldn't be making up
numbers. The actual length is known, and we should use that. The
made-up number doesn't save any time, as the translator uses that value;
it might as well use the correct one.

So this bug wouldn't have manifested as being a security issue if we had
used the correct length, instead of a made-up one.

I want to emphasize that no extra time is required here to process using
the correct length as opposed to a made-up one.

What is needed is the patch I am attaching here for blead.

This affects v5.22 and later, so we need patches for 22, 24, and 26,
and one for blead.

Thanks for the version range, saves me having to work it out for the CVE.

Tony

[1] since :utf8 could return invalidly encoded SVf_UTF8 SVs.

Then that is a bug in the :utf8 layer, not in the regex engine, and
should be addressed there.

Yves

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 5, 2018

From @demerphq

On 5 February 2018 at 19​:04, Karl Williamson <public@​khwilliamson.com> wrote​:

On 02/05/2018 01​:09 AM, demerphq wrote​:

On 5 February 2018 at 05​:32, Tony Cook via RT
<perl5-security-report-followup@​perl.org> wrote​:

On Fri, 02 Feb 2018 12​:41​:41 -0800, demerphq wrote​:

The regex engine might read the same utf8 character millions of times
during matching, we cannot defend against deliberate malformed input
without paying a disproportionate costs. I adamantly refuse to accept
that the regex engine should guard against malformed input in any way
other than doing a prescan over the target string. No such validation
should occur internally or it could end up being repeated over and
over and over and over and over and over.

Yves is correct that the trie code is wrong (and I made it wrong) and if it
were correct, this bug would not have happened. The problem is that it is
incorrectly branching to a portion of the code that expects it is handling
UTF-8. And the input isn't UTF-8, and the code that assumes it is UTF-8
should not have been branched to.

But I also think we should look that any UTF-8 we handle isn't too long for
the space available. What I diagnosed was code that passed in a made-up
number as to the available length.

Well that really isnt clear to me at all. I assume you mean the patch
from the 22nd.

The changes with utf8_to_uvchr_buf_flags() I am not fond of. The
UTF8_MAXLEN should just be replaced with "foldlen" in pretty much
every place it is used in the TRIE code. Once you do that replacement,
there is only one other use of utf8_to_uvchr_buf_flags which I would
prefer see just gets fixed explicitly.

What is left after that looks fine to me.

We shouldn't be making up numbers. The
actual length is known, and we should use that. The made-up number doesn't
save any time, as the translator uses that value; it might as well use the
correct one.

Sure, but we dont need to do the things that patch that does. foldlen
tells us how much is left in the buffer.

So this bug wouldn't have manifested as being a security issue if we had
used the correct length, instead of a made-up one.

Actually I beg to differ. The *security* issue comes from the utf8
code, and the logic in Perl__byte_dump_string().

The security issue doesn't come from the trie code at all, nor does it
come from any call utf8_to_uvchr_buf_flags(), but rather from the call
to

_toFOLD_utf8_flags()

and then from the internal error throwing logic in the utf8 code. If
we see a bad utf8 sequence we should not dump any of its bytes as it
is a security violation. Also notice that the problem in that code is
that it trusts UTFSKIP(), not anything to do with "a made up value",
like UTF8_MAXLEN.

If you look at the stack trace​:

READ of size 1 at 0x602000001b3a thread T0
  #0 0xc66a60 in Perl__byte_dump_string
/home/manh/Fuzzing/afl/perl/perl/utf8.c​:977​:39
  #1 0xc66a60 in S_unexpected_non_continuation_text
/home/manh/Fuzzing/afl/perl/perl/utf8.c​:1036
  #2 0xc66a60 in Perl_utf8n_to_uvchr_error
/home/manh/Fuzzing/afl/perl/perl/utf8.c​:1707
  #3 0xc5d678 in Perl__force_out_malformed_utf8_message
/home/manh/Fuzzing/afl/perl/perl/utf8.c​:90​:12
  #4 0xc726ae in Perl__to_utf8_fold_flags

The security issues comes from S_unexpected_non_continuation_text() or
Perl_byte_dump_string() leaking data from obviously corrupt input. To
fix the security issue we should fix them.

On the other hand the bug that leads to the security issue is that we
treat latin as utf8, and/or use UTF8-skip and not actual buffer length
in the call to _to_utf8_fold_flags(). But it doesnt have anything to
do with reading foldbuf, which is where we use UTF8_MAXLEN.

Anyway, yes we can and should change the UTF8_MAXLEN, and stop using
UTF8SKIP like this, but we should use foldlen, and do the other fixes
explicitly.

I want to emphasize that no extra time is required here to process using the
correct length as opposed to a made-up one.

Yeah I get it. Lets redo that patch reflecting the responses above and
then I see no problem with it.

Yves

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 6, 2018

From @tonycoz

On Mon, 05 Feb 2018 00​:09​:08 -0800, demerphq wrote​:

On 5 February 2018 at 05​:32, Tony Cook via RT
<perl5-security-report-followup@​perl.org> wrote​:

On Fri, 02 Feb 2018 12​:41​:41 -0800, demerphq wrote​:

On 27 September 2017 at 06​:28, Karl Williamson via RT
<rt-comment@​perl.org> wrote​:

I looked at this enough so I believe I understand the cause.
In several places in the trie handling code in regexec.c, it
passes
UTF8_MAXLEN to utf8n_to_uvchr() instead of the actual space
available.

I grepped blead for other possible problems with this, or its
synonym
UTF8_MAXBYTES. Fortunately there aren't many. I haven't
investigated the attached results to see whether the others are
problematic or not. But these certainly are.

I have reviewed this ticket and I disagree with Karls analysis.

U8 foldbuf[ UTF8_MAXBYTES_CASE + 1 ];

Anyway, when we use UTF8_MAXBYTES in

uvc = utf8n_to_uvchr( (const U8*) uscan, UTF8_MAXLEN, &len, uniflags
);

if you check you will find that uscan points into foldbuf[]
mentioned
above. The code in

uvc = _toFOLD_utf8_flags( (const U8*) uc, uc + len,
foldbuf, &foldlen, \

flags); \

would have to construct a malformed utf8 sequence inside of foldbuf.

No, it points into the string we're matching​:

Yes, because it is not going through the correct code path. It is
trying to read a latin-1 string as utf8, and blowing up.

If it went through the right code path it would not blow up. IOW, Karl
analysed this being a problem in code that is not involved.

It's true the bug isn't in Perl__byte_dump_string(), but that is where an attacker takes advantage of the bug.

This ticket is extremely misleading FWIW.

The bug says its about Perl__byte_dump_string. It isnt. That is a
red-herring. That comes from the utf8 error throwing code which is
triggered by other logic failures further up the stack.

The title simply says there's a heap buffer overflow in Perl__byte_dump_string(), which is the case, it just happens because of a bug elsewhere.

If you look at the stack the problem is caused here

#4 0x00000000002eeb44 in Perl__to_utf8_fold_flags (p=0x5f82be0
"\377",
e=0x5f82bed "", ustrp=0xfff00023a "", lenp=0xfff000120, flags=2
'\002',
file=0x37041d "", line=0) at utf8.c​:3856

while we are in

#5 0x00000000002d2bdc in S_find_byclass (prog=0x61254c0,
c=0x7059540,
s=0x5f82be0 "\377", strend=0x5f82be1 "", reginfo=0xfff000390)
at regexec.c​:3003

That Perl__to_utf8_fold_flags() is the same thing as

uvc = _toFOLD_utf8_flags( (const U8*) uc, uc + len, foldbuf, &foldlen,
flags);

And as you can see the *target* buffer is foldbuf. The *source* is the
string, which we have misidentified as utf8, even though we know it is
not as the utf8 flag is NOT set.

We should *never* end up in the code to read utf8, however a bug in
the logic for handling the EXACTLU8 variants of the TRIE code does not
account for the fact that target string might not be unicode (in which
case it cannot match), and mistakenly tries to read a _raw_ \xFF as
though it was utf8.

If you look at the patch I posted you can see how this was fixed.

Sorry, I misunderstood the point you were trying to make.

I don't know if the regexp engine is intended to handle badly encoded
SVf_UTF8 SVs.

No, it is not, and it shouldn't be. If you load malformed utf8 without
validating it you get to keep the results.

The regex engine might read the same utf8 character millions of times
during matching, we cannot defend against deliberate malformed input
without paying a disproportionate costs. I adamantly refuse to accept
that the regex engine should guard against malformed input in any way
other than doing a prescan over the target string. No such validation
should occur internally or it could end up being repeated over and
over and over and over and over and over.

The regular expression engine does have code supporting badly encoded inputs - all the calls to utf8n_to_uvchr() and its variants could be replaced with calls to a much simpler function with less error checking/handling if it wasn't trying to handle such badly encoded data (by croaking instead of crashing).

I think your change is needed to fix both the behaviour bug (treating the Latin-1 data as UTF-8) which results in a mistaken call that results in a heap buffer read overflow.

I think Karl's change is useful to improve robustness - it's not a direct fix the behaviour bug - which would have resulted in an error with only Karl's change, rather than a heap buffer read overflow.

Tony

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 6, 2018

From @tonycoz

On Sun, 04 Feb 2018 20​:40​:45 -0800, tonyc wrote​:

CVE form entry for once this is public​:

I've requested a CVE id for this issue.

Tony

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 6, 2018

From @xsawyerx

Please let me know when we decide on a patch (or patches, depending on
version). I will notify vendors once I have it.

On 6 February 2018 at 07​:47, Tony Cook via RT
<perl5-security-report-followup@​perl.org> wrote​:

On Sun, 04 Feb 2018 20​:40​:45 -0800, tonyc wrote​:

CVE form entry for once this is public​:

I've requested a CVE id for this issue.

Tony

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 6, 2018

From @khwilliamson

On 02/05/2018 01​:38 PM, demerphq wrote​:

On 5 February 2018 at 19​:04, Karl Williamson <public@​khwilliamson.com> wrote​:

On 02/05/2018 01​:09 AM, demerphq wrote​:

On 5 February 2018 at 05​:32, Tony Cook via RT
<perl5-security-report-followup@​perl.org> wrote​:

On Fri, 02 Feb 2018 12​:41​:41 -0800, demerphq wrote​:

The regex engine might read the same utf8 character millions of times
during matching, we cannot defend against deliberate malformed input
without paying a disproportionate costs. I adamantly refuse to accept
that the regex engine should guard against malformed input in any way
other than doing a prescan over the target string. No such validation
should occur internally or it could end up being repeated over and
over and over and over and over and over.

Yves is correct that the trie code is wrong (and I made it wrong) and if it
were correct, this bug would not have happened. The problem is that it is
incorrectly branching to a portion of the code that expects it is handling
UTF-8. And the input isn't UTF-8, and the code that assumes it is UTF-8
should not have been branched to.

But I also think we should look that any UTF-8 we handle isn't too long for
the space available. What I diagnosed was code that passed in a made-up
number as to the available length.

Well that really isnt clear to me at all. I assume you mean the patch
from the 22nd.

The changes with utf8_to_uvchr_buf_flags() I am not fond of. The
UTF8_MAXLEN should just be replaced with "foldlen" in pretty much
every place it is used in the TRIE code. Once you do that replacement,
there is only one other use of utf8_to_uvchr_buf_flags which I would
prefer see just gets fixed explicitly.

What is left after that looks fine to me.

We shouldn't be making up numbers. The
actual length is known, and we should use that. The made-up number doesn't
save any time, as the translator uses that value; it might as well use the
correct one.

Sure, but we dont need to do the things that patch that does. foldlen
tells us how much is left in the buffer.

So this bug wouldn't have manifested as being a security issue if we had
used the correct length, instead of a made-up one.

Actually I beg to differ. The *security* issue comes from the utf8
code, and the logic in Perl__byte_dump_string().

The security issue doesn't come from the trie code at all, nor does it
come from any call utf8_to_uvchr_buf_flags(), but rather from the call
to

_toFOLD_utf8_flags()

and then from the internal error throwing logic in the utf8 code. If
we see a bad utf8 sequence we should not dump any of its bytes as it
is a security violation. Also notice that the problem in that code is
that it trusts UTFSKIP(), not anything to do with "a made up value",
like UTF8_MAXLEN.

If you look at the stack trace​:

READ of size 1 at 0x602000001b3a thread T0
#0 0xc66a60 in Perl__byte_dump_string
/home/manh/Fuzzing/afl/perl/perl/utf8.c​:977​:39
#1 0xc66a60 in S_unexpected_non_continuation_text
/home/manh/Fuzzing/afl/perl/perl/utf8.c​:1036
#2 0xc66a60 in Perl_utf8n_to_uvchr_error
/home/manh/Fuzzing/afl/perl/perl/utf8.c​:1707
#3 0xc5d678 in Perl__force_out_malformed_utf8_message
/home/manh/Fuzzing/afl/perl/perl/utf8.c​:90​:12
#4 0xc726ae in Perl__to_utf8_fold_flags

The security issues comes from S_unexpected_non_continuation_text() or
Perl_byte_dump_string() leaking data from obviously corrupt input. To
fix the security issue we should fix them.

On the other hand the bug that leads to the security issue is that we
treat latin as utf8, and/or use UTF8-skip and not actual buffer length
in the call to _to_utf8_fold_flags(). But it doesnt have anything to
do with reading foldbuf, which is where we use UTF8_MAXLEN.

Anyway, yes we can and should change the UTF8_MAXLEN, and stop using
UTF8SKIP like this, but we should use foldlen, and do the other fixes
explicitly.

I want to emphasize that no extra time is required here to process using the
correct length as opposed to a made-up one.

Yeah I get it. Lets redo that patch reflecting the responses above and
then I see no problem with it.

Yves

The point of dumping out the data is to save a lot of debugging time.
It's not clear if you are arguing that when we find a UTF-8 malformation
that we shouldn't give any data to help the recipient discern what the
problem is, out of fear of running off the end of the buffer.

It occurs to me as I write, that the dump function could assume that a
NUL it encounters is the terminating NUL of the string, and to stop
dumping there. NULs can occur in Perl strings, but not in these
contexts. That would have prevented this issue.

I could change blead to do this, marked as a defensive coding measure,
and not give an attacker any real information as to this vector.

The function already has a flag to indicate that this particular call
has doubt about if we're too close to the buffer end to safely display
the whole malformation. It then dumps just to the first error byte
(which we've already had to read in order to determine that it's malformed)

But it is also good housekeeping to not assume that there is sufficient
space, hence my original patch. In working with UTF-8 over the years, I
have also come to believe that it's easier and safer to give the upper
limit pointer in calling these functions, rather than a length, which
varies as one moves along the input string. which is what I did in my
original patch. But attached is a revision based on Yves' objections.
I do think it's better to pass foldbuf + sizeof(foldbuf) than foldlen,
because that avoids any problems with the calculated foldlen getting too
big. It's just safer.

And this is the final place in blead that still just just passes this
MAX value without checking, though there are some still that pass
UTF8SKIP as the length.

So now we have three things
This security bug becomes a non-security one if I just stop dumping a
malformation when it finds a NUL

Or, this security bug becomes non-security if the calls to the utf8
translator are changed to not assume that there is space available

Or, this security bug becomes non-security if the trie handler is fixed.

I now think the first one is what we should do for the security fix. It
also would prevent other similar problems that we haven't discovered
yet. I haven't looked at backporting. The dumping has been revised in
recent releases, so it may be that the trie bug isn't a security issue
except recently.

The attached patch is testing for something that won't happen if either
of the other two changes are done.

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 6, 2018

From @khwilliamson

0004-Subject-PATCH-perl-132063-Heap-buffer-overflow.patch
From e4d9f67624f2547cb3a48377b7840e74ad024f9d Mon Sep 17 00:00:00 2001
From: Karl Williamson <khw@cpan.org>
Date: Tue, 6 Feb 2018 14:50:48 -0700
Subject: [PATCH 4/4] Subject: PATCH: [perl #132063]: Heap buffer overflow

The proximal cause is several instances in regexec.c of the code
assuming that the input was valid UTF-8, whereas the input was too short
for what the start byte claimed it would be.

I grepped through the core for any other similar uses, and did not find
any.
---
 regexec.c              | 33 ++++++++++++++++++---------------
 t/lib/warnings/regexec |  7 +++++++
 2 files changed, 25 insertions(+), 15 deletions(-)

diff --git a/regexec.c b/regexec.c
index 31a133f20b..163e4c4c5b 100644
--- a/regexec.c
+++ b/regexec.c
@@ -1820,7 +1820,9 @@ Perl_re_intuit_start(pTHX_
                                            ? trie_utf8_fold                         \
                                            :   trie_latin_utf8_fold)))
 
-#define REXEC_TRIE_READ_CHAR(trie_type, trie, widecharmap, uc, uscan, len, uvc, charid, foldlen, foldbuf, uniflags) \
+/* 'uscan' is set to foldbuf, and incremented, so below the end of uscan is
+ * 'foldbuf+sizeof(foldbuf)' */
+#define REXEC_TRIE_READ_CHAR(trie_type, trie, widecharmap, uc, uc_end, uscan, len, uvc, charid, foldlen, foldbuf, uniflags) \
 STMT_START {                                                                        \
     STRLEN skiplen;                                                                 \
     U8 flags = FOLD_FLAGS_FULL;                                                     \
@@ -1828,7 +1830,7 @@ STMT_START {
     case trie_flu8:                                                                 \
         _CHECK_AND_WARN_PROBLEMATIC_LOCALE;                                         \
         if (utf8_target && UTF8_IS_ABOVE_LATIN1(*uc)) {                             \
-            _CHECK_AND_OUTPUT_WIDE_LOCALE_UTF8_MSG(uc, uc + UTF8SKIP(uc));          \
+            _CHECK_AND_OUTPUT_WIDE_LOCALE_UTF8_MSG(uc, uc_end - uc);                \
         }                                                                           \
         goto do_trie_utf8_fold;                                                     \
     case trie_utf8_exactfa_fold:                                                    \
@@ -1837,14 +1839,14 @@ STMT_START {
     case trie_utf8_fold:                                                            \
       do_trie_utf8_fold:                                                            \
         if ( foldlen>0 ) {                                                          \
-            uvc = utf8n_to_uvchr( (const U8*) uscan, UTF8_MAXLEN, &len, uniflags ); \
+            uvc = utf8n_to_uvchr( (const U8*) uscan, foldlen, &len, uniflags );     \
             foldlen -= len;                                                         \
             uscan += len;                                                           \
             len=0;                                                                  \
         } else {                                                                    \
-            len = UTF8SKIP(uc);                                                     \
-            uvc = _toFOLD_utf8_flags( (const U8*) uc, uc + len, foldbuf, &foldlen,  \
+            uvc = _toFOLD_utf8_flags( (const U8*) uc, uc_end, foldbuf, &foldlen,    \
                                                                             flags); \
+            len = UTF8SKIP(uc);                                                     \
             skiplen = UVCHR_SKIP( uvc );                                            \
             foldlen -= skiplen;                                                     \
             uscan = foldbuf + skiplen;                                              \
@@ -1855,7 +1857,7 @@ STMT_START {
         /* FALLTHROUGH */                                                           \
     case trie_latin_utf8_fold:                                                      \
         if ( foldlen>0 ) {                                                          \
-            uvc = utf8n_to_uvchr( (const U8*) uscan, UTF8_MAXLEN, &len, uniflags ); \
+            uvc = utf8n_to_uvchr( (const U8*) uscan, foldlen, &len, uniflags );     \
             foldlen -= len;                                                         \
             uscan += len;                                                           \
             len=0;                                                                  \
@@ -1874,7 +1876,7 @@ STMT_START {
         }                                                                           \
         /* FALLTHROUGH */                                                           \
     case trie_utf8:                                                                 \
-        uvc = utf8n_to_uvchr( (const U8*) uc, UTF8_MAXLEN, &len, uniflags );        \
+        uvc = utf8n_to_uvchr( (const U8*) uc, uc_end - uc, &len, uniflags );        \
         break;                                                                      \
     case trie_plain:                                                                \
         uvc = (UV)*uc;                                                              \
@@ -3000,10 +3002,10 @@ S_find_byclass(pTHX_ regexp * prog, const regnode *c, char *s,
                     }
                     points[pointpos++ % maxlen]= uc;
                     if (foldlen || uc < (U8*)strend) {
-                        REXEC_TRIE_READ_CHAR(trie_type, trie,
-                                         widecharmap, uc,
-                                         uscan, len, uvc, charid, foldlen,
-                                         foldbuf, uniflags);
+                        REXEC_TRIE_READ_CHAR(trie_type, trie, widecharmap, uc,
+                                             (U8 *) strend, uscan, len, uvc,
+                                             charid, foldlen, foldbuf,
+                                             uniflags);
                         DEBUG_TRIE_EXECUTE_r({
                             dump_exec_pos( (char *)uc, c, strend,
                                         real_start, s, utf8_target, 0);
@@ -6062,8 +6064,9 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog)
 		    if ( base && (foldlen || uc < (U8*)(reginfo->strend))) {
 			I32 offset;
 			REXEC_TRIE_READ_CHAR(trie_type, trie, widecharmap, uc,
-					     uscan, len, uvc, charid, foldlen,
-					     foldbuf, uniflags);
+                                             (U8 *) reginfo->strend, uscan,
+                                             len, uvc, charid, foldlen,
+                                             foldbuf, uniflags);
 			charcount++;
 			if (foldlen>0)
 			    ST.longfold = TRUE;
@@ -6198,8 +6201,8 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog)
 			while (foldlen) {
 			    if (!--chars)
 				break;
-			    uvc = utf8n_to_uvchr(uscan, UTF8_MAXLEN, &len,
-					    uniflags);
+			    uvc = utf8n_to_uvchr(uscan, foldlen, &len,
+                                                 uniflags);
 			    uscan += len;
 			    foldlen -= len;
 			}
diff --git a/t/lib/warnings/regexec b/t/lib/warnings/regexec
index 900dd6ee7f..6635142dea 100644
--- a/t/lib/warnings/regexec
+++ b/t/lib/warnings/regexec
@@ -260,3 +260,10 @@ setlocale(&POSIX::LC_CTYPE, $utf8_locale);
 "k" =~ /(?[ \N{KELVIN SIGN} ])/i;
 ":" =~ /(?[ \: ])/;
 EXPECT
+########
+# NAME perl #132063, read beyond buffer end
+# OPTION fatal
+"\xff" =~ /(?il)\x{100}|\x{100}/;
+EXPECT
+Malformed UTF-8 character: \xff (too short; 1 byte available, need 13) in pattern match (m//) at - line 2.
+Malformed UTF-8 character (fatal) at - line 2.
-- 
2.11.0

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 7, 2018

From @tonycoz

On Mon, 05 Feb 2018 21​:47​:34 -0800, tonyc wrote​:

On Sun, 04 Feb 2018 20​:40​:45 -0800, tonyc wrote​:

CVE form entry for once this is public​:

I've requested a CVE id for this issue.

This is CVE-2018-6798

Tony

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 13, 2018

From @khwilliamson

On 02/07/2018 03​:23 PM, Tony Cook via RT wrote​:

On Mon, 05 Feb 2018 21​:47​:34 -0800, tonyc wrote​:

On Sun, 04 Feb 2018 20​:40​:45 -0800, tonyc wrote​:

CVE form entry for once this is public​:

I've requested a CVE id for this issue.

This is CVE-2018-6798

Tony

Attached is my utf8.c patch that solves this at the lowest level. I
wrote the commit message in such a way as to not indicate we're solving
a particular bug, and its far enough away from the scene of the crime
that we might be able to sneak it into blead earlier than CVE disclosure
date. I don't know.

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 13, 2018

From @khwilliamson

0006-utf8.c-Don-t-dump-malformation-past-first-NUL.patch
From 2578c13f9160090fb3c40f799148df8dc60c328d Mon Sep 17 00:00:00 2001
From: Karl Williamson <khw@cpan.org>
Date: Tue, 13 Feb 2018 07:03:43 -0700
Subject: [PATCH 6/6] utf8.c: Don't dump malformation past first NUL

When a UTF-8 string contains a malformation, the bytes are dumped out as
a debugging aid.  One should exercise caution, however, and not dump out
bytes that are actually past the end of the string.  Commit 99a765e9e37
from 2016 added the capability to signal to the dumping routines that
we're not sure where the string ends, and to dump the minimal possible.

It occurred to me that an additional safety measure can be easily added,
which this commit does.  And that is, in the dumping routines to stop at
the first NUL.  All strings automatically get a traiing NUL added, even
if they contain embedded NULs.  A NUL can never be part of a
malformation, and so its presence likely signals the end of the string.
---
 utf8.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/utf8.c b/utf8.c
index 9cd4fa0927..e586c120ae 100644
--- a/utf8.c
+++ b/utf8.c
@@ -1121,7 +1121,7 @@ Perl__byte_dump_string(pTHX_ const U8 * const start, const STRLEN len, const boo
 PERL_STATIC_INLINE char *
 S_unexpected_non_continuation_text(pTHX_ const U8 * const s,
 
-                                         /* How many bytes to print */
+                                         /* Max number of bytes to print */
                                          STRLEN print_len,
 
                                          /* Which one is the non-continuation */
@@ -1137,6 +1137,8 @@ S_unexpected_non_continuation_text(pTHX_ const U8 * const s,
                                ? "immediately"
                                : Perl_form(aTHX_ "%d bytes",
                                                  (int) non_cont_byte_pos);
+    const U8 * x = s + non_cont_byte_pos;
+    const U8 * e = s + print_len;
 
     PERL_ARGS_ASSERT_UNEXPECTED_NON_CONTINUATION_TEXT;
 
@@ -1144,10 +1146,20 @@ S_unexpected_non_continuation_text(pTHX_ const U8 * const s,
      * calculated, it's likely faster to pass it; verify under DEBUGGING */
     assert(expect_len == UTF8SKIP(s));
 
+    /* As a defensive coding measure, don't output anything past a NUL.  Such
+     * bytes shouldn't be in the middle of a malformation, and could mark the
+     * end of the allocated string, and what comes after is undefined */
+    for (; x < e; x++) {
+        if (*x == '\0') {
+            x++;            /* Output this particular NUL */
+            break;
+        }
+    }
+
     return Perl_form(aTHX_ "%s: %s (unexpected non-continuation byte 0x%02x,"
                            " %s after start byte 0x%02x; need %d bytes, got %d)",
                            malformed_text,
-                           _byte_dump_string(s, print_len, 0),
+                           _byte_dump_string(s, x - s, 0),
                            *(s + non_cont_byte_pos),
                            where,
                            *s,
-- 
2.11.0

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 15, 2018

From @khwilliamson

On 02/07/2018 03​:23 PM, Tony Cook via RT wrote​:

On Mon, 05 Feb 2018 21​:47​:34 -0800, tonyc wrote​:

On Sun, 04 Feb 2018 20​:40​:45 -0800, tonyc wrote​:

CVE form entry for once this is public​:

I've requested a CVE id for this issue.

This is CVE-2018-6798

Tony

Attached a patch for 5.24, since the other one doesn't apply.

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 15, 2018

From @khwilliamson

0375-5.24-maint-perl-132063-Heap-buffer-overflow-in-byte_.patch
From cb4f483cf79debd0177f5695f501f173ab24a986 Mon Sep 17 00:00:00 2001
From: Karl Williamson <khw@cpan.org>
Date: Wed, 14 Feb 2018 20:05:26 -0700
Subject: [PATCH 375/375] 5.24 maint [perl #132063] Heap-buffer-overflow in
 byte_dump_strin)

This is a specially formulated patch for 5.24 maint, as the regular
patch does not apply.  It adds a check and dies if malformed.  Later
releases dump what's in error as a debugging aid, but to keep this patch
minimal for a maint release, it dispenses with that, and just dies with
a generic message.
---
 regexec.c              | 32 +++++++++++++++++++-------------
 t/lib/warnings/regexec |  7 +++++++
 2 files changed, 26 insertions(+), 13 deletions(-)

diff --git a/regexec.c b/regexec.c
index 5735b997fd..cb3fb4bfdc 100644
--- a/regexec.c
+++ b/regexec.c
@@ -1466,7 +1466,9 @@ Perl_re_intuit_start(pTHX_
                                            ? trie_utf8_fold                         \
                                            :   trie_latin_utf8_fold)))
 
-#define REXEC_TRIE_READ_CHAR(trie_type, trie, widecharmap, uc, uscan, len, uvc, charid, foldlen, foldbuf, uniflags) \
+/* 'uscan' is set to foldbuf, and incremented, so below the end of uscan is
+ * 'foldbuf+sizeof(foldbuf)' */
+#define REXEC_TRIE_READ_CHAR(trie_type, trie, widecharmap, uc, uc_end, uscan, len, uvc, charid, foldlen, foldbuf, uniflags) \
 STMT_START {                                                                        \
     STRLEN skiplen;                                                                 \
     U8 flags = FOLD_FLAGS_FULL;                                                     \
@@ -1474,7 +1476,7 @@ STMT_START {
     case trie_flu8:                                                                 \
         _CHECK_AND_WARN_PROBLEMATIC_LOCALE;                                         \
         if (utf8_target && UTF8_IS_ABOVE_LATIN1(*uc)) {                             \
-            _CHECK_AND_OUTPUT_WIDE_LOCALE_UTF8_MSG(uc, uc + UTF8SKIP(uc));          \
+            _CHECK_AND_OUTPUT_WIDE_LOCALE_UTF8_MSG(uc, uc_end - uc);                \
         }                                                                           \
         goto do_trie_utf8_fold;                                                     \
     case trie_utf8_exactfa_fold:                                                    \
@@ -1483,11 +1485,14 @@ STMT_START {
     case trie_utf8_fold:                                                            \
       do_trie_utf8_fold:                                                            \
         if ( foldlen>0 ) {                                                          \
-            uvc = utf8n_to_uvchr( (const U8*) uscan, UTF8_MAXLEN, &len, uniflags ); \
+            uvc = utf8n_to_uvchr( (const U8*) uscan, foldlen, &len, uniflags );     \
             foldlen -= len;                                                         \
             uscan += len;                                                           \
             len=0;                                                                  \
         } else {                                                                    \
+            if (uc + UTF8SKIP(uc) > uc_end) {                                       \
+                Perl_croak(aTHX_ "Malformed UTF-8 character (fatal)");              \
+            }                                                                       \
             uvc = _to_utf8_fold_flags( (const U8*) uc, foldbuf, &foldlen, flags);   \
             len = UTF8SKIP(uc);                                                     \
             skiplen = UVCHR_SKIP( uvc );                                            \
@@ -1500,7 +1505,7 @@ STMT_START {
         /* FALLTHROUGH */                                                           \
     case trie_latin_utf8_fold:                                                      \
         if ( foldlen>0 ) {                                                          \
-            uvc = utf8n_to_uvchr( (const U8*) uscan, UTF8_MAXLEN, &len, uniflags ); \
+            uvc = utf8n_to_uvchr( (const U8*) uscan, foldlen, &len, uniflags );     \
             foldlen -= len;                                                         \
             uscan += len;                                                           \
             len=0;                                                                  \
@@ -1519,7 +1524,7 @@ STMT_START {
         }                                                                           \
         /* FALLTHROUGH */                                                           \
     case trie_utf8:                                                                 \
-        uvc = utf8n_to_uvchr( (const U8*) uc, UTF8_MAXLEN, &len, uniflags );        \
+        uvc = utf8n_to_uvchr( (const U8*) uc, uc_end - uc, &len, uniflags );        \
         break;                                                                      \
     case trie_plain:                                                                \
         uvc = (UV)*uc;                                                              \
@@ -2599,10 +2604,10 @@ S_find_byclass(pTHX_ regexp * prog, const regnode *c, char *s,
                     }
                     points[pointpos++ % maxlen]= uc;
                     if (foldlen || uc < (U8*)strend) {
-                        REXEC_TRIE_READ_CHAR(trie_type, trie,
-                                         widecharmap, uc,
-                                         uscan, len, uvc, charid, foldlen,
-                                         foldbuf, uniflags);
+                        REXEC_TRIE_READ_CHAR(trie_type, trie, widecharmap, uc,
+                                             (U8 *) strend, uscan, len, uvc,
+                                             charid, foldlen, foldbuf,
+                                             uniflags);
                         DEBUG_TRIE_EXECUTE_r({
                             dump_exec_pos( (char *)uc, c, strend,
                                         real_start, s, utf8_target, 0);
@@ -5511,8 +5516,9 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog)
 		    if ( base && (foldlen || uc < (U8*)(reginfo->strend))) {
 			I32 offset;
 			REXEC_TRIE_READ_CHAR(trie_type, trie, widecharmap, uc,
-					     uscan, len, uvc, charid, foldlen,
-					     foldbuf, uniflags);
+                                             (U8 *) reginfo->strend, uscan,
+                                             len, uvc, charid, foldlen,
+                                             foldbuf, uniflags);
 			charcount++;
 			if (foldlen>0)
 			    ST.longfold = TRUE;
@@ -5642,8 +5648,8 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog)
 			while (foldlen) {
 			    if (!--chars)
 				break;
-			    uvc = utf8n_to_uvchr(uscan, UTF8_MAXLEN, &len,
-					    uniflags);
+			    uvc = utf8n_to_uvchr(uscan, foldlen, &len,
+                                                 uniflags);
 			    uscan += len;
 			    foldlen -= len;
 			}
diff --git a/t/lib/warnings/regexec b/t/lib/warnings/regexec
index 900dd6ee7f..6635142dea 100644
--- a/t/lib/warnings/regexec
+++ b/t/lib/warnings/regexec
@@ -260,3 +260,10 @@ setlocale(&POSIX::LC_CTYPE, $utf8_locale);
 "k" =~ /(?[ \N{KELVIN SIGN} ])/i;
 ":" =~ /(?[ \: ])/;
 EXPECT
+########
+# NAME perl #132063, read beyond buffer end
+# OPTION fatal
+"\xff" =~ /(?il)\x{100}|\x{100}/;
+EXPECT
+Malformed UTF-8 character: \xff (too short; 1 byte available, need 13) in pattern match (m//) at - line 2.
+Malformed UTF-8 character (fatal) at - line 2.
-- 
2.11.0

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 20, 2018

From @tonycoz

On Wed, 07 Feb 2018 14​:23​:08 -0800, tonyc wrote​:

On Mon, 05 Feb 2018 21​:47​:34 -0800, tonyc wrote​:

On Sun, 04 Feb 2018 20​:40​:45 -0800, tonyc wrote​:

CVE form entry for once this is public​:

I've requested a CVE id for this issue.

This is CVE-2018-6798

Patches for 5.24 for this ticket, all merged and de-conflicted.

These apply after the changes in 131844 (though there should be no conflicts there.)

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 20, 2018

From @tonycoz

5.24-0002-perl-132063-Heap-buffer-overflow.patch
From 29231d73407542051a287cab5e18546e5a622f4a Mon Sep 17 00:00:00 2001
From: Karl Williamson <khw@cpan.org>
Date: Tue, 6 Feb 2018 14:50:48 -0700
Subject: [perl #132063]: Heap buffer overflow

The proximal cause is several instances in regexec.c of the code
assuming that the input was valid UTF-8, whereas the input was too short
for what the start byte claimed it would be.

I grepped through the core for any other similar uses, and did not find
any.
---
 regexec.c              | 29 ++++++++++++++++-------------
 t/lib/warnings/regexec |  7 +++++++
 2 files changed, 23 insertions(+), 13 deletions(-)

diff --git a/regexec.c b/regexec.c
index 5735b997fd..ea432c39d3 100644
--- a/regexec.c
+++ b/regexec.c
@@ -1466,7 +1466,9 @@ Perl_re_intuit_start(pTHX_
                                            ? trie_utf8_fold                         \
                                            :   trie_latin_utf8_fold)))
 
-#define REXEC_TRIE_READ_CHAR(trie_type, trie, widecharmap, uc, uscan, len, uvc, charid, foldlen, foldbuf, uniflags) \
+/* 'uscan' is set to foldbuf, and incremented, so below the end of uscan is
+ * 'foldbuf+sizeof(foldbuf)' */
+#define REXEC_TRIE_READ_CHAR(trie_type, trie, widecharmap, uc, uc_end, uscan, len, uvc, charid, foldlen, foldbuf, uniflags) \
 STMT_START {                                                                        \
     STRLEN skiplen;                                                                 \
     U8 flags = FOLD_FLAGS_FULL;                                                     \
@@ -1474,7 +1476,7 @@ STMT_START {
     case trie_flu8:                                                                 \
         _CHECK_AND_WARN_PROBLEMATIC_LOCALE;                                         \
         if (utf8_target && UTF8_IS_ABOVE_LATIN1(*uc)) {                             \
-            _CHECK_AND_OUTPUT_WIDE_LOCALE_UTF8_MSG(uc, uc + UTF8SKIP(uc));          \
+            _CHECK_AND_OUTPUT_WIDE_LOCALE_UTF8_MSG(uc, uc_end - uc);                \
         }                                                                           \
         goto do_trie_utf8_fold;                                                     \
     case trie_utf8_exactfa_fold:                                                    \
@@ -1483,7 +1485,7 @@ STMT_START {
     case trie_utf8_fold:                                                            \
       do_trie_utf8_fold:                                                            \
         if ( foldlen>0 ) {                                                          \
-            uvc = utf8n_to_uvchr( (const U8*) uscan, UTF8_MAXLEN, &len, uniflags ); \
+            uvc = utf8n_to_uvchr( (const U8*) uscan, foldlen, &len, uniflags );     \
             foldlen -= len;                                                         \
             uscan += len;                                                           \
             len=0;                                                                  \
@@ -1500,7 +1502,7 @@ STMT_START {
         /* FALLTHROUGH */                                                           \
     case trie_latin_utf8_fold:                                                      \
         if ( foldlen>0 ) {                                                          \
-            uvc = utf8n_to_uvchr( (const U8*) uscan, UTF8_MAXLEN, &len, uniflags ); \
+            uvc = utf8n_to_uvchr( (const U8*) uscan, foldlen, &len, uniflags );     \
             foldlen -= len;                                                         \
             uscan += len;                                                           \
             len=0;                                                                  \
@@ -1519,7 +1521,7 @@ STMT_START {
         }                                                                           \
         /* FALLTHROUGH */                                                           \
     case trie_utf8:                                                                 \
-        uvc = utf8n_to_uvchr( (const U8*) uc, UTF8_MAXLEN, &len, uniflags );        \
+        uvc = utf8n_to_uvchr( (const U8*) uc, uc_end - uc, &len, uniflags );        \
         break;                                                                      \
     case trie_plain:                                                                \
         uvc = (UV)*uc;                                                              \
@@ -2599,10 +2601,10 @@ S_find_byclass(pTHX_ regexp * prog, const regnode *c, char *s,
                     }
                     points[pointpos++ % maxlen]= uc;
                     if (foldlen || uc < (U8*)strend) {
-                        REXEC_TRIE_READ_CHAR(trie_type, trie,
-                                         widecharmap, uc,
-                                         uscan, len, uvc, charid, foldlen,
-                                         foldbuf, uniflags);
+                        REXEC_TRIE_READ_CHAR(trie_type, trie, widecharmap, uc,
+                                             (U8 *) strend, uscan, len, uvc,
+                                             charid, foldlen, foldbuf,
+                                             uniflags);
                         DEBUG_TRIE_EXECUTE_r({
                             dump_exec_pos( (char *)uc, c, strend,
                                         real_start, s, utf8_target, 0);
@@ -5511,8 +5513,9 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog)
 		    if ( base && (foldlen || uc < (U8*)(reginfo->strend))) {
 			I32 offset;
 			REXEC_TRIE_READ_CHAR(trie_type, trie, widecharmap, uc,
-					     uscan, len, uvc, charid, foldlen,
-					     foldbuf, uniflags);
+                                             (U8 *) reginfo->strend, uscan,
+                                             len, uvc, charid, foldlen,
+                                             foldbuf, uniflags);
 			charcount++;
 			if (foldlen>0)
 			    ST.longfold = TRUE;
@@ -5642,8 +5645,8 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog)
 			while (foldlen) {
 			    if (!--chars)
 				break;
-			    uvc = utf8n_to_uvchr(uscan, UTF8_MAXLEN, &len,
-					    uniflags);
+			    uvc = utf8n_to_uvchr(uscan, foldlen, &len,
+                                                 uniflags);
 			    uscan += len;
 			    foldlen -= len;
 			}
diff --git a/t/lib/warnings/regexec b/t/lib/warnings/regexec
index 900dd6ee7f..6635142dea 100644
--- a/t/lib/warnings/regexec
+++ b/t/lib/warnings/regexec
@@ -260,3 +260,10 @@ setlocale(&POSIX::LC_CTYPE, $utf8_locale);
 "k" =~ /(?[ \N{KELVIN SIGN} ])/i;
 ":" =~ /(?[ \: ])/;
 EXPECT
+########
+# NAME perl #132063, read beyond buffer end
+# OPTION fatal
+"\xff" =~ /(?il)\x{100}|\x{100}/;
+EXPECT
+Malformed UTF-8 character: \xff (too short; 1 byte available, need 13) in pattern match (m//) at - line 2.
+Malformed UTF-8 character (fatal) at - line 2.
-- 
2.11.0

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 20, 2018

From @tonycoz

5.24-0003-v5.24.3-fix-TRIE_READ_CHAR-and-DECL_TRIE_TYPE-to-acc.patch
From a59dc1f157bd0f626b6b864b9996480f9bac44aa Mon Sep 17 00:00:00 2001
From: Yves Orton <demerphq@gmail.com>
Date: Mon, 19 Feb 2018 13:49:46 +1100
Subject: v5.24.3: fix TRIE_READ_CHAR and DECL_TRIE_TYPE to account for
 non-utf8 target

---
 regexec.c     | 14 ++++++++++----
 t/re/re_tests |  1 +
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/regexec.c b/regexec.c
index ea432c39d3..ff8e89cb65 100644
--- a/regexec.c
+++ b/regexec.c
@@ -1451,7 +1451,7 @@ Perl_re_intuit_start(pTHX_
 #define DECL_TRIE_TYPE(scan) \
     const enum { trie_plain, trie_utf8, trie_utf8_fold, trie_latin_utf8_fold,       \
                  trie_utf8_exactfa_fold, trie_latin_utf8_exactfa_fold,              \
-                 trie_utf8l, trie_flu8 }                                            \
+                 trie_utf8l, trie_flu8, trie_flu8_latin }                           \
                     trie_type = ((scan->flags == EXACT)                             \
                                  ? (utf8_target ? trie_utf8 : trie_plain)           \
                                  : (scan->flags == EXACTL)                          \
@@ -1461,10 +1461,12 @@ Perl_re_intuit_start(pTHX_
                                          ? trie_utf8_exactfa_fold                   \
                                          : trie_latin_utf8_exactfa_fold)            \
                                       : (scan->flags == EXACTFLU8                   \
-                                         ? trie_flu8                                \
+                                         ? (utf8_target                             \
+                                           ? trie_flu8                              \
+                                           : trie_flu8_latin)                       \
                                          : (utf8_target                             \
                                            ? trie_utf8_fold                         \
-                                           :   trie_latin_utf8_fold)))
+                                           : trie_latin_utf8_fold)))
 
 /* 'uscan' is set to foldbuf, and incremented, so below the end of uscan is
  * 'foldbuf+sizeof(foldbuf)' */
@@ -1475,7 +1477,7 @@ STMT_START {
     switch (trie_type) {                                                            \
     case trie_flu8:                                                                 \
         _CHECK_AND_WARN_PROBLEMATIC_LOCALE;                                         \
-        if (utf8_target && UTF8_IS_ABOVE_LATIN1(*uc)) {                             \
+        if (UTF8_IS_ABOVE_LATIN1(*uc)) {                                            \
             _CHECK_AND_OUTPUT_WIDE_LOCALE_UTF8_MSG(uc, uc_end - uc);                \
         }                                                                           \
         goto do_trie_utf8_fold;                                                     \
@@ -1497,10 +1499,14 @@ STMT_START {
             uscan = foldbuf + skiplen;                                              \
         }                                                                           \
         break;                                                                      \
+    case trie_flu8_latin:                                                           \
+        _CHECK_AND_WARN_PROBLEMATIC_LOCALE;                                         \
+        goto do_trie_latin_utf8_fold;                                               \
     case trie_latin_utf8_exactfa_fold:                                              \
         flags |= FOLD_FLAGS_NOMIX_ASCII;                                            \
         /* FALLTHROUGH */                                                           \
     case trie_latin_utf8_fold:                                                      \
+      do_trie_latin_utf8_fold:                                                      \
         if ( foldlen>0 ) {                                                          \
             uvc = utf8n_to_uvchr( (const U8*) uscan, foldlen, &len, uniflags );     \
             foldlen -= len;                                                         \
diff --git a/t/re/re_tests b/t/re/re_tests
index 7e8522da98..ab7ddbb848 100644
--- a/t/re/re_tests
+++ b/t/re/re_tests
@@ -1968,6 +1968,7 @@ ab(?#Comment){2}c	abbc	y	$&	abbc
 (?:.||)(?|)000000000@	000000000@	y	$&	000000000@		#  [perl #126405]
 aa$|a(?R)a|a	aaa	y	$&	aaa		# [perl 128420] recursive matches
 (?:\1|a)([bcd])\1(?:(?R)|e)\1	abbaccaddedcb	y	$&	abbaccaddedcb		# [perl 128420] recursive match with backreferences
+(?il)\x{100}|\x{100}|\x{FF}	\xFF	y	$&	\xFF
 
 # Keep these lines at the end of the file
 # vim: softtabstop=0 noexpandtab
-- 
2.11.0

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 20, 2018

From @tonycoz

5.24-0004-perl-132063-we-should-no-longer-warn-for-this-code.patch
From 9dd4e0280eca2ba666cc0671ec3724610ed7d366 Mon Sep 17 00:00:00 2001
From: Tony Cook <tony@develop-help.com>
Date: Mon, 19 Feb 2018 15:11:42 +1100
Subject: (perl #132063) we should no longer warn for this code

The first patch for 132063 prevented the buffer read overflow when
dumping the warning but didn't fix the underlying problem.

The next change treats the supplied buffer correctly, preventing the
non-UTF-8 SV from being treated as UTF-8, preventing the warning.
---
 t/lib/warnings/regexec | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/t/lib/warnings/regexec b/t/lib/warnings/regexec
index 6635142dea..c370ddc3c7 100644
--- a/t/lib/warnings/regexec
+++ b/t/lib/warnings/regexec
@@ -262,8 +262,5 @@ setlocale(&POSIX::LC_CTYPE, $utf8_locale);
 EXPECT
 ########
 # NAME perl #132063, read beyond buffer end
-# OPTION fatal
 "\xff" =~ /(?il)\x{100}|\x{100}/;
 EXPECT
-Malformed UTF-8 character: \xff (too short; 1 byte available, need 13) in pattern match (m//) at - line 2.
-Malformed UTF-8 character (fatal) at - line 2.
-- 
2.11.0

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 20, 2018

From @tonycoz

On Wed, 07 Feb 2018 14​:23​:08 -0800, tonyc wrote​:

On Mon, 05 Feb 2018 21​:47​:34 -0800, tonyc wrote​:

On Sun, 04 Feb 2018 20​:40​:45 -0800, tonyc wrote​:

CVE form entry for once this is public​:

I've requested a CVE id for this issue.

This is CVE-2018-6798

Patches for 5.26.

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 20, 2018

From @tonycoz

5.26-0002-perl-132063-Heap-buffer-overflow.patch
From 7dc63e39d2d84920efd005092bc4d03b6ab24e1c Mon Sep 17 00:00:00 2001
From: Karl Williamson <khw@cpan.org>
Date: Tue, 6 Feb 2018 14:50:48 -0700
Subject: [perl #132063]: Heap buffer overflow

The proximal cause is several instances in regexec.c of the code
assuming that the input was valid UTF-8, whereas the input was too short
for what the start byte claimed it would be.

I grepped through the core for any other similar uses, and did not find
any.
---
 regexec.c              | 33 ++++++++++++++++++---------------
 t/lib/warnings/regexec |  7 +++++++
 2 files changed, 25 insertions(+), 15 deletions(-)

diff --git a/regexec.c b/regexec.c
index 134b196fc4..fa888823bd 100644
--- a/regexec.c
+++ b/regexec.c
@@ -1487,7 +1487,9 @@ Perl_re_intuit_start(pTHX_
                                            ? trie_utf8_fold                         \
                                            :   trie_latin_utf8_fold)))
 
-#define REXEC_TRIE_READ_CHAR(trie_type, trie, widecharmap, uc, uscan, len, uvc, charid, foldlen, foldbuf, uniflags) \
+/* 'uscan' is set to foldbuf, and incremented, so below the end of uscan is
+ * 'foldbuf+sizeof(foldbuf)' */
+#define REXEC_TRIE_READ_CHAR(trie_type, trie, widecharmap, uc, uc_end, uscan, len, uvc, charid, foldlen, foldbuf, uniflags) \
 STMT_START {                                                                        \
     STRLEN skiplen;                                                                 \
     U8 flags = FOLD_FLAGS_FULL;                                                     \
@@ -1495,7 +1497,7 @@ STMT_START {
     case trie_flu8:                                                                 \
         _CHECK_AND_WARN_PROBLEMATIC_LOCALE;                                         \
         if (utf8_target && UTF8_IS_ABOVE_LATIN1(*uc)) {                             \
-            _CHECK_AND_OUTPUT_WIDE_LOCALE_UTF8_MSG(uc, uc + UTF8SKIP(uc));          \
+            _CHECK_AND_OUTPUT_WIDE_LOCALE_UTF8_MSG(uc, uc_end - uc);                \
         }                                                                           \
         goto do_trie_utf8_fold;                                                     \
     case trie_utf8_exactfa_fold:                                                    \
@@ -1504,14 +1506,14 @@ STMT_START {
     case trie_utf8_fold:                                                            \
       do_trie_utf8_fold:                                                            \
         if ( foldlen>0 ) {                                                          \
-            uvc = utf8n_to_uvchr( (const U8*) uscan, UTF8_MAXLEN, &len, uniflags ); \
+            uvc = utf8n_to_uvchr( (const U8*) uscan, foldlen, &len, uniflags );     \
             foldlen -= len;                                                         \
             uscan += len;                                                           \
             len=0;                                                                  \
         } else {                                                                    \
-            len = UTF8SKIP(uc);                                                     \
-            uvc = _toFOLD_utf8_flags( (const U8*) uc, uc + len, foldbuf, &foldlen,  \
+            uvc = _toFOLD_utf8_flags( (const U8*) uc, uc_end, foldbuf, &foldlen,    \
                                                                             flags); \
+            len = UTF8SKIP(uc);                                                     \
             skiplen = UVCHR_SKIP( uvc );                                            \
             foldlen -= skiplen;                                                     \
             uscan = foldbuf + skiplen;                                              \
@@ -1522,7 +1524,7 @@ STMT_START {
         /* FALLTHROUGH */                                                           \
     case trie_latin_utf8_fold:                                                      \
         if ( foldlen>0 ) {                                                          \
-            uvc = utf8n_to_uvchr( (const U8*) uscan, UTF8_MAXLEN, &len, uniflags ); \
+            uvc = utf8n_to_uvchr( (const U8*) uscan, foldlen, &len, uniflags );     \
             foldlen -= len;                                                         \
             uscan += len;                                                           \
             len=0;                                                                  \
@@ -1541,7 +1543,7 @@ STMT_START {
         }                                                                           \
         /* FALLTHROUGH */                                                           \
     case trie_utf8:                                                                 \
-        uvc = utf8n_to_uvchr( (const U8*) uc, UTF8_MAXLEN, &len, uniflags );        \
+        uvc = utf8n_to_uvchr( (const U8*) uc, uc_end - uc, &len, uniflags );        \
         break;                                                                      \
     case trie_plain:                                                                \
         uvc = (UV)*uc;                                                              \
@@ -2623,10 +2625,10 @@ S_find_byclass(pTHX_ regexp * prog, const regnode *c, char *s,
                     }
                     points[pointpos++ % maxlen]= uc;
                     if (foldlen || uc < (U8*)strend) {
-                        REXEC_TRIE_READ_CHAR(trie_type, trie,
-                                         widecharmap, uc,
-                                         uscan, len, uvc, charid, foldlen,
-                                         foldbuf, uniflags);
+                        REXEC_TRIE_READ_CHAR(trie_type, trie, widecharmap, uc,
+                                             (U8 *) strend, uscan, len, uvc,
+                                             charid, foldlen, foldbuf,
+                                             uniflags);
                         DEBUG_TRIE_EXECUTE_r({
                             dump_exec_pos( (char *)uc, c, strend,
                                         real_start, s, utf8_target, 0);
@@ -5686,8 +5688,9 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog)
 		    if ( base && (foldlen || uc < (U8*)(reginfo->strend))) {
 			I32 offset;
 			REXEC_TRIE_READ_CHAR(trie_type, trie, widecharmap, uc,
-					     uscan, len, uvc, charid, foldlen,
-					     foldbuf, uniflags);
+                                             (U8 *) reginfo->strend, uscan,
+                                             len, uvc, charid, foldlen,
+                                             foldbuf, uniflags);
 			charcount++;
 			if (foldlen>0)
 			    ST.longfold = TRUE;
@@ -5822,8 +5825,8 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog)
 			while (foldlen) {
 			    if (!--chars)
 				break;
-			    uvc = utf8n_to_uvchr(uscan, UTF8_MAXLEN, &len,
-					    uniflags);
+			    uvc = utf8n_to_uvchr(uscan, foldlen, &len,
+                                                 uniflags);
 			    uscan += len;
 			    foldlen -= len;
 			}
diff --git a/t/lib/warnings/regexec b/t/lib/warnings/regexec
index 900dd6ee7f..6635142dea 100644
--- a/t/lib/warnings/regexec
+++ b/t/lib/warnings/regexec
@@ -260,3 +260,10 @@ setlocale(&POSIX::LC_CTYPE, $utf8_locale);
 "k" =~ /(?[ \N{KELVIN SIGN} ])/i;
 ":" =~ /(?[ \: ])/;
 EXPECT
+########
+# NAME perl #132063, read beyond buffer end
+# OPTION fatal
+"\xff" =~ /(?il)\x{100}|\x{100}/;
+EXPECT
+Malformed UTF-8 character: \xff (too short; 1 byte available, need 13) in pattern match (m//) at - line 2.
+Malformed UTF-8 character (fatal) at - line 2.
-- 
2.11.0

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 20, 2018

From @tonycoz

5.26-0003-5.26.1-fix-TRIE_READ_CHAR-and-DECL_TRIE_TYPE-to-acco.patch
From d58a6811d830c2f2f850a03a18129c38cb732791 Mon Sep 17 00:00:00 2001
From: Yves Orton <demerphq@gmail.com>
Date: Tue, 13 Feb 2018 16:11:55 +1100
Subject: 5.26.1: fix TRIE_READ_CHAR and DECL_TRIE_TYPE to account for non-utf8
 target

---
 regexec.c     | 14 ++++++++++----
 t/re/re_tests |  1 +
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/regexec.c b/regexec.c
index fa888823bd..cf81b07e30 100644
--- a/regexec.c
+++ b/regexec.c
@@ -1472,7 +1472,7 @@ Perl_re_intuit_start(pTHX_
 #define DECL_TRIE_TYPE(scan) \
     const enum { trie_plain, trie_utf8, trie_utf8_fold, trie_latin_utf8_fold,       \
                  trie_utf8_exactfa_fold, trie_latin_utf8_exactfa_fold,              \
-                 trie_utf8l, trie_flu8 }                                            \
+                 trie_utf8l, trie_flu8, trie_flu8_latin }                           \
                     trie_type = ((scan->flags == EXACT)                             \
                                  ? (utf8_target ? trie_utf8 : trie_plain)           \
                                  : (scan->flags == EXACTL)                          \
@@ -1482,10 +1482,12 @@ Perl_re_intuit_start(pTHX_
                                          ? trie_utf8_exactfa_fold                   \
                                          : trie_latin_utf8_exactfa_fold)            \
                                       : (scan->flags == EXACTFLU8                   \
-                                         ? trie_flu8                                \
+                                         ? (utf8_target                             \
+                                           ? trie_flu8                              \
+                                           : trie_flu8_latin)                       \
                                          : (utf8_target                             \
                                            ? trie_utf8_fold                         \
-                                           :   trie_latin_utf8_fold)))
+                                           : trie_latin_utf8_fold)))
 
 /* 'uscan' is set to foldbuf, and incremented, so below the end of uscan is
  * 'foldbuf+sizeof(foldbuf)' */
@@ -1496,7 +1498,7 @@ STMT_START {
     switch (trie_type) {                                                            \
     case trie_flu8:                                                                 \
         _CHECK_AND_WARN_PROBLEMATIC_LOCALE;                                         \
-        if (utf8_target && UTF8_IS_ABOVE_LATIN1(*uc)) {                             \
+        if (UTF8_IS_ABOVE_LATIN1(*uc)) {                                            \
             _CHECK_AND_OUTPUT_WIDE_LOCALE_UTF8_MSG(uc, uc_end - uc);                \
         }                                                                           \
         goto do_trie_utf8_fold;                                                     \
@@ -1519,10 +1521,14 @@ STMT_START {
             uscan = foldbuf + skiplen;                                              \
         }                                                                           \
         break;                                                                      \
+    case trie_flu8_latin:                                                           \
+        _CHECK_AND_WARN_PROBLEMATIC_LOCALE;                                         \
+        goto do_trie_latin_utf8_fold;                                               \
     case trie_latin_utf8_exactfa_fold:                                              \
         flags |= FOLD_FLAGS_NOMIX_ASCII;                                            \
         /* FALLTHROUGH */                                                           \
     case trie_latin_utf8_fold:                                                      \
+      do_trie_latin_utf8_fold:                                                      \
         if ( foldlen>0 ) {                                                          \
             uvc = utf8n_to_uvchr( (const U8*) uscan, foldlen, &len, uniflags );     \
             foldlen -= len;                                                         \
diff --git a/t/re/re_tests b/t/re/re_tests
index 410fceadac..78baed6ffc 100644
--- a/t/re/re_tests
+++ b/t/re/re_tests
@@ -1985,6 +1985,7 @@ AB\s+\x{100}	AB \x{100}X	y	-	-
 /(?x)[a b]/xx	\N{SPACE}	yS	$&	 	# Note a space char here
 /(?xx)[a b]/x	\N{SPACE}	n	-	-
 /(?-x:[a b])/xx	\N{SPACE}	yS	$&	 	# Note a space char here
+(?il)\x{100}|\x{100}|\x{FF}	\xFF	y	$&	\xFF
 
 # Keep these lines at the end of the file
 # vim: softtabstop=0 noexpandtab
-- 
2.11.0

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 20, 2018

From @tonycoz

5.26-0004-perl-132063-we-should-no-longer-warn-for-this-code.patch
From 3c4ea4886efdb477335f6931d0a553e818ee172f Mon Sep 17 00:00:00 2001
From: Tony Cook <tony@develop-help.com>
Date: Mon, 19 Feb 2018 15:11:42 +1100
Subject: (perl #132063) we should no longer warn for this code

The first patch for 132063 prevented the buffer read overflow when
dumping the warning but didn't fix the underlying problem.

The next change treats the supplied buffer correctly, preventing the
non-UTF-8 SV from being treated as UTF-8, preventing the warning.
---
 t/lib/warnings/regexec | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/t/lib/warnings/regexec b/t/lib/warnings/regexec
index 6635142dea..c370ddc3c7 100644
--- a/t/lib/warnings/regexec
+++ b/t/lib/warnings/regexec
@@ -262,8 +262,5 @@ setlocale(&POSIX::LC_CTYPE, $utf8_locale);
 EXPECT
 ########
 # NAME perl #132063, read beyond buffer end
-# OPTION fatal
 "\xff" =~ /(?il)\x{100}|\x{100}/;
 EXPECT
-Malformed UTF-8 character: \xff (too short; 1 byte available, need 13) in pattern match (m//) at - line 2.
-Malformed UTF-8 character (fatal) at - line 2.
-- 
2.11.0

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 20, 2018

From @tonycoz

5.26-0005-utf8.c-Don-t-dump-malformation-past-first-NUL.patch
From d72ba890c8d8ac800c9d00a1f542deca11551f33 Mon Sep 17 00:00:00 2001
From: Karl Williamson <khw@cpan.org>
Date: Tue, 13 Feb 2018 07:03:43 -0700
Subject: utf8.c: Don't dump malformation past first NUL

When a UTF-8 string contains a malformation, the bytes are dumped out as
a debugging aid.  One should exercise caution, however, and not dump out
bytes that are actually past the end of the string.  Commit 99a765e9e37
from 2016 added the capability to signal to the dumping routines that
we're not sure where the string ends, and to dump the minimal possible.

It occurred to me that an additional safety measure can be easily added,
which this commit does.  And that is, in the dumping routines to stop at
the first NUL.  All strings automatically get a traiing NUL added, even
if they contain embedded NULs.  A NUL can never be part of a
malformation, and so its presence likely signals the end of the string.
---
 utf8.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/utf8.c b/utf8.c
index a3d5f61b64..61346f0cb6 100644
--- a/utf8.c
+++ b/utf8.c
@@ -810,7 +810,7 @@ Perl__byte_dump_string(pTHX_ const U8 * s, const STRLEN len, const bool format)
 PERL_STATIC_INLINE char *
 S_unexpected_non_continuation_text(pTHX_ const U8 * const s,
 
-                                         /* How many bytes to print */
+                                         /* Max number of bytes to print */
                                          STRLEN print_len,
 
                                          /* Which one is the non-continuation */
@@ -826,6 +826,8 @@ S_unexpected_non_continuation_text(pTHX_ const U8 * const s,
                                ? "immediately"
                                : Perl_form(aTHX_ "%d bytes",
                                                  (int) non_cont_byte_pos);
+    const U8 * x = s + non_cont_byte_pos;
+    const U8 * e = s + print_len;
 
     PERL_ARGS_ASSERT_UNEXPECTED_NON_CONTINUATION_TEXT;
 
@@ -833,10 +835,20 @@ S_unexpected_non_continuation_text(pTHX_ const U8 * const s,
      * calculated, it's likely faster to pass it; verify under DEBUGGING */
     assert(expect_len == UTF8SKIP(s));
 
+    /* As a defensive coding measure, don't output anything past a NUL.  Such
+     * bytes shouldn't be in the middle of a malformation, and could mark the
+     * end of the allocated string, and what comes after is undefined */
+    for (; x < e; x++) {
+        if (*x == '\0') {
+            x++;            /* Output this particular NUL */
+            break;
+        }
+    }
+
     return Perl_form(aTHX_ "%s: %s (unexpected non-continuation byte 0x%02x,"
                            " %s after start byte 0x%02x; need %d bytes, got %d)",
                            malformed_text,
-                           _byte_dump_string(s, print_len, 0),
+                           _byte_dump_string(s, x - s, 0),
                            *(s + non_cont_byte_pos),
                            where,
                            *s,
-- 
2.11.0

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 20, 2018

From @tonycoz

On Wed, 07 Feb 2018 14​:23​:08 -0800, tonyc wrote​:

On Mon, 05 Feb 2018 21​:47​:34 -0800, tonyc wrote​:

On Sun, 04 Feb 2018 20​:40​:45 -0800, tonyc wrote​:

CVE form entry for once this is public​:

I've requested a CVE id for this issue.

This is CVE-2018-6798

Patches for blead.

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 20, 2018

From @tonycoz

blead-0002-perl-132063-Heap-buffer-overflow.patch
From af3b52eb96c0db51ca5efb7c605f59b4c89c3ae3 Mon Sep 17 00:00:00 2001
From: Karl Williamson <khw@cpan.org>
Date: Tue, 6 Feb 2018 14:50:48 -0700
Subject: [perl #132063]: Heap buffer overflow

The proximal cause is several instances in regexec.c of the code
assuming that the input was valid UTF-8, whereas the input was too short
for what the start byte claimed it would be.

I grepped through the core for any other similar uses, and did not find
any.
---
 regexec.c              | 33 ++++++++++++++++++---------------
 t/lib/warnings/regexec |  7 +++++++
 2 files changed, 25 insertions(+), 15 deletions(-)

diff --git a/regexec.c b/regexec.c
index 1cda2e8e88..47793e6e16 100644
--- a/regexec.c
+++ b/regexec.c
@@ -1822,7 +1822,9 @@ Perl_re_intuit_start(pTHX_
                                            ? trie_utf8_fold                         \
                                            :   trie_latin_utf8_fold)))
 
-#define REXEC_TRIE_READ_CHAR(trie_type, trie, widecharmap, uc, uscan, len, uvc, charid, foldlen, foldbuf, uniflags) \
+/* 'uscan' is set to foldbuf, and incremented, so below the end of uscan is
+ * 'foldbuf+sizeof(foldbuf)' */
+#define REXEC_TRIE_READ_CHAR(trie_type, trie, widecharmap, uc, uc_end, uscan, len, uvc, charid, foldlen, foldbuf, uniflags) \
 STMT_START {                                                                        \
     STRLEN skiplen;                                                                 \
     U8 flags = FOLD_FLAGS_FULL;                                                     \
@@ -1830,7 +1832,7 @@ STMT_START {
     case trie_flu8:                                                                 \
         _CHECK_AND_WARN_PROBLEMATIC_LOCALE;                                         \
         if (utf8_target && UTF8_IS_ABOVE_LATIN1(*uc)) {                             \
-            _CHECK_AND_OUTPUT_WIDE_LOCALE_UTF8_MSG(uc, uc + UTF8SKIP(uc));          \
+            _CHECK_AND_OUTPUT_WIDE_LOCALE_UTF8_MSG(uc, uc_end - uc);                \
         }                                                                           \
         goto do_trie_utf8_fold;                                                     \
     case trie_utf8_exactfa_fold:                                                    \
@@ -1839,14 +1841,14 @@ STMT_START {
     case trie_utf8_fold:                                                            \
       do_trie_utf8_fold:                                                            \
         if ( foldlen>0 ) {                                                          \
-            uvc = utf8n_to_uvchr( (const U8*) uscan, UTF8_MAXLEN, &len, uniflags ); \
+            uvc = utf8n_to_uvchr( (const U8*) uscan, foldlen, &len, uniflags );     \
             foldlen -= len;                                                         \
             uscan += len;                                                           \
             len=0;                                                                  \
         } else {                                                                    \
-            len = UTF8SKIP(uc);                                                     \
-            uvc = _toFOLD_utf8_flags( (const U8*) uc, uc + len, foldbuf, &foldlen,  \
+            uvc = _toFOLD_utf8_flags( (const U8*) uc, uc_end, foldbuf, &foldlen,    \
                                                                             flags); \
+            len = UTF8SKIP(uc);                                                     \
             skiplen = UVCHR_SKIP( uvc );                                            \
             foldlen -= skiplen;                                                     \
             uscan = foldbuf + skiplen;                                              \
@@ -1857,7 +1859,7 @@ STMT_START {
         /* FALLTHROUGH */                                                           \
     case trie_latin_utf8_fold:                                                      \
         if ( foldlen>0 ) {                                                          \
-            uvc = utf8n_to_uvchr( (const U8*) uscan, UTF8_MAXLEN, &len, uniflags ); \
+            uvc = utf8n_to_uvchr( (const U8*) uscan, foldlen, &len, uniflags );     \
             foldlen -= len;                                                         \
             uscan += len;                                                           \
             len=0;                                                                  \
@@ -1876,7 +1878,7 @@ STMT_START {
         }                                                                           \
         /* FALLTHROUGH */                                                           \
     case trie_utf8:                                                                 \
-        uvc = utf8n_to_uvchr( (const U8*) uc, UTF8_MAXLEN, &len, uniflags );        \
+        uvc = utf8n_to_uvchr( (const U8*) uc, uc_end - uc, &len, uniflags );        \
         break;                                                                      \
     case trie_plain:                                                                \
         uvc = (UV)*uc;                                                              \
@@ -3038,10 +3040,10 @@ S_find_byclass(pTHX_ regexp * prog, const regnode *c, char *s,
                     }
                     points[pointpos++ % maxlen]= uc;
                     if (foldlen || uc < (U8*)strend) {
-                        REXEC_TRIE_READ_CHAR(trie_type, trie,
-                                         widecharmap, uc,
-                                         uscan, len, uvc, charid, foldlen,
-                                         foldbuf, uniflags);
+                        REXEC_TRIE_READ_CHAR(trie_type, trie, widecharmap, uc,
+                                             (U8 *) strend, uscan, len, uvc,
+                                             charid, foldlen, foldbuf,
+                                             uniflags);
                         DEBUG_TRIE_EXECUTE_r({
                             dump_exec_pos( (char *)uc, c, strend,
                                         real_start, s, utf8_target, 0);
@@ -6100,8 +6102,9 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog)
 		    if ( base && (foldlen || uc < (U8*)(reginfo->strend))) {
 			I32 offset;
 			REXEC_TRIE_READ_CHAR(trie_type, trie, widecharmap, uc,
-					     uscan, len, uvc, charid, foldlen,
-					     foldbuf, uniflags);
+                                             (U8 *) reginfo->strend, uscan,
+                                             len, uvc, charid, foldlen,
+                                             foldbuf, uniflags);
 			charcount++;
 			if (foldlen>0)
 			    ST.longfold = TRUE;
@@ -6236,8 +6239,8 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog)
 			while (foldlen) {
 			    if (!--chars)
 				break;
-			    uvc = utf8n_to_uvchr(uscan, UTF8_MAXLEN, &len,
-					    uniflags);
+			    uvc = utf8n_to_uvchr(uscan, foldlen, &len,
+                                                 uniflags);
 			    uscan += len;
 			    foldlen -= len;
 			}
diff --git a/t/lib/warnings/regexec b/t/lib/warnings/regexec
index 900dd6ee7f..6635142dea 100644
--- a/t/lib/warnings/regexec
+++ b/t/lib/warnings/regexec
@@ -260,3 +260,10 @@ setlocale(&POSIX::LC_CTYPE, $utf8_locale);
 "k" =~ /(?[ \N{KELVIN SIGN} ])/i;
 ":" =~ /(?[ \: ])/;
 EXPECT
+########
+# NAME perl #132063, read beyond buffer end
+# OPTION fatal
+"\xff" =~ /(?il)\x{100}|\x{100}/;
+EXPECT
+Malformed UTF-8 character: \xff (too short; 1 byte available, need 13) in pattern match (m//) at - line 2.
+Malformed UTF-8 character (fatal) at - line 2.
-- 
2.11.0

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 20, 2018

From @tonycoz

blead-0003-fix-TRIE_READ_CHAR-and-DECL_TRIE_TYPE-to-account-for.patch
From 7cb7018008ced2cac377126364a3842d3f91d275 Mon Sep 17 00:00:00 2001
From: Yves Orton <demerphq@gmail.com>
Date: Wed, 14 Feb 2018 10:29:26 +1100
Subject: fix TRIE_READ_CHAR and DECL_TRIE_TYPE to account for non-utf8 target

---
 regexec.c     | 14 ++++++++++----
 t/re/re_tests |  2 +-
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/regexec.c b/regexec.c
index 47793e6e16..1630b676ac 100644
--- a/regexec.c
+++ b/regexec.c
@@ -1807,7 +1807,7 @@ Perl_re_intuit_start(pTHX_
 #define DECL_TRIE_TYPE(scan) \
     const enum { trie_plain, trie_utf8, trie_utf8_fold, trie_latin_utf8_fold,       \
                  trie_utf8_exactfa_fold, trie_latin_utf8_exactfa_fold,              \
-                 trie_utf8l, trie_flu8 }                                            \
+                 trie_utf8l, trie_flu8, trie_flu8_latin }                           \
                     trie_type = ((scan->flags == EXACT)                             \
                                  ? (utf8_target ? trie_utf8 : trie_plain)           \
                                  : (scan->flags == EXACTL)                          \
@@ -1817,10 +1817,12 @@ Perl_re_intuit_start(pTHX_
                                          ? trie_utf8_exactfa_fold                   \
                                          : trie_latin_utf8_exactfa_fold)            \
                                       : (scan->flags == EXACTFLU8                   \
-                                         ? trie_flu8                                \
+                                         ? (utf8_target                             \
+                                           ? trie_flu8                              \
+                                           : trie_flu8_latin)                       \
                                          : (utf8_target                             \
                                            ? trie_utf8_fold                         \
-                                           :   trie_latin_utf8_fold)))
+                                           : trie_latin_utf8_fold)))
 
 /* 'uscan' is set to foldbuf, and incremented, so below the end of uscan is
  * 'foldbuf+sizeof(foldbuf)' */
@@ -1831,7 +1833,7 @@ STMT_START {
     switch (trie_type) {                                                            \
     case trie_flu8:                                                                 \
         _CHECK_AND_WARN_PROBLEMATIC_LOCALE;                                         \
-        if (utf8_target && UTF8_IS_ABOVE_LATIN1(*uc)) {                             \
+        if (UTF8_IS_ABOVE_LATIN1(*uc)) {                                            \
             _CHECK_AND_OUTPUT_WIDE_LOCALE_UTF8_MSG(uc, uc_end - uc);                \
         }                                                                           \
         goto do_trie_utf8_fold;                                                     \
@@ -1854,10 +1856,14 @@ STMT_START {
             uscan = foldbuf + skiplen;                                              \
         }                                                                           \
         break;                                                                      \
+    case trie_flu8_latin:                                                           \
+        _CHECK_AND_WARN_PROBLEMATIC_LOCALE;                                         \
+        goto do_trie_latin_utf8_fold;                                               \
     case trie_latin_utf8_exactfa_fold:                                              \
         flags |= FOLD_FLAGS_NOMIX_ASCII;                                            \
         /* FALLTHROUGH */                                                           \
     case trie_latin_utf8_fold:                                                      \
+      do_trie_latin_utf8_fold:                                                      \
         if ( foldlen>0 ) {                                                          \
             uvc = utf8n_to_uvchr( (const U8*) uscan, foldlen, &len, uniflags );     \
             foldlen -= len;                                                         \
diff --git a/t/re/re_tests b/t/re/re_tests
index 61b8c875e2..3db034c67e 100644
--- a/t/re/re_tests
+++ b/t/re/re_tests
@@ -1989,7 +1989,7 @@ AB\s+\x{100}	AB \x{100}X	y	-	-
 ([[:ascii:]]+)\x81	a\x80b\x81	y	$&	b\x81
 [[:^ascii:]]+b	\x80a\x81b	y	$&	\x81b
 [[:^ascii:]]+b	\x80a\x81\x{100}b	y	$&	\x81\x{100}b
-
+(?il)\x{100}|\x{100}|\x{FF}	\xFF	y	$&	\xFF
 
 # Keep these lines at the end of the file
 # vim: softtabstop=0 noexpandtab
-- 
2.11.0

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 20, 2018

From @tonycoz

blead-0004-perl-132063-we-should-no-longer-warn-for-this-code.patch
From f7234720f553e29e4629b88d3928052def520da3 Mon Sep 17 00:00:00 2001
From: Tony Cook <tony@develop-help.com>
Date: Mon, 19 Feb 2018 15:11:42 +1100
Subject: (perl #132063) we should no longer warn for this code

The first patch for 132063 prevented the buffer read overflow when
dumping the warning but didn't fix the underlying problem.

The next change treats the supplied buffer correctly, preventing the
non-UTF-8 SV from being treated as UTF-8, preventing the warning.
---
 t/lib/warnings/regexec | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/t/lib/warnings/regexec b/t/lib/warnings/regexec
index 6635142dea..c370ddc3c7 100644
--- a/t/lib/warnings/regexec
+++ b/t/lib/warnings/regexec
@@ -262,8 +262,5 @@ setlocale(&POSIX::LC_CTYPE, $utf8_locale);
 EXPECT
 ########
 # NAME perl #132063, read beyond buffer end
-# OPTION fatal
 "\xff" =~ /(?il)\x{100}|\x{100}/;
 EXPECT
-Malformed UTF-8 character: \xff (too short; 1 byte available, need 13) in pattern match (m//) at - line 2.
-Malformed UTF-8 character (fatal) at - line 2.
-- 
2.11.0

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 20, 2018

From @tonycoz

blead-0005-utf8.c-Don-t-dump-malformation-past-first-NUL.patch
From f348493a414bc87cf78b4c882c409e2d4b775595 Mon Sep 17 00:00:00 2001
From: Karl Williamson <khw@cpan.org>
Date: Tue, 13 Feb 2018 07:03:43 -0700
Subject: utf8.c: Don't dump malformation past first NUL

When a UTF-8 string contains a malformation, the bytes are dumped out as
a debugging aid.  One should exercise caution, however, and not dump out
bytes that are actually past the end of the string.  Commit 99a765e9e37
from 2016 added the capability to signal to the dumping routines that
we're not sure where the string ends, and to dump the minimal possible.

It occurred to me that an additional safety measure can be easily added,
which this commit does.  And that is, in the dumping routines to stop at
the first NUL.  All strings automatically get a traiing NUL added, even
if they contain embedded NULs.  A NUL can never be part of a
malformation, and so its presence likely signals the end of the string.
---
 utf8.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/utf8.c b/utf8.c
index a30182a43f..4bec553d56 100644
--- a/utf8.c
+++ b/utf8.c
@@ -1121,7 +1121,7 @@ Perl__byte_dump_string(pTHX_ const U8 * const start, const STRLEN len, const boo
 PERL_STATIC_INLINE char *
 S_unexpected_non_continuation_text(pTHX_ const U8 * const s,
 
-                                         /* How many bytes to print */
+                                         /* Max number of bytes to print */
                                          STRLEN print_len,
 
                                          /* Which one is the non-continuation */
@@ -1137,6 +1137,8 @@ S_unexpected_non_continuation_text(pTHX_ const U8 * const s,
                                ? "immediately"
                                : Perl_form(aTHX_ "%d bytes",
                                                  (int) non_cont_byte_pos);
+    const U8 * x = s + non_cont_byte_pos;
+    const U8 * e = s + print_len;
 
     PERL_ARGS_ASSERT_UNEXPECTED_NON_CONTINUATION_TEXT;
 
@@ -1144,10 +1146,20 @@ S_unexpected_non_continuation_text(pTHX_ const U8 * const s,
      * calculated, it's likely faster to pass it; verify under DEBUGGING */
     assert(expect_len == UTF8SKIP(s));
 
+    /* As a defensive coding measure, don't output anything past a NUL.  Such
+     * bytes shouldn't be in the middle of a malformation, and could mark the
+     * end of the allocated string, and what comes after is undefined */
+    for (; x < e; x++) {
+        if (*x == '\0') {
+            x++;            /* Output this particular NUL */
+            break;
+        }
+    }
+
     return Perl_form(aTHX_ "%s: %s (unexpected non-continuation byte 0x%02x,"
                            " %s after start byte 0x%02x; need %d bytes, got %d)",
                            malformed_text,
-                           _byte_dump_string(s, print_len, 0),
+                           _byte_dump_string(s, x - s, 0),
                            *(s + non_cont_byte_pos),
                            where,
                            *s,
-- 
2.11.0

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Feb 24, 2018

From @xsawyerx

The public disclosure date is set for March 15th.

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Mar 18, 2018

From @tonycoz

On Sat, 24 Feb 2018 08​:56​:44 -0800, xsawyerx@​gmail.com wrote​:

The public disclosure date is set for March 15th.

This has been delayed.

Tony

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Mar 19, 2018

From @xsawyerx

The disclosure date has been postponed officially to April 14th.

On 19 March 2018 at 00​:13, Tony Cook via RT
<perl5-security-report-followup@​perl.org> wrote​:

On Sat, 24 Feb 2018 08​:56​:44 -0800, xsawyerx@​gmail.com wrote​:

The public disclosure date is set for March 15th.

This has been delayed.

Tony

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Apr 14, 2018

@xsawyerx - Status changed from 'open' to 'pending release'

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Apr 14, 2018

From @xsawyerx

On Mon, 19 Mar 2018 15​:26​:47 -0700, xsawyerx@​gmail.com wrote​:

The disclosure date has been postponed officially to April 14th.

Moved to public queue.

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Jun 23, 2018

From @khwilliamson

Thank you for filing this report. You have helped make Perl better.

With the release yesterday of Perl 5.28.0, this and 185 other issues have been
resolved.

Perl 5.28.0 may be downloaded via​:
https://metacpan.org/release/XSAWYERX/perl-5.28.0

If you find that the problem persists, feel free to reopen this ticket.

@p5pRT

This comment has been minimized.

Copy link
Collaborator Author

@p5pRT p5pRT commented Jun 23, 2018

@khwilliamson - Status changed from 'pending release' to 'resolved'

@p5pRT p5pRT closed this Jun 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant
You can’t perform that action at this time.