Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crash while using parse with non-utf-8 data from file #2476

Closed
semarie opened this issue Jan 7, 2022 · 11 comments
Closed

crash while using parse with non-utf-8 data from file #2476

semarie opened this issue Jan 7, 2022 · 11 comments

Comments

@semarie
Copy link

semarie commented Jan 7, 2022

The following script makes rebol3 binary to crash on OpenBSD (it is fine on Linux where I also tested).

If I correctly isolated the problem, it is related to the input, which contains insecable space in number as mille separator (1 917,00) in iso-8859-1 (or some windows) encoding, and to the fact that a file! is used (directly using data doesn't make rebol3 crash).

I made a self contained example with downloading of csv-tools.r script from rebol.org, and test.csv file creation on the fly.

REBOL []

secure [
  file throw
  net  throw
  %./  allow
]

print "loading csv-tools.r"
if ! exists? %csv-tools.r [
	save %csv-tools.r load/all http://www.rebol.org/download-a-script.r?script-name=csv-tools.r
]
do load %csv-tools.r

;-- 1234;ABCDEF;1 917,00;1 948,00;1,93\r\n
data: #{
  31 32 33 34 3b 41 42 43  44 45 46 3b 31 a0 39 31
  37 2c 30 30 3b 31 a0 39  34 38 2c 30 30 3b 31 2c
  39 33 0d 0a
}

print "writing file"
write/binary %test.csv data

print "loading csv from binary"
print load-csv/with data ";"

print "loading csv from file"
print load-csv/with %test.csv ";"

print "done"

The output is the following:

$ rebol3-bulk-openbsd-x64 -s test.r
loading csv-tools.r
writing file
loading csv from binary
1234 ABCDEF 1�917,00 1�948,00 1,93
loading csv from file
Segmentation fault (core dumped) 

The gdb backtrace is the following:

Program received signal SIGSEGV, Segmentation fault.
0x00000cb01b0136ac in To_Thru (parse=0x7f7ffffc15b0, index=<optimized out>, block=0xcb26b5e1570, is_thru=0) at src/core/u-parse.c:416
416                                     if (!HAS_CASE(parse)) ch1 = UP_CASE(ch1);
(gdb) bt
#0  0x00000cb01b0136ac in To_Thru (parse=0x7f7ffffc15b0, index=<optimized out>, block=0xcb26b5e1570, is_thru=0) at src/core/u-parse.c:416
#1  Parse_To (parse=0x7f7ffffc15b0, index=12, item=0xcb26b5e1570, is_thru=0) at src/core/u-parse.c:511
#2  Parse_Rules_Loop (parse=0x7f7ffffc15b0, index=12, rules=0xcb26b5e1570, depth=5) at src/core/u-parse.c:907
#3  0x00000cb01b01302c in Parse_Rules_Loop (parse=0x7f7ffffc15b0, index=12, rules=0xcb2a6288a70, depth=4) at src/core/u-parse.c:948
#4  0x00000cb01b01302c in Parse_Rules_Loop (parse=0x7f7ffffc15b0, index=12, rules=0xcb26b5e1890, depth=3) at src/core/u-parse.c:948
#5  0x00000cb01b01302c in Parse_Rules_Loop (parse=0x7f7ffffc15b0, index=11, rules=0xcb2b98114b0, depth=2) at src/core/u-parse.c:948
#6  0x00000cb01b01302c in Parse_Rules_Loop (parse=0x7f7ffffc15b0, index=0, rules=0xcb26b5e1830, depth=1) at src/core/u-parse.c:948
#7  0x00000cb01b0125d1 in Parse_Series (val=<optimized out>, rules=0x2, flags=<optimized out>, depth=0) at src/core/u-parse.c:96
#8  N_parse (ds=0xcb2c66e8900) at src/core/u-parse.c:1313
#9  0x00000cb01afada32 in Do_Native (func=<optimized out>) at src/core/c-function.c:317
#10 0x00000cb01afa62a5 in Do_Next (block=<optimized out>, index=40, op=0) at src/core/c-do.c:923
#11 0x00000cb01afa53c3 in Do_Blk (block=0xcb2cf69c370, index=37) at src/core/c-do.c:1055
#12 0x00000cb01afaddb8 in Do_Function (func=<optimized out>) at src/core/c-function.c:443
#13 0x00000cb01afa62a5 in Do_Next (block=<optimized out>, index=32, op=0) at src/core/c-do.c:923
#14 0x00000cb01afa6816 in Do_Args (func_offset=<optimized out>, path=0x0, block=0xcb2cf6981b0, index=29) at src/core/c-do.c:702
#15 0x00000cb01afa6205 in Do_Next (block=<optimized out>, index=28, op=0) at src/core/c-do.c:915
#16 0x00000cb01afa53c3 in Do_Blk (block=0xcb2cf6981b0, index=28) at src/core/c-do.c:1055
#17 0x00000cb01afd4312 in N_do (ds=0xcb2c66e8480) at src/core/n-control.c:588
#18 0x00000cb01afada32 in Do_Native (func=<optimized out>) at src/core/c-function.c:317
#19 0x00000cb01afa62a5 in Do_Next (block=<optimized out>, index=6, op=0) at src/core/c-do.c:923
#20 0x00000cb01afa53c3 in Do_Blk (block=0xcb24dffe650, index=3) at src/core/c-do.c:1055
#21 0x00000cb01afd454d in N_either (ds=<optimized out>) at src/core/n-control.c:676
#22 0x00000cb01afada32 in Do_Native (func=<optimized out>) at src/core/c-function.c:317
#23 0x00000cb01afa62a5 in Do_Next (block=<optimized out>, index=21, op=0) at src/core/c-do.c:923
#24 0x00000cb01afa53c3 in Do_Blk (block=0xcb24dffe6d0, index=12) at src/core/c-do.c:1055
#25 0x00000cb01afd454d in N_either (ds=<optimized out>) at src/core/n-control.c:676
#26 0x00000cb01afada32 in Do_Native (func=<optimized out>) at src/core/c-function.c:317
#27 0x00000cb01afa62a5 in Do_Next (block=<optimized out>, index=11, op=0) at src/core/c-do.c:923
#28 0x00000cb01afa53c3 in Do_Blk (block=0xcb24dffe710, index=5) at src/core/c-do.c:1055
#29 0x00000cb01afd461c in N_if (ds=<optimized out>) at src/core/n-control.c:703
#30 0x00000cb01afada32 in Do_Native (func=<optimized out>) at src/core/c-function.c:317
#31 0x00000cb01afa62a5 in Do_Next (block=<optimized out>, index=116, op=0) at src/core/c-do.c:923
#32 0x00000cb01afa53c3 in Do_Blk (block=0xcb24dffe730, index=112) at src/core/c-do.c:1055
#33 0x00000cb01afaddb8 in Do_Function (func=<optimized out>) at src/core/c-function.c:443
#34 0x00000cb01afa7db9 in Apply_Function (wblk=<optimized out>, widx=<optimized out>, func=0xcb239172000, args=<optimized out>) at src/core/c-do.c:1520
#35 0x00000cb01afa8001 in Do_Sys_Func (inum=<optimized out>) at src/core/c-do.c:1580
#36 0x00000cb01afa8f8c in Init_Mezz (reserved=<optimized out>) at src/core/c-do.c:2317
#37 0x00000cb01afa1787 in RL_Start (bin=<optimized out>, len=<optimized out>, flags=<optimized out>) at src/core/a-lib.c:186
#38 0x00000cb01b075222 in main (argc=<optimized out>, argv=<optimized out>) at src/os/host-main.c:337
@Oldes
Copy link
Owner

Oldes commented Jan 7, 2022

It is fine on macOS as well :/

@Oldes
Copy link
Owner

Oldes commented Jan 7, 2022

When checking the script, the difference between using data directly and from file is, that when using the file without /binary the parsed data are first converted to string (and so it expects UTF8 data source). If I remember it well, parsing string is by default case insensitive while binary is sensitive, and so the location of the crash is clear: UP_CASE(ch1).

Now just find out, why it crashes on OpenBSD and not elsewhere. (Will try it on Windows later)

@Oldes Oldes added Type.bug Type.parse Parse related issues labels Jan 7, 2022
@Oldes
Copy link
Owner

Oldes commented Jan 7, 2022

It's OK also on FreeBSD and Windows... so the crash is only on OpenBSD.

@semarie
Copy link
Author

semarie commented Jan 7, 2022

OpenBSD could be very sensitive to bugs and lot of code is made to make the program crashes if it tries to do something it shouldn't.

With your tip about UP_CASE(ch1), I did some debug: at crash time, ch1 value is 65533.

The macro is defined as #define UP_CASE(c) Upper_Cases[c], and Upper_Cases is a memory segment of 23552 bytes (Upper_Cases = Make_Mem(UNICODE_CASES * sizeof(REBUNI)), UNICODE_CASES = 0x2E00 and sizeof(REBUNI) = 2). So it crashes because of the read out-of-bound.

@semarie
Copy link
Author

semarie commented Jan 7, 2022

For testing purpose, the following diff should make any OS to abort(3) on array out-of-bound read.

blob - c19e0ed8a3e78da1aeec394c78e5d57c2411d09f
file + src/core/u-parse.c
--- src/core/u-parse.c
+++ src/core/u-parse.c
@@ -413,7 +413,12 @@ no_result:
 				REBCNT ch1 = GET_ANY_CHAR(series, index);
 				REBCNT ch2;
 
-				if (!HAS_CASE(parse)) ch1 = UP_CASE(ch1);
+				if (!HAS_CASE(parse)) {
+					if ((0 <= ch1) && (ch1 < UNICODE_CASES))
+						ch1 = UP_CASE(ch1);
+					else
+						abort();
+				}
 
 				// Handle special string types:
 				if (IS_CHAR(item)) {

I will continue to look at it tomorrow.

@Oldes
Copy link
Owner

Oldes commented Jan 7, 2022

You are right... the 65533 value is UNI_REPLACEMENT_CHAR and is out of the case conversion table... on Windows (and probably in other systems is the ch1 silently filled with a garbage value. So there probably must be the ch1 < UNICODE_CASES test. I should again start using Valgrind in CI tests. This is exactly the case which would be found.

Btw... good tip is, that when compiled with _DEBUG define, one can use dump function to see internal value structures.

@Oldes
Copy link
Owner

Oldes commented Jan 7, 2022

Test can be simplified to just:

parse to-string #{A032} [to ["2"]]
; or
parse to-string #{A032} [to [#"2"]]
; or
parse to-string #{A032} [thru ["2"]]
; and also
f: to-string #{A0} parse "a" [to [f]]
; and
parse to-string #{A0} [#"a"]
; or
c: to char! 65533 parse "a" [c]
; or
parse [#"a"][c]
; and also not parse related:
equal? [#"a"] reduce [c]

Everywhere, where is used UP_CASE or LO_CASE in the code without bound check.

@Oldes
Copy link
Owner

Oldes commented Jan 7, 2022

Question is, if such a conversion should be allowed. For example in Red it is not:

>> to-string #{A032}
*** Access Error: invalid UTF-8 encoding: #{A0322D6E}

Especially when it is possible to get correct chars using iconv function:

>> iconv #{31A032} 'iso-8859-1
== "1 2"

@Oldes
Copy link
Owner

Oldes commented Jan 7, 2022

Classic uppercase does not crash, because it have the bound checks.

@Oldes
Copy link
Owner

Oldes commented Jan 7, 2022

It should be fixed in the commit above. Better to have all the checks, because even in Red, one can construct such a strings using methods like:

>> append "a" to-char 65533
== "a�"

@semarie
Copy link
Author

semarie commented Jan 8, 2022

yes, the commit fix it.

Personally, I would have change UP_CASE() macro to do the check systematically (instead of explicitly check it at each call). But it is fine.

@semarie semarie closed this as completed Jan 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants