Add missing rand(::AbstractRNG, ::Type{Char}) function #11033

jakebolewski · 2015-04-27T17:51:16Z

No description provided.

sample in a uniform interval and map the sample to valid codepoint ranges

jakebolewski · 2015-04-28T18:58:45Z

@StefanKarpinski I autogenerated some code using is_valid_char to subset possible valid codepoints and then do a simple bisection. There may be a faster way to do this.

StefanKarpinski · 2015-04-28T19:49:01Z

Seems reasonable to me. If we do want this functionality, this is a good starting point and performance can be tweaked in the future. Do we want this functionality?

tkelman · 2015-04-29T07:25:34Z

Wouldn't the autogeneration version of the code be much more concise here?

jakebolewski · 2015-05-06T13:47:23Z

closed by 5986e58

ScottPJones · 2015-05-06T15:48:25Z

Ummm.... this code is actually incorrect (please see the FAQ http://www.unicode.org/faq/private_use.html).
This makes the same mistake that is in utf8proc, by not considering the 66 non-characters as valid codepoints, which they definitely are. I think the code should be really simple:

function rand(r::AbstractRNG, ::Type{Char})
   v = rand(0x00000000:0x0010f7ff)
   (v < 0xd800) ? Char(v) : Char(v+0x800)
end

I think maybe this explains why @StefanKarpinski thought checking validity of Char could be too expensive...

StefanKarpinski · 2015-05-06T18:06:32Z

That seems like a reasonable definition for the this function. I do think that even that amount of checking is a performance problem for something that you do for every character value you pull out of a string. I'd love to be proven wrong.

ScottPJones · 2015-05-06T18:11:09Z

@StefanKarpinski That's why @JeffBezanson and I want validated strings ;-) Also, the test is just:

function is_valid_char(ch::Unsigned) ; !Bool((ch-0xd800<0x800)|(ch>0x10ffff)) ; end

carnaval · 2015-05-06T18:16:55Z

This should come down to a few ALU instructions and no branches so I don't think it's gonna be significant (if inlined)

ScottPJones · 2015-05-06T18:25:55Z

😀 Thanks @carnaval, that's what I'd thought too (given that I'd assembly optimized that years ago where I used to work)

ScottPJones · 2015-05-06T19:08:15Z

@carnaval Julia is too smart for me... how do I get it to show the code generated when it doesn't know the value passed in? Also, this shows an important optimization that is not being done!!!

julia> ch = 0x45
0x45

julia> @code_native(is_valid_char(ch))

    .section    __TEXT,__text,regular,pure_instructions
Filename: utf8proc.jl
Source line: 16
    pushq   %rbp
    movq    %rsp, %rbp
    movb    $1, %al
Source line: 16
    popq    %rbp
    ret

carnaval · 2015-05-06T19:13:04Z

ch is only one byte long (UInt8) so the function is constant.

I don't know what you mean by an important optimization not being done. If you're talking about the frame pointer setup then maybe it is necessary for debugger support ? Anyway this is a llvm flag you can surely enable with something like -fomit-frame-pointer in the JIT config somewhere. It will go away if inlined of course.

pao · 2015-05-06T19:15:46Z

how do I get it to show the code generated when it doesn't know the value passed in

It's not clear what you mean by this. If you just mean you don't want to make up a value to pass to the @code_ macros, they have functional forms which take a tuple of argument types as their second argument: code_native(is_valid_char, (Char,)). But the result will be the same since Julia doesn't specialize on argument values, only their types.

ScottPJones · 2015-05-06T20:07:11Z

@carnaval... wow! I never thought about Julia using the size of the type of the variable and generating special case code... very impressive! I checked an it actually generates different code for UInt8, UInt16, and UInt32...
I'll investigate the frame-pointer flag... I was just surprised, because the default is to have them... I always compiled my C programs with -O3, no frames, no symbols for production, and with frames and symbols only for debugging builds.
I did find a sad missing optimization (unless you tell me what I've done wrong):

julia> valid(ch) = (0xe000 <= ch <= 0x10ffff)
valid (generic function with 1 method)
julia> @code_native(valid(ch))
```asm
    .section    __TEXT,__text,regular,pure_instructions
Filename: none
Source line: 1
    pushq   %rbp
    movq    %rsp, %rbp
    cmpl    $57344, %edi            ## imm = 0xE000
    jae L20
    xorb    %al, %al
Source line: 1
    popq    %rbp
    ret
L20:    cmpl    $1114112, %edi          ## imm = 0x110000
    setb    %al
    popq    %rbp
    ret

At least I found one case where I'm smarter than Julia! (should I create a new issue for this? I have no idea how to fix that!)

ScottPJones · 2015-05-06T20:09:36Z

@pao thanks - just wasn't thinking that the types were different, still wrapping my head around the fact that unsigned literals' types depend on their length, not their value... unlike any other language I've ever used, and unlike signed literals... [not complaining though, just makes it hard switching between C & Julia]

carnaval · 2015-05-06T20:14:54Z

I don't know why LLVM is not figuring out the branch free version. You can get it by writing (0xe000 <= ch) & (ch <= 0x10ffff).

ScottPJones · 2015-05-06T20:18:50Z

Yes, which is essentially what I did, I would have liked to been able to write it as: !((0xd800 <= ch <= 0xdfff) | (ch > 0x10ffff)), the Julia range comparison is definitely more readable, but that didn't generate good code... Hope you guys can fix this, there's a lot of code I found that generates branches when it's not needed... [speed killer]

mbauman · 2015-05-06T20:19:34Z

I played with that quite a bit in trying to optimize checkbounds. My conclusion was surprising - I found the branching version to be faster. I don't know why, and I didn't dig into it very much. I may have also not been measuring what I thought I was measuring.

ScottPJones · 2015-05-06T20:21:43Z

@mbauman, how did you test it? With random input?

carnaval · 2015-05-06T20:22:17Z

It probably depends a lot on the surrounding code and if the branching pattern is obvious or random.

ScottPJones · 2015-05-06T20:23:36Z

@carnaval, yes, what I was trying to point out... in general, I'd always go with the non-branching version, unless you really know how random the inputs are...

StefanKarpinski · 2015-05-06T20:27:11Z

Literals in C don't have a type: the type is determined by the type of the expression, not the literal; the literal only determines the value, not the type. In Julia, expressions don't have types, values do. Therefore literals can't just determine a value and let the context determine the type – they must have a value, which implies that they have a type as well. We also can't follow the example of other dynamic languages since most of them don't have unsigned integers, let alone literal syntaxes for such. So we're in uncharted territory here – we can't follow the examples of existing static or dynamic languages.

mbauman · 2015-05-06T20:27:38Z

Yes, that's a very good point. This was in my array indexing work, and there the bounds checks should always be true (unless the user screwed up). Crazy modern processors.

ScottPJones · 2015-05-06T20:34:30Z

@StefanKarpinski But that doesn't mean that the unsigned literals had to have types that depended on length instead of value (I do understand the convenience, once you are used to it, but it does have the disadvantage of cognitive dissonance, for people who are going back and forth between C/C++/Java and Julia...). It just as easily could have been, decimal literals are signed, start at Int64, promote to Int128 (based on value), hex/oct/bin literals are unsigned, start at UInt64, promote to UInt128 (based on value)...

StefanKarpinski · 2015-05-06T20:49:30Z

Right, but then you have to pick a specific unsigned integer type – UInt is the obvious one – and you'd have no literal syntax for bytes, shorts, etc. There are certainly a lot of cases where being able to decouple literal syntax from type would be nice, but that's just in the cards for a dynamic language.

ScottPJones · 2015-05-06T20:55:24Z

@StefanKarpinski I did say that I understood the convenience, once you were used to it, of the Julian way, I'm not suggesting that it be changed at this point. It is a point that really needs to be stressed though for people coming from other languages, which is why I did the addition to the docs. In my C code I'm constantly casting (uint8_t), (uint16_t), (uint32_t), or these days, (char16_t) and (char32_t)...
0xff, 0x00ff, etc. is more convenient, but at a price.

jakebolewski force-pushed the jcb/randchar branch from d1b193c to abd59fd Compare April 28, 2015 18:54

add missing rand function for Char's

abd59fd

sample in a uniform interval and map the sample to valid codepoint ranges

jakebolewski closed this May 6, 2015

jiahao mentioned this pull request May 6, 2015

is_valid_char does not correctly follow the Unicode standard #11171

Closed

ScottPJones mentioned this pull request May 6, 2015

Simplify and improve performance of rand(Char) #11178

Merged

DilumAluthge deleted the jcb/randchar branch March 25, 2021 22:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add missing rand(::AbstractRNG, ::Type{Char}) function #11033

Add missing rand(::AbstractRNG, ::Type{Char}) function #11033

jakebolewski commented Apr 27, 2015

jakebolewski commented Apr 28, 2015

StefanKarpinski commented Apr 28, 2015

tkelman commented Apr 29, 2015

jakebolewski commented May 6, 2015

ScottPJones commented May 6, 2015

StefanKarpinski commented May 6, 2015

ScottPJones commented May 6, 2015

carnaval commented May 6, 2015

ScottPJones commented May 6, 2015

ScottPJones commented May 6, 2015

carnaval commented May 6, 2015

pao commented May 6, 2015

ScottPJones commented May 6, 2015

ScottPJones commented May 6, 2015

carnaval commented May 6, 2015

ScottPJones commented May 6, 2015

mbauman commented May 6, 2015

ScottPJones commented May 6, 2015

carnaval commented May 6, 2015

ScottPJones commented May 6, 2015

StefanKarpinski commented May 6, 2015

mbauman commented May 6, 2015

ScottPJones commented May 6, 2015

StefanKarpinski commented May 6, 2015

ScottPJones commented May 6, 2015

Add missing rand(::AbstractRNG, ::Type{Char}) function #11033

Add missing rand(::AbstractRNG, ::Type{Char}) function #11033

Conversation

jakebolewski commented Apr 27, 2015

jakebolewski commented Apr 28, 2015

StefanKarpinski commented Apr 28, 2015

tkelman commented Apr 29, 2015

jakebolewski commented May 6, 2015

ScottPJones commented May 6, 2015

StefanKarpinski commented May 6, 2015

ScottPJones commented May 6, 2015

carnaval commented May 6, 2015

ScottPJones commented May 6, 2015

ScottPJones commented May 6, 2015

carnaval commented May 6, 2015

pao commented May 6, 2015

ScottPJones commented May 6, 2015

ScottPJones commented May 6, 2015

carnaval commented May 6, 2015

ScottPJones commented May 6, 2015

mbauman commented May 6, 2015

ScottPJones commented May 6, 2015

carnaval commented May 6, 2015

ScottPJones commented May 6, 2015

StefanKarpinski commented May 6, 2015

mbauman commented May 6, 2015

ScottPJones commented May 6, 2015

StefanKarpinski commented May 6, 2015

ScottPJones commented May 6, 2015