Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add missing rand(::AbstractRNG, ::Type{Char}) function #11033

Closed
wants to merge 1 commit into from

Conversation

jakebolewski
Copy link
Member

No description provided.

sample in a uniform interval and map the sample to valid
codepoint ranges
@jakebolewski
Copy link
Member Author

@StefanKarpinski I autogenerated some code using is_valid_char to subset possible valid codepoints and then do a simple bisection. There may be a faster way to do this.

@StefanKarpinski
Copy link
Sponsor Member

Seems reasonable to me. If we do want this functionality, this is a good starting point and performance can be tweaked in the future. Do we want this functionality?

@tkelman
Copy link
Contributor

tkelman commented Apr 29, 2015

Wouldn't the autogeneration version of the code be much more concise here?

@jakebolewski
Copy link
Member Author

closed by 5986e58

@ScottPJones
Copy link
Contributor

Ummm.... this code is actually incorrect (please see the FAQ http://www.unicode.org/faq/private_use.html).
This makes the same mistake that is in utf8proc, by not considering the 66 non-characters as valid codepoints, which they definitely are. I think the code should be really simple:

function rand(r::AbstractRNG, ::Type{Char})
   v = rand(0x00000000:0x0010f7ff)
   (v < 0xd800) ? Char(v) : Char(v+0x800)
end

I think maybe this explains why @StefanKarpinski thought checking validity of Char could be too expensive...

@StefanKarpinski
Copy link
Sponsor Member

That seems like a reasonable definition for the this function. I do think that even that amount of checking is a performance problem for something that you do for every character value you pull out of a string. I'd love to be proven wrong.

@ScottPJones
Copy link
Contributor

@StefanKarpinski That's why @JeffBezanson and I want validated strings ;-) Also, the test is just:

function is_valid_char(ch::Unsigned) ; !Bool((ch-0xd800<0x800)|(ch>0x10ffff)) ; end

@carnaval
Copy link
Contributor

carnaval commented May 6, 2015

This should come down to a few ALU instructions and no branches so I don't think it's gonna be significant (if inlined)

@ScottPJones
Copy link
Contributor

😀 Thanks @carnaval, that's what I'd thought too (given that I'd assembly optimized that years ago where I used to work)

@ScottPJones
Copy link
Contributor

@carnaval Julia is too smart for me... how do I get it to show the code generated when it doesn't know the value passed in? Also, this shows an important optimization that is not being done!!!

julia> ch = 0x45
0x45

julia> @code_native(is_valid_char(ch))
    .section    __TEXT,__text,regular,pure_instructions
Filename: utf8proc.jl
Source line: 16
    pushq   %rbp
    movq    %rsp, %rbp
    movb    $1, %al
Source line: 16
    popq    %rbp
    ret

@carnaval
Copy link
Contributor

carnaval commented May 6, 2015

ch is only one byte long (UInt8) so the function is constant.

I don't know what you mean by an important optimization not being done. If you're talking about the frame pointer setup then maybe it is necessary for debugger support ? Anyway this is a llvm flag you can surely enable with something like -fomit-frame-pointer in the JIT config somewhere. It will go away if inlined of course.

@pao
Copy link
Member

pao commented May 6, 2015

how do I get it to show the code generated when it doesn't know the value passed in

It's not clear what you mean by this. If you just mean you don't want to make up a value to pass to the @code_ macros, they have functional forms which take a tuple of argument types as their second argument: code_native(is_valid_char, (Char,)). But the result will be the same since Julia doesn't specialize on argument values, only their types.

@ScottPJones
Copy link
Contributor

@carnaval... wow! I never thought about Julia using the size of the type of the variable and generating special case code... very impressive! I checked an it actually generates different code for UInt8, UInt16, and UInt32...
I'll investigate the frame-pointer flag... I was just surprised, because the default is to have them... I always compiled my C programs with -O3, no frames, no symbols for production, and with frames and symbols only for debugging builds.
I did find a sad missing optimization (unless you tell me what I've done wrong):

julia> valid(ch) = (0xe000 <= ch <= 0x10ffff)
valid (generic function with 1 method)
julia> @code_native(valid(ch))
```asm
    .section    __TEXT,__text,regular,pure_instructions
Filename: none
Source line: 1
    pushq   %rbp
    movq    %rsp, %rbp
    cmpl    $57344, %edi            ## imm = 0xE000
    jae L20
    xorb    %al, %al
Source line: 1
    popq    %rbp
    ret
L20:    cmpl    $1114112, %edi          ## imm = 0x110000
    setb    %al
    popq    %rbp
    ret

At least I found one case where I'm smarter than Julia! (should I create a new issue for this? I have no idea how to fix that!)

@ScottPJones
Copy link
Contributor

@pao thanks - just wasn't thinking that the types were different, still wrapping my head around the fact that unsigned literals' types depend on their length, not their value... unlike any other language I've ever used, and unlike signed literals... [not complaining though, just makes it hard switching between C & Julia]

@carnaval
Copy link
Contributor

carnaval commented May 6, 2015

I don't know why LLVM is not figuring out the branch free version. You can get it by writing (0xe000 <= ch) & (ch <= 0x10ffff).

@ScottPJones
Copy link
Contributor

Yes, which is essentially what I did, I would have liked to been able to write it as: !((0xd800 <= ch <= 0xdfff) | (ch > 0x10ffff)), the Julia range comparison is definitely more readable, but that didn't generate good code... Hope you guys can fix this, there's a lot of code I found that generates branches when it's not needed... [speed killer]

@mbauman
Copy link
Sponsor Member

mbauman commented May 6, 2015

I played with that quite a bit in trying to optimize checkbounds. My conclusion was surprising - I found the branching version to be faster. I don't know why, and I didn't dig into it very much. I may have also not been measuring what I thought I was measuring.

@ScottPJones
Copy link
Contributor

@mbauman, how did you test it? With random input?

@carnaval
Copy link
Contributor

carnaval commented May 6, 2015

It probably depends a lot on the surrounding code and if the branching pattern is obvious or random.

@ScottPJones
Copy link
Contributor

@carnaval, yes, what I was trying to point out... in general, I'd always go with the non-branching version, unless you really know how random the inputs are...

@StefanKarpinski
Copy link
Sponsor Member

Literals in C don't have a type: the type is determined by the type of the expression, not the literal; the literal only determines the value, not the type. In Julia, expressions don't have types, values do. Therefore literals can't just determine a value and let the context determine the type – they must have a value, which implies that they have a type as well. We also can't follow the example of other dynamic languages since most of them don't have unsigned integers, let alone literal syntaxes for such. So we're in uncharted territory here – we can't follow the examples of existing static or dynamic languages.

@mbauman
Copy link
Sponsor Member

mbauman commented May 6, 2015

Yes, that's a very good point. This was in my array indexing work, and there the bounds checks should always be true (unless the user screwed up). Crazy modern processors.

@ScottPJones
Copy link
Contributor

@StefanKarpinski But that doesn't mean that the unsigned literals had to have types that depended on length instead of value (I do understand the convenience, once you are used to it, but it does have the disadvantage of cognitive dissonance, for people who are going back and forth between C/C++/Java and Julia...). It just as easily could have been, decimal literals are signed, start at Int64, promote to Int128 (based on value), hex/oct/bin literals are unsigned, start at UInt64, promote to UInt128 (based on value)...

@StefanKarpinski
Copy link
Sponsor Member

Right, but then you have to pick a specific unsigned integer type – UInt is the obvious one – and you'd have no literal syntax for bytes, shorts, etc. There are certainly a lot of cases where being able to decouple literal syntax from type would be nice, but that's just in the cards for a dynamic language.

@ScottPJones
Copy link
Contributor

@StefanKarpinski I did say that I understood the convenience, once you were used to it, of the Julian way, I'm not suggesting that it be changed at this point. It is a point that really needs to be stressed though for people coming from other languages, which is why I did the addition to the docs. In my C code I'm constantly casting (uint8_t), (uint16_t), (uint32_t), or these days, (char16_t) and (char32_t)...
0xff, 0x00ff, etc. is more convenient, but at a price.

@DilumAluthge DilumAluthge deleted the jcb/randchar branch March 25, 2021 22:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants