Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ncodeunits(c::Char): fast equivalent of ncodeunits(string(c)) #29153

Merged
merged 2 commits into from Sep 14, 2018

Conversation

4 participants
@StefanKarpinski
Copy link
Member

commented Sep 12, 2018

No description provided.

@@ -134,7 +141,7 @@ function decode_overlong(c::Char)
end

"""
decode_overlong(c::AbstractChar)
decode_overlong(c::AbstractChar) -> UInt32

This comment has been minimized.

Copy link
@KristofferC

KristofferC Sep 12, 2018

Contributor

Is this return type for all AbstractChar or only for Char?

This comment has been minimized.

Copy link
@StefanKarpinski

StefanKarpinski Sep 12, 2018

Author Member

The result is a code point which there's no reason to represent as anything but a UInt32.

ncodeunits(c::Char) = max(1, 4 - (trailing_zeros(reinterpret(UInt32, c)) >> 3))

"""
codepoint(c::AbstractChar) -> UInt32

This comment has been minimized.

Copy link
@KristofferC

KristofferC Sep 12, 2018

Contributor

Perhaps UInt32->Integer if this is for AbstractChar or add an extra line

codepoint(c::AbstractChar) -> Integer
codepoint(c::Char) -> UInt32

This comment has been minimized.

Copy link
@StefanKarpinski

StefanKarpinski Sep 12, 2018

Author Member

Likewise, there's not reason not to represent a code point as a UInt32.

This comment has been minimized.

Copy link
@KristofferC

KristofferC Sep 13, 2018

Contributor

Ok, I was thinking about the docs for this function that explicitly say

For `Char`, this is a `UInt32` value, but
`AbstractChar` types that represent only a subset of Unicode may
return a different-sized integer (e.g. `UInt8`).

The signature and the docs seem at odds now.

This comment has been minimized.

Copy link
@StefanKarpinski

StefanKarpinski Sep 13, 2018

Author Member

I guess someone could do that although it seems kind of silly to me. I didn't write these docs.

This comment has been minimized.

Copy link
@KristofferC

KristofferC Sep 13, 2018

Contributor

Who wrote the docs are not really that important though... Just that the end result is consistent and doesn't say in the signature that codepoint(c::AbstractChar) has to return a UInt32 and in the documentation just below it, that it can return a UInt8.

@stevengj

This comment has been minimized.

Copy link
Member

commented Sep 13, 2018

You could define a fallback method ncodeunits(c::AbstractChar) = write(devnull, c)? Maybe not, since write will return a count of bytes, which might not be code-units if c represents some other encoding?

@stevengj

This comment has been minimized.

Copy link
Member

commented Sep 13, 2018

Interestingly, according to @btime, the ncodeunits implementation here takes exactly the same amount of time as write(devnull, c).

@@ -91,7 +98,7 @@ end
# not to support malformed or overlong encodings.

"""
ismalformed(c::AbstractChar)
ismalformed(c::AbstractChar) -> Bool

This comment has been minimized.

Copy link
@stevengj

stevengj Sep 13, 2018

Member

Are we still using this syntax or are we transitioning to ismalformed(c::AbstractChar)::Bool in documentation?

This comment has been minimized.

Copy link
@StefanKarpinski

StefanKarpinski Sep 13, 2018

Author Member

As far as I'm aware we're still using this syntax most places.

@StefanKarpinski

This comment has been minimized.

Copy link
Member Author

commented Sep 13, 2018

This is pretty efficient:

julia> @code_native write(devnull, 'x')
    bswapl	%edi
    xorl	%eax, %eax
    nopw	%cs:(%rax,%rax)
L16:
    shrl	$8, %edi
    addq	$1, %rax
    testl	%edi, %edi
    jne	L16
    retq

Versus:

julia> @code_native ncodeunits('x')
	tzcntl	%edi, %eax
	shrl	$3, %eax
	movl	$4, %ecx
	subq	%rax, %rcx
	testq	%rcx, %rcx
	movl	$1, %eax
	cmovgq	%rcx, %rax
	retq

The write version might be slower if the loop branch is unpredictable, i.e. characters with different sizes. The ncodeunits version could be faster if I could get rid of the max in it. There may be some clever way to do that but I'm not sure that this really warrants that much micro-optimization.

@vtjnash

This comment has been minimized.

Copy link
Member

commented Sep 13, 2018

Intel predicts the second is faster, by 10%:

$ ./usr/tools/llvm-mca
    bswapl	%edi
    xorl	%eax, %eax
    nopw	%cs:(%rax,%rax)
L16:
    shrl	$8, %edi
    addq	$1, %rax
    testl	%edi, %edi
    jne	L16
^D
Iterations:        100
Instructions:      700
Total Cycles:      205
Total uOps:        700

Dispatch Width:    6
uOps Per Cycle:    3.41
IPC:               3.41
Block RThroughput: 1.2


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      1     0.50                        bswapl	%edi
 1      1     0.25                        xorl	%eax, %eax
 1      1     0.17                        nopw	%cs:(%rax,%rax)
 1      1     0.50                        shrl	$8, %edi
 1      1     0.25                        addq	$1, %rax
 1      1     0.25                        testl	%edi, %edi
 1      1     0.50                        jne	L16


Resources:
[0]   - SKXDivider
[1]   - SKXFPDivider
[2]   - SKXPort0
[3]   - SKXPort1
[4]   - SKXPort2
[5]   - SKXPort3
[6]   - SKXPort4
[7]   - SKXPort5
[8]   - SKXPort6
[9]   - SKXPort7


Resource pressure per iteration:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    
 -      -     1.50   1.50    -      -      -     1.50   1.50    -     

Resource pressure by instruction:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    Instructions:
 -      -      -     0.50    -      -      -     0.50    -      -     bswapl	%edi
 -      -     0.49    -      -      -      -     0.50   0.01    -     xorl	%eax, %eax
 -      -      -      -      -      -      -      -      -      -     nopw	%cs:(%rax,%rax)
 -      -     0.50    -      -      -      -      -     0.50    -     shrl	$8, %edi
 -      -      -     0.50    -      -      -     0.50    -      -     addq	$1, %rax
 -      -      -     0.50    -      -      -      -     0.50    -     testl	%edi, %edi
 -      -     0.51    -      -      -      -      -     0.49    -     jne	L16


$ ./usr/tools/llvm-mca
	tzcntl	%edi, %eax
	shrl	$3, %eax
	movl	$4, %ecx
	subq	%rax, %rcx
	testq	%rcx, %rcx
	movl	$1, %eax
	cmovgq	%rcx, %rax
^D
Iterations:        100
Instructions:      700
Total Cycles:      184
Total uOps:        700

Dispatch Width:    6
uOps Per Cycle:    3.80
IPC:               3.80
Block RThroughput: 1.2


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      3     1.00                        tzcntl	%edi, %eax
 1      1     0.50                        shrl	$3, %eax
 1      1     0.25                        movl	$4, %ecx
 1      1     0.25                        subq	%rax, %rcx
 1      1     0.25                        testq	%rcx, %rcx
 1      1     0.25                        movl	$1, %eax
 1      1     0.50                        cmovgq	%rcx, %rax


Resources:
[0]   - SKXDivider
[1]   - SKXFPDivider
[2]   - SKXPort0
[3]   - SKXPort1
[4]   - SKXPort2
[5]   - SKXPort3
[6]   - SKXPort4
[7]   - SKXPort5
[8]   - SKXPort6
[9]   - SKXPort7


Resource pressure per iteration:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    
 -      -     1.76   1.74    -      -      -     1.73   1.77    -     

Resource pressure by instruction:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    Instructions:
 -      -      -     1.00    -      -      -      -      -      -     tzcntl	%edi, %eax
 -      -     0.17    -      -      -      -      -     0.83    -     shrl	$3, %eax
 -      -     0.02    -      -      -      -     0.96   0.02    -     movl	$4, %ecx
 -      -     0.41   0.17    -      -      -     0.03   0.39    -     subq	%rax, %rcx
 -      -     0.31   0.31    -      -      -     0.07   0.31    -     testq	%rcx, %rcx
 -      -     0.01   0.26    -      -      -     0.67   0.06    -     movl	$1, %eax
 -      -     0.84    -      -      -      -      -     0.16    -     cmovgq	%rcx, %rax

But that's actually only for my native CPU, if we look back in time (ivybridge, broadwell, haswell, nehalem), we see that the predicted performance of the first has been relatively unchanged over time, while the performance of the second has been steadily improving.

What's I think is likely happening is that the first loop is actually much cheaper for the processor to execute (much lower latency), so it has always done fairly well in a benchmarking loop. Whereas the second loop actually requires more transistors to reach the same level of performance (the above output is truncated, the full output includes some graphs to illustrate this point). I could be wrong, since I'm just reverse-engineering the output of a static-prediction tool, but that would be my analysis.

@StefanKarpinski StefanKarpinski force-pushed the sk/ncodeunits-char branch from cb4bc8e to b931684 Sep 13, 2018

define ncodeunits(c::Char) as fast equivalent of ncodeunits(string(c))
There was a non-public `codelen(c::Char)` method which previously did
this. This also replaces internal uses of this with `ncodeunits(c)`.

@StefanKarpinski StefanKarpinski force-pushed the sk/ncodeunits-char branch from b931684 to d4d577e Sep 13, 2018

@StefanKarpinski

This comment has been minimized.

Copy link
Member Author

commented Sep 14, 2018

I replaced the internal-only Base.codelen function with this new method of ncodeunits.

@StefanKarpinski StefanKarpinski merged commit 3b02991 into master Sep 14, 2018

4 checks passed

ci/circleci: build-i686 Your tests passed on CircleCI!
Details
ci/circleci: build-x86_64 Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@StefanKarpinski StefanKarpinski deleted the sk/ncodeunits-char branch Sep 14, 2018

@StefanKarpinski

This comment has been minimized.

Copy link
Member Author

commented Sep 14, 2018

Also, went with the write(devnull, c) definition since I figure that something that works well on a larger span of older and newer CPUs is somewhat better. If this changes in the future we could always use a different definition.

fredrikekre added a commit that referenced this pull request Dec 5, 2018

Addition of NEWS and compat admonitions for important
changes between Julia 1.0 and 1.1, including:

- Custom .css-style for compat admonitions.

- Information about compat annotations to CONTRIBUTING.md.

- NEWS.md entry for PRs #30090, #30035, #30022, #29978,
  #29969, #29858, #29845, #29754, #29638, #29636, #29615,
  #29600, #29506, #29469, #29316, #29259, #29178, #29153,
  #29033, #28902, #28761, #28745, #28708, #28696, #29997,
  #28790, #29092, #29108, #29782

- Compat annotation for PRs #30090, #30013, #29978,
  #29890, #29858, #29827, #29754, #29679, #29636, #29623,
  #29600, #29440, #29316, #29259, #29178, #29157, #29153,
  #29033, #28902, #28878, #28761, #28708, #28156, #29733,
  #29670, #29997, #28790, #29092, #29108, #29782, #25278

- Documentation for broadcasting CartesianIndices (#30230).
- Documentation for Base.julia_cmd().
- Documentation for colon constructor of CartesianIndices (#29440).
- Documentation for ^(::Matrix, ::Number) and ^(::Number, ::Matrix).

- Run NEWS-update.jl.

Co-authored-by: Morten Piibeleht <morten.piibeleht@gmail.com>
Co-authored-by: Fredrik Ekre <ekrefredrik@gmail.com>

fredrikekre added a commit that referenced this pull request Dec 5, 2018

Addition of NEWS and compat admonitions for important
changes between Julia 1.0 and 1.1, including:

- Custom .css-style for compat admonitions.

- Information about compat annotations to CONTRIBUTING.md.

- NEWS.md entry for PRs #30090, #30035, #30022, #29978,
  #29969, #29858, #29845, #29754, #29638, #29636, #29615,
  #29600, #29506, #29469, #29316, #29259, #29178, #29153,
  #29033, #28902, #28761, #28745, #28708, #28696, #29997,
  #28790, #29092, #29108, #29782

- Compat annotation for PRs #30090, #30013, #29978,
  #29890, #29858, #29827, #29754, #29679, #29636, #29623,
  #29600, #29440, #29316, #29259, #29178, #29157, #29153,
  #29033, #28902, #28878, #28761, #28708, #28156, #29733,
  #29670, #29997, #28790, #29092, #29108, #29782, #25278

- Documentation for broadcasting CartesianIndices (#30230).
- Documentation for Base.julia_cmd().
- Documentation for colon constructor of CartesianIndices (#29440).
- Documentation for ^(::Matrix, ::Number) and ^(::Number, ::Matrix).

- Run NEWS-update.jl.

Co-authored-by: Morten Piibeleht <morten.piibeleht@gmail.com>
Co-authored-by: Fredrik Ekre <ekrefredrik@gmail.com>

fredrikekre added a commit that referenced this pull request Dec 5, 2018

Compat admonitions and NEWS for Julia 1.1 (#30230)
Addition of NEWS and compat admonitions for important changes between Julia 1.0 and 1.1, including:

- Custom .css-style for compat admonitions.

- Information about compat annotations to CONTRIBUTING.md.

- NEWS.md entry for PRs #30090, #30035, #30022, #29978,
  #29969, #29858, #29845, #29754, #29638, #29636, #29615,
  #29600, #29506, #29469, #29316, #29259, #29178, #29153,
  #29033, #28902, #28761, #28745, #28708, #28696, #29997,
  #28790, #29092, #29108, #29782

- Compat annotation for PRs #30090, #30013, #29978,
  #29890, #29858, #29827, #29754, #29679, #29636, #29623,
  #29600, #29440, #29316, #29259, #29178, #29157, #29153,
  #29033, #28902, #28878, #28761, #28708, #28156, #29733,
  #29670, #29997, #28790, #29092, #29108, #29782, #25278

- Documentation for broadcasting CartesianIndices (#30230).
- Documentation for Base.julia_cmd().
- Documentation for colon constructor of CartesianIndices (#29440).
- Documentation for ^(::Matrix, ::Number) and ^(::Number, ::Matrix).

- Run NEWS-update.jl.


Co-authored-by: Morten Piibeleht <morten.piibeleht@gmail.com>
Co-authored-by: Fredrik Ekre <ekrefredrik@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.