Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
ncodeunits(c::Char): fast equivalent of ncodeunits(string(c)) #29153
This is pretty efficient:
julia> @code_native write(devnull, 'x') bswapl %edi xorl %eax, %eax nopw %cs:(%rax,%rax) L16: shrl $8, %edi addq $1, %rax testl %edi, %edi jne L16 retq
julia> @code_native ncodeunits('x') tzcntl %edi, %eax shrl $3, %eax movl $4, %ecx subq %rax, %rcx testq %rcx, %rcx movl $1, %eax cmovgq %rcx, %rax retq
Intel predicts the second is faster, by 10%:
But that's actually only for my native CPU, if we look back in time (ivybridge, broadwell, haswell, nehalem), we see that the predicted performance of the first has been relatively unchanged over time, while the performance of the second has been steadily improving.
What's I think is likely happening is that the first loop is actually much cheaper for the processor to execute (much lower latency), so it has always done fairly well in a benchmarking loop. Whereas the second loop actually requires more transistors to reach the same level of performance (the above output is truncated, the full output includes some graphs to illustrate this point). I could be wrong, since I'm just reverse-engineering the output of a static-prediction tool, but that would be my analysis.