-
-
Notifications
You must be signed in to change notification settings - Fork 3k
shift-based DFA implementation of utf8ValidateSlice #23968
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
e0d3885
to
fb482e1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two nitpicks here: one state_from_offset
/ offset_from_state
are in snake case, when the preferred style for function is camel case. More importantly though, the offsets could probably be expressed much more simply as tag values for the State enum. It would look something like this:
const State = enum(u5) {
ok = 8,
one = 13,
two1 = 23,
two2 = 18,
three1 = 3,
three2 = 28,
fail = 0,
fn toOffset(state: State) u5 {
return @intFromEnum(state);
}
fn fromOffset(offset: u5) State {
return @enumFromInt(offset);
}
};
Hi, thanks for feedback. For some reason I though enums with explicit tag type didn't play well with exhaustive switches and as the State is only a comptime construct, I'd rather have those than the tag type. That was an incorrect assumption on my part. The enum now has an explicit tag type and values and the helper methods are completely gone. |
lib/std/unicode.zig
Outdated
|
||
const shift_table = blk: { | ||
@setEvalBranchQuota(30000); | ||
var t = [_]u32{0} ** 256; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a total nitpick, but why not just use @splat
to initialize the shift table? At least for me, this makes the type of the shift table quicker to understand (even if it only takes a few seconds to understand it to begin with), and means the compiler doesn't have to do any comptime list concatenation.
var t = [_]u32{0} ** 256; | |
var t: [256]u32 = @splat(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
I don't find much of a difference readability-wise, but if that's the idiomatic pattern nowadays then it's better to be consistent.
I've been playing around with shift-based DFAs and I think they could be a good fit for the Zig standard library, as they're 3 to 8 times faster than the current
utf8ValidateSlice
implementation in my testing on an i5 8350u.The main difference from the original gist is that this implementation uses only 32bit shifts, making it compatible with important Zig targets like ARM, RISC-V and good old x86. The awkward downside is that this is achieved by validating the string from end to start.
I chose to submit an implementation that first defines a DFA and converts it to a shift transition table using
comptime
, although I am open to replacing that code with a pre-built table literal.