Skip to content

shift-based DFA implementation of utf8ValidateSlice #23968

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

hollmmax
Copy link
Contributor

I've been playing around with shift-based DFAs and I think they could be a good fit for the Zig standard library, as they're 3 to 8 times faster than the current utf8ValidateSlice implementation in my testing on an i5 8350u.

The main difference from the original gist is that this implementation uses only 32bit shifts, making it compatible with important Zig targets like ARM, RISC-V and good old x86. The awkward downside is that this is achieved by validating the string from end to start.

I chose to submit an implementation that first defines a DFA and converts it to a shift transition table using comptime, although I am open to replacing that code with a pre-built table literal.

@hollmmax hollmmax force-pushed the master branch 2 times, most recently from e0d3885 to fb482e1 Compare May 22, 2025 22:55
Copy link
Contributor

@SeanTheGleaming SeanTheGleaming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two nitpicks here: one state_from_offset / offset_from_state are in snake case, when the preferred style for function is camel case. More importantly though, the offsets could probably be expressed much more simply as tag values for the State enum. It would look something like this:

        const State = enum(u5) { 
             ok = 8, 
             one = 13, 
             two1 = 23, 
             two2 = 18, 
             three1 = 3, 
             three2 = 28, 
             fail = 0, 

             fn toOffset(state: State) u5 {
                 return @intFromEnum(state);
             }

             fn fromOffset(offset: u5) State {
                 return @enumFromInt(offset);
             }
         };

@hollmmax
Copy link
Contributor Author

Hi, thanks for feedback. For some reason I though enums with explicit tag type didn't play well with exhaustive switches and as the State is only a comptime construct, I'd rather have those than the tag type.

That was an incorrect assumption on my part. The enum now has an explicit tag type and values and the helper methods are completely gone.


const shift_table = blk: {
@setEvalBranchQuota(30000);
var t = [_]u32{0} ** 256;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a total nitpick, but why not just use @splat to initialize the shift table? At least for me, this makes the type of the shift table quicker to understand (even if it only takes a few seconds to understand it to begin with), and means the compiler doesn't have to do any comptime list concatenation.

Suggested change
var t = [_]u32{0} ** 256;
var t: [256]u32 = @splat(0);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants