-
-
Notifications
You must be signed in to change notification settings - Fork 745
add byCodeunit, byChar, byWchar and byDchar to std.utf #2043
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
||
auto byCodeunit(R)(R r) if (isSomeString!R) | ||
{ | ||
alias tchar = Unqual!(ElementEncodingType!R); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't used in this function.
If I use "SomeText".byCodeUnit.map!(c => c) what is the type of the given output items? Is it a dchar again? |
@bearophile it's char |
/******************************************** | ||
* Iterate a range of char, wchar, or dchars by code unit. | ||
* The purpose is to bypass the special case decoding that | ||
* std.array.put does to character arrays. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only std.array.put
, not front
? And as far as I can remember, put
doesn't actually special-case character arrays, it just offers transcoding support for character ranges in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't, but could. As a matter of fact, it's something I'd like to change. Once I get the time to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant front().
None of the Ddoc comments use any of the standard sections, i.e. |
@property bool empty() { return r.length == 0; } | ||
@property auto front() { return r[0]; } | ||
void popFront() { r = r[1 .. $]; } | ||
auto opIndex(size_t index) const { return r[index]; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not so sure I like the style of these methods, putting everything on a single line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer it for trivial functions. Otherwise things tend to get spread out, making it harder to digest at a glance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto here about inout and ref.
I would like to see some descriptions/annotations for the unit tests. |
@WalterBright see also pull #2038, and add a test case for bad UTF-8 with values above 0x10_FFFF. |
This PR addresses https://d.puremagic.com/issues/show_bug.cgi?id=12113 |
Point being that this pull has the same bug in UTF-8 decoding. I could make a separate bug report for these new primitives, but it's kind of meh. |
The same bug? What am I missing? |
Updated to address comments. |
} | ||
|
||
// check for out of range only needed for 4 bytes | ||
static if (i == 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WalterBright
I've meant this check. It seems that either you've fixed it or I've missed that it was here in the first place.
Anyway, all good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had missed it until you pointed it out. It's fixed now, with a unittest.
@blackwhale and @monarchdodra, you are both right, @andralex I misled you. I was using the wrong test case. Will change it back. |
@monarchdodra no, U+FFFF is not checked for. It's a bit in limbo, being reserved for the application. |
I don't think it'll happen very often, but it could be trivially fixed by setting |
@monarchdodra I did a fair amount of thinking about that. We've used U+FFFF as a 'nan' value from the beginning without any reported issues with it. U+FFFF will remain in the transcoded UTF stream as U+FFFF, it won't get lost or missed. U+FFFF is not a transcoding error. U+D800 is not an "actually illegal" value according to the spec, and I think it would be a mistake to mix it up with U+FFFF. Etc. |
You have me misunderstood, and I agree with you. I'm not suggesting we replace U+FFFF or anything. I agree it is not an 'illegal' value, and should remain in the stream. I'm just pointing out you are also using the value U+FFFF (dchar.init) as a flag to mean "front not yet evaluated". As such, if there does happen to be a U+FFFF in your stream, calling front more than once in a row will pop it. In essence, you'd get 2 different values for two consecutive calls to front. I'm instead suggesting you choose the value U+D800 as the "not yet evaluated" value: If there are any U+D800 values in your stream, they are flagged as "invalid", and replaced by U+FFFD. In essence, this makes it impossible to "have" U+D800 as a front, avoiding the issue above. It's just an implementation tweak, not a change of behavior/handling for U+FFFF. |
Updated so that:
|
static struct ByCodeUnitImpl | ||
{ | ||
@property bool empty() const { return r.length == 0; } | ||
@property auto ref front() const { return r[0]; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This (and opIndex
) should be inout
, or you won't actually be able to mutate the returned reference:
auto s = "hello".dup.byCodeUnit();
s.front = 'H';
Unless that was done on purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
I gave one final review to the entire pull. I like the new implementations. It looks good to ship as is. I left some nitpicks; it would be nice if you could address them, but they aren't "blocker" issues. |
Doesn't this fail miserably with std.range.popFrontN ?? I'm not convinced that I like the whole design here*, but at the very least you need to have an accumulator for how many elements front should remove when called. *take a range of auto r2 = r1.byDchar.
if(!r2.empty)
{
auto e = r2.front; //Woops, access to an already known value is blocked
//waiting on getting the *next* data point.
} |
How so? No any worst than with a
I don't see how
Again, I don't see how that's different from any other range addaptor. Besides, this could block also block at |
auto s = "fdsa";
s.popFrontN(2);
assert(equal(s, "sa"));
s = "fdsa";
auto sByDchar = s.byDchar();
sByDchar.popFrontN(2);
assert(equal(sByDChar, "sa")); //fails, only the 'f' has been popped. |
I think we still have a serious problem with the semantic specification of ranges and their primitives. |
Indeed, I missed that. There's an implementation problem here.
Just the implementation I think. |
BTW, I hadn't suggested yet, but we might want to split this pull between |
|
did all the issues here |
Anybody see any outstanding (non-style related) issues left? @John-Colvin maybe? Will soon merge otherwise. |
I still think a range wrapper like byDchar shouldn't advance it's inner range by more than is necessary to calculate the current front. They should maximally lazy w.r.t. the amount of data they request, to avoid unecessary latency. Having said that, I see no problem merging this how it is, it can always be changed later (becoming more lazy here isn't a breaking change unelss someone is relying on undocumented implementation details). |
Isn't that what it's doing though? I see most of the work being done in |
Talking specifically about byDchar for a range of char: Calling front causes the inner range to be popped such that it's front points to the first char not in the current code-point. For an inner range that updates its state in popFront, you are requesting more information from the original data-source than is currently needed. Sometimes this wouldn't be a problem, but anything that handles user input events in realtime can't afford to be blocked waiting for the next input in order to access the previous one, Once a wrapper uses eager popping it requires a redesign of the inner range (or another intermediate range that defers the work in popFront to front) to restore the fully lazy behaviour and avoid latency problems. It's inelegant and leads to a rats nest of guard variables that confuse the optimiser and increase register pressure. |
@John-Colvin popFront() does not have the behavior you describe - it does not wait for more input. It 'consumes' the current input. |
I kind of like it but turns out that I, for instance, can't use anything of this in std.regex. Reasons:
All of the above makes it next to useless in my use cases, might as well for other parser/lexer stuff. |
I think he means that I'm not sure that's a problem though. |
@WalterBright You misunderstand me. I am referring to |
@John-Colvin ok, done. |
Auto-merge toggled on |
At this point, I think this is good enough for inclusion. Anything left we can fix later. |
add byCodeunit, byChar, byWchar and byDchar to std.utf
thanks! |
{ | ||
if (!haveData) | ||
front; | ||
r.popFront(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hum... looks like we were a little fast with the whole "lazy pop thing":
http://forum.dlang.org/thread/trqnqtzspoyhggvvftgp@forum.dlang.org
Long story short, if the last character in the string is actually truncated, then the popFront()
of the underlying range will fail, since it will already be empty.
A "fix" would be to add:
if (!r.empty)
r.popFront();
But that check could be more costly than what we are trying to save on? It needs to be fixed one way or the other. @WalterBright ?
I didn't check if byWchar
is also subject to the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this has been merged, please file bugzilla issues for problems with it.
Edit: filed a bug https://issues.dlang.org/show_bug.cgi?id=13535 |
Please file a bug for this, since this PR is already merged. |
These should form the foundation for higher performance string processing, as they:
Unit tests are at 100%.