Skip to content

Conversation

WalterBright
Copy link
Member

These should form the foundation for higher performance string processing, as they:

  1. do not auto-decode
  2. are pure/nothrow/safe, depending on the input parameter
  3. are lazy
  4. do not allocate memory
  5. replace invalid UTF sequences with U+FFFD, per the Unicode Standard 6.2, rather than throwing or asserting
  6. are composable - completely compatible with ranges and algorithms

Unit tests are at 100%.


auto byCodeunit(R)(R r) if (isSomeString!R)
{
alias tchar = Unqual!(ElementEncodingType!R);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't used in this function.

@bearophile
Copy link

If I use "SomeText".byCodeUnit.map!(c => c) what is the type of the given output items? Is it a dchar again?

@WalterBright
Copy link
Member Author

@bearophile it's char

/********************************************
* Iterate a range of char, wchar, or dchars by code unit.
* The purpose is to bypass the special case decoding that
* std.array.put does to character arrays.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only std.array.put, not front? And as far as I can remember, put doesn't actually special-case character arrays, it just offers transcoding support for character ranges in general.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't, but could. As a matter of fact, it's something I'd like to change. Once I get the time to.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant front().

@jacob-carlborg
Copy link
Contributor

None of the Ddoc comments use any of the standard sections, i.e. Params, Returns and so on.

@property bool empty() { return r.length == 0; }
@property auto front() { return r[0]; }
void popFront() { r = r[1 .. $]; }
auto opIndex(size_t index) const { return r[index]; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not so sure I like the style of these methods, putting everything on a single line.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer it for trivial functions. Otherwise things tend to get spread out, making it harder to digest at a glance.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto here about inout and ref.

@jacob-carlborg
Copy link
Contributor

I would like to see some descriptions/annotations for the unit tests.

@DmitryOlshansky
Copy link
Member

@WalterBright see also pull #2038, and add a test case for bad UTF-8 with values above 0x10_FFFF.

@WalterBright
Copy link
Member Author

@DmitryOlshansky
Copy link
Member

This PR addresses https://d.puremagic.com/issues/show_bug.cgi?id=12113

Point being that this pull has the same bug in UTF-8 decoding. I could make a separate bug report for these new primitives, but it's kind of meh.

@WalterBright
Copy link
Member Author

Point being that this pull has the same bug in UTF-8 decoding.

The same bug? What am I missing?

@WalterBright
Copy link
Member Author

Updated to address comments.

}

// check for out of range only needed for 4 bytes
static if (i == 3)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WalterBright
I've meant this check. It seems that either you've fixed it or I've missed that it was here in the first place.
Anyway, all good.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had missed it until you pointed it out. It's fixed now, with a unittest.

@WalterBright
Copy link
Member Author

@blackwhale and @monarchdodra, you are both right, @andralex I misled you. I was using the wrong test case. Will change it back.

@WalterBright
Copy link
Member Author

@monarchdodra no, U+FFFF is not checked for. It's a bit in limbo, being reserved for the application.

@monarchdodra
Copy link
Collaborator

@monarchdodra no, U+FFFF is not checked for. It's a bit in limbo, being reserved for the application.

I don't think it'll happen very often, but it could be trivially fixed by setting frontChar's .init value to an actually illegal and checked for value, such as U+D800.

@WalterBright
Copy link
Member Author

@monarchdodra I did a fair amount of thinking about that. We've used U+FFFF as a 'nan' value from the beginning without any reported issues with it. U+FFFF will remain in the transcoded UTF stream as U+FFFF, it won't get lost or missed. U+FFFF is not a transcoding error. U+D800 is not an "actually illegal" value according to the spec, and I think it would be a mistake to mix it up with U+FFFF. Etc.

@monarchdodra
Copy link
Collaborator

@monarchdodra I did a fair amount of thinking about that. We've used U+FFFF as a 'nan' value from the beginning without any reported issues with it. U+FFFF will remain in the transcoded UTF stream as U+FFFF, it won't get lost or missed. U+FFFF is not a transcoding error. U+D800 is not an "actually illegal" value according to the spec, and I think it would be a mistake to mix it up with U+FFFF. Etc.

You have me misunderstood, and I agree with you. I'm not suggesting we replace U+FFFF or anything. I agree it is not an 'illegal' value, and should remain in the stream. I'm just pointing out you are also using the value U+FFFF (dchar.init) as a flag to mean "front not yet evaluated". As such, if there does happen to be a U+FFFF in your stream, calling front more than once in a row will pop it. In essence, you'd get 2 different values for two consecutive calls to front.

I'm instead suggesting you choose the value U+D800 as the "not yet evaluated" value: If there are any U+D800 values in your stream, they are flagged as "invalid", and replaced by U+FFFD. In essence, this makes it impossible to "have" U+D800 as a front, avoiding the issue above.

It's just an implementation tweak, not a change of behavior/handling for U+FFFF.

@WalterBright
Copy link
Member Author

Updated so that:

  1. empty needn't be called
  2. front can be called repeatedly

static struct ByCodeUnitImpl
{
@property bool empty() const { return r.length == 0; }
@property auto ref front() const { return r[0]; }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (and opIndex) should be inout, or you won't actually be able to mutate the returned reference:

    auto s = "hello".dup.byCodeUnit();
    s.front = 'H';

Unless that was done on purpose?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@monarchdodra
Copy link
Collaborator

I gave one final review to the entire pull. I like the new implementations. It looks good to ship as is.

I left some nitpicks; it would be nice if you could address them, but they aren't "blocker" issues.

@John-Colvin
Copy link
Contributor

Doesn't this fail miserably with std.range.popFrontN ??

I'm not convinced that I like the whole design here*, but at the very least you need to have an accumulator for how many elements front should remove when called.

*take a range of char r1 that blocks on popFront waiting for data from network or expensive calculation, with it's current front cached.

auto r2 = r1.byDchar.
if(!r2.empty)
{
    auto e = r2.front; //Woops, access to an already known value is blocked
                       //waiting on getting the *next* data point.
}

@monarchdodra
Copy link
Collaborator

Doesn't this fail miserably with std.range.popFrontN ??

How so? No any worst than with a string...?

I'm not convinced that I like the whole design here*, but at the very least you need to have an accumulator for how many elements front should remove when called.

I don't see how byDchar is really any different from any other input range adaptor, such as filter, splitter or whatever. Simply put, yes, there is not a 1:1 mapping in terms of elements popped, but I don't see that as a problem.

auto r2 = r1.byDchar.
if(!r2.empty)
{
    auto e = r2.front; //Woops, access to an already known value is blocked
                       //waiting on getting the *next* data point.
}

Again, I don't see how that's different from any other range addaptor. Besides, this could block also block at if(!r2.empty). At the end of the day, if the range/stream halts waiting for data, then a wait will occur. !empty != data is in buffer.

@John-Colvin
Copy link
Contributor

auto s = "fdsa";
s.popFrontN(2);
assert(equal(s, "sa"));
s = "fdsa";
auto sByDchar = s.byDchar();
sByDchar.popFrontN(2);
assert(equal(sByDChar, "sa")); //fails, only the 'f' has been popped.

@John-Colvin
Copy link
Contributor

I think we still have a serious problem with the semantic specification of ranges and their primitives.

@monarchdodra
Copy link
Collaborator

//fails, only the 'f' has been popped.

Indeed, I missed that. There's an implementation problem here.

I think we still have a serious problem with the semantic specification of ranges.

Just the implementation I think.

@monarchdodra
Copy link
Collaborator

BTW, I hadn't suggested yet, but we might want to split this pull between byCodeUnit and byXChar. I think byCodeUnit is much more important, and trivial, and virtually ready to go.

@John-Colvin
Copy link
Contributor

BTW, I hadn't suggested yet, but we might want to split this pull between byCodeUnit and byXChar. I think byCodeUnit is much more important, and trivial, and virtually ready to go.
I agree. My problems are with the by*char ranges, byCodeUnit looks fine.

@WalterBright
Copy link
Member Author

did all the issues here

@monarchdodra
Copy link
Collaborator

Anybody see any outstanding (non-style related) issues left? @John-Colvin maybe? Will soon merge otherwise.

@John-Colvin
Copy link
Contributor

I still think a range wrapper like byDchar shouldn't advance it's inner range by more than is necessary to calculate the current front. They should maximally lazy w.r.t. the amount of data they request, to avoid unecessary latency.

Having said that, I see no problem merging this how it is, it can always be changed later (becoming more lazy here isn't a breaking change unelss someone is relying on undocumented implementation details).

@monarchdodra
Copy link
Collaborator

I still think a range wrapper like byDchar shouldn't advance it's inner range by more than is necessary to calculate the current front.

Isn't that what it's doing though? I see most of the work being done in front? popFront only does any work if it's called and front has not yet been evaluated...

@John-Colvin
Copy link
Contributor

Talking specifically about byDchar for a range of char: Calling front causes the inner range to be popped such that it's front points to the first char not in the current code-point. For an inner range that updates its state in popFront, you are requesting more information from the original data-source than is currently needed.

Sometimes this wouldn't be a problem, but anything that handles user input events in realtime can't afford to be blocked waiting for the next input in order to access the previous one,

Once a wrapper uses eager popping it requires a redesign of the inner range (or another intermediate range that defers the work in popFront to front) to restore the fully lazy behaviour and avoid latency problems. It's inelegant and leads to a rats nest of guard variables that confuse the optimiser and increase register pressure.

@WalterBright
Copy link
Member Author

@John-Colvin popFront() does not have the behavior you describe - it does not wait for more input. It 'consumes' the current input.

@DmitryOlshansky
Copy link
Member

I kind of like it but turns out that I, for instance, can't use anything of this in std.regex.

Reasons:

  • no tail-const as in slices, so I can't forward byCodeUnit!(string) to byCodeUnit!(const(char)[])
  • decoding is baked into a range - can't mix and match iterating by say char and dchar on the same string (though I'm getting rid of that in std.regex soon enough)
  • anyhow C's run-time memchr still needs to peek at naked character (or rather byte[]) arrays

All of the above makes it next to useless in my use cases, might as well for other parser/lexer stuff.

@monarchdodra
Copy link
Collaborator

@John-Colvin popFront() does not have the behavior you describe - it does not wait for more input. It 'consumes' the current input.

I think he means that front calls popFront() at the end of its execution, and then popFront() itself does nothing. So arguably, the underlying range is popped just a little bit eagerly.

I'm not sure that's a problem though.

@John-Colvin
Copy link
Contributor

@WalterBright You misunderstand me. I am referring to byDchar.front calling r.popFront more eagerly than is needed.

@WalterBright
Copy link
Member Author

@John-Colvin ok, done.

@monarchdodra
Copy link
Collaborator

Auto-merge toggled on

@monarchdodra
Copy link
Collaborator

At this point, I think this is good enough for inclusion. Anything left we can fix later.

monarchdodra added a commit that referenced this pull request Jun 7, 2014
add byCodeunit, byChar, byWchar and byDchar to std.utf
@monarchdodra monarchdodra merged commit 5319539 into dlang:master Jun 7, 2014
@WalterBright WalterBright deleted the byChar branch June 7, 2014 18:04
@WalterBright
Copy link
Member Author

thanks!

{
if (!haveData)
front;
r.popFront();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum... looks like we were a little fast with the whole "lazy pop thing":
http://forum.dlang.org/thread/trqnqtzspoyhggvvftgp@forum.dlang.org

Long story short, if the last character in the string is actually truncated, then the popFront() of the underlying range will fail, since it will already be empty.

A "fix" would be to add:

if (!r.empty)
    r.popFront();

But that check could be more costly than what we are trying to save on? It needs to be fixed one way or the other. @WalterBright ?

I didn't check if byWchar is also subject to the issue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this has been merged, please file bugzilla issues for problems with it.

@sinkuu
Copy link
Contributor

sinkuu commented Sep 26, 2014

ByCodeUnitImpl.opSlice returns a slice to the original array. Shouldn't it return the same type as the range itself?(as hasSlicing requires)

Edit: filed a bug https://issues.dlang.org/show_bug.cgi?id=13535

@quickfur
Copy link
Member

Please file a bug for this, since this PR is already merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.