Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix issue 14861 - Error in stdio.d in LockingTextReader.readFront() #3622

Closed
wants to merge 1 commit into from

Conversation

aG0aep6G
Copy link
Contributor

@aG0aep6G aG0aep6G commented Sep 1, 2015

Calling ungetc more than once is not guaranteed to work. Replacing the
ungetc calls with ftell/fseek.

https://issues.dlang.org/show_bug.cgi?id=14861

immutable c = chars[$ - 1 - i];
enforce(ungetc(c, cast(FILE*) _f._p.handle) == c);
}
_f.seek(start); // rewind
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is horrible. Do we ungetc the file only to maintain range abstraction?? I think it's apparent that C's I/O doesn't map well to codepoint ranges so one more point for native i/o package.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC @schveiguy - any news btw ? ;)

Calling ungetc more than once is not guaranteed to work. Replacing the
ungetc calls with ftell/fseek.
@CyberShadow
Copy link
Member

Is this going to affect performance much? Maybe it can cringe do an ungetc if it read only one character?

@DmitryOlshansky
Copy link
Member

Maybe it can cringe do an ungetc if it read only one character?

I guess the main problem is DMC++ runtime which is lacking proper support for most basic features, other libc-s do just fine.

@MGWL
Copy link

MGWL commented Sep 2, 2015

I think algorithm with ungetc bad idea. The read-ahead should be carried not in the ring buffer of the driver input-output, and in own this to structure.

1 similar comment
@MGWL
Copy link

MGWL commented Sep 2, 2015

I think algorithm with ungetc bad idea. The read-ahead should be carried not in the ring buffer of the driver input-output, and in own this to structure.

@DmitryOlshansky
Copy link
Member

I think algorithm with ungetc bad idea. The read-ahead should be carried not in the ring buffer of the driver input-output, and in own this to structure.

Well the problem is that if somebody stops reading this range and craetes a new one with the current stream then whatever extra buffering on top of C's stdio will stay with the first range.. Ehm now the 2nd one will miss a lot of data that's supposed to be there.

The whole "just wrap C's IO in ranges" idea is bunkrupt from the start and I'm really surprized how long it managed to survive.

@DmitryOlshansky
Copy link
Member

Another idea would be to explicitly disable C's buffering and replicate that in std.stdio.File however that won't work if any I/O was done on a stream by a 3rd party e.g. a C library.

@aG0aep6G
Copy link
Contributor Author

aG0aep6G commented Sep 2, 2015

Is this going to affect performance much? Maybe it can cringe do an ungetc if it read only one character?

The performance hit is big. I tested reading 10 MB of some patterns (test code). Results:

Multiple ungetc calls (master):

"a" 575 ms, 344 μs, and 1 hnsec
"ä" 456 ms, 378 μs, and 7 hnsecs
"aaaaaaaaä" 559 ms, 995 μs, and 9 hnsecs

No ungetc, always tell and seek (what's currently in this PR):

"a" 1 sec, 975 ms, 638 μs, and 9 hnsecs
"ä" 1 sec, 122 ms, 810 μs, and 9 hnsecs
"aaaaaaaaä" 1 sec, 804 ms, 806 μs, and 4 hnsecs

Not good.

Call ungetc instead of seek when chars.length == 1 (always call tell):

"a" 755 ms, 866 μs, and 4 hnsecs
"ä" 1 sec, 120 ms, 895 μs, and 7 hnsecs
"aaaaaaaaä" 831 ms, 562 μs, and 4 hnsecs

Better, but not good.

Don't call tell and use relative seeking with SEEK_CUR instead:

"a" 556 ms, 168 μs, and 3 hnsecs
"ä" 1 sec, 57 ms, 159 μs, and 1 hnsec
"aaaaaaaaä" 673 ms, 818 μs, and 5 hnsecs

I'm not sure how portable SEEK_CUR is.

Tested on linux with dmd and no compilation flags whatsoever.

Seems that seek is just costly.

How public is LockingTextReader exactly? I mean, it's not private, but it's not documented either. Could we change its interface or drop it completely? Looks like it's only used in readf.

@DmitryOlshansky
Copy link
Member

How public is LockingTextReader exactly? I mean, it's not private, but it's not documented either. Could we change its interface or drop it completely? Looks like it's only used in readf.

As in lockingTextReader public. The latter being quite convenient input range for dchars.

@aG0aep6G
Copy link
Contributor Author

aG0aep6G commented Sep 2, 2015

As in lockingTextReader public. The latter being quite convenient input range for dchars.

You mean there's a lockingTextReader (lower L) function somewhere? I can't find it.

@DmitryOlshansky
Copy link
Member

You mean there's a lockingTextReader (lower L) function somewhere? I can't find it.

Mistook for lockingTextWriter. Anyway the thing is public so it very well may be used... even if undocumented.

@quickfur
Copy link
Member

quickfur commented Sep 2, 2015

@DmitryOlshansky I agree wrapping around C's stdio is kludgy. It's the reason for a lot of back-and-forth in std.stdio, e.g., the drama with File originally being a "thin" wrapper around FILE*, then we have to add hacks to make range API work, then we have to use ungetc, then we have to turn it into a struct-wrapped reference type, then we have to make it a ref-counted type, ad nauseaum.

Maybe what we really need to do is to implement our own stdio, that directly uses OS APIs to read file data. Handle all buffering in D instead of trying to kludge around C's API. (Where's the long-promised but never materialized std.io?!)

@DmitryOlshansky
Copy link
Member

(Where's the long-promised but never materialized std.io?!)

Stay tuned.

@quickfur
Copy link
Member

quickfur commented Sep 2, 2015

LOL... I've been tuned for the past, oh, how many years now? and have yet to see anything materialize.

@DmitryOlshansky
Copy link
Member

LOL... I've been tuned for the past, oh, how many years now? and have yet to see anything materialize.

And I've participated in 3 iterations of the std.io effort. Patience, patience it's coming.

@quickfur
Copy link
Member

quickfur commented Sep 2, 2015

On a more serious note, I agree that any algorithm that uses ungetc is fundamentally flawed. Any read-ahead should be kept in the range itself. As for creating a new range that "loses" some characters because of read-ahead, why not make lockingTextReader a reference type to a unique instance per File? The idea of multiple readers per File already doesn't work anyway (e.g., if I create multiple readers then hand them to different threads -- the data read will basically be randomly interspersed bytes from the underlying file, which is almost always not what is wanted).

@quickfur
Copy link
Member

quickfur commented Sep 2, 2015

As for std.io, is the current effort on github somewhere?

@schveiguy
Copy link
Member

LOL... I've been tuned for the past, oh, how many years now?

This is a fair and deserved dig. I have been very poor at completing this.

As for std.io, is the current effort on github somewhere?

I am ashamed that I have not pushed any code to github for my latest effort, because I've designed it as I was writing (and a lot of things have changed), and I didn't want an ugly history with all kinds of deletes and rewrites, plus some files there now probably will be removed. However, it IS forthcoming, I'm going to commit more time for this.

The previous efforts are here:

https://github.com/schveiguy/phobos/tree/new-io
https://github.com/schveiguy/phobos/tree/new-io2
https://github.com/schveiguy/phobos/tree/new-io3

The last one is very similar to what I have now, but I've redesigned a crucial portion of it.

@quickfur
Copy link
Member

quickfur commented Sep 2, 2015

@schveiguy I apologize for making light of the situation; I was not aware of all the work that has gone into the prospective new std.io.

Maybe you could put the code up on github in a separate repo or something like that, which will be merged into Phobos when it's ready. That way people could see what's happening, and more importantly, contribute. For something as crucial as std.io, I think a group effort would do a lot of good.

@schveiguy
Copy link
Member

I apologize for making light of the situation

Not necessary, I agree that I should have finished it by now :)

Maybe you could put the code up on github in a separate repo or something like that

I will push the code to github as soon as it's working to any degree (and that should be soon). At this point, I'm unsure where it will fit in, the design is a drastic change from standard io mechanisms.

@WalterBright
Copy link
Member

Seems to me that the problem is that the 'fix' for LockingTextReader:

#2663

attempts to do decoding. NONONONONONONONONONONONOOOO

There is no reason whatsoever to decode when reading characters and putting them into a string. It should simply have made it a range over char.

@DmitryOlshansky
Copy link
Member

attempts to do decoding. NONONONONONONONONONONONOOOO

Stop it. LockingTextReader was SUPPOSED to do so, except that it was broken.
And we can't silently change dchar to char[] API anyhow.

@WalterBright
Copy link
Member

Low level code should not be decoding. It's just nothing but trouble, as repeated regressions show.

@DmitryOlshansky
Copy link
Member

It's just nothing but trouble, as repeated regressions show.

That ship has long sailed when std.stdio was designed. Now we either fix it or not, but breaking API is out of question.

@WalterBright
Copy link
Member

It's already broken. Before the last 'fix', it simply cast char to dchar. And frankly, I don't believe in throwing the whole thing out for std.io just because of this. And lastly, as the benchmarks show, this decoding is absolutely miserable.

@DmitryOlshansky
Copy link
Member

And frankly, I don't believe in throwing the whole thing out for std.io just because of this

Frankly I never like std.stdio nor wrapping C's IO was EVER fastest and/or easiest of things. The cost of maintenance goes up with every run-time we try to support. There is a case for std.io and I'm helping the effort as time permits.

And lastly, as the benchmarks show, this decoding is absolutely miserable.

Want no decoding - use raw reads. Anyhow I don't decode and it takes some WORK to avoid it in every case. The mistake with decoding was made, now we need to clean up the consequences properly not by waving hands and silently disabling decoding in existing code.

@WalterBright
Copy link
Member

LockingTextReader isn't even documented. Fixing it won't break things. The only user of it is readf(), which can use .byDchar if it must, although the decoding should only be done as necessary by formattedRead.

The fix is being applied in the wrong place.

@DmitryOlshansky
Copy link
Member

LockingTextReader isn't even documented

Then let's start with making it private. But you know, bug reports came for it even though it's undocumented. Keeping private stuff undocumented doesn't prevent people from using it. Keeping it private does.

The fix is being applied in the wrong place.

That might be true.

@WalterBright
Copy link
Member

bug reports came for it even though it's undocumented

3 bug reports. 2 were reporting a bug in readf. The third https://issues.dlang.org/show_bug.cgi?id=12320 doesn't say where the use comes from - could have been isolated down from a call to readf as well.

We don't need to maintain undocumented interfaces. They are implicitly use at your own risk.

@DmitryOlshansky
Copy link
Member

We don't need to maintain undocumented interfaces. They are implicitly use at your own risk.

Cool - so let's close this pull and start with making lockingTextReader private.

@CyberShadow
Copy link
Member

Then let's start with making it private. But you know, bug reports came for it even though it's undocumented.

The original test case does not mention LockingTextReader anywhere, only a reduced test case in a comment.

@aG0aep6G
Copy link
Contributor Author

aG0aep6G commented Oct 5, 2015

New PR that makes LockingTextReader private and turns it into a char range: #3696

@WalterBright
Copy link
Member

Things seem to be worse than I expected. formattedRead is doing one character lookahead, but readf closes down its copy of LockingTextReader after formattedRead returns, which causes the buffered looked ahead character to be lost. When readf is called again, the looked ahead character is lost.

This isn't the protocol for InputRange.

@CyberShadow
Copy link
Member

What about ungetc?

@WalterBright
Copy link
Member

ungetc fails if a dchar is pushed back.

@CyberShadow
Copy link
Member

Right, I thought the new thing was going to use chars.

@WalterBright
Copy link
Member

Right, but when you use byDchar to read more than one in order to decode, then you've looked ahead more than one.

@WalterBright
Copy link
Member

I have to do a bit more investigating to see just where the fault lies.

@WalterBright
Copy link
Member

It's looking like formattedRead, when it is reading dchars, does a popFront after the lookahead. If formattedRead is just reading chars, this does not happen.

Correction: it's byDchar doing the extra popFront.

https://issues.dlang.org/show_bug.cgi?id=15164

@WalterBright
Copy link
Member

More research. Fixing byDchar won't fix the lookahead problem. The fix has to be in formattedRead, and the fix for that is for char ranges, it only reads as far as necessary (which should only be one lookahead) and decode only when absolutely necessary. This should run faster, too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants