Fix Issue 12556 - Add persistent stdio.File.byLine #2077

ntrel · 2014-04-10T15:11:06Z

https://issues.dlang.org/show_bug.cgi?id=12556

Add stdio.File.byLineCopy, which uses ByLineImpl internally.

By default, lines are immutable. This can be changed by using a mutable Char type.
ByLineCopyImpl.front is cached to avoid unnecessary allocations.
The API follows byLine for consistency.

monarchdodra · 2014-04-10T16:15:33Z

std/stdio.d

+    {
+        ByLineImpl!(Char, Terminator) impl;
+        bool gotFront;
+        immutable(Char)[] line;


I think the design would be more efficient if instead, it simply returned a Char[]: No point in forcing the user to allocate "yet again", if he needed a mutable (but unique) buffer to begin with.

To keep things "safe by default", you could have byLineCopy take Char = immutable(char) as a parameter by default. This should give us a good mix of safe, but customizeable:

File myFile = ...; string[] lines = myFile.byLineCopy().array(); //safe by copy char[][] mutableLines = myFileCopy!char().array(); //efficient, without much overhead.

This would need an extra overload to byLineCopy to work, but is completely manageable.

No point in forcing the user to allocate "yet again", if he needed a mutable (but unique) buffer

I thought perhaps they would use plain byLine for that and [i]dup when ready, but you're right that byLineCopy would be more useful if it supported mutable copies.

you could have byLineCopy take Char = immutable(char) as a parameter by default

Another option would be to take Front = string, which might be nicer to allow passing [dw]string instead of (immutable [dw]char) as a template argument for multi-byte characters. Mutable copies would pass (char[]). Not a big difference, but makes the immutable case's syntax a bit nicer.

This would need an extra overload to byLineCopy to work

Do you mean because byLine's first template argument is Terminator, not Char (and byLineCopy is consistent with that order), or am I misunderstanding you?

Another option would be to take Front = string , which might be nicer to allow passing [dw]string instead of (immutable [dw]char) as a template argument for multi-byte characters. Mutable copies would pass (char[]) . Not a big difference, but makes the immutable case's syntax a bit nicer.

If at all possible, the idea would be to remain consistent with vanilla byLine. Having one that take a Char type, while the other a String type, would be confusing :/

Do you mean because byLine 's first template argument is Terminator , not Char (and byLineCopy is consistent with that order)

Yes, that's exactly what I'm saying. This is something I've thought of introducing in byLine too anyways, to be able to myFile.byLine!dchar().

OK, I'll make the Char = immutable char change.

This is something I've thought of introducing in byLine too anyways, to be able to myFile.byLine!dchar().

I understand now. That would work, although I have wondered about solving this by adding an overload that requires terminator to be passed explicitly as the first argument:

byLine(Char, Terminator)(Terminator terminator, KeepTerminator kt = KeepTerminator.no)

This would be more flexible as "\r\n" is usually needed on Windows, and this would be clearer to people who may assume file.byLine() is portable with line endings.

This overload could replace the other two, which could potentially be deprecated.

To be clear, my proposed overload switches the existing order of kt and terminator as well as Terminator and Char. AIUI, your overload would be:

byLine(Char = char)()

So if I wanted Windows newlines with multi-byte chars I would still have to write:

byLine!(dstring, dchar)(KeepTerminator.no, "\r\n"d);

vs:

byLine!dchar("\r\n"d);

edit: dchar -> dstring

ntrel · 2014-04-11T14:44:55Z

Updated with Char = immutable char. I haven't added the overload yet, see my last comment.

monarchdodra · 2014-04-14T11:18:04Z

Well, I think the design is correct. Any other improvements to simplify the declaration of the byLineCopy (eg: byLineCopy!(wchar[])) can be addressed in a separate pull.

I think this function is both more efficient that what we can do with UFCS + map/dup, and more convenient to use. It is newbie friendly, and overall, good to have.

So this gets my stamp of approval for inclusion. I would like some feedback from others, before merging anything though. @andralex ? Is byLineCopy good?

Nitpick: If this is approved, I will request you give byChunk the same treatment.

ntrel · 2014-04-14T12:16:03Z

@monarchdodra Thanks for the review. Rebased.

monarchdodra · 2014-04-17T13:16:16Z

std/stdio.d

+        @property front()
+        {
+            if (!gotFront)
+                line = impl.front.dup;


I think this should set gotFront to true somewhere...

Oops, fixed.

JakobOvrum · 2014-05-25T10:12:13Z

std/stdio.d

@@ -1720,6 +1721,129 @@ the contents may well have changed).
        return ByLine!(Char, Terminator)(this, keepTerminator, terminator);
    }

+    private struct ByLineCopy(Char, Terminator)


Is this struct really necessary? Couldn't it basically just have been an alias?

You mean have byLineCopy return a RefCounted!(ByLineCopyImpl, ...) instance? I suppose it might be OK, but when I added the ref-counting for ByLine @monarchdodra preferred that I didn't use alias this, and instead added individual forwarding for each method - this avoids exposing any implementation details (such as internal types). Having ByLineCopy as a separate struct follows the same principle. I don't really mind either way.

Is this struct really necessary? Couldn't it basically just have been an alias?

Technically, no, it's not needed. But arguably, it leaves a hole in the type system. Also, the "end type" would be RefCounted!ByLineCopyImpl, which could confuse users, who might think the return type was actually a ByLineCopyImpl, and something else ref counted it. By creating a wrapper ByLineCopy, we can keep things tidy in terms of type and interface.

quickfur · 2014-07-11T18:33:19Z

ping
This looks ready to merge, is anything else holding it up?

mihails-strasuns · 2014-07-11T21:54:37Z

std/stdio.d

+    auto byLineCopy(Terminator = char, Char = immutable char)
+    (KeepTerminator keepTerminator = KeepTerminator.no,
+            Terminator terminator = '\n')
+    if (isScalarType!Terminator)


where does this indentation style comes from? It looks absolutely unreadable and does not fit rest of the Phobos I have been reading so far. I'd expect something like this:

auto byLineCopy(Terminator = char, Char = immutable char) (KeepTerminator keepTerminator = KeepTerminator.no, Terminator terminator = '\n') if (isScalarType!Terminator) { }

(actually Phobos hard line length limit is 120 as per http://dlang.org/dstyle.html so line breaks here can be reduced/removeD)

I just copied the style from byLine. It's quite common in algorithm.d to have the opening bracket on column 0. Anyway, I indented the runtime args as suggested, but kept the constraint unindented - this is more in keeping with stdio.d IMO. Less line breaks would exceed the soft limit of 80, so I prefer it like this.

Ok, thanks. Looks like quite lot of stdio code has style notably different from i.e. std.algorithm :(

mihails-strasuns · 2014-07-11T21:55:32Z

Other than code style - LGTM

mihails-strasuns · 2014-07-14T11:35:59Z

Auto-merge toggled on

Fix Issue 12556 - Add persistent stdio.File.byLine

bearophile · 2014-07-14T17:08:57Z

This syntax to produce mutable strings is not intuitive, I don't understand it:

byLineCopy!(char, char)(KeepTerminator.no)

And it seems KeepTerminator.no still doesn't work on Windows.

ntrel · 2014-07-15T11:12:33Z

The syntax was chosen to be consistent with byLine. We discussed alternatives here:
#2077 (comment)

And it seems KeepTerminator.no still doesn't work on Windows.

The unit test passes, can you be more specific?

bearophile · 2014-07-15T23:00:01Z

@ntrel: I think a syntax byLineCopy!(char, char) is confusing. And the docs don't explain well the situation.

(Regarding the newlines, I was wrong, sorry, it's my "special" console on Windows that has tricked me again. On the regular Windows console the bug isn't present).

bearophile · 2014-07-15T23:14:54Z

I have changed my mind again. If I save this program on Windows with normal Windows newlines (that are 2 chars long each) with the "test.d" name:

void main() {
    import std.stdio, std.string, std.array,
           std.algorithm;
    auto data1 = "test.d".File
                 .byLine
                 .map!(r => r.chomp.dup)
                 .join;
    auto data2 = "test.d"
                 .File.byLineCopy!(char, char)
                                  (KeepTerminator.no)
                 .join;
    writeln(data1.length, " ", data2.length);
}

It outputs 392 404, this shows that KeepTerminator.no is not stripping away the Windows newlines correctly (it means it's not stripping away both \n and \r every time, that chomp instead is correctly removing).

ntrel · 2014-07-16T15:57:59Z

@bearophile:

@ntrel: I think a syntax byLineCopy!(char, char) is confusing. And the docs don't explain well the situation.

I want to improve that syntax (as discussed here), but it's difficult to get agreement. Incidentally, putting the terminator argument first and required would solve your problem below:

KeepTerminator.no is not stripping away the Windows newlines

You need to pass "\r\n" as the separator, otherwise it uses the default of '\n'. Personally, I don't think we should have a default if it's not portable.

bearophile · 2014-07-16T21:15:10Z

@ntrel:

You need to pass "\r\n" as the separator, otherwise it uses the default of '\n'.

At least this needs to be written clearly in the ddocs.

Personally, I don't think we should have a default if it's not portable.

Can't byLine/byLineCopy strip the newline with a bit smarter algorithm like chomp is doing?

DmitryOlshansky · 2014-07-16T21:19:25Z

You need to pass "\r\n" as the separator, otherwise it uses the default of '\n'.

In this day and age we got to accept that line ending is a sequence:
\u{A} | \u{B} | \u{C} | \u{D} | \u{85} | \u{2028} | \u{2029} | \u{D A}

mihails-strasuns · 2014-07-17T13:07:28Z

Personally, I don't think we should have a default if it's not portable.

What is the problem in having platform-specific defaults?

ntrel · 2014-07-17T16:01:32Z

What is the problem in having platform-specific defaults?

No problem, but it would be a breaking change. The same for making it automatically handle all newline variants.

bearophile · 2014-07-17T16:29:52Z

I think it's a welcome breaking change.

ntrel · 2014-07-17T16:46:15Z

But people may have corrected for the trailing CR with byLine() on Windows by calling popBack(). Your breaking change would then silently break their correct, working code. That is a no-go IMO, and I doubt anyone would merge that change.

mihails-strasuns · 2014-07-17T16:49:57Z

and I doubt anyone would merge that change

Yes I agree, we can't do breakage that does not catch all problematic cases automatically.

bearophile · 2014-07-17T20:05:37Z

But people may have corrected for the trailing CR with byLine() on Windows by calling popBack().

This is a bit mind-bending.

mihails-strasuns · 2014-07-17T20:22:31Z

It is theoretically possible. It will result in silent will breakage => not an option even if completely unreasonable unless some sort of verification/porting tool is provided.

DmitryOlshansky · 2014-07-18T13:51:34Z

@bearophile I mostly agree but I think we'd better get things right in std.io while std.stdio would hopefully drift along deprecation path with all of its "historical" choices.

ntrel · 2014-07-23T14:38:58Z

@bearophile:

You need to pass "\r\n" as the separator, otherwise it uses the default of '\n'.

At least this needs to be written clearly in the ddocs.

Created #2364.

bearophile · 2014-07-24T08:43:25Z

@ntrel:

Created #2364.

What's needed in the byLine/byLineCopy documentation is a sign: "Warning: use "\r\n" on Windows otherwise your code will not work, because of a design mistake of those functions.".

But even this is not a good enough solution. I think that the breaking change (of stripping away newlines automatically in a bit smarter way using chomp) on the whole will cause less bugs than leaving the current wrong code/design untouched + comment. Sometimes breaking changes are the lesser evil.

monarchdodra reviewed Apr 10, 2014
View reviewed changes

ntrel added 2 commits April 14, 2014 13:09

Fix Issue 12556 - Add persistent byLine

22edb2e

Make byLineCopy support mutable line copies

5d7e008

monarchdodra reviewed Apr 17, 2014
View reviewed changes

Fix caching byLineCopy.front (oops)

adafe19

ntrel changed the title ~~Fix Issue 12556 - Add persistent byLine~~ Fix Issue 12556 - Add persistent stdio.File.byLine May 22, 2014

JakobOvrum reviewed May 25, 2014
View reviewed changes

Clarify when to use byLineCopy vs byLine

7c4e7f5

mihails-strasuns reviewed Jul 11, 2014
View reviewed changes

Indent parameters

73da5d2

mihails-strasuns pushed a commit that referenced this pull request Jul 14, 2014

Merge pull request #2077 from ntrel/byLineCopy

9c41595

Fix Issue 12556 - Add persistent stdio.File.byLine

mihails-strasuns merged commit 9c41595 into dlang:master Jul 14, 2014

ntrel deleted the byLineCopy branch July 15, 2014 11:05

Fix Issue 12556 - Add persistent stdio.File.byLine #2077

Fix Issue 12556 - Add persistent stdio.File.byLine #2077

Conversation

ntrel commented Apr 10, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ntrel commented Apr 11, 2014

monarchdodra commented Apr 14, 2014

ntrel commented Apr 14, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

quickfur commented Jul 11, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mihails-strasuns commented Jul 11, 2014

mihails-strasuns commented Jul 14, 2014

bearophile commented Jul 14, 2014

ntrel commented Jul 15, 2014

bearophile commented Jul 15, 2014

bearophile commented Jul 15, 2014

ntrel commented Jul 16, 2014

bearophile commented Jul 16, 2014

DmitryOlshansky commented Jul 16, 2014

mihails-strasuns commented Jul 17, 2014

ntrel commented Jul 17, 2014

bearophile commented Jul 17, 2014

ntrel commented Jul 17, 2014

mihails-strasuns commented Jul 17, 2014

bearophile commented Jul 17, 2014

mihails-strasuns commented Jul 17, 2014

DmitryOlshansky commented Jul 18, 2014

ntrel commented Jul 23, 2014

bearophile commented Jul 24, 2014