Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Issue 12556 - Add persistent stdio.File.byLine #2077

Merged
merged 5 commits into from Jul 14, 2014

Conversation

ntrel
Copy link
Contributor

@ntrel ntrel commented Apr 10, 2014

https://issues.dlang.org/show_bug.cgi?id=12556

Add stdio.File.byLineCopy, which uses ByLineImpl internally.

  • By default, lines are immutable. This can be changed by using a mutable Char type.
  • ByLineCopyImpl.front is cached to avoid unnecessary allocations.
  • The API follows byLine for consistency.

{
ByLineImpl!(Char, Terminator) impl;
bool gotFront;
immutable(Char)[] line;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the design would be more efficient if instead, it simply returned a Char[]: No point in forcing the user to allocate "yet again", if he needed a mutable (but unique) buffer to begin with.

To keep things "safe by default", you could have byLineCopy take Char = immutable(char) as a parameter by default. This should give us a good mix of safe, but customizeable:

File myFile = ...;
string[] lines = myFile.byLineCopy().array(); //safe by copy
char[][] mutableLines = myFileCopy!char().array(); //efficient, without much overhead.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would need an extra overload to byLineCopy to work, but is completely manageable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No point in forcing the user to allocate "yet again", if he needed a mutable (but unique) buffer

I thought perhaps they would use plain byLine for that and [i]dup when ready, but you're right that byLineCopy would be more useful if it supported mutable copies.

you could have byLineCopy take Char = immutable(char) as a parameter by default

Another option would be to take Front = string, which might be nicer to allow passing [dw]string instead of (immutable [dw]char) as a template argument for multi-byte characters. Mutable copies would pass (char[]). Not a big difference, but makes the immutable case's syntax a bit nicer.

This would need an extra overload to byLineCopy to work

Do you mean because byLine's first template argument is Terminator, not Char (and byLineCopy is consistent with that order), or am I misunderstanding you?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option would be to take Front = string , which might be nicer to allow passing [dw]string instead of (immutable [dw]char) as a template argument for multi-byte characters. Mutable copies would pass (char[]) . Not a big difference, but makes the immutable case's syntax a bit nicer.

If at all possible, the idea would be to remain consistent with vanilla byLine. Having one that take a Char type, while the other a String type, would be confusing :/

Do you mean because byLine 's first template argument is Terminator , not Char (and byLineCopy is consistent with that order)

Yes, that's exactly what I'm saying. This is something I've thought of introducing in byLine too anyways, to be able to myFile.byLine!dchar().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll make the Char = immutable char change.

This is something I've thought of introducing in byLine too anyways, to be able to myFile.byLine!dchar().

I understand now. That would work, although I have wondered about solving this by adding an overload that requires terminator to be passed explicitly as the first argument:

byLine(Char, Terminator)(Terminator terminator, KeepTerminator kt = KeepTerminator.no)

This would be more flexible as "\r\n" is usually needed on Windows, and this would be clearer to people who may assume file.byLine() is portable with line endings.

This overload could replace the other two, which could potentially be deprecated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, my proposed overload switches the existing order of kt and terminator as well as Terminator and Char. AIUI, your overload would be:

byLine(Char = char)()

So if I wanted Windows newlines with multi-byte chars I would still have to write:

    byLine!(dstring, dchar)(KeepTerminator.no, "\r\n"d);

vs:

    byLine!dchar("\r\n"d);

edit: dchar -> dstring

@ntrel
Copy link
Contributor Author

ntrel commented Apr 11, 2014

Updated with Char = immutable char. I haven't added the overload yet, see my last comment.

@monarchdodra
Copy link
Collaborator

Well, I think the design is correct. Any other improvements to simplify the declaration of the byLineCopy (eg: byLineCopy!(wchar[])) can be addressed in a separate pull.

I think this function is both more efficient that what we can do with UFCS + map/dup, and more convenient to use. It is newbie friendly, and overall, good to have.

So this gets my stamp of approval for inclusion. I would like some feedback from others, before merging anything though. @andralex ? Is byLineCopy good?


Nitpick: If this is approved, I will request you give byChunk the same treatment.

@ntrel
Copy link
Contributor Author

ntrel commented Apr 14, 2014

@monarchdodra Thanks for the review. Rebased.

@property front()
{
if (!gotFront)
line = impl.front.dup;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should set gotFront to true somewhere...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, fixed.

@ntrel ntrel changed the title Fix Issue 12556 - Add persistent byLine Fix Issue 12556 - Add persistent stdio.File.byLine May 22, 2014
@@ -1720,6 +1721,129 @@ the contents may well have changed).
return ByLine!(Char, Terminator)(this, keepTerminator, terminator);
}

private struct ByLineCopy(Char, Terminator)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this struct really necessary? Couldn't it basically just have been an alias?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean have byLineCopy return a RefCounted!(ByLineCopyImpl, ...) instance? I suppose it might be OK, but when I added the ref-counting for ByLine @monarchdodra preferred that I didn't use alias this, and instead added individual forwarding for each method - this avoids exposing any implementation details (such as internal types). Having ByLineCopy as a separate struct follows the same principle. I don't really mind either way.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this struct really necessary? Couldn't it basically just have been an alias?

Technically, no, it's not needed. But arguably, it leaves a hole in the type system. Also, the "end type" would be RefCounted!ByLineCopyImpl, which could confuse users, who might think the return type was actually a ByLineCopyImpl, and something else ref counted it. By creating a wrapper ByLineCopy, we can keep things tidy in terms of type and interface.

@quickfur
Copy link
Member

ping
This looks ready to merge, is anything else holding it up?

auto byLineCopy(Terminator = char, Char = immutable char)
(KeepTerminator keepTerminator = KeepTerminator.no,
Terminator terminator = '\n')
if (isScalarType!Terminator)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where does this indentation style comes from? It looks absolutely unreadable and does not fit rest of the Phobos I have been reading so far. I'd expect something like this:

auto byLineCopy(Terminator = char, Char = immutable char)
        (KeepTerminator keepTerminator = KeepTerminator.no,
        Terminator terminator = '\n')
    if (isScalarType!Terminator)
{
}

(actually Phobos hard line length limit is 120 as per http://dlang.org/dstyle.html so line breaks here can be reduced/removeD)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just copied the style from byLine. It's quite common in algorithm.d to have the opening bracket on column 0. Anyway, I indented the runtime args as suggested, but kept the constraint unindented - this is more in keeping with stdio.d IMO. Less line breaks would exceed the soft limit of 80, so I prefer it like this.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks. Looks like quite lot of stdio code has style notably different from i.e. std.algorithm :(

@mihails-strasuns
Copy link

Other than code style - LGTM

@mihails-strasuns
Copy link

Auto-merge toggled on

mihails-strasuns pushed a commit that referenced this pull request Jul 14, 2014
Fix Issue 12556 - Add persistent stdio.File.byLine
@mihails-strasuns mihails-strasuns merged commit 9c41595 into dlang:master Jul 14, 2014
@bearophile
Copy link

This syntax to produce mutable strings is not intuitive, I don't understand it:

byLineCopy!(char, char)(KeepTerminator.no)

And it seems KeepTerminator.no still doesn't work on Windows.

@ntrel ntrel deleted the byLineCopy branch July 15, 2014 11:05
@ntrel
Copy link
Contributor Author

ntrel commented Jul 15, 2014

The syntax was chosen to be consistent with byLine. We discussed alternatives here:
#2077 (comment)

And it seems KeepTerminator.no still doesn't work on Windows.

The unit test passes, can you be more specific?

@bearophile
Copy link

@ntrel: I think a syntax byLineCopy!(char, char) is confusing. And the docs don't explain well the situation.

(Regarding the newlines, I was wrong, sorry, it's my "special" console on Windows that has tricked me again. On the regular Windows console the bug isn't present).

@bearophile
Copy link

I have changed my mind again. If I save this program on Windows with normal Windows newlines (that are 2 chars long each) with the "test.d" name:

void main() {
    import std.stdio, std.string, std.array,
           std.algorithm;
    auto data1 = "test.d".File
                 .byLine
                 .map!(r => r.chomp.dup)
                 .join;
    auto data2 = "test.d"
                 .File.byLineCopy!(char, char)
                                  (KeepTerminator.no)
                 .join;
    writeln(data1.length, " ", data2.length);
}

It outputs 392 404, this shows that KeepTerminator.no is not stripping away the Windows newlines correctly (it means it's not stripping away both \n and \r every time, that chomp instead is correctly removing).

@ntrel
Copy link
Contributor Author

ntrel commented Jul 16, 2014

@bearophile:

@ntrel: I think a syntax byLineCopy!(char, char) is confusing. And the docs don't explain well the situation.

I want to improve that syntax (as discussed here), but it's difficult to get agreement. Incidentally, putting the terminator argument first and required would solve your problem below:

KeepTerminator.no is not stripping away the Windows newlines

You need to pass "\r\n" as the separator, otherwise it uses the default of '\n'. Personally, I don't think we should have a default if it's not portable.

@bearophile
Copy link

@ntrel:

You need to pass "\r\n" as the separator, otherwise it uses the default of '\n'.

At least this needs to be written clearly in the ddocs.

Personally, I don't think we should have a default if it's not portable.

Can't byLine/byLineCopy strip the newline with a bit smarter algorithm like chomp is doing?

@DmitryOlshansky
Copy link
Member

You need to pass "\r\n" as the separator, otherwise it uses the default of '\n'.

In this day and age we got to accept that line ending is a sequence:
\u{A} | \u{B} | \u{C} | \u{D} | \u{85} | \u{2028} | \u{2029} | \u{D A}

@mihails-strasuns
Copy link

Personally, I don't think we should have a default if it's not portable.

What is the problem in having platform-specific defaults?

@ntrel
Copy link
Contributor Author

ntrel commented Jul 17, 2014

What is the problem in having platform-specific defaults?

No problem, but it would be a breaking change. The same for making it automatically handle all newline variants.

@bearophile
Copy link

I think it's a welcome breaking change.

@ntrel
Copy link
Contributor Author

ntrel commented Jul 17, 2014

But people may have corrected for the trailing CR with byLine() on Windows by calling popBack(). Your breaking change would then silently break their correct, working code. That is a no-go IMO, and I doubt anyone would merge that change.

@mihails-strasuns
Copy link

and I doubt anyone would merge that change

Yes I agree, we can't do breakage that does not catch all problematic cases automatically.

@bearophile
Copy link

But people may have corrected for the trailing CR with byLine() on Windows by calling popBack().

This is a bit mind-bending.

@mihails-strasuns
Copy link

It is theoretically possible. It will result in silent will breakage => not an option even if completely unreasonable unless some sort of verification/porting tool is provided.

@DmitryOlshansky
Copy link
Member

@bearophile I mostly agree but I think we'd better get things right in std.io while std.stdio would hopefully drift along deprecation path with all of its "historical" choices.

@ntrel
Copy link
Contributor Author

ntrel commented Jul 23, 2014

@bearophile:

You need to pass "\r\n" as the separator, otherwise it uses the default of '\n'.

At least this needs to be written clearly in the ddocs.

Created #2364.

@bearophile
Copy link

@ntrel:

Created #2364.

What's needed in the byLine/byLineCopy documentation is a sign: "Warning: use "\r\n" on Windows otherwise your code will not work, because of a design mistake of those functions.".

But even this is not a good enough solution. I think that the breaking change (of stripping away newlines automatically in a bit smarter way using chomp) on the whole will cause less bugs than leaving the current wrong code/design untouched + comment. Sometimes breaking changes are the lesser evil.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
8 participants