Skip to content

[API Proposal]: Get/Enumerate line spans from TextReader #114987

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
pawchen opened this issue Apr 24, 2025 · 10 comments
Open

[API Proposal]: Get/Enumerate line spans from TextReader #114987

pawchen opened this issue Apr 24, 2025 · 10 comments
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.IO untriaged New issue has not been triaged by the area owner

Comments

@pawchen
Copy link

pawchen commented Apr 24, 2025

Background and motivation

When parsing a file line by line, and there's no need to keep the whole line as a string, span based APIs will be helpful for reducing unnecessary allocations up to the file size, as we already have span based Regex APIs.

API Proposal

namespace System.IO;

public class TextReader
{
    public ReadOnlySpan<char> ReadLineAsSpan();
    public TextReaderLineEnumerator EnumerateLines();
}

public ref struct TextReaderLineEnumerator
{
    // Similar to SpanLineEnumerator
}

API Usage

ReadLineAsSpan:

using var reader = new StreamReader(file);

while (!reader.EndOfStream)
{
    ReadOnlySpan<char> line = reader.ReadLineAsSpan();   // <----- gets the span

    if (line.TrimStart().StartsWith('['))
    {
        // continue parsing with Regex
        foreach (var match in OpenBlockRegex.EnumerateMatches(line))
        {
        }
    }
        // or ignore the line
}

EnumerateLines:

using var reader = new StreamReader(file);

foreach (ReadOnlySpan<char> line in reader.EnumerateLines())
{
    if (line.TrimStart().StartsWith('['))
    {
        // continue parsing with Regex
        foreach (var match in OpenBlockRegex.EnumerateMatches(line))
        {
        }
    }
        // or ignore the line
}

Alternative Designs

/// <returns> false when no more lines can be read (EndOfStream)</returns>
/// <param name="length"> -1 when the span.Length is less than the line</param>
bool TryReadLine(Span<char> destination, out int length);

Risks

For the span returning proposal, subsequent read will invalidate the span, user should not keep the value before the next call.

For the TryReadLine design, user need to manage the buffer by themselves.

@pawchen pawchen added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Apr 24, 2025
@ghost ghost added the area-System.IO label Apr 24, 2025
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Apr 24, 2025
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-io
See info in area-owners.md if you want to be subscribed.

@huoyaoyuan
Copy link
Member

There will be lifetime problem to return spans.

The API shape of TextReader isn't friendly for span-based processing. The implementations don't keep a buffer for chars either, but using string builders.

The shape of PipeReader would be more suitable for this.

@huoyaoyuan
Copy link
Member

The key difference between TextReader and Split(ReadOnlySpan) is that the underlying data isn't necessarily stored in UTF-16, thus TextReader is usually doing decoding, and there's no existing storage to expose.

@pawchen
Copy link
Author

pawchen commented Apr 24, 2025

I see. The current implementation of StreamReader holds a byte buffer and a char buffer at the same time, for ReadLine it uses a StringBuilder which internally could use linked list storage thus could not get a single span.

Any chances the char buffer be resizable, or be backed by ArrayPool/Pipelines? The Pipelines introduction is just too long and warns about a lot of gotchas, and doesn't include how to get a line span, but StringBuilder instead. Looks way more complex compared to the example I have up here.

I'm OK with TextReader's ReadLineAsSpan being abstract or throws by default. Need StreamReader to work and StringReader to test.

As for decoding overhead, yes, ultimately one would need some sort of Utf8StreamReader and some sort of Utf8Regex to avoid decoding. But those sound like asking too much?

@huoyaoyuan
Copy link
Member

Looks way more complex compared to the example I have up here.

The example in OP doesn't work at all because of lifetime violation. It should at least be:

bool TryReadLine(Span<char> destination);

@pawchen
Copy link
Author

pawchen commented Apr 25, 2025

This shape could work, with an additional parameter out int length. It leaves the user to manage the buffer on their own, might end up using the similar code from ReadLineAsyncInternal but synchronous instead.

I don't get why returning span violates lifetime though. The following code compiles fine.

    class MyReader
    {
        char[] _charBuffer = ['1', '\n'];
        public ReadOnlySpan<char> ReadLine() => _charBuffer.AsSpan(0..1);
    }

@huoyaoyuan
Copy link
Member

This shape could work, with an additional parameter out int length. It leaves the user to manage the buffer on their own, might end up using the similar code from ReadLineAsyncInternal but synchronous instead.

No, this requires the char buffer to be persisted as a state of the reader. Subsequent read will either invalidate the span, or prevent it to be pooled in any form at all.

@pawchen
Copy link
Author

pawchen commented Apr 25, 2025

This shape could work

I meant the TryReadLine you proposed.

@pawchen
Copy link
Author

pawchen commented Apr 25, 2025

.. this requires the char buffer to be persisted as a state of the reader. Subsequent read will either invalidate the span ..

Simply throw away the old value and get the new one. Spans are not meant to be kept for a long time by callers, for example, the Span returned by CollectionsMarshal.AsSpan are not expected to be used again after List.Add.

@huoyaoyuan
Copy link
Member

Simply throw away the old value and get the new one. Spans are not meant to be kept for a long time by callers, for example, the Span returned by CollectionsMarshal.AsSpan are not expected to be used again after List.Add.

This is considered unsafe and not widely used, only with strong performance considerations. TextReader is already not performant enough, so it's unlikely to have this level unsafety on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.IO untriaged New issue has not been triaged by the area owner
Projects
None yet
Development

No branches or pull requests

2 participants