Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CodeBuffer: implement .subSequence(), plus tests #2

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

fge
Copy link

@fge fge commented Mar 22, 2015

There is no reason why CodeBuffer isn't able to support it.

Just use a CharBuffer to .wrap() the buffer, and use this CharBuffer's
.subSequence() with the indices adjusted according to the buffer's current
position.

@Godin
Copy link
Member

Godin commented Mar 25, 2015

Thanks for PR, it will be reviewed during next sprint on SSLR.

@Godin
Copy link
Member

Godin commented May 15, 2015

Implementation of method "subSequence" can be copy-on-read or wrapping of underlying array, and both have their advantages and disadvantages depending on scenarios of usage. Since currently there is no potential consumers of this feature nor in SSLR itself, nor in main consumers of this library (SonarQube plugins), scenarios of usage are unknown and method left unimplemented as a kind of defensive approach against disadvantages of both implementations.

Now wrap(...).subSequence(...) will create CharBuffer, which will retain reference on underlying byte array from CodeBuffer, opening door for possible memory leaks. This PR is not backed by any usage scenario, so that unlikely to be accepted.

@Godin Godin self-assigned this May 15, 2015
@fge
Copy link
Author

fge commented May 15, 2015

Well, another possibility would have been to wrap the whole byte[] into a CharBuffer to begin with. A .subsequence() would then have been marginally cheaper.

The problem here is basically that CodeBuffer breaks the CharSequence contract; this patch solves it, at a cost which is on the whole pretty negligible since no SonarQube code uses that... BUT if client code wants to use that (I want; and since CodeBuffer doesn't support that I have to create a wrapper just for that), I believe the principle of least surprise here should have priority over _very_hypothetical memory leaks.

@Godin
Copy link
Member

Godin commented May 15, 2015

Not hypothetical at all and defensive principle is exactly to avoid bad surprises. And again - "I want" is not a concrete usage scenario.

@fge
Copy link
Author

fge commented May 15, 2015

Sorry but those are purely hypothetical. And like I said, another implementation (I can provide it) would just have been to wrap the whole char array. This way a call to .subSequence() would just reuse that instance-wide charbuffer.

And I maintain: the principle of least surprise takes priority over anything else. CodeBuffer implements CharSequence, therefore it should obey the CharSequence contract, and CharSequence never says anywhere, in any way, shape or form, that.subSequence() may throw an UnsupportedOperationException.

I'll redo the patch with a global CharBuffer.

@fge
Copy link
Author

fge commented May 15, 2015

Done.

Note however that this could be vastly improved. In particular, it looks like pop() could be rewritten as a simple get() on the buffer which in turn means the offset calculations could be simplified.

If you want me to work on this aspect, I can do that too.

Not only that but it can also lead the way to real unicode support, as I do in grappa.

@Godin
Copy link
Member

Godin commented May 15, 2015

And I repeat - not hypothetical, but a real issue faced in past. Current implementation is the principle of least surprise for us. Update has exactly the same flaw as previous change. Instead of being aggressively pushy, might be better to start listening to what others say. So once again - how you use subsequence (usage scenario)? Without this another solution to problem of contract - remove inheritance of CharSequence, ticket created - http://jira.sonarsource.com/browse/SSLR-372

@fge
Copy link
Author

fge commented May 16, 2015

In grappa, I use what is called an InputBuffer.

What parboiled did in the past, for no good reason, was to specialize implementations for String and char[] where CharSequence could do it all. I unified this into a single implementation relying only on a CharSequence.

I was quite happy to see that CodeBuffer implemented it... Until I saw that in fact no it didn't. Therefore I had to reimplement the interface:

/**
 * An {@link InputBuffer} over a {@link CodeReader}
 *
 * <p>Unfortunately, this is required. Sonar's {@link CodeBuffer} does not
 * support {@link CharSequence#subSequence(int, int)} for no good reason that I
 * can see... So this is basically a {@link CharSequenceInputBuffer} with
 * subsequence extraction rewritten.</p>
 *
 */
@Immutable
public final class CodeReaderInputBuffer
    implements InputBuffer
{
    private static final ExecutorService EXECUTOR_SERVICE;

    static {
        final ThreadFactory factory = new ThreadFactoryBuilder()
            .setDaemon(true).setNameFormat("linecounter-thread-%d").build();
        EXECUTOR_SERVICE = Executors.newCachedThreadPool(factory);
    }

    private final CodeReader reader;
    private final Future<LineCounter> lineCounter;

    public CodeReaderInputBuffer(@Nonnull final CodeReader reader)
    {
        this.reader = Objects.requireNonNull(reader);
        lineCounter = EXECUTOR_SERVICE.submit(() -> new LineCounter(reader));
    }

    @Override
    public char charAt(final int index)
    {
        return index >= 0 && index < reader.length()
            ? reader.charAt(index) : Chars.EOI;
    }

    /**
     * Returns the Unicode code point starting at a given index
     * <p>If the index is greater than, or equal to, the buffer's length, this
     * method returns -1.</p>
     *
     * @param index the index
     * @return the code point at this index, or -1 if the end of input has been
     * reached
     *
     * @throws IllegalArgumentException index is negative
     */
    @Override
    public int codePointAt(final int index)
    {
        final int length = reader.length();
        if (index >= length)
            return -1;
        if (index < 0)
            throw new IllegalArgumentException("index is negative");

        final char c = reader.charAt(index);
        if (!Character.isHighSurrogate(c))
            return c;
        if (index == length - 1)
            return c;
        final char c2 = reader.charAt(index + 1);
        return Character.isLowSurrogate(c2) ? Character.toCodePoint(c, c2) : c;
    }

    @Override
    public String extract(final int start, final int end)
    {
        final int realStart = Math.max(start, 0);
        final int realEnd = Math.min(end, reader.length());
        final CharBuffer buf = CharBuffer.allocate(realEnd - realStart);
        for (int i = realStart; i < realEnd; i++)
            buf.put(charAt(i));
        return new String(buf.array());
    }

    @Override
    public String extract(final IndexRange range)
    {
        return extract(range.start, range.end);
    }

    @Override
    public Position getPosition(final int index)
    {
        /*
         * A CodeReader column index starts at 0, not 1; we therefore need to
         * substract one from the extracted position...
         */
        final Position position
            = Futures.getUnchecked(lineCounter).toPosition(index);
        return new Position(position.getLine(), position.getColumn() - 1);
    }

    @Override
    public String extractLine(final int lineNumber)
    {
        Preconditions.checkArgument(lineNumber > 0, "line number is negative");
        final LineCounter counter = Futures.getUnchecked(lineCounter);
        final Range<Integer> range = counter.getLineRange(lineNumber);
        final int start = range.lowerEndpoint();
        int end = range.upperEndpoint();
        if (charAt(end - 1) == '\n')
            end--;
        if (charAt(end - 1) == '\r')
            end--;
        return extract(start, end);
    }

    /**
     * Get the index range matching a given line number
     *
     * @param lineNumber the line number
     * @return the index range
     */
    @Override
    public IndexRange getLineRange(final int lineNumber)
    {
        final Range<Integer> range
            = Futures.getUnchecked(lineCounter).getLineRange(lineNumber);
        return new IndexRange(range.lowerEndpoint(), range.upperEndpoint());
    }

    @Override
    public int getLineCount()
    {
        return Futures.getUnchecked(lineCounter).getNrLines();
    }

    @Override
    public int length()
    {
        return reader.length();
    }
}

One more reason that I fail to see the non implementation of .subsequence() is that it is actually extremely benefical for regexes which I see is used a lot in SSLR: Pattern.matcher() accepts a CharSequence as an argument.

@fge
Copy link
Author

fge commented May 16, 2015

As to this:

And I repeat - not hypothetical, but a real issue faced in past

would you please elaborate? The interface is quite simple to implement. In fact, CharBuffer does it all.

@Godin
Copy link
Member

Godin commented May 16, 2015

Yes, I can elaborate: consumer of "subSequence" instead creation of copy keeps reference on result as is, and result survives multiple parses of different files, accumulating content of those files in memory.

@fge
Copy link
Author

fge commented May 16, 2015

Then why don't you just return a String? String implements CharSequence... Which means you don't need to pass a hidden reference to the caller...

I can modify the patch to do that, too...

Just copy the necessary contents into a new char[] and return it in a String,
which implements CharSequence.
@fge
Copy link
Author

fge commented May 16, 2015

Ate my own dog's food, new patch.

(oh, and tests still pass)

@fge
Copy link
Author

fge commented May 16, 2015

Further note about the patch: it fully obeys the contract in that if an index is passed which is out of bounds, or start is greater than end etc, it will return an IndexOutOfBoundsException. No violation at all here ;)

@Godin
Copy link
Member

Godin commented May 16, 2015

Creation of String - is a copy-on-read, which might be undesirable in scenarios, where expected that method "subSequence" guarantees constant time. That's why I asked to provide usage scenarios, which are supposed to be not just a dump of some code, but an explanations (data, data flows, workloads, performance requirements, ...), which would help to choose solution / implementation approach, instead of trying quickly randomly find "correct" hot patch, whose behavior we'll be forced to support as soon as consumers will start relying on it.

Pattern itself doesn't use/require implementation of method subSequence.

@fge
Copy link
Author

fge commented May 16, 2015

Pattern itself doesn't use/require implementation of method subSequence.

It does. I speak from experience here.

A Matcher takes a CharSequence as an argument and this is not for kicks. What a Matcher does is call .charAt() to fecth characters from the CharSequence, and when it finds a match and you want the match, it extracts the .subsequence() from the CharSequence and calls .toString() on it.

I discovered this fact while working on largetext, and tried to print what was actually matched... I didn't implement .toString() over my CharSequences at first and I obtained a .toString() representation which looked like MyShortClassName@xxxxx....

@Godin
Copy link
Member

Godin commented May 16, 2015

Ok, excuse me, usage of Pattern within SSLR doesn't require this.

@fge
Copy link
Author

fge commented May 16, 2015

OK, fine, then the decision is yours. I believe my patch is pretty simple to be included as is, since the results of it are basically stateless.

@fge
Copy link
Author

fge commented May 17, 2015

Some more comments...

Creation of String - is a copy-on-read, which might be undesirable in scenarios, where expected that method "subSequence" guarantees constant time.

Let's be realistic here for a moment. Can a copy of some arbitrary amount of data be in constant time, that is O(1)?

The answer is obvious: NO. It can never be. The only condition where it can be is that if a reference to the original data is kept, and THAT is the scenario which is undesirable.

While my first patch was indeed "constant time" in that regard, and undesirable, the latest patch is NOT constant time (its runtime depends on the length of the requested subsequence).

But both patches cure the fact that CodeBuffer, as it is now, breaks the CharSequence contract; with my patch, it obeys it to its fullest extent, bounds checking and all, for no additional cost whatsoever, except for the guys like me who do happen to rely on CharSequence.

That's why I asked to provide usage scenarios, which are supposed to be not just a dump of some code, but an explanations (data, data flows, workloads, performance requirements, ...), which would help to choose solution / implementation approach, instead of trying quickly randomly find "correct" hot patch, whose behavior we'll be forced to support as soon as consumers will start relying on it.

Well there you have one scenario. That dump of code as you call it is precisely the critical part in all of my parsin process. And there is no data flow/performance requirements to speak of: just the fact that if an interface is claimed to be implemented, IT SHOULD BE. And my fix for that is simple, obvious and comes at no additional cost.

So, what's the problem?

@Godin Godin force-pushed the master branch 3 times, most recently from e1979f9 to 9a641a5 Compare October 28, 2020 15:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants