Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

review: feat: java lexer for better position detection #5753

Open
wants to merge 24 commits into
base: master
Choose a base branch
from

Conversation

SirYwell
Copy link
Collaborator

@SirYwell SirYwell commented Apr 8, 2024

Fixes #4705.

As mentioned in the linked PR, the current approach to find source locations in spoon is insufficient. The naive approach on only caring about spaces is not enough as there are other separators. By implementing a lexer, we can deal with that properly and also support unicode sequences.

This solution will hopefully - besides correctness - be more future-proof.

I tested the lexer on all .java files of the OpenJDK. By manually inspecting failures, it looks like only actual invalid code is rejected. So I'm confident this is a robust implementation.

There is more code in the PositionBuilder relying on the whitespace-based methods. It might be possible to replace those usages in future too.

@SirYwell SirYwell marked this pull request as ready for review April 8, 2024 15:56
@SirYwell SirYwell changed the title wip: feat: java lexer for better position detection review: feat: java lexer for better position detection Apr 8, 2024
@MartinWitt
Copy link
Collaborator

I had a nightmare that @SirYwell was writing a complete lexer. Luckily, this did not happen or?

Copy link
Collaborator

@I-Al-Istannen I-Al-Istannen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a good chunk, I hope I didn't miss anything important :) Have fun telling me why the comments are invalid :P

Comment on lines 587 to 591
contents,
start,
Math.max(start, end) + 1, //move end after the last char
explicitModifiersByKind,
(modStart, modEnd) -> cf.createSourcePosition(cu, modStart, modEnd, cu.getLineSeparatorPositions())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
contents,
start,
Math.max(start, end) + 1, //move end after the last char
explicitModifiersByKind,
(modStart, modEnd) -> cf.createSourcePosition(cu, modStart, modEnd, cu.getLineSeparatorPositions())
contents,
start,
Math.max(start, end) + 1, //move end after the last char
explicitModifiersByKind,
(modStart, modEnd) -> cf.createSourcePosition(cu, modStart, modEnd, cu.getLineSeparatorPositions())

and why is there a Math.max? Is end ever smaller than start? I think this either deserves an explanation or should be end + 1.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some unholy reason, end can be -1 sometimes. That can probably be fixed, but I don't want to include that here.

i += 5;
if (this.positionRemap == null) {
this.positionRemap = createPositionRemap(chars);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment stating why this is 6? And maybe a short comment at the top saying that you are first building a map from index -> skip value to next char and then accumulate it at the bottom or something in that spirit?

Comment on lines +47 to +53
if (this.content[i] == '\\') {
if (escape) {
escape = false;
} else if (this.end > i + 1 && this.content[i + 1] == '\\') {
escape = true;
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not quite sure what this is doing. Maybe rename escape to escapeBackslash or just handle it inline and increment i by one in that branch and skip the loop iteration? At present I am not exactly sure what this is doing.

package spoon.support.util.internal.lexer;

/**
* Valid Java (contextual) keywords
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But VAR for example is missing, so not all contextual?

readHexadecimalsAndUnderscore();
}
} else if (peek() == 'b' || peek() == 'B') {
next();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
next();
// binary literal
next();

// check if hexadecimal notation
if (first == '0') {
if (peek() == 'x' || peek() == 'X') {
next();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
next();
// hexadecimal literal
next();

readDigitsOrUnderscore();
}
} else {
readDigitsOrUnderscore();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
readDigitsOrUnderscore();
// decimal literal
readDigitsOrUnderscore();

?

return false;
}

private void readNumericLiteral(char first) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this lex 1e20f? Can you add a test for that?

Copy link
Collaborator

@I-Al-Istannen I-Al-Istannen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I pressed the wrong button :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants