review: feat: java lexer for better position detection #5753

SirYwell · 2024-04-08T14:57:04Z

As mentioned in the linked PR, the current approach to find source locations in spoon is insufficient. The naive approach on only caring about spaces is not enough as there are other separators. By implementing a lexer, we can deal with that properly and also support unicode sequences.

This solution will hopefully - besides correctness - be more future-proof.

I tested the lexer on all .java files of the OpenJDK. By manually inspecting failures, it looks like only actual invalid code is rejected. So I'm confident this is a robust implementation.

There is more code in the PositionBuilder relying on the whitespace-based methods. It might be possible to replace those usages in future too.

MartinWitt · 2024-04-15T18:42:22Z

I had a nightmare that @SirYwell was writing a complete lexer. Luckily, this did not happen or?

I-Al-Istannen

This was a good chunk, I hope I didn't miss anything important :) Have fun telling me why the comments are invalid :P

I-Al-Istannen · 2024-04-20T10:39:52Z

src/main/java/spoon/support/compiler/jdt/PositionBuilder.java

+				contents,
+				start,
+				Math.max(start, end) + 1, //move end after the last char
+				explicitModifiersByKind,
+				(modStart, modEnd) -> cf.createSourcePosition(cu, modStart, modEnd, cu.getLineSeparatorPositions())


Suggested change

contents,

start,

Math.max(start, end) + 1, //move end after the last char

explicitModifiersByKind,

(modStart, modEnd) -> cf.createSourcePosition(cu, modStart, modEnd, cu.getLineSeparatorPositions())

contents,

start,

Math.max(start, end) + 1, //move end after the last char

explicitModifiersByKind,

(modStart, modEnd) -> cf.createSourcePosition(cu, modStart, modEnd, cu.getLineSeparatorPositions())

and why is there a Math.max? Is end ever smaller than start? I think this either deserves an explanation or should be end + 1.

For some unholy reason, end can be -1 sometimes. That can probably be fixed, but I don't want to include that here.

src/main/java/spoon/support/util/internal/lexer/CharRemapper.java

I-Al-Istannen · 2024-04-20T10:57:25Z

src/main/java/spoon/support/util/internal/lexer/CharRemapper.java

+					i += 5;
+					if (this.positionRemap == null) {
+						this.positionRemap = createPositionRemap(chars);
+					}


Could you add a comment stating why this is 6? And maybe a short comment at the top saying that you are first building a map from index -> skip value to next char and then accumulate it at the bottom or something in that spirit?

I-Al-Istannen · 2024-04-20T11:25:21Z

src/main/java/spoon/support/util/internal/lexer/CharRemapper.java

+			if (this.content[i] == '\\') {
+				if (escape) {
+					escape = false;
+				} else if (this.end > i + 1 && this.content[i + 1] == '\\') {
+					escape = true;
+				}
+			}


I am not quite sure what this is doing. Maybe rename escape to escapeBackslash or just handle it inline and increment i by one in that branch and skip the loop iteration? At present I am not exactly sure what this is doing.

I-Al-Istannen · 2024-04-20T11:48:13Z

src/main/java/spoon/support/util/internal/lexer/JavaKeyword.java

+package spoon.support.util.internal.lexer;
+
+/**
+ * Valid Java (contextual) keywords


But VAR for example is missing, so not all contextual?

I-Al-Istannen · 2024-04-20T19:01:51Z