New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Identifier wrapper that strips backticks from token text #2576
base: main
Are you sure you want to change the base?
Add Identifier wrapper that strips backticks from token text #2576
Conversation
581dea4
to
7d0faac
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for putting up the PR 👍🏽
Sources/SwiftSyntax/Identifier.swift
Outdated
/// An abstraction for sanitized values on a token. | ||
public struct Identifier: Equatable, Sendable { | ||
/// The sanitized `text` of a token. | ||
public let name: String |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to store the name as pure bytes instead of a String
. That way the identifier actually represents the raw byte name of the identifier in the source file and doesn’t do any unicode normalization.
The best option would probably to use SyntaxText
(TokenSyntax.rawText
) but SyntaxText
doesn’t own the underlying text buffer, so Identifier
would also need to keep the SyntaxArena
it was allocated in alive (RetainedSyntaxArena
).
CC @rintaro In case you have opinions / thoughts here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just chatted to @rintaro about this.
What we want is two types:
- An
@_spi(RawSyntax) public struct RawIdentifier
that stores the name of the identifier without keeping theSyntaxArena
alive. This is intended to be used in performance-critical situations that can guarantee that theSyntaxArena
stays alive and don’t want to pay the ref-counting overhead for it. - A
public struct Identifier
that wrapsRawIdentifier
and also keeps the syntax arena alive usingRetainedSyntaxArena
. Identifier should then have a computed propertyname: String
that returns a Unicode-normalized version of the identifier.String
is probably what most clients choose to use but for uses in the compiler, we need to be byte accurate because the compiler considersU+00E0
(à LATIN SMALL LETTER A WITH GRAVE) to be different thanU+0061 U+0300
(a LATIN SMALL LETTER A followed by ̀ COMBINING GRAVE ACCENT) while Swift’sString
performs Unicode normalization and considers them the same (print("\u{e0}" == "\u{61}\u{300}")
printstrue
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @ahoppen thanks for following up on this!
I'm looking in to the above and wanted to clarify whether the RawIdentifier.name
needs to be a SyntaxText
or a RawSyntax
I've pushed a WIP commit which covers the above and uses a SyntaxText
which mostly works (except for the test testRawIdentifier()
which I'll look in to after the conclusion of this conversation)
However on attempting to use RawSyntax
type I'm getting a bit lost in the code from me being so new to this codebase.
Are you able to:
- clarify if we want
RawIdentifier.name
to be aSyntaxText
or aRawSyntax
- look at my latest WIP commit and provide some input on my attempts so far?
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RawIdentifier.name
should be a SyntaxText
. Sorry if that wasn’t clear from my last comment.
ac33a8c
to
3b4820e
Compare
Added a new Identifier type which contains a name property This name property contains a sanitized version of the TokenSyntax's text property which for now only consists of trimming backticks
This acts as a convenience property to convert a TokenSyntax to an Identifier
This isn't explicitly needed for this problem but it seems as though the default for all types that can conform to Sendable/Hashable should This also sets up this new type for Swift 6.0 (and the current Swift 5.10) to pass this type around safely when needed As well as allows Identifier to be a key in a dictionary which could be a common scenario
3b4820e
to
336b4b3
Compare
When trying to create an Identifier from a non-identifier token, the initializer should fail, returning nil Additionally the identifier property of the TokenSyntax should also return nil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! This is going in a great direction. It’s about the design that I had in mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you indent the file using 2 spaces and run swift-format on it? https://github.com/apple/swift-syntax/blob/main/CONTRIBUTING.md#formatting
public static func == (lhs: Identifier, rhs: Identifier) -> Bool { | ||
lhs.rawIdentifier == rhs.rawIdentifier | ||
} | ||
|
||
public func hash(into hasher: inout Hasher) { | ||
hasher.combine(rawIdentifier) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don’t these implementations get automatically synthesized?
|
||
@_spi(RawSyntax) | ||
public struct RawIdentifier: Equatable, Hashable, Sendable { | ||
public let name: SyntaxText |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should do the trimming of backticks in RawIdentifier
. That way it’s possible to construct a RawIdentifier
from a RawSyntaxTokenView
that contains a token. Ie. I think it should have an initializer with the signature init(_ raw: RawSyntaxTokenView)
.
let name = rawText.withUTF8 { | ||
syntaxArena.intern( | ||
SyntaxText(buffer: SyntaxArenaAllocatedBufferPointer<UInt8>($0)) | ||
) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t think we need to create a new arena and intern the string at all. If you pick up my suggestion of doing the trimming in RawIdentifier
, you should be able to
- Get the
RawSyntaxTokenView.rawText
- If the text has a leading and trailing backtick, slice off the first and last byte of the
SyntaxText
using a subscript - Constructing a
SyntaxText
from the slice again usingSyntaxText.init(rebasing:)
What’s conceptually happening then, is that the whole text including backticks is allocated memory in the SyntaxArena
and RawIdentifier
just references the slice of that text without the backticks without doing any more memory allocations.
Also, I think we don’t want to trim an arbitrary number of backticks at the front and back, but a single backtick at the front and a single backtick at the back if both exist.
@_spi(RawSyntax) | ||
public let rawIdentifier: RawIdentifier | ||
|
||
let arena: RetainedSyntaxArena |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the arena can even be private.
let someToken = TokenSyntax(stringLiteral: "someToken") | ||
XCTAssertNotNil(Identifier(someToken)) | ||
|
||
let nonIdentifierToken = DeclSyntax("let a = 1").firstToken(viewMode: .all)! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you use XCTUnwrap
instead of the force unwrapping? That way test execution continues and doesn’t crash if firstToken
should be nil
.
@_spi(RawSyntax) | ||
public let rawIdentifier: RawIdentifier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just call this public let raw: RawIdentifier
to match the naming scheme that Syntax
have a raw: RawSyntax
property.
Test functions with backticks in their identifier should have them filtered out. This is a stopgap until apple/swift-syntax#2576 is ready.
Motivation
Following the issue opened in #1936, take the code below as an example:
When wanting to pull out the character
you will have to manually strip the backslashes to get the
z
character.The backslashes are completely valid in this case but often using the Swift syntax you might want to remove the backslashes to get the sanitised name of the token for your own purposes which can introduce unnecessary boilerplate.
Solution
Thinking forward, this new
Identifier
type allows an abstraction over tokens to further sanitise/strip data out, for example, backticks (as in this PR) or future updates like:é
#
symbols such as in#if
/*
,///
or//
With the changes in this PR we can take the same example code:
And by calling
[token].identifier.name
we can getz
without the backticks.Alternatives considered
Stripping the backslashes when returning the
.text
is one solution, but the backticks are a valid part of the text and so wouldn't be an accurate way of returning the source codeTrimming the backticks as part of
trimmedDescription
is a valid and simpler solution IMO, but doesn't allow for future implementations like the above under theIdentifier
abstraction