Add Identifier wrapper that strips backticks from token text #2576

adammcarter · 2024-03-28T19:41:13Z

Motivation

Following the issue opened in #1936, take the code below as an example:

let x = y.`z`()

When wanting to pull out the character

`z`

you will have to manually strip the backslashes to get the z character.

The backslashes are completely valid in this case but often using the Swift syntax you might want to remove the backslashes to get the sanitised name of the token for your own purposes which can introduce unnecessary boilerplate.

Solution

Thinking forward, this new Identifier type allows an abstraction over tokens to further sanitise/strip data out, for example, backticks (as in this PR) or future updates like:

combined characters, e.g. é
# symbols such as in #if
comment blocks like /*, /// or //

With the changes in this PR we can take the same example code:

let x = y.`z`()

And by calling [token].identifier.name we can get z without the backticks.

Alternatives considered

Stripping the backslashes when returning the .text is one solution, but the backticks are a valid part of the text and so wouldn't be an accurate way of returning the source code

Trimming the backticks as part of trimmedDescription is a valid and simpler solution IMO, but doesn't allow for future implementations like the above under the Identifier abstraction

Tests/SwiftSyntaxTest/IdentifierTests.swift

ahoppen

Thanks for putting up the PR 👍🏽

ahoppen · 2024-03-29T07:51:30Z

Sources/SwiftSyntax/Identifier.swift

+/// An abstraction for sanitized values on a token.
+public struct Identifier: Equatable, Sendable {
+  /// The sanitized `text` of a token.
+  public let name: String


I would prefer to store the name as pure bytes instead of a String. That way the identifier actually represents the raw byte name of the identifier in the source file and doesn’t do any unicode normalization.

The best option would probably to use SyntaxText (TokenSyntax.rawText) but SyntaxText doesn’t own the underlying text buffer, so Identifier would also need to keep the SyntaxArena it was allocated in alive (RetainedSyntaxArena).

CC @rintaro In case you have opinions / thoughts here.

I just chatted to @rintaro about this.

What we want is two types:

An @_spi(RawSyntax) public struct RawIdentifier that stores the name of the identifier without keeping the SyntaxArena alive. This is intended to be used in performance-critical situations that can guarantee that the SyntaxArena stays alive and don’t want to pay the ref-counting overhead for it.

A public struct Identifier that wraps RawIdentifier and also keeps the syntax arena alive using RetainedSyntaxArena. Identifier should then have a computed property name: String that returns a Unicode-normalized version of the identifier. String is probably what most clients choose to use but for uses in the compiler, we need to be byte accurate because the compiler considers U+00E0 (à LATIN SMALL LETTER A WITH GRAVE) to be different than U+0061 U+0300 (a LATIN SMALL LETTER A followed by ̀ COMBINING GRAVE ACCENT) while Swift’s String performs Unicode normalization and considers them the same (print("\u{e0}" == "\u{61}\u{300}") prints true)

Hey @ahoppen thanks for following up on this!

I'm looking in to the above and wanted to clarify whether the RawIdentifier.name needs to be a SyntaxText or a RawSyntax

I've pushed a WIP commit which covers the above and uses a SyntaxText which mostly works (except for the test testRawIdentifier() which I'll look in to after the conclusion of this conversation)

However on attempting to use RawSyntax type I'm getting a bit lost in the code from me being so new to this codebase.

Are you able to:

clarify if we want RawIdentifier.name to be a SyntaxText or a RawSyntax

look at my latest WIP commit and provide some input on my attempts so far?

Thanks!

RawIdentifier.name should be a SyntaxText. Sorry if that wasn’t clear from my last comment.

Sources/SwiftSyntax/Identifier.swift

Tests/SwiftSyntaxTest/IdentifierTests.swift

Sources/SwiftSyntax/Identifier.swift

Added a new Identifier type which contains a name property This name property contains a sanitized version of the TokenSyntax's text property which for now only consists of trimming backticks

This acts as a convenience property to convert a TokenSyntax to an Identifier

This isn't explicitly needed for this problem but it seems as though the default for all types that can conform to Sendable/Hashable should This also sets up this new type for Swift 6.0 (and the current Swift 5.10) to pass this type around safely when needed As well as allows Identifier to be a key in a dictionary which could be a common scenario

When trying to create an Identifier from a non-identifier token, the initializer should fail, returning nil Additionally the identifier property of the TokenSyntax should also return nil

ahoppen

Thank you! This is going in a great direction. It’s about the design that I had in mind.

ahoppen · 2024-04-15T23:34:20Z

Sources/SwiftSyntax/Identifier.swift

Could you indent the file using 2 spaces and run swift-format on it? https://github.com/apple/swift-syntax/blob/main/CONTRIBUTING.md#formatting

ahoppen · 2024-04-15T23:36:21Z

Sources/SwiftSyntax/Identifier.swift

+    public static func == (lhs: Identifier, rhs: Identifier) -> Bool {
+        lhs.rawIdentifier == rhs.rawIdentifier
+    }
+
+    public func hash(into hasher: inout Hasher) {
+        hasher.combine(rawIdentifier)
+    }


Don’t these implementations get automatically synthesized?

ahoppen · 2024-04-15T23:36:56Z

Sources/SwiftSyntax/Identifier.swift

+
+@_spi(RawSyntax) 
+public struct RawIdentifier: Equatable, Hashable, Sendable {
+    public let name: SyntaxText


I think we should do the trimming of backticks in RawIdentifier. That way it’s possible to construct a RawIdentifier from a RawSyntaxTokenView that contains a token. Ie. I think it should have an initializer with the signature init(_ raw: RawSyntaxTokenView).

ahoppen · 2024-04-15T23:41:46Z

Sources/SwiftSyntax/Identifier.swift

+        let name = rawText.withUTF8 {
+            syntaxArena.intern(
+                SyntaxText(buffer: SyntaxArenaAllocatedBufferPointer<UInt8>($0))
+            )
+        }


I don’t think we need to create a new arena and intern the string at all. If you pick up my suggestion of doing the trimming in RawIdentifier, you should be able to

Get the RawSyntaxTokenView.rawText

If the text has a leading and trailing backtick, slice off the first and last byte of the SyntaxText using a subscript

Constructing a SyntaxText from the slice again using SyntaxText.init(rebasing:)

What’s conceptually happening then, is that the whole text including backticks is allocated memory in the SyntaxArena and RawIdentifier just references the slice of that text without the backticks without doing any more memory allocations.

Also, I think we don’t want to trim an arbitrary number of backticks at the front and back, but a single backtick at the front and a single backtick at the back if both exist.

ahoppen · 2024-04-15T23:41:54Z

Sources/SwiftSyntax/Identifier.swift

+    @_spi(RawSyntax)
+    public let rawIdentifier: RawIdentifier
+
+    let arena: RetainedSyntaxArena


I think the arena can even be private.

ahoppen · 2024-04-15T23:43:33Z

Tests/SwiftSyntaxTest/IdentifierTests.swift

+        let someToken = TokenSyntax(stringLiteral: "someToken")
+        XCTAssertNotNil(Identifier(someToken))
+
+        let nonIdentifierToken = DeclSyntax("let a = 1").firstToken(viewMode: .all)!


Could you use XCTUnwrap instead of the force unwrapping? That way test execution continues and doesn’t crash if firstToken should be nil.

ahoppen · 2024-04-15T23:44:34Z

Sources/SwiftSyntax/Identifier.swift

+    @_spi(RawSyntax)
+    public let rawIdentifier: RawIdentifier


I would just call this public let raw: RawIdentifier to match the naming scheme that Syntax have a raw: RawSyntax property.

Test functions with backticks in their identifier should have them filtered out. This is a stopgap until apple/swift-syntax#2576 is ready.

adammcarter requested review from ahoppen and bnbarham as code owners March 28, 2024 19:41

adammcarter commented Mar 28, 2024

View reviewed changes

Tests/SwiftSyntaxTest/IdentifierTests.swift Show resolved Hide resolved

adammcarter force-pushed the adamcarter93/separate-backtick-tokens branch from 581dea4 to 7d0faac Compare March 28, 2024 19:49

adammcarter mentioned this pull request Mar 28, 2024

Backticks on identifiers should be separate tokens #1936

Open

ahoppen reviewed Mar 29, 2024

View reviewed changes

adammcarter force-pushed the adamcarter93/separate-backtick-tokens branch 4 times, most recently from ac33a8c to 3b4820e Compare March 29, 2024 14:54

adammcarter added 3 commits March 29, 2024 15:12

Remove backticks when creating Identifier

23e48f1

Added a new Identifier type which contains a name property This name property contains a sanitized version of the TokenSyntax's text property which for now only consists of trimming backticks

Added TokenSyntax identifier property

d5e150e

This acts as a convenience property to convert a TokenSyntax to an Identifier

adammcarter force-pushed the adamcarter93/separate-backtick-tokens branch from 3b4820e to 336b4b3 Compare March 29, 2024 15:14

adammcarter added 2 commits March 29, 2024 19:11

Make Identifier initializer failable

a63ade8

When trying to create an Identifier from a non-identifier token, the initializer should fail, returning nil Additionally the identifier property of the TokenSyntax should also return nil

WIP - SyntaxText

21ba4fb

ahoppen reviewed Apr 15, 2024

View reviewed changes

plemarquand mentioned this pull request Apr 30, 2024

Trim parameter names in SwiftTestingScanner apple/sourcekit-lsp#1209

Merged

plemarquand added a commit to plemarquand/sourcekit-lsp that referenced this pull request May 1, 2024

Filter backticks in TestItem IDs

1ac0685

Test functions with backticks in their identifier should have them filtered out. This is a stopgap until apple/swift-syntax#2576 is ready.

plemarquand mentioned this pull request May 1, 2024

Filter backticks in TestItem IDs apple/sourcekit-lsp#1211

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Identifier wrapper that strips backticks from token text #2576

Add Identifier wrapper that strips backticks from token text #2576

adammcarter commented Mar 28, 2024 •

edited

ahoppen left a comment

ahoppen Mar 29, 2024

ahoppen Apr 3, 2024

adammcarter Apr 14, 2024

ahoppen Apr 15, 2024

ahoppen left a comment

ahoppen Apr 15, 2024

ahoppen Apr 15, 2024

ahoppen Apr 15, 2024

ahoppen Apr 15, 2024

ahoppen Apr 15, 2024

ahoppen Apr 15, 2024

ahoppen Apr 15, 2024

Add Identifier wrapper that strips backticks from token text #2576

Are you sure you want to change the base?

Add Identifier wrapper that strips backticks from token text #2576

Conversation

adammcarter commented Mar 28, 2024 • edited

Motivation

Solution

Alternatives considered

ahoppen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahoppen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adammcarter commented Mar 28, 2024 •

edited