Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Identifier wrapper that strips backticks from token text #2576

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

adammcarter
Copy link
Contributor

@adammcarter adammcarter commented Mar 28, 2024

Motivation

Following the issue opened in #1936, take the code below as an example:

let x = y.`z`()

When wanting to pull out the character

`z`

you will have to manually strip the backslashes to get the z character.

The backslashes are completely valid in this case but often using the Swift syntax you might want to remove the backslashes to get the sanitised name of the token for your own purposes which can introduce unnecessary boilerplate.

Solution

Thinking forward, this new Identifier type allows an abstraction over tokens to further sanitise/strip data out, for example, backticks (as in this PR) or future updates like:

  • combined characters, e.g. é
  • # symbols such as in #if
  • comment blocks like /*, /// or //

With the changes in this PR we can take the same example code:

let x = y.`z`()

And by calling [token].identifier.name we can get z without the backticks.

Alternatives considered

Stripping the backslashes when returning the .text is one solution, but the backticks are a valid part of the text and so wouldn't be an accurate way of returning the source code

Trimming the backticks as part of trimmedDescription is a valid and simpler solution IMO, but doesn't allow for future implementations like the above under the Identifier abstraction

Copy link
Collaborator

@ahoppen ahoppen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting up the PR 👍🏽

/// An abstraction for sanitized values on a token.
public struct Identifier: Equatable, Sendable {
/// The sanitized `text` of a token.
public let name: String
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to store the name as pure bytes instead of a String. That way the identifier actually represents the raw byte name of the identifier in the source file and doesn’t do any unicode normalization.

The best option would probably to use SyntaxText (TokenSyntax.rawText) but SyntaxText doesn’t own the underlying text buffer, so Identifier would also need to keep the SyntaxArena it was allocated in alive (RetainedSyntaxArena).

CC @rintaro In case you have opinions / thoughts here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just chatted to @rintaro about this.

What we want is two types:

  • An @_spi(RawSyntax) public struct RawIdentifier that stores the name of the identifier without keeping the SyntaxArena alive. This is intended to be used in performance-critical situations that can guarantee that the SyntaxArena stays alive and don’t want to pay the ref-counting overhead for it.
  • A public struct Identifier that wraps RawIdentifier and also keeps the syntax arena alive using RetainedSyntaxArena. Identifier should then have a computed property name: String that returns a Unicode-normalized version of the identifier. String is probably what most clients choose to use but for uses in the compiler, we need to be byte accurate because the compiler considers U+00E0 (à LATIN SMALL LETTER A WITH GRAVE) to be different than U+0061 U+0300 (a LATIN SMALL LETTER A followed by ̀ COMBINING GRAVE ACCENT) while Swift’s String performs Unicode normalization and considers them the same (print("\u{e0}" == "\u{61}\u{300}") prints true)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ahoppen thanks for following up on this!

I'm looking in to the above and wanted to clarify whether the RawIdentifier.name needs to be a SyntaxText or a RawSyntax

I've pushed a WIP commit which covers the above and uses a SyntaxText which mostly works (except for the test testRawIdentifier() which I'll look in to after the conclusion of this conversation)

However on attempting to use RawSyntax type I'm getting a bit lost in the code from me being so new to this codebase.

Are you able to:

  • clarify if we want RawIdentifier.name to be a SyntaxText or a RawSyntax
  • look at my latest WIP commit and provide some input on my attempts so far?

Thanks!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RawIdentifier.name should be a SyntaxText. Sorry if that wasn’t clear from my last comment.

Sources/SwiftSyntax/Identifier.swift Outdated Show resolved Hide resolved
Sources/SwiftSyntax/Identifier.swift Outdated Show resolved Hide resolved
Tests/SwiftSyntaxTest/IdentifierTests.swift Show resolved Hide resolved
Sources/SwiftSyntax/Identifier.swift Outdated Show resolved Hide resolved
Sources/SwiftSyntax/Identifier.swift Outdated Show resolved Hide resolved
@adammcarter adammcarter force-pushed the adamcarter93/separate-backtick-tokens branch 4 times, most recently from ac33a8c to 3b4820e Compare March 29, 2024 14:54
Added a new Identifier type which contains a name property

This name property contains a sanitized version of the TokenSyntax's
text property which for now only consists of trimming backticks
This acts as a convenience property to convert a TokenSyntax to an
Identifier
This isn't explicitly needed for this problem but it seems as though
the default for all types that can conform to Sendable/Hashable should

This also sets up this new type for Swift 6.0 (and the current Swift 5.10)
to pass this type around safely when needed

As well as allows Identifier to be a key in a dictionary which could be
a common scenario
@adammcarter adammcarter force-pushed the adamcarter93/separate-backtick-tokens branch from 3b4820e to 336b4b3 Compare March 29, 2024 15:14
When trying to create an Identifier from a non-identifier token, the
initializer should fail, returning nil

Additionally the identifier property of the TokenSyntax should also return nil
Copy link
Collaborator

@ahoppen ahoppen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! This is going in a great direction. It’s about the design that I had in mind.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you indent the file using 2 spaces and run swift-format on it? https://github.com/apple/swift-syntax/blob/main/CONTRIBUTING.md#formatting

Comment on lines +44 to +50
public static func == (lhs: Identifier, rhs: Identifier) -> Bool {
lhs.rawIdentifier == rhs.rawIdentifier
}

public func hash(into hasher: inout Hasher) {
hasher.combine(rawIdentifier)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don’t these implementations get automatically synthesized?


@_spi(RawSyntax)
public struct RawIdentifier: Equatable, Hashable, Sendable {
public let name: SyntaxText
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should do the trimming of backticks in RawIdentifier. That way it’s possible to construct a RawIdentifier from a RawSyntaxTokenView that contains a token. Ie. I think it should have an initializer with the signature init(_ raw: RawSyntaxTokenView).

Comment on lines +34 to +38
let name = rawText.withUTF8 {
syntaxArena.intern(
SyntaxText(buffer: SyntaxArenaAllocatedBufferPointer<UInt8>($0))
)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t think we need to create a new arena and intern the string at all. If you pick up my suggestion of doing the trimming in RawIdentifier, you should be able to

  • Get the RawSyntaxTokenView.rawText
  • If the text has a leading and trailing backtick, slice off the first and last byte of the SyntaxText using a subscript
  • Constructing a SyntaxText from the slice again using SyntaxText.init(rebasing:)

What’s conceptually happening then, is that the whole text including backticks is allocated memory in the SyntaxArena and RawIdentifier just references the slice of that text without the backticks without doing any more memory allocations.


Also, I think we don’t want to trim an arbitrary number of backticks at the front and back, but a single backtick at the front and a single backtick at the back if both exist.

@_spi(RawSyntax)
public let rawIdentifier: RawIdentifier

let arena: RetainedSyntaxArena
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the arena can even be private.

let someToken = TokenSyntax(stringLiteral: "someToken")
XCTAssertNotNil(Identifier(someToken))

let nonIdentifierToken = DeclSyntax("let a = 1").firstToken(viewMode: .all)!
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use XCTUnwrap instead of the force unwrapping? That way test execution continues and doesn’t crash if firstToken should be nil.

Comment on lines +20 to +21
@_spi(RawSyntax)
public let rawIdentifier: RawIdentifier
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just call this public let raw: RawIdentifier to match the naming scheme that Syntax have a raw: RawSyntax property.

plemarquand added a commit to plemarquand/sourcekit-lsp that referenced this pull request May 1, 2024
Test functions with backticks in their identifier should have them
filtered out.

This is a stopgap until apple/swift-syntax#2576 is ready.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants