Skip to content

Commit

Permalink
Auto merge of #59706 - matklad:the-essence-of-lexer, r=petrochenkov
Browse files Browse the repository at this point in the history
The essence of lexer

cc @eddyb

I would love to make a reusable library to lex rust code, which could be used by rustc, rust-analyzer, proc-macros, etc. This **draft** PR is my attempt at the API. Currently, the PR uses new lexer to lex comments and shebang, while using the old lexer for everything else. This should be enough to agree on the API though!

### High-level picture

An `rust_lexer` crate is introduced, with zero or minimal (for XID_Start and other unicode) dependencies. This crate basically exposes a single function: `next_token(&str) -> (TokenKind, usize)` which returns the first token of a non-empty string (`usize` is the length of the token). The main goal of the API is to be minimal. Non-strictly essential concerns, like string interning, are left to the clients.

### Finer Points

#### Iterator API

We probably should expose a convenience function `fn tokenize(&str) -> impl Iterator<Item = Token>`

EDIT: I've added `tokenize`

#### Error handling

The lexer itself provides only minimal amount of error detection and reporting. Additionally, it never fatal-errors and always produces some non-empty token. Examples of errors detected by the lexer:

* unterminated block comment
* unterminated string literals

Example of errors **not** detected by the lexer:

* invalid escape sequence in a string literal
* out of range integer literal
* bare `\r` in the doc comment.

The idea is that the clients are responsible for additional validation of tokens. This is the mode IDE operates in: you want to skip validation for library files, because you are not showing errors there anyway, and for user-code, you want to do a deep validation with quick fixes and suggestions, which is not really fit for the lexer itself.

In particular, in this PR unclosed `/*` comment is handled by the new lexer, bare `\r` and distinction between doc and non-doc comments is handled by the old lexer.

#### Performance

No attempt at performance measurement is made so far :) I think it is acceptable to regress perf here a bit in exchange for cleaner code, and I hope that regression wouldn't be too costly. In particular, because we validate tokens separately, we'll have to do one more pass for some of the tokens. I hope this is not a prohibitive cost. For example, for doc comments we already do two passes (lexing + interning), so adding a third one shouldn't be that much slower (and we also do an additional pass for utf-8 validation). And lexing is hopefully not a bottleneck. Note that for IDEs separate validation might actually improve performance, because we will be able to skip validation when, for example, computing completions.

Long term, I hope that this approach will allow for *better* performance. If we separate pure lexing, in the future we can code-gen super-optimizes state machine that walks utf-8 directly, instead of current manual char-by-char toil.

#### Cursor API

For implementation, I am going slightly unconventionally. Instead of defining a `Lexer` struct with a bunch of helper methods (`current`, `bump`) and a bunch of lexing methods (`lex_comment`, `lex_whitespace`), I define a `Cursor` struct which has only helpers, and define a top-level function with a `&mut Cursor` argument for each grammar production. I find this C-style more readable for parsers and lexers.

EDIT: swithced to a more conventional setup with lexing methods

So, what do folks think about this?
  • Loading branch information
bors committed Jul 21, 2019
2 parents 1301422 + 395ee0b commit 83dfe7b
Show file tree
Hide file tree
Showing 15 changed files with 1,335 additions and 1,259 deletions.
8 changes: 8 additions & 0 deletions Cargo.lock
Expand Up @@ -2972,6 +2972,13 @@ dependencies = [
"tempfile 3.0.5 (registry+https://github.com/rust-lang/crates.io-index)",
]

[[package]]
name = "rustc_lexer"
version = "0.1.0"
dependencies = [
"unicode-xid 0.1.0 (registry+https://github.com/rust-lang/crates.io-index)",
]

[[package]]
name = "rustc_lint"
version = "0.0.0"
Expand Down Expand Up @@ -3622,6 +3629,7 @@ dependencies = [
"log 0.4.6 (registry+https://github.com/rust-lang/crates.io-index)",
"rustc_data_structures 0.0.0",
"rustc_errors 0.0.0",
"rustc_lexer 0.1.0",
"rustc_macros 0.1.0",
"rustc_target 0.0.0",
"scoped-tls 1.0.0 (registry+https://github.com/rust-lang/crates.io-index)",
Expand Down
9 changes: 9 additions & 0 deletions src/librustc_lexer/Cargo.toml
@@ -0,0 +1,9 @@
[package]
authors = ["The Rust Project Developers"]
name = "rustc_lexer"
version = "0.1.0"
edition = "2018"

# Note that this crate purposefully does not depend on other rustc crates
[dependencies]
unicode-xid = { version = "0.1.0", optional = true }
57 changes: 57 additions & 0 deletions src/librustc_lexer/src/cursor.rs
@@ -0,0 +1,57 @@
use std::str::Chars;

pub(crate) struct Cursor<'a> {
initial_len: usize,
chars: Chars<'a>,
#[cfg(debug_assertions)]
prev: char,
}

pub(crate) const EOF_CHAR: char = '\0';

impl<'a> Cursor<'a> {
pub(crate) fn new(input: &'a str) -> Cursor<'a> {
Cursor {
initial_len: input.len(),
chars: input.chars(),
#[cfg(debug_assertions)]
prev: EOF_CHAR,
}
}
/// For debug assertions only
pub(crate) fn prev(&self) -> char {
#[cfg(debug_assertions)]
{
self.prev
}

#[cfg(not(debug_assertions))]
{
'\0'
}
}
pub(crate) fn nth_char(&self, n: usize) -> char {
self.chars().nth(n).unwrap_or(EOF_CHAR)
}
pub(crate) fn is_eof(&self) -> bool {
self.chars.as_str().is_empty()
}
pub(crate) fn len_consumed(&self) -> usize {
self.initial_len - self.chars.as_str().len()
}
/// Returns an iterator over the remaining characters.
fn chars(&self) -> Chars<'a> {
self.chars.clone()
}
/// Moves to the next character.
pub(crate) fn bump(&mut self) -> Option<char> {
let c = self.chars.next()?;

#[cfg(debug_assertions)]
{
self.prev = c;
}

Some(c)
}
}

0 comments on commit 83dfe7b

Please sign in to comment.