Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking PR for v0.10.0 release #1340

Open
wants to merge 104 commits into
base: main
Choose a base branch
from
Open

Tracking PR for v0.10.0 release #1340

wants to merge 104 commits into from

Conversation

bobbinth
Copy link
Contributor

This is tracking PR for v0.10.0 release.

Overcastan and others added 30 commits February 27, 2024 22:27
* chore: fix no-std errors
* fix: Falcon DSA decorators and tests
This commit introduces two things of interest:

1. The `miette` crate, with dependency configuration to support no-std
   builds. This crate provides the foundation of the diagnostics infra
   that will be used in later commits. It is primarily based around a
   Diagnostics trait, with a derive-macro similar to thiserror for
   decorating Error types with diagnostics data. It also provides some
   primitives for source spans and source file information, which ties
   into those diagnostics to print source snippets in errors with spans.

2. The `diagnostics` module, which in addition to re-exporting the parts
   of `miette` that we want to use, also defines some utility types that
   make expressing common error structures more convenient. It also
   defines the `SourceFile` alias which we will use throughout the crate
   when referencing the source file some construct was derived from.
This commit adds `thiserror` to the workspace for use in defining error
types, however it is a bit odd due to some background context you will
want to know:

Currently, the `std::error::Error` trait is only stable in libstd, and
its use in libcore is unstable, behind the `error_in_core` trait. This
makes defining `Error` types very awkward in general. As a result of
this, `thiserror` currently depends on libstd.

However, the crate author, dtolnay, has expressed that once
`error_in_core` stabilizes, `thiserror` will support no-std
environments. It is expected that this will stabilize soon-ish, but
there are no definite dates on feature stabilization.

Even though `thiserror` ostensibly requires libstd, it actually is
trivial to support no-std with it, by simply choosing _not_ to emit the
`std::error::Error` trait implementation when building for no-std, or
by enabling the `error_in_core` feature when building with nightly. The
crate author, dtolnay, has expressed that they would rather not maintain
that as a build option since `error_in_core` is so close to stabilization.
So to bridge the gap, I've forked `thiserror` and implemented the
necessary modifications, as well as `miette`, used for diagnostics,
which depends on `thiserror` internally.

In the future, when `thiserror` natively supports no-std builds, we can remove
these forked dependencies in favor of mainline.  In the meantime, we can
benefit from the use of `thiserror`'s ergonomic improvements when it comes
to defining error types, and it allows us to use `miette` as well.
NOTE: This commit has `use` statements, and has types/functions that do
not exist in the source tree yet. This is because this commit is being
introduced retroactively. This commit leaves the `parser` module
disconnected from the actual module hierarchy in this commit.

This commit implements a new LALR(1) grammar and parser for the Miden
Assembly language, which will replace the existing MASM parsing code. It
consists of the following components:

* The grammar, expressed using `lalrpop` (which supports no-std builds
  if you were wondering). This grammar is LALR(1) in formal grammar
  parlance, and can be found in
  `assembly/src/parser/grammar.lalrpop`. Many common validations and
  optimizations are performed during parsing, as we can restrict the
  space of what is possible to express in the grammar itself, rather
  than having to implement it manually via recursive descent.

* The lexer, found in `assembly/src/parser/token.rs`, which makes use of
  `logos` (also no-std compatible), a well-established and fast
  lexer-generator. It is defined in terms of a `Token` type, on which
  the lexer trait is derived using a set of rules attached to each
  variant of the `Token` enum. There are various ways you can approach
  defining lexers, but in our case I opted for a stricter definition, in
  which the full MASM instruction set is tokenized as keywords, rather
  than parsing the instruction names later in the pipeline. This means
  that typos are caught immediately during parsing, with precise
  locations and diagnostics which tell the user what tokens are expected
  at the erroneous location.

* The parser interface, found in `assembly/src/parser/mod.rs`, which is
  basically a thin wrapper around instantiating the lexer, named source
  file, and invocation of the generated LALRPOP parser.

* A set of types and a trait for expressing source-spanned types. The
  `SourceSpan` type expresses the range of bytes covered by a token in
  the source file from which it was parsed, and is composed of two u32
  indices. The lexer emits these indices for each token. The grammar
  then can make use of those indices to construct a `SourceSpan` for
  each production as it sees fit. The `Spanned` trait is implemented for
  types which have an associated `SourceSpan`; typically these types
  would be the types making up the AST, but it is also useful as the AST
  is lowered to propagate source locations through the pipeline. Lastly,
  the `Span` type allows wrapping third-party types such that they
  implement `Spanned`, e.g. `Span<u32>` is a spanned-`u32` value, which
  would otherwise be impossible to associate a `SourceSpan` to.

* Two error types, `LexerError` and `ParsingError`, the latter of which
  can be converted into from the former. These make use of the new
  diagnostics and `thiserror` infrastructure and make for a good
  illustration of how ergonomic such types can be with those additions.
This commit introduces a simple implementation of the Prettier-style
source code formatting infrastructure using the algorithm of Philip
Wadler, and extended with some extra features recently described in
a blog post by GitHub user justinpombrio.

This commit does not make use of the infrastructure yet, that will come
in later PRs which introduce changes to the AST.
This commit adds the implementations of the AST types which are new:

* `Form`, represents items which are valid at the top-level of a MASM
  module. The parser produces a vector of `Form` when parsing a module,
  which is then later translated into a `Module` (coming in a later
  commit) during semantic analysis. After semantic analysis, this type
  is never used again.
* `Block`, represents a block in Miden Assembly, i.e. a flat sequence of
  operations. These are akin to "regions" in compiler parlance, a subtle
  extension of basic blocks that allows instructions to have nested
  regions/blocks, whereas strict basic blocks in a typical SSA compiler
  do not permit nesting in this way. Since we have structured control
  flow operations, our blocks have region-like semantics.
* `Op`, represents the full MASM instruction set, unlike `Instruction`
  which represents the subset without control flow ops.
* `Constant` and `ConstantExpr`, which represent the subset of the
  syntax for constant expressions and definitions. Unlike the previous
  parser and AST, we do not evaluate constants during parsing - except to
  do infallible constant folding where possible - but instead do it later
  during semantic analysis when we have the full set of constant
  definitions on hand. This lets constant definitions appear anywhere in
  the source file, and in any order as long as there are no cyclic
  dependencies.
* `Immediate<T>`, which represents instruction immediates generically,
  and in a form that supports the superposition of literal values and
  constant identifiers anywhere that immediates are allowed. These are
  then resolved to concrete values during semantic analysis. Immediates
  are thus represented as `Immediate<T>` in the AST universally, except
  in a small number of cases where we may only want to allow literals.
* `Ident`, which represents a cheaply-clonable identifier in the source
  code (not quite interned, but close in many cases). When parsing a
  string into an `Ident`, it imposes the general set of validation rules
  which apply to bare (unquoted) identifiers, such as those used for
  module names, or import aliases. An `Ident` can be constructed without
  enforcing those rules, such as the case for `ProcedureName`, which
  uses `Ident` internally, but enforces a looser set of rules so as to
  support quoted identifiers in the source code. `Ident` is used
  anywhere where an identifier is represented in the syntax tree.

NOTE: These modules are disconnected from the module hierarchy in this
commit, and may reference types that are not listed, or types which have
familiar names but which will have new implementations in later commits.
Please keep that in mind during review.
This commit introduces the `Visit` and `VisitMut` traits and associated
helpers, which can be used to succinctly express analysis and rewrite
passes on the Miden Assembly syntax tree.

No such passes are implemented in this commit, but will be defined in
subsequent commits.
This commit is the first in a sequence of commits that represent the
refactoring of the `assembly` crate to use the new parser, etc., to
introduce the remaining AST changes, and then propagate those changes in
addition to refactoring parts of the compilation pipeline that can take
advantage of new features and analysis that were previously not
available.

This commit specifically refactors/rewrites the set of types which
represent various details about procedures and procedure "aliases", i.e.
re-exported procedures. There are some new types implemented as well, to
better represent the specificity of a particular procedure identifier,
and to build on other types in a more structured fashion.

NOTE: This commit references things which are not yet implemented in the
source tree, this is intentional so as to let you focus on this set of
related changes abstractly, and then be able to review other later
changes with this context in mind.
This commit builds on changes to the procedure types to represent a
richer set of targets, with varying degrees of specificity:

* The `MastRoot` variant remains unchanged, but gains a source span for
  use later during compilation
* The `ProcedureName` variant represents a local name
* The `ProcedurePath` variant represents an unresolved projection of an
  imported module/function. It depends on the current module context to
  resolve.
* The `AbsoluteProcedurePath` variant represents a resolved projection
  of an imported module/function. This type is used when we have
  resolved a `ProcedurePath` to an imported module, and thus know the
  absolute path of the imported function. However, the distinction
  between an absolute path which is "fully-resolved", i.e. not an alias,
  and an absolute path which is "partially-resolved", i.e.
  possibly-aliased, is not represented here. Instead, that distinction
  is implicit dependiing on the phase of compilation we are in. This
  will become clearer in later commits.

Later, this type will be used to represent the targets of any
instruction which references a callable (name or mast root)
This commit refactors the `LibraryPath` and `LibraryNamespace` types to
build on the `Ident` type, support `Spanned`, and to provide better
building blocks for other types, such as `FullyQualifiedProcedureName`.

In addition, the structure of the `library` namespace is cleaned up, and
the `MaslLibrary` type has its internals rewritten to make use of the
new parsing infrastructure.

The `Library` trait was refactored to remove the associated `ModuleIter`
type, as many forms of iterators have "unnameable" types which cannot be
easily expressed when defining a trait implementation. Instead, we make
use of RPIT (return-position impl trait) to acheive the same goal, while
retaining the ability to impose some useful constraints on the type.
This commit refactors the `Instruction` syntax tree type in the
following ways:

* Remove the various `Call*`, `Exec*`, etc. instruction variants in
  favor of `Call`, `Exec`, `SysCall`, and `ProcRef`, all of which now
  take an `InvocationTarget`.
* Replace all immediate values with `Immediate<T>`
* Introduce wrapper types where necessary to support the `Serializable`
  and `Deserializable` traits, and to shield us from breaking upstream
  changes to those types (to some degree). See `SignatureKind` and
  `DebugOptions` specifically.
* Give the `AdviceInjectorNode`, `DebugOptions`, and other enumerated
  types explicit physical representation and discriminant values, so
  that we can safely serialize/deserialize the discriminant tags.
* Allow expressing a slightly wider range of variation in the
  `DebugOptions` syntax, which is then converted to the more explicit
  representation during semantic analysis.
* Make use of the new `PrettyPrint` trait for formatting
This is the last commit which will contain changes to the abstract
syntax tree, and is the one which answers the question: "what does a
parsed module look like?".

This commit introduces a new `Module` type, which supercedes the
previous `ProgramAst`, `ModuleAst` and `Module` types, providing all of
the information you might want to know about a given module, as well as
supporting the core functionality necessary to parse, serialize, and
pretty print modules.

A `Module` has a `ModuleKind`, which identifies what type of module it
represents: an executable (e.g. what was previously a `ProgramAst`), a
library (e.g. `what was previously a `ModuleAst`), or a kernel (which
previously had no specific representation).

The `ModuleKind` dictates the semantics of the module, but in practice
all the "kinds" of modules are virtually identical except for these
slight semantic differences, which is why they have been unified here.
These semantics are validated by the semantic analysis pass, and catch
the set of things you are not allowed to do in certain types of modules,
e.g. exporting from an executable module, syscall/call in a kernel
module, `begin` in a library module.

Lastly, this commit removes the old `ModuleImports` type, which is
superceded by these changes (as well as subsequent ones in other parts
of the assembler). In its place, is a new `Import` type which represents
all of the details about a specific module import. Each `Module` has a
set of `Import` associated with it, which it uses to resolve names, as
well as determine syntax-level inter-module dependencies.
This commit allows enabling/disabling the storage of debug info and
source code in serialized modules and various AST structures.
bitwalker and others added 30 commits April 29, 2024 11:47
…onacci

Signed-off-by: GopherJ <alex_cj96@foxmail.com>
Signed-off-by: GopherJ <alex_cj96@foxmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet