-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tracking PR for v0.10.0 release #1340
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* chore: fix no-std errors * fix: Falcon DSA decorators and tests
This commit introduces two things of interest: 1. The `miette` crate, with dependency configuration to support no-std builds. This crate provides the foundation of the diagnostics infra that will be used in later commits. It is primarily based around a Diagnostics trait, with a derive-macro similar to thiserror for decorating Error types with diagnostics data. It also provides some primitives for source spans and source file information, which ties into those diagnostics to print source snippets in errors with spans. 2. The `diagnostics` module, which in addition to re-exporting the parts of `miette` that we want to use, also defines some utility types that make expressing common error structures more convenient. It also defines the `SourceFile` alias which we will use throughout the crate when referencing the source file some construct was derived from.
This commit adds `thiserror` to the workspace for use in defining error types, however it is a bit odd due to some background context you will want to know: Currently, the `std::error::Error` trait is only stable in libstd, and its use in libcore is unstable, behind the `error_in_core` trait. This makes defining `Error` types very awkward in general. As a result of this, `thiserror` currently depends on libstd. However, the crate author, dtolnay, has expressed that once `error_in_core` stabilizes, `thiserror` will support no-std environments. It is expected that this will stabilize soon-ish, but there are no definite dates on feature stabilization. Even though `thiserror` ostensibly requires libstd, it actually is trivial to support no-std with it, by simply choosing _not_ to emit the `std::error::Error` trait implementation when building for no-std, or by enabling the `error_in_core` feature when building with nightly. The crate author, dtolnay, has expressed that they would rather not maintain that as a build option since `error_in_core` is so close to stabilization. So to bridge the gap, I've forked `thiserror` and implemented the necessary modifications, as well as `miette`, used for diagnostics, which depends on `thiserror` internally. In the future, when `thiserror` natively supports no-std builds, we can remove these forked dependencies in favor of mainline. In the meantime, we can benefit from the use of `thiserror`'s ergonomic improvements when it comes to defining error types, and it allows us to use `miette` as well.
NOTE: This commit has `use` statements, and has types/functions that do not exist in the source tree yet. This is because this commit is being introduced retroactively. This commit leaves the `parser` module disconnected from the actual module hierarchy in this commit. This commit implements a new LALR(1) grammar and parser for the Miden Assembly language, which will replace the existing MASM parsing code. It consists of the following components: * The grammar, expressed using `lalrpop` (which supports no-std builds if you were wondering). This grammar is LALR(1) in formal grammar parlance, and can be found in `assembly/src/parser/grammar.lalrpop`. Many common validations and optimizations are performed during parsing, as we can restrict the space of what is possible to express in the grammar itself, rather than having to implement it manually via recursive descent. * The lexer, found in `assembly/src/parser/token.rs`, which makes use of `logos` (also no-std compatible), a well-established and fast lexer-generator. It is defined in terms of a `Token` type, on which the lexer trait is derived using a set of rules attached to each variant of the `Token` enum. There are various ways you can approach defining lexers, but in our case I opted for a stricter definition, in which the full MASM instruction set is tokenized as keywords, rather than parsing the instruction names later in the pipeline. This means that typos are caught immediately during parsing, with precise locations and diagnostics which tell the user what tokens are expected at the erroneous location. * The parser interface, found in `assembly/src/parser/mod.rs`, which is basically a thin wrapper around instantiating the lexer, named source file, and invocation of the generated LALRPOP parser. * A set of types and a trait for expressing source-spanned types. The `SourceSpan` type expresses the range of bytes covered by a token in the source file from which it was parsed, and is composed of two u32 indices. The lexer emits these indices for each token. The grammar then can make use of those indices to construct a `SourceSpan` for each production as it sees fit. The `Spanned` trait is implemented for types which have an associated `SourceSpan`; typically these types would be the types making up the AST, but it is also useful as the AST is lowered to propagate source locations through the pipeline. Lastly, the `Span` type allows wrapping third-party types such that they implement `Spanned`, e.g. `Span<u32>` is a spanned-`u32` value, which would otherwise be impossible to associate a `SourceSpan` to. * Two error types, `LexerError` and `ParsingError`, the latter of which can be converted into from the former. These make use of the new diagnostics and `thiserror` infrastructure and make for a good illustration of how ergonomic such types can be with those additions.
This commit introduces a simple implementation of the Prettier-style source code formatting infrastructure using the algorithm of Philip Wadler, and extended with some extra features recently described in a blog post by GitHub user justinpombrio. This commit does not make use of the infrastructure yet, that will come in later PRs which introduce changes to the AST.
This commit adds the implementations of the AST types which are new: * `Form`, represents items which are valid at the top-level of a MASM module. The parser produces a vector of `Form` when parsing a module, which is then later translated into a `Module` (coming in a later commit) during semantic analysis. After semantic analysis, this type is never used again. * `Block`, represents a block in Miden Assembly, i.e. a flat sequence of operations. These are akin to "regions" in compiler parlance, a subtle extension of basic blocks that allows instructions to have nested regions/blocks, whereas strict basic blocks in a typical SSA compiler do not permit nesting in this way. Since we have structured control flow operations, our blocks have region-like semantics. * `Op`, represents the full MASM instruction set, unlike `Instruction` which represents the subset without control flow ops. * `Constant` and `ConstantExpr`, which represent the subset of the syntax for constant expressions and definitions. Unlike the previous parser and AST, we do not evaluate constants during parsing - except to do infallible constant folding where possible - but instead do it later during semantic analysis when we have the full set of constant definitions on hand. This lets constant definitions appear anywhere in the source file, and in any order as long as there are no cyclic dependencies. * `Immediate<T>`, which represents instruction immediates generically, and in a form that supports the superposition of literal values and constant identifiers anywhere that immediates are allowed. These are then resolved to concrete values during semantic analysis. Immediates are thus represented as `Immediate<T>` in the AST universally, except in a small number of cases where we may only want to allow literals. * `Ident`, which represents a cheaply-clonable identifier in the source code (not quite interned, but close in many cases). When parsing a string into an `Ident`, it imposes the general set of validation rules which apply to bare (unquoted) identifiers, such as those used for module names, or import aliases. An `Ident` can be constructed without enforcing those rules, such as the case for `ProcedureName`, which uses `Ident` internally, but enforces a looser set of rules so as to support quoted identifiers in the source code. `Ident` is used anywhere where an identifier is represented in the syntax tree. NOTE: These modules are disconnected from the module hierarchy in this commit, and may reference types that are not listed, or types which have familiar names but which will have new implementations in later commits. Please keep that in mind during review.
This commit introduces the `Visit` and `VisitMut` traits and associated helpers, which can be used to succinctly express analysis and rewrite passes on the Miden Assembly syntax tree. No such passes are implemented in this commit, but will be defined in subsequent commits.
This commit is the first in a sequence of commits that represent the refactoring of the `assembly` crate to use the new parser, etc., to introduce the remaining AST changes, and then propagate those changes in addition to refactoring parts of the compilation pipeline that can take advantage of new features and analysis that were previously not available. This commit specifically refactors/rewrites the set of types which represent various details about procedures and procedure "aliases", i.e. re-exported procedures. There are some new types implemented as well, to better represent the specificity of a particular procedure identifier, and to build on other types in a more structured fashion. NOTE: This commit references things which are not yet implemented in the source tree, this is intentional so as to let you focus on this set of related changes abstractly, and then be able to review other later changes with this context in mind.
This commit builds on changes to the procedure types to represent a richer set of targets, with varying degrees of specificity: * The `MastRoot` variant remains unchanged, but gains a source span for use later during compilation * The `ProcedureName` variant represents a local name * The `ProcedurePath` variant represents an unresolved projection of an imported module/function. It depends on the current module context to resolve. * The `AbsoluteProcedurePath` variant represents a resolved projection of an imported module/function. This type is used when we have resolved a `ProcedurePath` to an imported module, and thus know the absolute path of the imported function. However, the distinction between an absolute path which is "fully-resolved", i.e. not an alias, and an absolute path which is "partially-resolved", i.e. possibly-aliased, is not represented here. Instead, that distinction is implicit dependiing on the phase of compilation we are in. This will become clearer in later commits. Later, this type will be used to represent the targets of any instruction which references a callable (name or mast root)
This commit refactors the `LibraryPath` and `LibraryNamespace` types to build on the `Ident` type, support `Spanned`, and to provide better building blocks for other types, such as `FullyQualifiedProcedureName`. In addition, the structure of the `library` namespace is cleaned up, and the `MaslLibrary` type has its internals rewritten to make use of the new parsing infrastructure. The `Library` trait was refactored to remove the associated `ModuleIter` type, as many forms of iterators have "unnameable" types which cannot be easily expressed when defining a trait implementation. Instead, we make use of RPIT (return-position impl trait) to acheive the same goal, while retaining the ability to impose some useful constraints on the type.
This commit refactors the `Instruction` syntax tree type in the following ways: * Remove the various `Call*`, `Exec*`, etc. instruction variants in favor of `Call`, `Exec`, `SysCall`, and `ProcRef`, all of which now take an `InvocationTarget`. * Replace all immediate values with `Immediate<T>` * Introduce wrapper types where necessary to support the `Serializable` and `Deserializable` traits, and to shield us from breaking upstream changes to those types (to some degree). See `SignatureKind` and `DebugOptions` specifically. * Give the `AdviceInjectorNode`, `DebugOptions`, and other enumerated types explicit physical representation and discriminant values, so that we can safely serialize/deserialize the discriminant tags. * Allow expressing a slightly wider range of variation in the `DebugOptions` syntax, which is then converted to the more explicit representation during semantic analysis. * Make use of the new `PrettyPrint` trait for formatting
This is the last commit which will contain changes to the abstract syntax tree, and is the one which answers the question: "what does a parsed module look like?". This commit introduces a new `Module` type, which supercedes the previous `ProgramAst`, `ModuleAst` and `Module` types, providing all of the information you might want to know about a given module, as well as supporting the core functionality necessary to parse, serialize, and pretty print modules. A `Module` has a `ModuleKind`, which identifies what type of module it represents: an executable (e.g. what was previously a `ProgramAst`), a library (e.g. `what was previously a `ModuleAst`), or a kernel (which previously had no specific representation). The `ModuleKind` dictates the semantics of the module, but in practice all the "kinds" of modules are virtually identical except for these slight semantic differences, which is why they have been unified here. These semantics are validated by the semantic analysis pass, and catch the set of things you are not allowed to do in certain types of modules, e.g. exporting from an executable module, syscall/call in a kernel module, `begin` in a library module. Lastly, this commit removes the old `ModuleImports` type, which is superceded by these changes (as well as subsequent ones in other parts of the assembler). In its place, is a new `Import` type which represents all of the details about a specific module import. Each `Module` has a set of `Import` associated with it, which it uses to resolve names, as well as determine syntax-level inter-module dependencies.
This commit allows enabling/disabling the storage of debug info and source code in serialized modules and various AST structures.
@phklive - could you take a look at the changelog job? I think it is failing but for some extraneous reason. |
Looking into it. |
* refactor: remove MerkleTreeNode trait * chore: added comments and section separators * fix: remove implicit serialization of MastNodeId
* add `Assembler::assemble_library()` * changelog * fix `no_std` * add test * `MastForestStore::procedure_digests()` * assemble_graph: fix comment * fix weird formatting * remove `Assembler::assemble_graph()` * remove `ModuleGraph.topo` * Rename `Assembler::assemble` to `assemble_program` * Remove `ModuleInfo` and `ProcedureInfo` * fix docs * fix docs
* Introduce `ExternalNode` * Replace `Assembler.node_id_by_digest` map * add TODOP * Add `Host::get_mast_forest` * Move kernel and entrypoint out of `MastForest` * Remove ProgramError * docs * cleanup Program constructors * fix docs * Make `Program.kernel` an `Arc` * fix executable * invoke_mast_root: fix external node creation logic * add failing test * don't make root in `combine_mast_node_ids` and `compile_body` * fix External docs * fmt * fix `entrypoint` doc * Rename `Program::new_with_kernel()` * Document `MastForestStore` and `MemMastForestStore` * fix syscall * execute_* functions: use `MastForest` * `Program`: Remove `Arc` around kernel * remove `Arc` around `MastForest` in `Program` * Return error on malformed host * Simplify `DefaultHost` * `MastForest::add_node()`: add docs * fmt * add failing `duplicate_procedure()` test * Introduce `MastForestBuilder` * Rename `mod tests` -> `testing` * add `duplicate_node()` test * changelog * Program: use `assert!()` instead of `debug_assert!()` * `MastForest::make_root()`: add assert * fmt * Serialization for `MastNodeId` * serialization for MastNode variants except basic block * MastForest serialization scaffolding * define `MastNodeType` constructor from `MastNode` * test join serialization of MastNodeType * `MastNodeType` serialization of split * Revert "serialization for MastNode variants except basic block" This reverts commit efc24fd. * add TODOP * impl Deserializable for `MastForest` (scaffold) * mast_node_to_info() scaffold * try_info_to_mast_node scaffold * Rename `EncodedMastNodeType` * add info module * encode operations into `data` field * decode operations * implement `BasicBlockNode::num_operations_and_decorators()` * OperationOrDecoratorIterator * basic block node: move tests in new file * operation_or_decorator_iterator test * Implement `Operation::with_opcode_and_data()` * encode decorators * implement `decode_decorator()` * fix exec invocation * no else blk special case * add procedure roots comment * implement forgotten `todo!()` * `serialize_deserialize_all_nodes` test * `decode_operations_and_decorators`: fix bit check * confirm_assumptions test scaffold * minor adjustments * Introduce `StringTableBuilder` * naming * test confirm_operation_and_decorator_structure * remove TODOP * remove unused `MastNode::new_dyncall()` * Remove `Error` type * add TODOP * complete test `serialize_deserialize_all_nodes` * check digest on deserialization * remove TODOP * safely decode mast node ids * use method syntax in `MastNodeType` decoding * TODOPs * rewrite <= expression * new `MastNodeType` * implement `Deserializable` for `MastNodeType` * migrate tests to new * Use new MastNodeType * rename string_table_builder_ module * implement `BasicBlockDataBuilder` * add TODOP * BasicBlockDataDecoder * use `BasicBlockDataDecoder` * add headers * add `MastNodeInfo` method * return `Result` instead of `Option` * Remove TODOP * docs * chore: add section separators and fix typos * refactor: change type of the error code of u32assert2 from Felt to u32 (#1382) * impl `Serializable` for `Operation` * impl Deserializable for `Operation` * `StringTableBuilder`: switch to using blake 3 * `EncodedDecoratorVariant`: moved discriminant bit logic to `discriminant()` method * Remove basic block offset * Cargo: don't specify patch versions * make deserialization more efficient * num-traits and num-derive: set default-features false * Remove `OperationData` * `StringRef`: move string length to data buffer * store offset in block * Use `source.read_u32/u64()` * Update `MastNodeInfo` docstring * rename arguments in `encode_u32_pair` * Use basic block offset in deserialization * `BasicBlockDataDecoder`: use `ByteReader::read_u16/32()` methods * `StringTableBuilder`: fix comment * Remove `StringRef` in favor of `DataOffset` * cleanup `MastNodeType` serialization * derive `Copy` for `MastNodeType` * `MastNodeType` tests * add `MastNodeType` tests * use assert * fix asserts * `ModuleGraph::recompute()` reverse edge caller/callee * Implement `Assembler::assemble_library()` * changelog * fix docs * Introduce `CompiledFQDN` * Introduce `WrapperModule` to module graph * split `ModuleGraph::add_module()` * fix compile errors from API changes * fix debug structs * fix `Assembler::get_module_exports()` * fix `process_graph_worklist` * fix procedure * fix `NameResolver` * move `CompiledModule` * `CompiledLibrary::into_compiled_modules` * `Assembler::add_compiled_library()` * changelog * fix `assemble_library()` signature * test `compiled_library()` * nits * register mast roots in `Assembler::add_compiled_library()` * fix resolve * `ModuleGraph::topological_sort_from_root`: only include AST procedures * `Assembler::resolve_target()`: look for digest in module graph first * remove `AssemblyContext::allow_phantom_calls` flag * remove TODOP * `ResolvedProcedure` is no longer `Spanned` * improve test * remove TODOP * `CompiledProcedure` -> `ProcedureInfo` * Document `CompiledLibrary` * Rename `CompiledModule` -> `ModuleInfo` * Refactor `ModuleInfo` * `ModuleWrapper` -> `WrappedModule` * Document `PendingModuleWrapper` * document `Assembler::assemble_library()` * fix TODOP * rename * fix test * cleanup `ModuleGraph::topological_sort_from_root` * fix CI * re-implement `Spanned` for `ResolvedProcedure` * reintroduce proper error message * remove unused methods * Remove all `allow(unused)` methods * Document `unwrap_ast()` call * `NameResolver`: remove use of `unwrap_ast()` * Document or remove all calls to `WrappedModule.unwrap_ast()` * rename `PendingWrappedModule` * Add `ModuleGraph::add_compiled_modules()` * Remove `ModuleGraph::add_module_info()` * refactor: remove Assembler::compile_program() internal method * refactor: remove Assembler::assemble_with_options() internal method --------- Co-authored-by: Bobbin Threadbare <bobbinth@protonmail.com> Co-authored-by: Andrey Khmuro <andrey@polygon.technology>
* feat(assembler): implement serialization for `CompiledLibrary` and `KernelLibrary` * feat(assembly): Add `CompiledLibrary::write_to_dir()` and `KernelLibrary::write_to_dir()` * feat(assembler): Remove `MaslLibrary` * remove docs building * fix(assembly): ensure aliases are resolved to roots This commit fixes the issue where re-exported procedures (aliases) were not being handled properly during assembly. It now explicitly represents them in the call graph, so that topological ordering of the call graph will ensure that we only visit the aliases once the aliased procedure has been compiled/visited. As part of the solution implemented here, some refinements to `AliasTarget` were made, in order to explicitly represent whether the target is absolute or not (just like `InvocationTarget`). Additionally, to avoid confusion, `FullyQualifiedProcedureName` was renamed to `QualifiedProcedureName`, so as to make it clear that just because the path is qualified, does not mean it is absolute (and thus "fully" qualified). Some conveniences were added to `LibraryNamespace`, `LibraryPath`, and `AliasTarget` to make certain operations/conversions more ergonomic and expressive. * feat: add pretty-print helper for lists of values * feat: support assembling with compiled libraries This commit refactors `CompiledLibrary` a bit, to remove some unnecessary restrictions leftover from the old MASL libraries: * A `CompiledLibrary` no longer has a name, but it has a content digest obtained by lexicographically ordering the exported MAST roots of the library, and merging the hashes in order. * As a consequence of being unnamed/having no namespace, a `CompiledLibrary` can now consist of procedures from many modules _and_ many namespaces. Any limitation we impose on top of that can be done via wrapper types, like how `KernelLibrary` is implemented. * Support for re-exported procedures in a `CompiledLibrary` is implemented. It is assumed that all required libraries will be provided to the `Host` when executing a program. * Some ergonomic improvements to APIs which accept libraries or sets of modules, to allow a greater variety of ways you can pass them. * fix(assembly): address conflicting procedure definitions bug Previously, we assumed that two procedures with the same MAST root, but differing numbers of locals, was a bug in the assembler. However, this is not the case, as I will elaborate on below. If you compile a program like so: ```masm use.std::u64 begin exec.u64::checked_and end ``` The resulting MAST would look something like: ```mast begin external.0x.... end ``` This MAST will have the exact same MAST root as `std::u64::checked_and`, because `external` nodes have the same digest as the node they refer to. Now, if the exec'd procedure has the same number of locals as the caller, this is presumed to be a "compatible" procedure, meaning it is fine to let both procedures refer to the same MAST. However, when the number of procedure locals _differs_, we were raising a compatible definition error, because it was assumed that due to the instructions added when procedure locals are present, two procedures with differing numbers of locals _could not_ have the same root by definition. This is incorrect, let me illustrate: ```masm export.foo.2 ... end use.foo proc.bar.3 exec.foo::foo end begin exec.foo::foo end ``` Assume that `foo.masm` is first compiled to a `CompiledLibrary`, which is then added to the assembler when assembling an executable program from `bar.masm`. Also, recall that `begin .. end` blocks are effectively an exported procedure definition, with zero locals, that has a special name - but in all other ways it is just a normal procedure. The above program is perfectly legal, but we would raise an error during assembly of the program due to the `begin .. end` block compiling to an `external` node referencing the MAST root of `foo::foo`, which has a non-zero number of procedure locals. The `bar` procedure is there to illustrate that even though it too simply "wraps" the `foo::foo` procedure, it has a non-zero number of procedure locals, and thus cannot ever have the same MAST root as a wrapped procedure with a non-zero number of locals, due to the presence of locals changing the root of the wrapping procedure. A check has been kept around that ensures we catch if ever there are two procedures with non-zero counts of procedure locals, with the same MAST root. * chore: update changelog --------- Co-authored-by: Bobbin Threadbare <bobbinth@protonmail.com> Co-authored-by: Paul Schoenfelder <paulschoenfelder@fastmail.com>
* feat(assembly): track source locations in debug mode This commit introduces an initial, minimal implementation of source mapping for Miden Assembly instructions assembled to MAST. This is done by piggy-backing on the existing `AssemblyOp` decorator, extending it with an optional `SourceLocation`, which is only emitted to the serialized MAST when debug mode is disabled _and_ there is source location information available. A `SourceLocation` consists of a string representing the file path of the source (as known to the assembler), and a span, whose start and end bounds are byte indices in the original source file, representing a contiguous range of valid UTF-8 bytes corresponding to the relevant source code from which the given Miden Assembly operation was derived. This will be pretty much 1:1 when the source is itself MASM code, but when the MASM in question is lowered from a high-level language, such as Rust, the source span may correspond to many instructions, and the connection between a given span and a specific MASM instruction may be non-obvious at times - this is to be expected though, and is simply a result of translating a high-level language to a low-level one. The serialization used here is not particularly efficient, but isn't horrible. In the future though, we may want to have a dedicated table representation for source locations, so that streams of instructions with a shared location can simply reference a single instance of the source location, rather than duplicating the information for every instruction. Additionally, we may want to allow including the actual source file content in a library containing debug information, rather than requiring end users to supply a path to a directory containing the correct files. One issue to be aware of, is that this implementation uses absolute paths, but we may want to make paths relative when possible, so as to make the paths portable. Technically, this is in the hands of users of the assembler, by setting the source file paths to relative paths - but in practice it would be quite awkard to force end users to handle this. Additionally, there are times where an absolute path may be desirable (such as when referencing well-known global paths, e.g. `/usr/local/etc/..`). In any case, that is not a problem we attempt to solve in this commit. * Remove `SingleThreadedSourceManager` * Rename `MultiThreadedSourceManager` to `DefaultSourceManager` * Replace all uses of `SingleThreadedSourceManager` with `DefaultSourceManager`, remove all clippy attributes used to disable the lint about wrapping non-Send objects in an Arc * Give `ProgramFile` two constructors, one for the common case which constructs a default source manager, the other for specific use cases where the caller owns the source manager and needs to construct the `ProgramFile` with it
* refactor: rename Assembler::add_compiled_library() into add_library() * refactor: move ModuleInfo into a separate file * refactor: rename CompiledLibrary into Library
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is tracking PR for v0.10.0 release.