Skip to content

AppThreat/wasm-tools

Repository files navigation

wasm-tools

wasm-tools is a pure-Python WebAssembly parser and disassembler. It is designed around binary decoding and callback-based visitors rather than a large object model. The project currently focuses on practical inspection of .wasm binaries, objdump-style disassembly, and programmatic extraction of decoded instructions for integration into other tooling.

AI-DECLARATION: auto

What this project is for

This repository is useful when you need a lightweight WebAssembly parser that can:

  • inspect a binary module without depending on native parsing libraries,
  • produce readable instruction traces for analyst review,
  • expose structured instruction data as Python dictionaries or JSON,
  • behave safely on malformed or truncated input by reporting parser errors through callbacks instead of crashing the caller.

For a security engineering audience, the main value is that the code path is short and inspectable. Most behavior lives in four files:

  • wasm_tools/parser.py for binary decoding and traversal,
  • wasm_tools/opcodes.py for opcode and immediate metadata,
  • wasm_tools/visitor.py for human-readable output,
  • wasm_tools/api.py for library-first structured output.

Command-line usage

The installed console script is wasm-tools, as defined in pyproject.toml.

Disassemble a fixture module:

python -m wasm_tools.cli tests/fixtures/simple_add.wasm -d

If installed as a package, the equivalent entrypoint is:

wasm-tools tests/fixtures/simple_add.wasm -d

Current CLI flags in wasm_tools/cli.py:

  • -h, --headers — print section header table with ids, sizes, and offsets
  • -x, --details — print section contents: type signatures, imports, exports, globals, tables, memories, data segments, elements, tags, and code body summaries
  • -d, --disassemble — decode and print function body instructions
  • --json — print a minified JSON report to stdout
  • --json-out PATH — write a minified JSON report to PATH
  • --analysis-only — with --json and/or --json-out, emit only the high-level analysis object

With no flags, --details is the default.

Index notes for CLI output:

  • function/global/table/memory/tag indices are printed in module-global index space,
  • locally-defined function bodies therefore start at func[imported_function_count] when function imports are present,
  • section detail headers use entry counts (for example Function[3], Code[3], Data[1]) and DataCount prints the decoded count value.

Write a minified JSON report to a file:

wasm-tools tests/fixtures/simple_add.wasm --json-out simple_add.json

Print a minified JSON report to stdout:

wasm-tools tests/fixtures/simple_add.wasm --json

Print only the high-level analysis object to stdout:

wasm-tools tests/fixtures/wasi_capabilities.wasm --json --analysis-only

Use both JSON options together to write a file and print the same payload:

wasm-tools tests/fixtures/simple_add.wasm --json --json-out simple_add.json

Write only the analysis object to a file:

wasm-tools tests/fixtures/dos_growth_loop.wasm --json-out analysis.json --analysis-only

Library usage

Parse from a file

from wasm_tools.api import parse_wasm_file

report = parse_wasm_file("tests/fixtures/simple_add.wasm")
print(report["module_version"])
print(report["function_count"])
print(report["functions"][0]["instructions"])

Parse from bytes and emit JSON

from wasm_tools.api import parse_wasm_bytes_json

with open("tests/fixtures/unicode_names.wasm", "rb") as wasm_file:
    print(parse_wasm_bytes_json(wasm_file.read(), filename="unicode_names.wasm"))

Trust and provenance

The source code in this repository was fully generated by AI assistants, with any human edits limited to formatting or minor changes. For a technical reader, the practical implication is simple: treat the codebase as useful but review every line of code carefully. Review parser behavior, test coverage, and known gaps before depending on it in a security workflow.

The repository itself already reflects this review posture:

  • parser failures are covered by unit tests for malformed input,
  • end-to-end tests assert exact disassembly substrings,
  • CLI and JSON outputs use module-global index spaces for functions, globals, tables, memories, and tags, including imported-entity offsets.

Architecture

A detailed description of the WebAssembly binary format, the parser internals, visitor pattern, two-pass execution model, and security-relevant design decisions is in ARCHITECTURE.md.

The short version:

BinaryReader in wasm_tools/parser.py owns the binary walk. It reads the module header, iterates sections, and decodes function bodies instruction by instruction. It does not build a full AST. Instead, it emits parser events to a delegate object. The parser checks callbacks with hasattr(...) before calling them, so a visitor only needs to implement the hooks it cares about.

The CLI and the JSON API both run the parse twice. The first pass collects names and type information into ObjdumpState. The second pass uses that state to produce disassembly, section details, or a structured JSON report. The shared state lives in wasm_tools/models.py.

wasm_tools/opcodes.py defines the mapping from (prefix, opcode) to (mnemonic, immediate type). The parser uses this table inside BinaryReader.read_instructions() to decide how many bytes to consume. When extending the instruction set, only this table and the immediate dispatch branches in the parser need to change.

Relationship to the specification

The repository includes a local specification snapshot under specification/wasm-latest/. The most relevant files for current implementation work are:

  • specification/wasm-latest/5.3-binary.instructions.spectec
  • specification/wasm-latest/5.4-binary.modules.spectec
  • specification/wasm-latest/6.3-text.instructions.spectec

These files are useful when validating opcode encodings, section layouts, and text-to-binary expectations. The current parser is not a full implementation of everything described by the latest specification snapshot. It implements a practical subset and falls back to unknown_<prefix>_<opcode> names for unsupported instructions.

Spec coverage matrix

This matrix is a planning aid, not a certification statement. It reflects what the current codebase does today based on wasm_tools/parser.py, wasm_tools/opcodes.py, wasm_tools/visitor.py, wasm_tools/api.py, and the current test suite.

Status terms used below:

  • Tested: implemented and covered by the current automated tests.
  • Partial: implemented in a limited way, or traversed without full semantic decoding.
  • Known gap: explicitly tracked as missing behavior in tests.
  • Not implemented or unverified: no support or no current evidence in tests.

Module and section coverage

Area Spec reference Status Current behavior and evidence
Module header and version 5.4-binary.modules.spectec Tested Validates magic and version in BinaryReader._do_read_module(). Error cases for short files and bad magic are covered in tests/test_parser.py.
Section framing and bounds checks 5.4-binary.modules.spectec Tested Reads section id and size, checks file bounds, and reports errors through on_error. Covered by truncated section tests.
Custom sections, generic 5.4-binary.modules.spectec Partial Parser reads custom section name and skips unknown payloads. The JSON API records the custom section name, but does not decode arbitrary custom payloads.
Custom name section for function and local names 5.4-binary.modules.spectec Tested Subsections 1 (function names) and 2 (local names) are decoded and stored in ObjdumpState. Names appear in disassembly and JSON reports. Covered by custom_name.wasm and unicode_names.wat.
Type section 5.4-binary.modules.spectec Tested Full function type decoding with GC subtype / rec-type wrappers. Params and results stored as FuncType in ObjdumpState.types and surfaced in --details, JSON types[], and tests/test_details.py.
Import section 5.4-binary.modules.spectec Tested All five import kinds (func, table, memory, global, tag) fully decoded into ImportEntry with kind-specific fields. Exposed in --details output, JSON imports[], and covered by tests/test_details.py.
Function section 5.4-binary.modules.spectec Tested Function signature indices decoded and stored via on_function. Used in prepass and JSON reports.
Table section 5.4-binary.modules.spectec Tested Reference type and limits decoded into TableEntry. Exposed in --details and JSON tables[].
Memory section 5.4-binary.modules.spectec Tested Limits decoded (i32 and i64 variants, including shared flag combinations) into MemoryEntry. Exposed in --details and JSON memories[].
Global section 5.4-binary.modules.spectec Tested Value type, mutability, and constant init expression decoded into GlobalEntry. Exposed in --details and JSON globals[].
Export section 5.4-binary.modules.spectec Tested All five export kinds decoded into ExportEntry. Exposed in --details and JSON exports[].
Start section 5.4-binary.modules.spectec Tested Start function index stored and surfaced in JSON start_function field and --details output.
Element section 5.4-binary.modules.spectec Tested All 8 element segment variants decoded, with mode, ref type, table index, offset expression, and function index list stored in ElementEntry.
Code section and function bodies 5.4-binary.modules.spectec Tested Local declaration headers are consumed, instructions are decoded, and end-of-body tracking is implemented. Covered heavily by tests/test_e2e.py and tests/test_json_api.py.
Data section 5.4-binary.modules.spectec Tested Active (mem 0), passive, and active (mem x) variants decoded into DataEntry. Exposed in --details and JSON data_segments[]. Covered by bulk_memory.wat and memory_data.wat.
Data count section 5.4-binary.modules.spectec Tested Data count is decoded and forwarded to delegates via on_data_count.
Tag section 5.4-binary.modules.spectec Tested Tag entries decoded into TagEntry with type index. Exposed in --details and JSON tags[].

Instruction coverage

Area Spec reference Status Current behavior and evidence
Basic parametric instructions (unreachable, nop, drop, select) 5.3-binary.instructions.spectec Tested All mapped explicitly in OPCODES. Typed select with result type vector is handled via SELECT_T immediate dispatch. Covered by fixture disassembly tests.
Block/control structure (block, loop, if, else, end) 5.3-binary.instructions.spectec Tested Block signatures and expression depth tracking are implemented in read_instructions(). Covered by control_flow.wat and complex_flow.wat.
Branching (br, br_if, br_table, return) 5.3-binary.instructions.spectec Tested Core branch immediates are decoded. br_table target list decoded and printed. Covered by tests/test_e2e.py and adversarial_ops.wat.
Direct and indirect calls (call, call_indirect) 5.3-binary.instructions.spectec Tested Direct index operands and call_indirect signature/table operands decoded. Covered by call_indirect.wat and complex_flow.wat.
Return-call extensions (return_call, return_call_indirect, call_ref, return_call_ref) 5.3-binary.instructions.spectec Tested All four opcodes are in OPCODES with correct immediate types. Covered by tests/test_extended_ops.py and fixture-level call_ref disassembly in call_refs.wat.
Variable access (local.get/set/tee, global.get/set) 5.3-binary.instructions.spectec Tested Index immediates decoded and printed. Covered by arithmetic, globals, and control-flow fixtures.
Memory load/store with memarg 5.3-binary.instructions.spectec Tested All scalar load/store instructions use the MEMARG decoder path, including memory64 large-offset fixtures. Covered by memory_data.wat, complex_flow.wat, and load64.wat.
Integer and float constants 5.3-binary.instructions.spectec Tested i32.const, i64.const, f32.const, and f64.const immediates decoded. Edge signed immediates covered in parser tests and adversarial_ops.wat.
Scalar numeric arithmetic and comparisons 5.3-binary.instructions.spectec Tested Full i32, i64, f32, f64 arithmetic, comparison, and conversion opcode sets are in OPCODES. Sign-extension opcodes (0xC0-0xC4) included. Covered by tests/test_extended_ops.py.
Reference type instructions (ref.null, ref.func, ref.eq, etc.) 5.3-binary.instructions.spectec Tested 0xD0-0xD6 fully mapped. ref.null uses HEAP_TYPE immediate. br_on_null/br_on_non_null use INDEX. Covered by tests/test_extended_ops.py.
Saturating truncation (i32.trunc_sat_*, i64.trunc_sat_*) 5.3-binary.instructions.spectec Tested All eight 0xFC 0-7 opcodes in OPCODES with NONE immediate. Dispatch covered by tests/test_extended_ops.py::test_dispatch_sat_trunc.
Bulk memory (memory.init, data.drop, memory.copy, memory.fill) 5.3-binary.instructions.spectec Tested 0xFC 8-11 with correct binary operand order for memory.init. Covered by tests/test_confidence_parser.py, tests/test_e2e.py, tests/test_json_api.py.
Table bulk ops (table.init, elem.drop, table.copy, table.grow, table.size, table.fill) 5.3-binary.instructions.spectec Tested 0xFC 12-17 fully mapped with TABLE_INIT, TABLE_COPY, and INDEX immediate types. Dispatch covered by tests/test_extended_ops.py.
Exception handling (throw, throw_ref, try_table) 5.3-binary.instructions.spectec Tested throw (0x08), throw_ref (0x0A), and try_table (0x1F with full catch list) decoded. TRY_TABLE_BLOCK parses catch opcodes 0x00-0x03. Covered by tests/test_extended_ops.py.
GC / reference types (0xFB prefix, struct/array/ref ops) 5.3-binary.instructions.spectec Tested All 31 0xFB 0-30 opcodes in OPCODES. BR_ON_CAST (flags + label + 2 heaptypes) fully decoded. tests/test_extended_ops.py covers table completeness and dispatch for array.len, struct.new, ref.test.
SIMD / vector instructions (0xFD prefix) 5.3-binary.instructions.spectec Tested All standard SIMD opcodes 0-275 mapped, including relaxed SIMD. Load/store use MEMARG, v128.const uses V128_CONST (16 raw bytes), i8x16.shuffle uses V128_SHUFFLE, lane ops use LANE_IDX and MEMARG_LANE. Covered by tests/test_extended_ops.py.
Threads / atomics (0xFE prefix) 5.3-binary.instructions.spectec Tested All atomic operations mapped. atomic.fence uses ATOMIC_FENCE (reads reserved byte). All others use MEMARG. Covered by tests/test_extended_ops.py.
Unknown opcode resilience 5.3-binary.instructions.spectec Tested Unsupported opcodes fall back to unknown_<prefix>_<opcode> rather than crashing. Covered by tests/test_confidence_parser.py.

Interface and analysis coverage

Area Status Current behavior and evidence
CLI disassembly mode (-d) Tested Covered by tests/test_e2e.py with exact substring assertions across all fixture files.
CLI headers mode (--headers) Tested BinaryReaderObjdumpHeaders prints section id, name, size, and offset. Covered by tests/test_details.py.
CLI details mode (-x) Tested BinaryReaderObjdumpDetails prints all section contents: types, imports, exports, globals, tables, memories, data segments, elements, tags, and code bodies. Covered by tests/test_details.py.
JSON-friendly library API Tested parse_wasm_file() and related helpers return full semantic reports including types, imports, exports, globals, tables, memories, data segments, and elements. Covered in tests/test_json_api.py.
Non-throwing parse errors for library callers Tested Malformed inputs populate errors instead of forcing a traceback. Covered in parser and JSON API tests.
Full validation against the specification Not implemented The current code decodes and reports binary structure; it does not implement the validation chapters from the bundled specification snapshot.
Text-format parsing (.wat as input) Not implemented The repository consumes .wat only through the external fixture build step with wat2wasm.

How to use this matrix

The library covers the full WebAssembly binary format at the decoding level. The remaining gaps are deliberate scope choices rather than missing work items:

  1. Spec validation (type checking, structural constraints from chapters 2 and 3 of the spec) is not the goal of this library. Validation belongs in a downstream consumer such as a language runtime.
  2. Text-format (.wat) input is handled externally by WABT and is not in scope.
  3. The specification snapshot is kept locally under specification/wasm-latest/ to serve as an authoritative reference during development but is not shipped with the distributed package.

Report schema

The structured report currently contains:

  • file: source path or caller-supplied label,
  • module_version: wasm version from the module header, or None on parse failure,
  • section_count: number of recorded sections,
  • sections: list of section dictionaries with index, id, name, size, and offset,
  • function_count: number of decoded function bodies,
  • functions: list of function dictionaries with index, name, signature_index, offset, body_size, instruction_count, and instructions,
  • tables: list of decoded table entries with index, ref_type, and limits (min, max, is_64),
  • memories: list of decoded memory entries with index and limits (min, max, is_64),
  • errors: list of parsing or file read errors.

Each instruction entry contains:

  • offset: byte offset used by the parser when the opcode was decoded,
  • opcode: mnemonic from OPCODES or an unknown_... fallback,
  • immediates: decoded immediate values in parser order,
  • decode_incomplete: present only when a function body ended with a partially decoded instruction record.

This shape is covered by tests/test_json_api.py.

High-level security analysis

The JSON report includes an analysis object designed for analyst triage.

  • summary: overall risk_score, risk_tier, and finding_count,
  • detections.wasi: explicit WASI import detection (detected, variants, matched import modules/count),
  • detections.js_interface: JavaScript-interface signals from imports/exports (js/wbg namespaces, wasm:* builtins such as wasm:js-string, and common glue symbol patterns),
  • detections.format: coarse format classification (core, possible-component, invalid-core) with evidence signals,
  • capabilities: inferred host capability tags from imports (for example fs.path, network, process.terminate),
  • profiles.memory: memory access density, memory.grow, bulk-memory activity, and total data segment bytes,
  • profiles.control_flow: dynamic dispatch metrics (call_indirect, call_ref) and table mutation counts,
  • profiles.compute: loop depth and loop-contained memory/control-flow pressure,
  • findings: actionable rule-based results with stable ids and remediation guidance.

Current built-in finding ids:

  • WASM-CAP-001: filesystem and network host capabilities imported together.
  • WASM-CFG-002: indirect call surface combined with mutable table operations.
  • WASM-DOS-003: memory growth in loop context.
  • WASM-LOOP-004: deep loop nesting amplification signal.
  • WASM-FMT-005: binary appears to be non-core or otherwise parse-incompatible for this decoder.

Error handling model

The parser does not re-raise WasmParseError by default. BinaryReader.read_module() catches parse exceptions and forwards the message to delegate.on_error(...) when that callback exists.

This behavior is important for integration scenarios:

  • command-line flows can report errors without a Python traceback,
  • library callers can collect structured failure information,
  • fuzzing or batch inspection pipelines can continue after a malformed file.

Unit tests cover this behavior in tests/test_parser.py and tests/test_confidence_parser.py.

Examples of currently tested failure cases include:

  • truncated modules,
  • bad magic values,
  • sections extending beyond file boundaries,
  • malformed LEB128 encodings,
  • truncated instruction immediates.

Test fixtures and what they cover

The repository uses .wat fixtures under tests/fixtures/, compiled to .wasm with WABT's wat2wasm.

Representative fixtures include:

  • simple_add.wat for minimal arithmetic and local access,
  • control_flow.wat for block, loop, br, and br_if,
  • labels_control.wat for named-label lowering, br_table depth vectors, and label shadowing/redefinition patterns,
  • memory_data.wat for memory load semantics and data segments,
  • globals_imports.wat for imported globals and functions,
  • call_indirect.wat for indirect calls,
  • call_refs.wat for typed call_ref through locals and globals, plus null-ref call paths,
  • load64.wat for memory64 ((memory i64 ...)) addressing and large memarg offsets,
  • float_memory64.wat for memory64 float load/store decoding across f32.* and f64.* memory ops,
  • bulk64.wat for memory64 memory.init, data.drop, memory.copy, and memory.fill,
  • memory_trap64.wat for memory64 boundary-style address construction with memory.size, memory.grow, and scalar load/store ops,
  • memory64_shared.wat for shared memory64 limit decoding and memory.size/memory.grow disassembly,
  • table_fill64.wat for table64 table.fill and table.get,
  • table_set64.wat for table64 table.set/table.get on externref and funcref tables,
  • table_size64.wat for table64 table.size/table.grow plus i64 table limits,
  • table_init64.wat for table64 ((table ... i64 ...)) offsets plus table.init, table.copy, and table-indexed call_indirect,
  • simd_store64_lane.wat for SIMD lane memory operands, including v128.store64_lane alignment, offset, and lane immediates,
  • unreachable.wat for stack-polymorphic unreachable behavior across blocks, loops, calls, branches, memory, and numeric operators,
  • bulk_memory.wat for memory.init, data.drop, and memory.fill,
  • complex_flow.wat for mixed control flow, memory, direct calls, and indirect calls,
  • unicode_names.wat for Unicode content,
  • adversarial_ops.wat for edge immediates and br_table,
  • wasi_capabilities.wat for host capability/risk analysis checks,
  • wasi_preview2_like.wat for WASI preview2-like namespace detection (wasi:* imports),
  • js_interface.wat for JavaScript embedding detection (js, wbg, and wasm:js-string imports),
  • dos_growth_loop.wat for loop + memory.grow DoS heuristics.

These fixtures are used in tests/test_e2e.py to validate the disassembly output and in tests/test_json_api.py to validate the structured API.

Known limitations

The repository is a practical decoder, not a full specification implementation:

  • Spec validation (type checking, module-level structural constraints) is deliberately out of scope.
  • The custom name section decodes subsections 1 (function names) and 2 (local names); other subsections such as label names are skipped.
  • Some rarely used init-expression forms in element and data segments fall back to a hex scan rather than full expression decoding.
  • The analysis layer is heuristic by design and is intended for triage, not formal proof of exploitability.

Development workflow

Run the full test suite:

python -m pytest -q

Rebuild .wasm fixtures from .wat sources:

python tests/fixtures/build.py

The fixture build script requires WABT's wat2wasm binary to be available on PATH.

If you prefer using Poetry, the repository metadata in pyproject.toml indicates Poetry-based packaging:

poetry install
poetry run pytest -q
poetry run python tests/fixtures/build.py

Guidance for reviewers and integrators

If you are evaluating this project for security tooling or pipeline integration, start with these files:

  • wasm_tools/parser.py for parse correctness,
  • wasm_tools/opcodes.py for current opcode coverage,
  • wasm_tools/api.py for the stable integration surface,
  • tests/test_e2e.py for output expectations,
  • specification/wasm-latest/5.3-binary.instructions.spectec for spec alignment work.

License

This project is licensed under the MIT License. See LICENSE for details.

The inputs to the AI agents came from the WebAssembly specification, the WABT project, and the author's knowledge of Python and WebAssembly. The outputs are original code generated by the AI agents based on those inputs. It is possible this project is therefore not MIT-licensed due to the presence of third-party specification text in the training data. The author has made a good faith effort to generate original code and to avoid copying any specific text from the specification, but this cannot be guaranteed. Users should review the code and the specification to ensure compliance with their licensing needs.