Skip to content

Latest commit

 

History

History
134 lines (84 loc) · 6.25 KB

Secure-code-generation.md

File metadata and controls

134 lines (84 loc) · 6.25 KB

Secure code generation

During code generation, the supplied schema is treated as untrusted input (but in JSON form).

While DoS safety of an untrusted schema can't be guaranteed even with complexity checks, both the process of code generation and the generated validator/parser code are supposed to be safe against arbitrary code execution, even when both schema and the data being validated were received as untrusted JSON input.

It should be noted that supplying untrusted schemas is not recommended.

Even though they should not cause arbitrary code execution, they can cause DoS. Also, security issues can happen on any point of the path, and the best way to avoid the problem is to not use untrusted input in the process of code generation, if possible.

Code generation approach

To protect code generation against code injection from schemas, the following approach is taken:

  1. All code generated from the schema is going through the format(template, ...args) function, which embeds supplied arguments into a template string.

    It typechecks the supplied arguments, and everything that is not code must be wrapped in JSON.

  2. First argument of the format() function (the template), is trusted and must always be a literal string or come from a wrapper where it is a literal string. That way, it can't come from a runtime variable and can't be affected by untrusted input directly.

    Custom ESLint rules additionally check that in the code so that it won't be accidentally broken.

  3. Trusted code could be produced only via a limited number of ways:

    1. Output emitted by format() is treated as code.

    2. Safe generated variable ids are also treated as code.

    3. Logical and (&&) and or (||) operations on a variable-length list of code arguments is also treated as code. That is used in complex block conditions.

    4. Certain explicitly listed constant values can be inserted in format() template, but only at designated places.

    Everything else is rejected by format() when trying to be concatenated into the template without proper escaping.

  4. No string transformations are performed on the generated code, as any non-context aware transformations of generated code are unsafe.
    See below for more information.

    As AST transformation would be too complex here, all optimizations eliminating empty blocks, for example, have to come before the code is prepared.

    The approach used here wraps block body generators so that when generating the body of the block does not emit any code, the whole block is excluded.

Source code for that could be seen in safe-format.js.

Separate logic is used for function stringification (e.g. formats), but those do not come from the schema and are assumed to be trusted input (and are typechecked to be functions).

Things to note

format() does not make things magically safe

Templates for format() should be treated with care.

E.g. format('%j', arg) (where %j is JSON-escaped variable) is an arbitrary code execution vulnerability if arg is untrusted, as the contents of arg can close the single quotes and escape from the string.

Variables should be inserted only in those contexts where inserting any JSON-encoded value is safe against code execution.

JSON.stringify is not always safe for inclusion in code

Symbols \u2028 and \u2029 should be escaped when embedding JSON-wrapped objects into code.

The behavior changed only in ES2019, so it's best to not rely on that.

See https://v8.dev/features/subsume-json#security for more details.

format() function handles that.

__proto__ properties should be post-processed after JSON.stringify

Even with the \u2028 and \u2029 difference resolved in newer ECMA Script specification versions and by post-processing, there is one more parsing difference between JSON and JS contexts which has to be accounted for before including JSON-stringified variables into JS context.

{"__proto__": ... parses differently due to JS having special-handling for it which JSON ignores:

To account for that, all occurances of {"__proto__": should be replaced with {["__proto__"]: and all occurances of ,"__proto__": — with ,["__proto__"]:, after each JSON.stringify call.

The replacement above works given that JSON.stringify is used without the space formatting option. Full regex pattern for properties that need replacement is /[^\\]"__proto__":/g.

format() function handles that.

RegExp stringification should use new RegExp()

Using /regexp/ form, produced by converting a RegExp object to a string, is not safe.

Consider this on Node.js 10 and below: console.log(String(new RegExp("\n"))).
That behaviour differs between platforms and versions.

If the pattern is arbitrary input from the schema and the RegExp object is stringified via converting it to a string (e.g. ${regexp}, regexp + '', regexp.toString() or String(regexp), it can break the generated code in certain cases.

Instead, convert those to code in the form of new RegExp(pattern, flags) (where pattern and flags should be properly escaped, as with other string variables).

Stringification of RegExp objects is supported in the format() function.

Non-context aware code transformations are unsafe

That includes any type of replacements performed on the code after its generation.

Even a seemingly harmless replacement of else {} to an empty string can be abused in some situations because that string could be met in a non-code context, e.g. inside an object property name.

That also includes altering newlines in the code.

The only safe way to transform generated code is by parsing it to AST, transforming the AST and re-generating the code from AST.

Trusted code concatenation is fine without being context aware, though.