Skip to content

Internals: jq Assigment Operators

Mattias Wadman edited this page Jul 26, 2023 · 10 revisions

Assignment semantics recap

As with all internals pages, reader familiarity with the jq language is assumed.

In jq all data and all bindings are "immutable". But there is in fact a way to appear to mutate data: use assignment operators, or low-level builtins like setpath/2 to produce new values that are the same as the original value mutated as requested, with copy-on-write semantics. The new values produced by assignments (and setpath()) are only visible "to the right" of the assignment expression. E.g., in stuff | (((.a |= more_stuff) | even_more_stuff), last_stuff) the value produced by the assignment is only visible to even_more_stuff, and when that backtracks the changes will be "undone", and last_stuff will see the unmodified value.

jq internally depends critically on the jv API implementing a copy-on-write abstraction, and for performance jq depends critically on the jv API mutating values in place when those values have just one reference.

Note that the , operator retains a reference to . while the left-hand side (LHS) of the , is executing so that it can pass that to the right-hand side (RHS) when the LHS is done. It can be hard to "see" when jq is holding on to a reference to a value such that a mutation of that value will incur a copy. Having to always copy a value in a reduction would ruin performance.

Every assignment in jq can be modeled as a call to a function with two arguments: one being a path expression matching paths to be udpated, and the other being a value expression that produces values to set at the matching paths.

Assignments in the compiler and in the block representation

The jq assigment operators =, //=, <op>= (e.g., +=, -=, etc.), and |= are very special. They're not like assignments in most languages -- they are just another kind of jq expression that produces zero, one, or more values, but the values produced are the input with the changes denoted by the right-hand side (RHS) to the left-hand side (LHS) of the input to the assignment.

The LHS is very special: it is a path expressions, which is an expression consisting only of sub-expressions like .a, if/then/else with path expressions as the actions, and/or calls to functions whose bodies are path expressions.

The RHS is some expression which, in the case of |= receives the current value at the LHS in ., while in the other cases the RHS receives . (the input to the whole assignment expression). The latter can be confusing.

Inspecting src/parser.y is instructive.

First we have //= and <op>=

Exp "//=" Exp {
  $$ = gen_definedor_assign($1, $3);
} |
static block gen_definedor_assign(block object, block val) {
  block tmp = gen_op_var_fresh(STOREV, "tmp");
  return BLOCK(gen_op_simple(DUP),
               val, tmp,
               gen_call("_modify", BLOCK(gen_lambda(object),
                                         gen_lambda(gen_definedor(gen_noop(),
                                                                  gen_op_bound(LOADV, tmp))))));
}
Exp "+=" Exp {
  $$ = gen_update($1, $3, '+');
} |
static block gen_update(block object, block val, int optype) {
  block tmp = gen_op_var_fresh(STOREV, "tmp");
  return BLOCK(gen_op_simple(DUP),
               val,
               tmp,
               gen_call("_modify", BLOCK(gen_lambda(object),
                                         gen_lambda(gen_binop(gen_noop(),
                                                              gen_op_bound(LOADV, tmp),
                                                              optype)))));
}

Having val before the gen_call("_modify", ...) is the reason that the RHS of //= gets the . of the LHS as its value, the reason that it's evaluated every time, and also the reason that the assignment is done once per-value output by the RHS.

Compare to |= which is coded like this:

Exp "|=" Exp {
  $$ = gen_call("_modify", BLOCK(gen_lambda($1), gen_lambda($3)));
} |

Ok, let's translate all of this to English:

  • First |=: gen_call("_modify", BLOCK(gen_lambda($1), gen_lambda($3))); means: "generate a call to _modify with the lhs ($1) as the first argument and the rhs ($3) as the second argument (note that jq function arguments are lambdas, thus the gen_lambda()s).

  • Now gen_definedor_assign() and gen_update() (which are very similar):

    • the DUP is memory management -- ignore for this analysis
    • val is the RHS, and we will invoke it immediately
    • store the val output(s) (RHS) in tmp (a gensym'ed $binding)
    • call _modify (the heart of modify-assign operators) with the input to the LHS as the first argument and a second argument that amounts to . // $tmp where $tmp is the gensym'ed binding mentione above

The difference between //= and other op= assignments is that // is block-coded in gen_definedor() while the ops are builtins like _plus. // could have been jq-coded, but it's not.

jq-coded assignment helpers

The jq-coded builtin _assign implements the jq-coded part of the = assignment operator:

def _assign(paths; $value): reduce path(paths) as $p (.; setpath($p; $value));

_assign is pretty self-explanatory. All it does is reduce over the paths setting the given value at each path. It helps to first see the yacc/bison/compiler side of things (see above).

The jq-coded builtin _modify implements the jq-coded part of all the other assignment operators:

def _modify(paths; update):
    reduce path(paths) as $p ([., []];
        . as $dot
      | null
      | label $out
      | ($dot[0] | getpath($p)) as $v
      | (
          (   $$$$v
            | update
            | (., break $out) as $v
            | $$$$dot
            | setpath([0] + $p; $v)
          ),
          (
              $$$$dot
            | setpath([1, (.[1] | length)]; $p)
          )
        )
    ) | . as $dot | $dot[0] | delpaths($dot[1]);

The $$$$v thing is an internal-only hack where evaluating $$$$v produces $v's value, but also sets $v to null so that the next invocation of $$$$v or $v produces null. This is done to avoid holding on to a reference that would cause copy-on-write behavior that would make _modify accidentally quadratic.

In English we're reducing over the paths using an array as the reduction stat containing . and an initially empty array of paths to delete. For any path for which update produces a value, we take the first value and alter . to set that value at that path. For any path for which update produces no value (empty), we add that path to the array of paths to delete. Once the reduction completes we then delete all the paths queued up for deletion. We delay deletions because otherwise we risk deleting array elements incorrectly because we generally traverse array elements from the first to the last, but if we delete any non-last element then the indices of the remaining elements will decrement, which in turn causes subsequent deletions to be off.

What's really tricky here is that we need to make sure we have just one reference to the reduction state when we get to setpath([0] + $p; $v) (where update produced a value) or setpath([1, (.[1] | length)]; $p) (where update was empty so we're queuing up a deletion of that path). We also need to have only one reference to the value to be altered.

Internal use only special form: $$$$binding

In jq $name bindings are a) lexically scoped, b) immutable. This presents a problem in reductions where one may wish to have a binding and then also somehow not hold on to an extra reference to the bound value.

The $$$$name syntax is currently meant for internal-use only, and is a variant of $name which run produces the named value and also replaces the binding with null. This is done by generating a LOADVN instruction instead of a LOAD instruction as $name would do. Some jq special forms already use LOADVN (and related instructions like STOREVN), but those are not available directly to jq code because immutability is a central concept in jq. Still, when faced with rewriting _modify/2 in C in the compiler, it was an easy call to add $$$$v! But we retain the right to remove $$$$v at any time, so be warned and do not use it.

Relevant issues

Clone this wiki locally