Skip to content
pkoppstein edited this page May 30, 2023 · 83 revisions

For delicacies too choice for the manual.

TOC

Using bag to implement a sort-free version of unique

jq's unique built-in involves a sort, which in practice is usually fast enough, but may not be desirable for very large arrays or especially if processing a very long stream of entities, or if the order of first-occurrence is important. One solution is to use "bags", that is, multisets in the sense of sets-with-multiplicities. Here is a stream-oriented implementation that preserves generality and takes advantage of jq's implementation of lookups in JSON objects:

# bag(stream) uses a two-level dictionary: .[type][tostring]
# So given a bag, $b, to recover a count for an entity, $e, use
# $e | $b[type][tostring]
def bag(stream):
  reduce stream as $x ({}; .[$x|type][$x|tostring] += 1 );

def bag:  bag(.[]);

def bag_to_entries:
  [to_entries[]
   | .key as $type
   | .value
   | to_entries[]
   | {key: (if $type == "string" then .key else .key|fromjson end), value} ] ;

It is now a simple matter to define uniques(stream), the "s" being appropriate here because the filter produces a stream:

# Produce a stream of the distinct elements in the given stream
def uniques(stream):
  bag(stream)
  | to_entries[]
  | .key as $type
  | .value
  | to_entries[]
  | if $type == "string" then .key else .key|fromjson end ;

As a bonus, we have a histogram function:

# Emit an array of [value, frequency] pairs, sorted by value
def histogram(stream):
  bag(stream)
  | bag_to_entries
  | sort_by( .key )
  | map( [.key, .value] ) ;

Find the maximal elements of an array or stream

# Given an array of values as input, generate a stream of values of the 
# maximal elements as determined by f.
# Notes:
# 1. If the input is [] then the output stream is empty.
# 2. If f evaluates to null for all the input elements,
#    then the output stream will be the stream of all the input items.

def maximal_by(f):
  (map(f) | max) as $mx
  | .[] | select(f == $mx);

Example:

[ {"a":1, "id":1},  {"a":2, "id":2}, {"a":2, "id":3}, {"a":1, "id":4} ] | maximal_by(.a)

emits the objects with "id" equal to 2 and 3.

The above can also be used to find the maximal elements of a stream, but if the stream has a very large number of items, then an approach that requires less space might be warranted. Here are two alternative stream-oriented functions. The first simply iterates through the given stream, s, twice, and therefore assumes that [s]==[s], which is not the case, for example, for inputs :

# Emit a stream of the f-maximal elements of the given stream on the assumption
# that `[stream]==[stream]`
def maximals_by_(stream; f):
   (reduce stream as $x (null;  ($x|f) as $y | if . == null or . < $y then $y else . end)) as $mx
   | stream
   | select(f == $mx);

Here is a one-pass implementation that maintains a candidate list of maximal elements:

# Emit a stream of the f-maximal elements of the stream, s:
def maximals_by(s; f):
  reduce s as $x ([];
    ($x|f) as $y
    | if length == 0 then [$x]
      else (.[0]|f) as $v
      | if $y == $v then . + [$x] elif $y > $v then [$x] else . end
      end )
  | .[] ;

Using jq as a template engine

Here we describe three approaches:

  • the first uses jq "$-variables" as template variables; it might be suitable if there are only a small number of template variables, and if it is a requirement that all template variables be given values.

  • the second approach is similar to the first approach but scales well and does not require that all template variables be explicitly given values. It uses jq accessors (such as .foo or .["foo-bar"]) as template variables instead of "$-variables".

  • the third approach uses a JSON dictionary to define the template variables; it scales well but is slightly more complex and presupposes that the JSON dictionary is accurate.

Using jq variables as template variables

One straightforward approach is to use a jq object as a template, with jq variables as the template variables. The template can then be instantiated at the command line.

For example, suppose we start with the following template in a file named ab.jq:

{a: $a, b: $a}

One way to instantiate it would be by invoking jq as follows:

jq -n --argjson a 0 -f ab.jq

Notice that the contents of the file ab.jq need not be valid JSON; in fact, any valid jq program will do, so long as JSON values are provided for all the global "$-variables".

Notice also that if a key name is itself to be a template variable, it would have to be specified in parentheses, as for example:

{($a) : 0}

The disadvantage of this approach is that it does not scale so well for a large number of template variables, though jq's support for object destructuring might help. For example, one might want to set the "$-variables" in the template file using object destructuring, like so:

. as {a: $a}       # use the incoming data to set the $-variables 
| {a: $a, b: $a}   # the template

Using jq accessors as template variables

Using this approach, jq accessors are used as template variables. With the above example in mind, the template file (ab.jq) would be:

{a: .a, b: .a}

To instantiate the variables, we now only need a JSON object specifying the values, e.g.

echo '{"a":0}' | jq -f ab.jq

This approach scales well, but considerable care may be required.

Arbitrary strings as template variables

Another scalable approach would be to use special JSON string values as template variables, and a JSON object for mapping these strings to JSON values.

For example, suppose that the file template.json contains the template:

{"a": "<A>", "b": ["<A>"]}

Here, the intent is that "<A>" is a template variable.

Now suppose that dictionary.json contains the dictionary as a JSON object:

{ "<A>": 0 }

and that fillin.jq contains the following jq program for instantiating templates:

# $dict should be the dictionary for mapping template variables to JSON entities.
# WARNING: this definition does not support template-variables being 
# recognized as such in key names.
reduce paths as $p (.;
  getpath($p) as $v
  | if $v|type == "string" and $dict[$v] then setpath($p; $dict[$v]) else . end)

Then the invocation:

jq --argfile dict dictionary.json -f fillin.jq template.json

produces:

{
  "a": 0,
  "b": [
    0
  ]
}

Summary

  • dictionary.json is a JSON object defining the mapping
  • template.json is a JSON document defining the template
  • fillin.jq is the jq program for instantiating the template

The main disadvantage of this approach is that care must be taken to ensure that template variable names do not "collide" with string values that are intended to be fixed.

Emit the ids of JSON objects in a Riak database

The following script illustrates how curl and jq can work nicely together, especially if the entities stored at each Riak key are JSON entities.

The specific task we consider is as follows:

Task:

Given a Riak database at $RIAK with a bucket $BUCKET, and assuming that each value at each riak key is a JSON entity, then for each top-level object or array of objects, emit the value if any of its "id" key; the values should be emitted as a stream, it being understood that if any of the objects does not have an "id" key, then it should be skipped.

The following script has been tested as a bash script with these values for RIAK and BUCKET:

RIAK=http://127.0.0.1:8098
BUCKET=test
curl -Ss "$RIAK/buckets/$BUCKET/keys?keys=stream" |\
  jq -r '.keys[] | @uri' |\
while read key
do
  curl -Ss "$RIAK/buckets/$BUCKET/keys/$key?keys"
done | jq 'if type == "array" then .[] | .id elif type == "object" then .id else empty end'

Filter objects based on the contents of a key

E.g., I only want objects whose genre key contains "house".

$ json='[{"genre":"deep house"}, {"genre": "progressive house"}, {"genre": "dubstep"}]'
$ echo "$json" | jq -c '.[] | select(.genre | contains("house"))'
{"genre":"deep house"}
{"genre":"progressive house"}

If it is possible that some objects might not contain the key you want to check, and you just want to ignore the objects that don't have it, then the above will need to be modified. For example:

$ json='[{"genre":"deep house"}, {"genre": "progressive house"}, {"volume": "wubwubwub"}]'
$ echo "$json" | jq -c '.[] | select(.genre | . and contains("house"))'

If your version of jq supports ? then it could also be used:

$ echo "$json" | jq -c '.[] | select(.genre | contains("house"))?'

In jq version 1.4+ (that is, in sufficiently recent versions of jq after 1.4), you can also use regular expressions, e.g. using the "$json" variable defined above:

$ echo "$json" | jq -c 'map( select(.genre | test("HOUSE"; "i")))'
[{"genre":"progressive house"},{"genre":"progressive house"}]

Note: use a semi-colon (";") to separate the arguments of test.

Filter objects based on tags in an array

In this section, we discuss how to select items from an array of objects each of which has an array of tags, where the selection is based on the presence or absence of a given tag in the array of tags.

For the sake of illustration, suppose the following sample JSON is in a file named input.json:

[ { "name": "Item 1",
    "tags": [{ "name": "TAG" },  { "name": "TAG" }, { "name": "Not-TAG" } ] },
  { "name": "Item 2",
    "tags": [ { "name": "Not-TAG" } ] } ]

Notice that the first item is tagged twice with the tag "TAG".

Here is a jq filter that will select the objects with the tag "TAG":

map(select( any(.tags[]; .name == "TAG" )))

In words: select an item if any of its tags matches "TAG".

Using the -c command-line option would result in the following output:

[{"name":"Item 1","tags":[{"name":"TAG"},{"name":"TAG"},{"name":"Not-TAG"}]}]

Using any/2 here is recommended because it allows the search for the matching tag to stop once a match is found.

A less efficient approach would be to use any/0:

map(select([ .tags[] | .name == "TAG" ] | any))

The subexpression [ .tags[] | .name == "TAG" ] creates an array of boolean values, where true means the corresponding tag matched; this array is then passed as input to the any filter to determine whether there is a match.

If the tags are distinct, the subexpression could be written as select(.tags[] | .name == "TAG") with the same results; however if this subexpression is used, then the same item will appear as many times as there is a matching tag, as illustrated here:

$ jq 'map(select(.tags[] | .name == "TAG"))[] | .name'  input.json
"Item 1"
"Item 1"

Selecting all items that do NOT have a specific tag

To select items that do NOT have the "TAG" tag, we could use all/2 or all/0 with the same results:

$ jq -c 'map(select( all( .tags[]; .name != "TAG") ))'  input.json
[{"name":"Item 2","tags":[{"name":"Not-TAG"}]}]
$ jq -c 'map(select([ .tags[] | .name != "TAG" ] | all))'  input.json
[{"name":"Item 2","tags":[{"name":"Not-TAG"}]}]

Using all/2 would be more efficient if only because it avoids the intermediate array.

Find the most recent object in an S3 bucket

$ json=`aws s3api list-objects --bucket my-bucket-name`
$ echo "$json" | jq '.Contents | max_by(.LastModified) | {Key}' 

Sort by numeric values extracted from text

Say you have an array of objects with an "id" key and a text value that embeds a numeric ID among other text, and you want to sort by that numeric ID:

sort_by(.id|scan("[0-9]*$")|tonumber)

Add an element to an object array

Given an array of objects, I want to add another key to all elements in each of those objects based on existing keys:

$ json='[{"a":1,"b":2},{"a":1,"b":1}]'
$ echo "$json" | jq 'map(. + {color:(if (.a/.b) == 1 then "red" else "green" end)})'
[{"color":"green","b":2,"a":1},{"color":"red","b":1,"a":1}]

Explanation This example uses the map() operator. The filter for map copies all the keys of the input object using . and then merges this new object with the color object using the + operator. The color object itself is formed using the if conditional operator.

Note that this could also be done in the following manner:

jq 'map(.color = if (.a/.b) == 1 then "red" else "green" end)'

Zip column headers with their rows

Given the following JSON:

{
    "columnHeaders": [
        {
            "name": "ga:pagePath",
            "columnType": "DIMENSION",
            "dataType": "STRING"
        },
        {
            "name": "ga:pageviews",
            "columnType": "METRIC",
            "dataType": "INTEGER"
        }
    ],
    "rows": [
        [ "/" , 8 ],
        [ "/a", 4 ],
        [ "/b", 3 ],
        [ "/c", 2 ],
        [ "/d", 1 ]
    ]
}

How can I convert this into a form like:

[
    { "ga:pagePath": "/", "ga:pageviews": 8 },
    { "ga:pagePath": "/a", "ga:pageviews": 4 },
    { "ga:pagePath": "/b", "ga:pageviews": 3 },
    { "ga:pagePath": "/c", "ga:pageviews": 2 },
    { "ga:pagePath": "/d", "ga:pageviews": 1 }
]

Explanation

Okay, so first we want to get the columnHeaders as an array of names:

(.columnHeaders | map(.name)) as $headers

Then, for each row, we take the $headers as entries (if this doesn't mean anything to you, refer to the with_entries section of the manual) and we use those to create a new object, in which the keys are the values from the entries and the values are the corresponding values on the row for each of said entries. Tricky, I know.

.rows
  | map(. as $row
        | $headers
        | with_entries({ "key": .value,
                         "value": $row[.key]}) )

Then we put it all together: wrapping it on a filter is left as an exercise for the reader.

(.columnHeaders | map(.name)) as $headers
| .rows
| map(. as $row
      | $headers
      | with_entries({"key": .value,
                      "value": $row[.key]}) )

(This recipe is from #623.)

Delete elements from objects recursively

A straightforward and general way to delete key/value pairs from all objects, no matter where they occur, is to use walk/1. (If your jq does not have walk/1, then you can copy its definition from https://github.com/jqlang/jq/blob/master/src/builtin.jq)

For example, to delete all "foo" keys, you could use the filter:

walk(if type == "object" then del(.foo) else . end)

It may also be possible to use the recurse builtin, as shown in the following example.

Let's take the recurse example from the manual, and add a bunch of useless {"foo": "bar"} to it:

{"name": "/", "foo": "bar", "children": [
  {"name": "/bin", "foo": "bar", "children": [
    {"name": "/bin/ls", "foo": "bar", "children": []},
    {"name": "/bin/sh", "foo": "bar", "children": []}]},
  {"name": "/home", "foo": "bar", "children": [
    {"name": "/home/stephen", "foo": "bar", "children": [
      {"name": "/home/stephen/jq", "foo": "bar", "children": []}]}]}]}

recurse(.children[]) | .name will give me all the names, but destroy the structure of the JSON in the process.

Is there a way to get that information, but preserve the structure?

That is, with the JSON above as input, the desired output would be:

{"name": "/", "children": [
  {"name": "/bin", "children": [
    {"name": "/bin/ls", "children": []},
    {"name": "/bin/sh", "children": []}]},
  {"name": "/home", "children": [
    {"name": "/home/stephen", "children": [
      {"name": "/home/stephen/jq", "children": []}]}]}]}

Explanation

In order to remove the "foo" attribute from each element of the structure, you want to recurse through the structure and set each element to the result of deleting the foo attribute from itself. This translates to jq as:

recurse(.children[]) |= del(.foo)

If, instead of blacklisting foo, you'd rather whitelist name and children, you could do something like:

recurse(.children[]) |= {name, children}

(This recipe is from #263.)

Extract Specific Data for While Loop in Shell Script

Thanks to @pkoppstein and @wtlangford in Issue #663, I (@RickCogley) was able to finalize a shell script to pull descriptive metadata from a database of ours, which has a REST interface.

This cookbook entry makes use of curl, while read loops, and of course jq in a bash shell script. Once the JSON metadata files are output, they can be git pushed to a git repo, and diffed to see how the database settings change over time.

We assume a JSON stream like the following, with unique values for table id's, aliases and names:

{
  "id": "99999",
  "name": "My Database",
  "description": "Lorem ipsum, the description.",
  "culture": "en-US",
  "timeZone": "CST",
  "tables": [
    {
      "id": 12341,
      "recordName": "Company",
      "recordsName": "Companies",
      "alias": "t_12341",
      "showTab": true,
      "color": "#660000"
    },
    {
      "id": 12342,
      "recordName": "Order",
      "recordsName": "Orders",
      "alias": "t_12342",
      "showTab": true,
      "color": "#006600"
    },
    {
      "id": 12343,
      "recordName": "Order Item",
      "recordsName": "Order Items",
      "alias": "t_12343",
      "showTab": true,
      "color": "#000099"
    }
  ]
}

... the goal is to extract to a file only the table aliases using curl against a db's REST interface, then use the file's aliases as input to a while loop, in which curl again can be used to grab the details about tables.

First we set variables, then run curl against the REST API. The resulting JSON stream has no newlines, so piping it through jq '.' fixes this (bonus, if you also have XML, you can pipe it through xmllint to get a similar effect: xmllint --format -). The result is output to a file which contains JSON like the above.

#!/bin/bash
db_id="98765"
db_rest_token="ABCDEFGHIJK123456789"
compcode="ACME"

curl -k "https://mydb.tld/api/$db_id/$db_rest_token/getinfo.json" |\
  jq '.' > $compcode-$db_id-Database-describe.json 
jq -r '.tables[] | "\(.alias) \(.recordName)"' \
  $compcode-$db_id-Database-describe.json > $compcode-tables.txt

The filter '.tables[] | "\(.alias) \(.recordName)"' selects the "tables" array, then from that, uses the filter "\(.foo) \(.bar)" to create a string with just those elements. Note, the -r here gives you just raw output in the file, which is what you need for the while read loop.

The output file looks like:

t_12341 Company
t_12342 Order
t_12343 Order Item

Next, the shell script uses a while read loop to parse that output file $compcode-tables.txt, then curl again to get table-specific info using the table alias talias as input. It passes the raw JSON output from the REST i/f through jq '.' to add newlines, then outputs that to a file using the two loop variables in the filename (as well as variables from the top of the script).

while read talias tname
do
  curl -k "https://mydb.tld/api/$db_id/$db_rest_token/$talias/getinfo.json" |\
    jq '.' >"$compcode-$db_id-Table-$talias-$tname-getinfo.json"
done < $compcode-tables.txt

The result is a collection of files like these:

ACME-98765-Table-t_12341-Company-getinfo.json
ACME-98765-Table-t_12342-Order-getinfo.json
ACME-98765-Table-t_12343-Order Item-getinfo.json

... that can be committed to a git repo, for diffing.

Extract data and set shell variables

A variation on the preceding entry:

$ eval "$(jq -r '@sh "a=\(.a) b=\(.b)"')"

This works because the @sh format type quotes strings to be shell-eval safe.

Another variant:

$ jq -r '@sh "a=\(.a) b=\(.b)"' | while read -r line; do eval "$line"; ...; done

To share multiple values without using eval, consider setting a bash array variable, e.g.

vars=( $(jq -n -r '[1,2.3, null, "abc"] | .[] | @sh' ) )
for f in "${vars[@]}"; do echo "$f" ; done
1
2.3
null
'abc'

This approach will only work if the values are all single tokens, as in the example. In general, it is better to use jq -c to emit each value on a line separately; they can then be read using mapfile or one at a time.

For Windows, here is a .bat file that illustrates two approaches using jq. In the first example, the name of the variable is determined in the .bat file; in the second example, the name is determined by the jq program:

@echo off
setlocal

for /f "delims=" %%I in ('jq -n -r "\"123\""') do set A=%%I
echo A is %A%

jq -n -r  "@sh \"set B=123\"" > setvars.bat
call .\setvars.bat
echo B is %B%

Convert a CSV file with Headers to JSON

There are several freely available tools for converting CSV files to JSON. For example, the npm package d3-dsv (npm install -g d3-dsv) includes a command-line program named csv2json, which expects the first line of the input file to be a header row, and uses these as keys. Such tools may be more convenient than jq for converting CSV files to JSON, not least because there are several "standard" CSV file formats.

For trivially simple CSV files, however, the jq invocation jq -R 'split(",")' can be used to convert each line to a JSON array. If the trivially simple CSV file has a row of headers, then as shown below, jq can also be used to produce a stream or array of objects using the header values as keys.

In this recipe, therefore, we will assume that either the CSV is trivially simple or that a suitable tool for performing the basic row-by-row conversion to JSON arrays is available. One such tool is any-json.

The following jq program expects as input an array, the first element of which is to be interpreted as a row of headers, and the other elements of which are to be interpreted as rows.

# Requires: jq 1.5

# objectify/1 expects an array of atomic values as inputs, and packages
# these into an object with keys specified by the "headers" array and
# values obtained by trimming string values, replacing empty strings
# by null, and converting strings to numbers if possible.
def objectify(headers):
  def tonumberq: tonumber? // .;
  def trimq: if type == "string" then sub("^ +";"") | sub(" +$";"") else . end;
  def tonullq: if . == "" then null else . end;
  . as $in
  | reduce range(0; headers|length) as $i
      ({}; .[headers[$i]] = ($in[$i] | trimq | tonumberq | tonullq) );

def csv2jsonHelper:
  .[0] as $headers
  | reduce (.[1:][] | select(length > 0) ) as $row
      ([]; . + [ $row|objectify($headers) ]);

csv2jsonHelper

Usage example:

$ any-json input.csv | jq -f csv2json-helper.jq

Processing a large number of lines or JSON entities

Using jq 1.4 to process a file consisting of a large number of JSON entities or lines of raw text can be very challenging if any kind of reduction step is necessary, as the --slurp option requires the input to be stored in memory. One way to circumvent the limitations of jq 1.4 in this respect would be to break up the input file into smaller pieces, process them separately (perhaps in parallel), and then combine the results. Examples and utilities for parallel processing using jq can be found in jq-hopkok's parallelism folder.

The introduction of the inputs builtin in jq 1.5 allows files to be read in efficiently on an entity-by-entity or line-by-line basis. That is, the entire file no longer need be read in using the "slurp" option.

(Here is an example drawn from http://stackoverflow.com/questions/31035704/use-jq-to-count-on-multiple-levels.)

The input file consists of JSON entities, like so:

{"machine": "possible_victim01", "domain": "evil.com", "timestamp":1435071870}
{"machine": "possible_victim01", "domain": "evil.com", "timestamp":1435071875}
{"machine": "possible_victim01", "domain": "soevil.com", "timestamp":1435071877}
{"machine": "possible_victim02", "domain": "bad.com", "timestamp":1435071877}
{"machine": "possible_victim03", "domain": "soevil.com", "timestamp":1435071879}

The task is to produce a report consisting of a single object, like so:

{
  "possible_victim01": {
    "total": 3,
    "evildoers": {
      "evil.com": 2,
      "soevil.com": 1
    }
  },
  "possible_victim02": {
    "total": 1,
    "evildoers": {
      "bad.com": 1
    }
  },
  "possible_victim03": {
    "total": 1,
    "evildoers": {
      "soevil.com": 1
    }
  }
}

Here is a straightforward jq program that will do the job:

reduce inputs as $line
  ({};
   $line.machine as $machine
   | $line.domain as $domain
   | .[$machine].total as $total
   | .[$machine].evildoers as $evildoers
   | . + { ($machine): {"total": (1 + $total),
                        "evildoers": ($evildoers | (.[$domain] += 1)) }} )

The program would be invoked with the -n option, e.g., like so:

jq -n -f program.jq data.json

The -n option is required as the invocation of inputs does the reading of the file.

If the task requires both per-line (or per-entity) processing as well as some kind of reduction, then the foreach builtin, also introduced in jq 1.5, is very useful, as it obviates the need to accumulate anything that is not required for the reduction.

The trick is to use foreach (inputs, null) rather than just foreach inputs. As a simple example, suppose we have a file consisting of a large number of JSON objects, some of which have a key, say "n", and we are required to extract the corresponding values as well as determine the number of objects for which the "n" value is present and not null.

foreach (inputs, null) as $line 
  (0;
   if $line.n then .+1 else . end;
   if $line == null then . else $line.n // empty end)

Processing huge JSON texts

jq incorporates a so-called "streaming parser" so that it can process very large (and even certain types of arbitrarily large) JSON files without requiring very much memory. This parser, which has been available since the release of version 1.5, is activated by jq's "--stream" command-line option.

Unfortunately, the streaming parser is somewhat cumbersome to use, and can be very slow, so before delving into examples, it is worth emphasizing that when dealing with one or more very large (perhaps more than 10GB) monolithic JSON blobs, it is usually better to use some other tool in conjunction with jq. For example, it often makes sense to use such a tool to extract the relevant portions of a large blob, or to break it up into smaller pieces, for subsequent processing by jq.

In particular, jstream and jm both work very nicely in conjunction with jq, especially when dealing with ginormous files. When used in this way, they are both very easy to use.

For example, consider the task of converting a top-level JSON array into a stream of its elements. The jq FAQ shows how this can be done using jq's streaming parser. By contrast, this can be accomplished very simply by running jm or jstream -d 1.

jm has the added advantage of having a mode which preserves the numerical accuracy of all JSON numbers (not just integers).

In the following, we consider three other tasks and how they can be accomplished using jq's streaming parser alone, and using jm alone. The point of these examples is primarily to illustrate how jq's streaming parser can be used. In practice, if jm or jstream is available, it would probably be simpler to use one of them for simple tasks such as these.

(a) a single JSON object with no arrays

Input:

{"a":1, "b": {"c": 3}}

Program:

jq -c --stream '. as $in | select(length == 2) | {}|setpath($in[0]; $in[1])' # stream of leaflets

or:

jm -s

Output:

{"a":1}
{"b":{"c":3}}

Notice that the output consists of a stream of "leaflets", that is, a stream of JSON entities, one for each "leaf", where each "leaflet" reflects the original structure:

(b) A JSON object with a single key that is a flat array

Input:

{"a": [1, 2.2, true, "abc", null]}'

Program:

jq -nc --stream '
    fromstream( 1|truncate_stream(inputs)
      |  select(length>1)
      | .[0] |= .[1:] )'

or

jm /a

Output:

1
2.2
true
"abc"
null  

(c) An arbitrary JSON object

Input:

{"a": [1, 2], "b": [3, 4]}

Program:

jq -nc --stream '
  def atomize(s):
    fromstream(foreach s as $in ( {previous:null, emit: null};
      if ($in | length == 2) and ($in|.[0][0]) != .previous and .previous != null
      then {emit: [[.previous]], previous: $in|.[0][0]}
      else { previous: ($in|.[0][0]), emit: null}
      end;
      (.emit // empty), $in) ) ;
  atomize(inputs)

or

jm -s

Output:

{"a":[1,2]}
{"b":[3,4]} 

For further information about the streaming parser, see the jq Manual and the FAQ.

List keys used in any object in a list

If you have an array of JSON objects and want to obtain a listing of the top-level keys in these objects, consider:

add | keys

If you want to obtain a listing of all the keys in all the objects, no matter how deeply nested, you can use this filter:

[.. | objects | keys[]] | unique

For example, given the array:

[{"a": {"b":1}}, {"a": {"c":2}}]

the previous filter will produce:

["a", "b", "c"]

Include or import a module and call its functions

Key points:

  • If the module, say M.jq, is located in ~/.jq/ or ~/.jq/M/ then there should be no need to invoke jq with the -L option unless there is a file M.jq in the pwd;
  • The search path can be specified using the command-line option: -L <path> -- the path may be relative or absolute, and may begin with ~/
  • include "filename"; -- a reference to filename.jq
  • include "filename" {"search": "PATH"}; -- e.g. jq -n 'include "sigma" {search: "~/jq"}; sigma(inputs)'
  • import "filename" as symbol; -- a reference to filename.jq
  • :: is the scope resolution operator, e.g. builtin::walk

Example 1: ~/.jq/library/library.jq

  1. Copy the definition of walk/1 to $HOME/.jq/library/library.jq (see e.g. https://github.com/jqlang/jq/blob/master/src/builtin.jq)

  2. Invoke jq:

jq 'include "library"; walk(if type == "object" then del(.foo) else . end)' <<< '{"a":1, "foo": 2}'

Example 2: ~/jq/library.jq

  1. Copy the definition of walk/1 to $HOME/jq/library.jq (see e.g. https://github.com/jqlang/jq/blob/master/src/builtin.jq)

  2. Invoke jq with the -L option:

jq -L $HOME/jq 'import "library" as lib;
   lib::walk(if type == "object" then del(.foo) else . end)' <<< '{"a":1, "foo": 2}'

Remove adjacent matching elements from a list

The unique built-in will give you each unique element in a list, but sometimes it's useful to mimic the behavior of the unix uniq command. One way to do that is to use range to select only elements which differ from their neighbor:

def uniq: 
  [range(0;length) as $i
   | .[$i] as $x
   | if $i == 0 or $x != .[$i-1] then $x else empty end];

Example input:

[1,1,3,1,2,2,1]

Applying the uniq filter produces:

[1,3,1,2,1]

And here is a stream-oriented version:

def uniq(s):
  foreach s as $x (null;
    if . == null or .emitted != $x then {emit: true, emitted: $x}
    else .emit = false
    end;
    if .emit then $x else empty end);

Parse ncdu output (WIP)

Personally, I make backups using the LABFD (Literally a Billion Flash Drives) technique (partly due to the sense of adventure, partly as a bad habit). This is not actually as terrible as it sounds if you remember to label your drives with their contents, but fragments of post-it notes can only do so much--some kind of browsable offline metadata archive would be ideal.

For storing this metadata of a file tree, JSON is an excellent choice--The Jq manual page even notes this with an example schema under recurse:

{"name": "/", "children": [
   {"name": "/bin", "children": [
     {"name": "/bin/ls", "children": []},
     {"name": "/bin/sh", "children": []}]},
   {"name": "/home", "children": [
     {"name": "/home/stephen", "children": [
       {"name": "/home/stephen/jq", "children": []}]}]}]}

The good news is that there is already a widely available tool that can generate JSON trees like this: ncdu. A single invocation incocation of ncdu -eo fd02.json /mnt/fdrive02 Will make a comprehensive (albeit potentially large) listing of the entirety of said drive (complete with extended attributes!) that can then be browsed without access to the drive via ncdu -ef fd02.json.

The bad news is that the format, while efficient, seems hard to parse. There is no easy way to discern files from folders (from what I can tell): files are objects, yet so is metadata. It makes for a challenge, for sure. I asked someone proficient in Jq for help on the ##linux IRC channel and got this:

jq 'walk(if type == "array" then . else (if type == "object" then {name: .name?} else (if type == "string" then . else null end) end) end) | walk(select(.?))'

I'm not quite sure what it's intended for, but that may be because I was more vauge in what I was asking for than I should have been. It's a start though.

Currently I use ncdu's json files very simply with things like cat fd.json | grep -i filename. What would eventually be cool to do would be to write a set of scripts (possibly modules) so that it is possible to:

  • Search for file names based on a regex and return JSON results containing full paths.
  • Query files based on other metadata (like size < 8MIB) or filter existing search results.
  • Systematically add other metadata for other purposes (I think ncdu would happily ignore unknown fields anyway). For example, a hash field could be used to locate identical files, either within the same ncdu file or across multiple ones.
  • Add binary metadata as, for example, z85-encoded lowres PNGs for image previews. Possibly excessive but cool.

After some tinkering and 'borrowing' snippets from elsewhere, I now have this:

jq -c '.[3]
  | paths(scalars) as $p
  | [$p, getpath($p)]'

With the sample data this yields:

[[0,"name"],"/media/harddrive"]
[[0,"dsize"],4096]
[[0,"asize"],422]
[[0,"dev"],39123423]
[[0,"ino"],29342345]
[[1,"name"],"SomeFile"]
[[1,"dsize"],32768]
[[1,"asize"],32414]
[[1,"ino"],91245479284]
[[2,0,"name"],"EmptyDir"]
[[2,0,"dsize"],4096]
[[2,0,"asize"],10]
[[2,0,"ino"],3924]

This is promising: The indexing is more explicit, and I think that it's possibly to discern folders and files because the latter's first array always ends in a zero.

Another improvement:

jq -c '.[3]
  | paths(scalars) as $p
  | [$p, getpath($p)]
  | [ .[0][0:-1], .[0][-1], .[1] ]'

Yields:

[[0],"name","/media/harddrive"]
[[0],"dsize",4096]
[[0],"asize",422]
[[0],"dev",39123423]
[[0],"ino",29342345]
[[1],"name","SomeFile"]
[[1],"dsize",32768]
[[1],"asize",32414]
[[1],"ino",91245479284]
[[2,0],"name","EmptyDir"]
[[2,0],"dsize",4096]
[[2,0],"asize",10]
[[2,0],"ino",3924]

Now the index/level information is all in one array.

Edit: with more IRC help, now each entry is in a single object:

jq '[.[3]
  | paths(scalars) as $p
  | [$p, getpath($p)]
  | { keys: .[0][0:-1], ( .[0][-1] ): ( .[1] ) }]
  | group_by(.keys)
  | reduce . as $x (.[]; add)'

Yields:

{"keys":[0],"name":"/media/harddrive","dsize":4096,"asize":422,"dev":39123423,"ino":29342345}
{"keys":[1],"name":"SomeFile","dsize":32768,"asize":32414,"ino":91245479284}
{"keys":[2,0],"name":"EmptyDir","dsize":4096,"asize":10,"ino":3924}

TODO: work on this more (but feel free to add to it if you know more than me! :) )

Clone this wiki locally