Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse JSON using simdjson #7249

Closed
wants to merge 5 commits into from
Closed

Parse JSON using simdjson #7249

wants to merge 5 commits into from

Conversation

yorickvP
Copy link
Contributor

@yorickvP yorickvP commented Nov 1, 2022

Description

This PR changes the json-to-value function to use simdjson, an optimized json library.
It additionally adds an ExprJSON expression in order to generate json thunks, which lazily convert the rest of the document from the simdjson representation to nix Values (allocating nix values turns out to be expensive). These are generated for complex attrset values.

Motivation

  • Some real world nix evaluations spend a significant fraction of the eval time parsing JSON (I've seen up to 50%).
  • There has been a simdjson plugin for a while. At NixCon, some people expressed interest in having it upstream.

This mainly saves time on json-heavy libs like haskell.nix and nix-pypi-fetcher, which can be quite slow. The added laziness helps lot for the real world big-attrset-of-pkg-descriptions json files.

Downsides

  • simdjson can't fully replace nlohmann, since it doesn't generate any JSON.
  • The generated syntax errors are worse. There is no position information and the error messages are generic. (e.g. "The JSON document has an improper structure: missing or superfluous commas, braces, missing keys, etc.")

Current state

  • I'm not happy with the autoconf code
  • Everything should work.

Benchmarks

Real world performance difference (~30%)

This PR:

{
  "cpuTime": 3.17696,
  "envs": {
    "number": 9101249,
    "elements": 13673660,
    "bytes": 255009264
  },
  "list": {
    "elements": 3733294,
    "bytes": 29866352,
    "concats": 375405
  },
  "values": {
    "number": 16694476,
    "bytes": 400667424
  },
  "symbols": {
    "number": 114647,
    "bytes": 1468921
  },
  "sets": {
    "number": 2499791,
    "bytes": 474100000,
    "elements": 27131459
  },
  "sizes": {
    "Env": 16,
    "Value": 24,
    "Bindings": 16,
    "Attr": 16
  },
  "nrOpUpdates": 892412,
  "nrOpUpdateValuesCopied": 19229361,
  "nrThunks": 12884099,
  "nrAvoided": 10452895,
  "nrLookups": 5494571,
  "nrPrimOpCalls": 4107432,
  "nrFunctionCalls": 7952752,
  "gc": {
    "heapSize": 1610612736,
    "totalBytes": 1589656096
  }
}

Nix 2.11.0:

{
  "cpuTime": 4.70948,
  "envs": {
    "number": 9101249,
    "elements": 13673346,
    "bytes": 255006752
  },
  "list": {
    "elements": 5487105,
    "bytes": 43896840,
    "concats": 375405
  },
  "values": {
    "number": 20152112,
    "bytes": 483650688
  },
  "symbols": {
    "number": 521783,
    "bytes": 17180015
  },
  "sets": {
    "number": 3380559,
    "bytes": 515393184,
    "elements": 28831515
  },
  "sizes": {
    "Env": 16,
    "Value": 24,
    "Bindings": 16,
    "Attr": 16
  },
  "nrOpUpdates": 892098,
  "nrOpUpdateValuesCopied": 19225593,
  "nrThunks": 12884099,
  "nrAvoided": 10452581,
  "nrLookups": 5494571,
  "nrPrimOpCalls": 4107432,
  "nrFunctionCalls": 7952752,
  "gc": {
    "heapSize": 1073803264,
    "totalBytes": 2901247072
  }
}
Artificial testcase (strict) (~35%)

testcase.nix

let dir = "${builtins.getFlake "github:davhau/nix-pypi-fetcher"}/pypi";
    files = builtins.readDir dir;
    s = q: builtins.deepSeq q q; # defeat laziness
    all = builtins.concatMap (fn: builtins.attrNames (s (builtins.fromJSON (builtins.readFile (dir + "/${fn}"))))) (builtins.attrNames files);
in
builtins.length all

This PR

{
  "cpuTime": 7.46029,
  "envs": {
    "number": 513,
    "elements": 517,
    "bytes": 12344
  },
  "list": {
    "elements": 11552840,
    "bytes": 92422720,
    "concats": 0
  },
  "values": {
    "number": 22141715,
    "bytes": 531401160
  },
  "symbols": {
,
    "bytes": 119547228
  },
  "sets": {
    "number": 5504285,
    "bytes": 263017072,
    "elements": 10934282
  },
  "sizes": {
    "Env": 16,
    "Value": 24,
    "Bindings": 16,
    "Attr": 16
  },
  "nrOpUpdates": 0,
  "nrOpUpdateValuesCopied": 0,
  "nrThunks": 1030,
  "nrAvoided": 516,
  "nrLookups": 1028,
  "nrPrimOpCalls": 1028,
  "nrFunctionCalls": 512,
  "gc": {
    "heapSize": 1073741824,
    "totalBytes": 2581761360
  }
}

Nix 2.11.0

{
  "cpuTime": 11.6449,
  "envs": {
    "number": 513,
    "elements": 517,
    "bytes": 12344
  },
  "list": {
    "elements": 11552840,
    "bytes": 92422720,
    "concats": 0
  },
  "values": {
    "number": 22141716,
    "bytes": 531401184
  },
  "symbols": {
    "number": 3200205,
    "bytes": 119547238
  },
  "sets": {
    "number": 5504285,
    "bytes": 263017072,
    "elements": 10934282
  },
  "sizes": {
    "Env": 16,
    "Value": 24,
    "Bindings": 16,
    "Attr": 16
  },
  "nrOpUpdates": 0,
  "nrOpUpdateValuesCopied": 0,
  "nrThunks": 1030,
  "nrAvoided": 516,
  "nrLookups": 1028,
  "nrPrimOpCalls": 1028,
  "nrFunctionCalls": 512,
  "gc": {
    "heapSize": 1073741824,
    "totalBytes": 9016760096
  }
}
Artificial testcase (lazy) (~95%)

testcase.nix

let dir = "${builtins.getFlake "github:davhau/nix-pypi-fetcher"}/pypi";
    files = builtins.readDir dir;
    all = builtins.concatMap (fn: builtins.attrNames (builtins.fromJSON (builtins.readFile (dir + "/${fn}")))) (builtins.attrNames files);
in
builtins.length all

This PR

{
  "cpuTime": 0.549965,
  "envs": {
    "number": 513,
    "elements": 517,
    "bytes": 12344
  },
  "list": {
    "elements": 693098,
    "bytes": 5544784,
    "concats": 0
  },
  "values": {
    "number": 694496,
    "bytes": 16667904
  },
  "symbols": {
    "number": 346895,
    "bytes": 4537351
  },
  "sets": {
    "number": 258,
    "bytes": 5553008,
    "elements": 346805
  },
  "sizes": {
    "Env": 16,
    "Value": 24,
    "Bindings": 16,
    "Attr": 16
  },
  "nrOpUpdates": 0,
  "nrOpUpdateValuesCopied": 0,
  "nrThunks": 1030,
  "nrAvoided": 4,
  "nrLookups": 772,
  "nrPrimOpCalls": 772,
  "nrFunctionCalls": 512,
  "gc": {
    "heapSize": 1073741824,
    "totalBytes": 893077040
  }
}

Nix 2.11.0

{
  "cpuTime": 9.54107,
  "envs": {
    "number": 513,
    "elements": 517,
    "bytes": 12344
  },
  "list": {
    "elements": 11552840,
    "bytes": 92422720,
    "concats": 0
  },
  "values": {
    "number": 22141716,
    "bytes": 531401184
  },
  "symbols": {
    "number": 3200205,
    "bytes": 119547238
  },
  "sets": {
    "number": 5504285,
    "bytes": 263017072,
    "elements": 10934282
  },
  "sizes": {
    "Env": 16,
    "Value": 24,
    "Bindings": 16,
    "Attr": 16
  },
  "nrOpUpdates": 0,
  "nrOpUpdateValuesCopied": 0,
  "nrThunks": 1030,
  "nrAvoided": 4,
  "nrLookups": 772,
  "nrPrimOpCalls": 772,
  "nrFunctionCalls": 512,
  "gc": {
    "heapSize": 1073741824,
    "totalBytes": 9016760272
  }
}
hackage.nix

testcase2.nix

let hackage-json = builtins.fetchurl "https://raw.githubusercontent.com/input-output-hk/hackage.nix/master/hackage.json";
in
   builtins.fromJSON (builtins.readFile hackage-json)
$ time nix-2.11 eval --json -f testcase2.nix yeller
{..}
Executed in  613.84 millis

$ time nix-unstable eval --json -f testcase2.nix yeller
{..}
Executed in  124.87 millis

In some usecases, Nix spends a significant amount of time parsing
JSON. This library is a lot (at least 10x) faster, and
decreases real world eval times 15-25%, mostly when using
json-heavy code like nix-pypi-fetcher or haskell.nix.
GC_STRNDUP was calling strlen on the source string, but some strings
aren't zero-terminated, resulting in pathological slowdowns.

Use strnlen to only scan up to n bytes and call malloc manually.
@edolstra
Copy link
Member

edolstra commented Nov 2, 2022

I'm not really keen on having three JSON implementations in Nix (nlohmann, src/libutil/json.cc and simdjson). But maybe simdjson can replace src/libutil/json.cc?

The generated syntax errors are worse. There is no position information and the error messages are generic.

That's a pretty significant downside, given that we generally want to improve error handling in Nix. Probably for most users, the error messages are more important than being able to deal with gigabytes of JSON. Is this an inherent limitation of simdjson?

@xaverdh
Copy link

xaverdh commented Nov 7, 2022

If the only issue is that syntax errors are bad, and it is easy to distinguish syntax errors from other (e.g. internal ones), one could first try with simdjson and if that fails parse again with a library that has better errors?

@fricklerhandwerk
Copy link
Contributor

Discussed in the Nix team meeting 2023-02-10:

Decision: closed. If simdjson gets better error messages, we will reconsider.

Complete discussion
  • @thufschmitt: this makes error messages strictly worse, and tips the trade-off to the negative
    • @roberth: this could be fixed by first parsing with the existing library
  • @edolstra: also it adds another dependency
  • @thufschmitt: the question is if it happens often enough to warrant an optimisation
    • so far we're only sure it happens for Haskell builds
    • @Ericson2314: the real problem is shoving lots of data into Nix
      • taking a step back, maybe we should instead have a fromBSON or fromCBOR to optimise that transmission
      • @edolstra: CBOR is a pretty small dependency
      • @Ericson2314: fromSQLite
        • @edolstra: that's an interesting thought!
          • you woulnd't have to read everything, you could specify only what you need
            • @thufschmitt: then you'd need to specify a DSL to query the DB
          • it's basically what the eval cache does already
            • @Ericson2314: the builtin should work on queries exclusively against this schema
            • @roberth: the schema is not stable though, we'd need to specify properly it first
          • @edolstra will open an issue, we consider the general idea approved
  • decision: closed. if simdjson gets better error messages, we will reconsider.

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/2023-02-10-nix-team-meeting-minutes-31/25438/1

@Ericson2314
Copy link
Member

@yorickvP I recommend the solution in #7826 as also tentatively Nix-team-approved, less risky than fromSQLite, and easier for other 3rd party tools to work with (don't need to learn the eval cache schema / deal with its restriction).

Hopefully the parsing is even faster with those, too :).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

6 participants