- Fixed jn:intersect#1 to always be run locally
- General performance improvements for many expressions and iterators that return at most one item
- New builtin functions supported: fn:min#2, fn:max#2, fn:unordered#1, fn:distinct-values#2, fn:index-of#3, fn:deep-equal#3, fn:string#0, fn:string#1, fn:substring-before#3, fn:substring-after#3, fn:string-length#0, fn:resolve-uri#1, fn:resolve-uri#2, fn:ends-width#3, fn:starts-width#3, fn:contains#3, , fn:normalize-space#0, fn:default-collation#0, fn:number#0, fn:implicit-timezone#0, fn:not#0, fn:static-base-uri#1, fn:dateTime#2, fn:false#0, fn:true#0
- all JSONiq builtin types are now supported: newly supported are byte, dateTimeStamp, gDay, gMonth, gYear, gYearMonth, gMonthDay, int, long, negativeInteger, nonNegativeInteger, positiveInteger, nonPositiveInteger, unsignedInt, unsignedLong, unsignedByte, unsignedShort, short,
- ceiling, floor, round, abs, round-half-to-even are now correctly in the fn namespace (not math) and all accept numeric values (instead of converting everything to doubles) and a few bugs have been fixed
- support for open object types via the JSound verbose syntax (they are, of course, not implemented as DataFrames, but this makes no difference at the syntactic level except they cannot be used with ML estimators and transformers)
- support for user-defined array types via the JSound verbose syntax, including subtypes
- validation of atomic values is now correctly done by casting the lexical value (not the typed value) to the expected type.
- Fixed serialization of NaN, double/float infinity, dates, etc (the quotes are now correctly included to make them JSON strings)
- positive and negative zero (for double, float) now compare as equals in value/general comparison
Note that Spark 2.4.x is no longer maintained. We provide rumbledb-1.15.0-for-spark-2.jar only for legacy purposes for a smooth transition, and recommend instead using Spark 3.0.x or 3.1.x with the rumbledb-1.15.0.jar package.
- Rumble now outputs error messages displaying the faulty line of code and pointing to the place of error.
- Machine Learning estimators and models can now run at scale (in parallel) on very large amounts of data. This is automatically detected.
- Many stability improvements in the Machine Learning library
- Machine Learning Pipelines are now supported with stages given as function items
- Static typing is now always done and used to optimize even more
- Initial (experimental) support for user-defined types with the JSound Compact syntax. Types can be used everywhere builtin types can be used (instance of, treat as, type annotations for variables...).
- New validate type expression to validate against user-defined types and (if the type is DF-compatible) to create object* instances as optimized dataframes.
- Features must be assembled with the VectorAssembler transformer prior to being used with an estimator or transformer (for example, at the start of a pipeline). featuresCol and InputCol must specify the name (as a string) of the assembled feature vector field. This is now fully consistent with the Spark ML framework.
Note that Spark 2.4.x is no longer maintained. We provide rumbledb-1.14.0-for-spark-2.jar only for legacy purposes for a smooth transition, and recommend instead using Spark 3.0.x or 3.1.x with the rumbledb-1.14.0.jar package.
- Fixed performance issue when a big for clause follows other small clauses
- Fixed grouping and ordering of floats
- Fixed a bug that prevented grouping with keys of incompatible types when hashcodes collided.
- Experimental (and incomplete) support for XQuery 3.1 syntax (prefix queries with xquery version "3.1"; to activate)
- project() calls are pushed down if the argument is structured (e.g., coming from parquet-file(), etc).
- Performance improvements for round() and abs()
- Variable references ($x) are resolved quicker
- Support for general function types (including their signature) and type checking (including statically)
- When iterating on schema-based data (Parquet, Avro, structured-json-file()...) in a FLWOR expression, some let, for, where, group-by and order-by clauses will be automatically faster if they only involve literals, variable references, object/array lookups, and value comparison (native mapping to Spark SQL)
- Fixed several bugs in switch expressions
- Switch expressions and conditional expressions can handle/forward structured data faster (underlying DataFrames)
- experimental support for static typing (--static-typing yes) following the W3C standard.
- performance improvements in arithmetics, logics, comparison
- spaces are now supported in paths to json-file()
- HTTP URLs are now supported by unparsed-text() and unparsed-text-lines()
- yearMonthDuration, dayTimeDurations, hexBinary, base64Binary can now be compared for inequality in addition to equality
- performance improvements for comparison
- the effective boolean value is now correctly taken in quantified expressions
- quantified expressions now work in parallel as well (they leverage the FLWOR iterators)
- support for floats
- sum(), avg() are now pushed down and work on large homogeneous as well as heterogeneous sequences
- stability improvements and improved conformance for comparison, arithmetics and casts
- dayTimeDuration and yearMonthDuration can now be compared
- all constructors are now available (semantics identical to cast as)
- switch and index-of no longer throw an error for incompatible types, which now follows the standard
- empty function bodies are now allowed (in which case it is considered to return the empty sequence)
- variable names $null, $array, $object are now allowed
- annotate() can now automatically cast whenever it makes sense, and is thus more flexible
- the Item hierarchy is now flat, with a public Item interface available in the Rumble Java API, and individual classes providing the implementation, which should lead to a small performance boost with lighter method calls.
- fixed an issue (null pointer exception) when an ordering key is always the empty sequence
- constant predicate lookups with small numbers (<= materialization cap) are pushed down, e.g., json-file("...")
- general support at the parser level of any type QName. prefixes like xs: and js: are now accepted but remain optional (e.g., xs:integer, js:null).
- an error is appropriately thrown if an order by expression evaluates to more than an item or a non-atomic item
- builtin functions can now be called with fn:, jn: and math: prefixes as well (depending on their namespace). It is still, however, possible to refer to them without prefix, i.e., this is backward compatible.
The main jar is for Spark 3, but there is another jar for Spark 2.
- Fixed navigation issue with structured datasets when objects are nested in arrays.
- Fixed a bug that prevented calling a user-defined functions repeatedly in a FLWOR expression in some cases
- Any verbose messages are now printed to stderr, no longer stdout for those who want to pipeline the output in bash
- Bugfixes in unary expressions (an error is now thrown for more than one item, and multiple unary signs, allowed by the spec are handled correctly)
- Big integers can now be cast from strings
- string() now returns serialized numbers consistent with JSON output
- typeswitch now correctly matches the empty sequence type
- improved stability for user-defined function calls consuming dataframe parameter. Seamless materialization for ? and 1 arities.
- max() and min() are now pushed down to Spark and work on big sequences
- +INF and INF (doubles) are now serialized to strings correctly
- Fixed the division by 0 on doubles, to correctly produce +INF and -INF, and mod by 0 to produce NaN. idiv raises an error as per the spec.
- It is now possible to build INF, -INF, und NaN double by casting from a string literal.
- Fixed bug in the object lookup expression leading to a crash when the field to lookup depends on a variable, and the sequence of objects being looked up is partitioned on Spark. Same fix for array lookup expressions.
- Fixed a crash happening in a FLWOR expression in a group-by clause executed in parallel, when none of the variables before and including this group clause is used anywhere in the remainder of the FLWOR expression.
- Performance improvements in the processing of items.
- Performance improvement for distinct-values call on heterogeneous sequences.
- support for W3C-standard functions unparsed-text, unparsed-text-lines (in parallel) and parse-json (all with arity 1 for now)
- Fixed a bug occasionally happening with JsonIter streaming by switching to another JSON parser (gson).
Interim release with the following fixes and improvements:
There is a new CLI parameter --deactivate-jsoniter-streaming to set to yes if there is any error regarding the JsonIter dependency, the library we use to parse JSON (the error in question being "com.jsoniter.spi.JsonException: javassist.CannotCompileException: by java.lang.ClassFormatError: class com.jsoniter.IterImpl cannot access its superclass com.jsoniter.IterImplForStreaming"). This flag deactivates streaming (i.e., avoids dynamic code generation by JsonIter) and avoids the error. This is a known issue with the Rumble docker but it never happened on our own machines. We are actively investigating why the Rumble docker has this issue. If you deactivate JsonIter streaming, though, this makes json-doc() unavailable after using json-file() in the same Rumble application (which is why we activate JsonIter streaming by default).
The public Rumble API (also accessible via the Rumble Maven dependency) now allows passing any lists of items as an external variable. You can thus gather the results of a query as a list of items, and put it back as the input of another query in Java as a host language.
- Left-outer equi-joins with let clauses: if you have two large tabular datasets, Rumble can nest one into the other with just a few lines of code, and fast.
- Inner equi-joins and generic joins with where clauses are detected.
- Renamed --result-size to --materialization-size to avoid confusion, and adding more hints about --output-path for getting the complete output from a parallel query.
- New CLI options --output-format and output-format-option:* for outputting structured output to other formats than JSON (Parquet, CSV...).
- New CLI option --number-of-output-partitions to repartition the output as desired
- New function local-text-file() to read a file as a sequence of string items, but without Spark parallelism (streaming instead). This makes Rumble faster for smaller files
- Performance improvements for FLWOR queries on structured data (Avro, Parquet, structured JSON, CSV)...
- Performance improvement for when parallelism is not used at all
- Stability improvement for json-doc(), which will now also work after json-file() has been used.
Interim release with small fixes
- Improve performance of joins whenever possible (quadratic -> linear)
- fixed a bug with non-exact averages with avg()
Note that Rumble is in beta. Use at your own risks.
- Support for joining two large datasets; automatic detection of joins if a for expression is a predicate expression, and the left-hand side can be evaluated independently of the former clauses. The right-hand-side is the joining criterion. Left outer joins are also supported in parallel (allowing empty).
- outer joins ("allowing empty" in a for clause) are now supported both locally and in parallel.
- support for empty sequence order least/greatest prolog setter (for order by clauses)
- positional variables in for clauses are now supported both locally and in parallel (except for large-scale joins).
- arbitrary large integer literals are now supported (an error was thrown before beyond 32 bits)
- json-file() and json-doc() can both read over HTTP
- you can store your JSONiq modules on the Web and import them with an HTTP URL
- you can store your queries on the Web and execute them via the Rumble command line with their URL
- an error with the appropriate code is now thrown if a collation is specified that is not supported (the W3C standard requires support for at least the Unicode codepoint collation, which Rumble recognizes and supports).
- It is now possible to specify a hostname in the server mode (--host), and to filter for specific URI prefixes for security reasons (--allowed-uri-prefixes)
- big integers are now seamlessly supported: no more overflows, and arbitrary large integer literals are accepted in JSONiq code
- fixed display bugs in debug mode (--print-iterator-tree yes)
- fixed an error with local group-by queries nested inside local FLWORs
- fixed an error when counting items in a variable that was not a post-grouping variable, in parallelized FLWORs.
- fixed a bug encountered when a local iteration followed by a parallel for clause produced, and unioned, several Spark jobs internally.
Important: The jar for Spark 3.0.0 does not have Laurelin (ROOT parser) support. We are waiting for a 3.0.0-compatible Laurelin release. If you need to query ROOT files, please use Spark 2.4.6.