Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MOD-3283: Initial support for multi-value numeric #2985

Merged
merged 42 commits into from
Sep 11, 2022
Merged

Conversation

oshadmi
Copy link
Collaborator

@oshadmi oshadmi commented Aug 9, 2022

Adding support for indexing and searching multiple numeric values.
Multiple values could be encountered with JSONPath leading to an array, or using JSONPath operators such as wildcard, recursive descent, slices, etc.

  • Multiple numeric values mixed with non-numerical values or non-scalar values are causing indexing failure,
    except for null values, which are skipped.

  • Sort - By first value

Other considerations:

  • Previous invariant of "one value <==> one doc" is no longer true and can now be "many values <==> one doc"
  • doc Id delta can now be zero (same docId as previous value)
    • Affecting encoder encodeNumeric and decoder readNumeric
    • See structs EncodingHeader and NumEncodingCommon for details
  • One Inverted Index Block can now have a larger Buffer when indexing multi numeric values
    • We add a new block when a number of entries is reached (not docs), but only when another doc is encountered
      • Entries from a single doc do not span more than a single block
      • InvertedIndex_WriteEntryGeneric
      • Easier change and with smaller impact
      • Maybe better locality of values
      • If we change that, and entries from the same doc can span more than a single block - need to adjust IndexReader_SkipToBlock (and maybe more)
  • Multiple values on same doc - currently we keep repetitions (not only distinct values)
    • In next phases, with more elaborated predicates, it is also a question whether duplicates are considered or not, e.g., something like at_least(2) with range [10,20] should match value as [15, 15] or just something like [15, 17]
    • Currently we use an implicit Any predicate
      • For this predicate, there's no need to keep duplicates
        • Consider "uniqifying" similar values (e.g., use khash_t)

Related fixes:

  • This PR also fixes GC with regard to multi numeric values
    • Also fixes FT.INFO return value total_inverted_index_blocks which was not updated in GC, also for HASH and single value JSON (see python test checkInfoAndGC)

Followup PRs:

  • Add documentation on indexing and searching multi numeric #3067
    • Also add example for searching a numeric attribute with ALL values in a specific range, e.g., all values are in the range [-20, 100] (inclusive)
      FT.SEARCH idx:all '-@val:[-inf (-20] -@val:[(100 +inf]'
  • Remove code that handles loading old rdb's (also add this to Release Notes MOD-4081)
    • Remove NumericIndexType_RdbSave and NumericIndexType_RdbLoad

@codecov
Copy link

codecov bot commented Aug 10, 2022

Codecov Report

Merging #2985 (21266e1) into master (2744b3b) will decrease coverage by 0.16%.
The diff coverage is 69.13%.

@@            Coverage Diff             @@
##           master    #2985      +/-   ##
==========================================
- Coverage   82.08%   81.91%   -0.17%     
==========================================
  Files         180      181       +1     
  Lines       29713    29960     +247     
==========================================
+ Hits        24389    24543     +154     
- Misses       5324     5417      +93     
Impacted Files Coverage Δ
src/inverted_index.h 100.00% <ø> (ø)
src/numeric_index.h 0.00% <0.00%> (ø)
src/numeric_index.c 75.17% <17.64%> (-6.11%) ⬇️
src/redis_index.c 50.87% <40.00%> (ø)
src/fork_gc.c 56.14% <58.33%> (-0.17%) ⬇️
src/inverted_index.c 68.69% <62.76%> (-1.53%) ⬇️
src/json.c 87.54% <84.48%> (-0.49%) ⬇️
src/document.c 74.52% <91.66%> (+0.74%) ⬆️
src/debug_commads.c 87.60% <93.90%> (+0.99%) ⬆️
src/document_basic.c 85.09% <100.00%> (+0.20%) ⬆️
... and 2 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Copy link
Contributor

@ashtul ashtul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!
Some small comments.

src/document.c Outdated Show resolved Hide resolved
src/document.c Show resolved Hide resolved
src/inverted_index.c Outdated Show resolved Hide resolved
src/inverted_index.c Outdated Show resolved Hide resolved
src/json.c Outdated Show resolved Hide resolved
src/numeric_index.c Outdated Show resolved Hide resolved
src/numeric_index.c Outdated Show resolved Hide resolved
tests/pytests/test_json_multi_numeric.py Outdated Show resolved Hide resolved
Copy link
Contributor

@ashtul ashtul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

src/document.c Outdated Show resolved Hide resolved
src/document.c Outdated Show resolved Hide resolved
@@ -351,11 +351,16 @@ void Document_Clear(Document *d) {
rm_free(field->strval);
break;
case FLD_VAR_T_ARRAY:
for (int i = 0; i < field->arrayLen; ++i) {
rm_free(field->multiVal[i]);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before we freed the multiVal anyway, now there is a case where we do not free it at all, is that OK?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We free it in all currently supported types for array: TEXT, TAG and now NUMERIC (in the else clause)
TEXT and TAG use field multiVal and NUMERIC use field arrNumval

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reach this code and not enter one of those conditions?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, if we reach here - we enter one of those conditions.
We use unionType FLD_VAR_T_ARRAY only with TEXT TAG and NUMERIC
(VECTOR is using FLD_VAR_T_CSTR - so not reaching here)

We can add an assert.

rafie
rafie previously approved these changes Sep 1, 2022
Copy link
Contributor

@ashtul ashtul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few last comments

src/inverted_index.c Outdated Show resolved Hide resolved
src/inverted_index.c Outdated Show resolved Hide resolved
tests/cpptests/test_cpp_index.cpp Show resolved Hide resolved
tests/cpptests/test_cpp_index.cpp Show resolved Hide resolved
src/inverted_index.c Outdated Show resolved Hide resolved
src/inverted_index.c Outdated Show resolved Hide resolved
Copy link
Collaborator

@MeirShpilraien MeirShpilraien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added one question, but not critical.

@oshadmi oshadmi merged commit e81502d into master Sep 11, 2022
@oshadmi oshadmi deleted the omer_multi_numeric branch September 11, 2022 16:05
oshadmi added a commit that referenced this pull request Sep 21, 2022
* WIP [skip ci] allow zero as doc id delta

* WIP [skip ci] fix encoder, prepare FieldIndexerData

* WIP [skip ci] ingest multi values and avoid returning same doc

* add missing init + adjust testMultiValueErrors

* avoid test failure due to slow background indexing on CI

* add basic tests

* Per review by Ariel

* add range tests

* Fix duplicated results

* skip tests with FT.DEBUG on cluster

* remove old TODO

* Skip writing/reading zero delta to/from buffer

* add debug dump for NumericRangeTree

* Add testInvertedIndexMultipleBlocks

* Add missing free in InvertedIndex_Dump

* add cpp test testNumericEncodingMulti and fix cpp tests with inverted index with zero delta

* add cpp test testRangeIteratorMulti

* Skip test with DEBUG command on cluster

* Update TotalIIBlocks in GC

* Fix GC with multi and add test

* add test for SORTBY

* Cleanup commented out printf/debug code

* Fix FT.INFO num_records and adjust GC

* fix test for coord

* split NumericRange by InvertedIndex numEntries instead of numDocs

* improve coverage of NumericRangeNode_Balance

* Avoid calling DocTable_Exists for the same doc

* Add subcommand FT.DEBUG DUMP_NUMIDXTREE

* Add test for num_records

* Remove unused member `card` of NumericRangeTree

* Fix testRangeIteratorMulti for change in commit a982d18

(split NumericRange by InvertedIndex numEntries instead of numDocs)

* fix per code review

* Add numEntries to IndexBlock and split by it

* Cosmetic fixes per Ariel's review

(cherry picked from commit e81502d)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants