Skip to content

create_index: No validation when split_length <= split_overlap[BUG] #605

Open
@pandu-k

Description

@pandu-k

Describe the bug
Internal error occurs on add_docs when split_length < split_overlap. This issue was raised on our forums here.

Reproducing the issue
To reproduce:

# create index: 

curl -XPOST -H 'Content-type: application/json' http://localhost:8882/indexes/text-index -d '{ "index_defaults": { "text_preprocessing": { "split_length": 2, "split_overlap": 5, "split_method": "word" }, "treat_urls_and_pointers_as_images": false, "model": "hf/all_datasets_v4_MiniLM-L6", "normalize_embeddings": true, "image_preprocessing": { "patch_method": null }, "ann_parameters" : { "space_type": "cosinesimil", "parameters": { "ef_construction": 128, "m": 16 } } }, "number_of_shards": 3, "number_of_replicas": 0 }'

# add docs

curl -XPOST -H 'Content-type: application/json' http://localhost:8882/indexes/text-index/documents -d '{ "documents" : [{"_id":"1","title":"Fat cat","description":"The fat cat sits on the mat in the sunshine"},{"_id":"2","title":"Brown fox","description":"The quick brown fox jumps over the lazy dog"}], "tensorFields" : ["description"] }'

Yields this error:

Marqo logs:

  File "/app/src/marqo/tensor_search/tensor_search.py", line 522, in add_documents
    content_chunks = text_processor.split_text(field_content, split_by=split_by,
  File "/app/src/marqo/s2_inference/processing/text.py", line 147, in split_text
    segments = list(windowed(split_text, n=split_length, step=split_length - split_overlap))
  File "/usr/local/lib/python3.8/dist-packages/more_itertools/more.py", line 841, in windowed
    raise ValueError('step must be >= 1')
ValueError: step must be >= 1

The return message is an unhelpful message: Internal Server Error.

Expected behavior
Index-creation-time validation should prevent creating an index with these problematic settings.

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions