split docs instead of truncating on tokenizer max_input_tokens #253

K-Schubert · 2024-07-02T14:58:34Z

Description

Currently, a call to an embedding model will truncate the input text to embed according to the embedding model max_input_tokens if the number of tokens of the input text exceeds this limit.

It would be better to split the text and embed each chunk separately (and keep track of chunk IDs and relationship to one another) to avoid losing data.

K-Schubert added the feature label Jul 2, 2024

K-Schubert added this to the MVP2 milestone Jul 2, 2024

K-Schubert self-assigned this Jul 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

split docs instead of truncating on tokenizer max_input_tokens #253

split docs instead of truncating on tokenizer max_input_tokens #253

K-Schubert commented Jul 2, 2024

split docs instead of truncating on tokenizer max_input_tokens #253

split docs instead of truncating on tokenizer max_input_tokens #253

Comments

K-Schubert commented Jul 2, 2024