Skip to content

Conversation

mskarlin
Copy link
Collaborator

@mskarlin mskarlin commented Dec 5, 2024

When you have different flavors of docs (say texts of JSON/Code along with some PDF-parsed paper text), the embedding ranking won't be based on semantics alone. All of the texts that are JSON may cluster together, and while they could be highly relevant, they are sometimes embedded further away from a natural text query based on structure alone. Our summary step deals with this downstream of the embedding ranking, but some good texts will never have the opportunity to enter into the summary step in this case.

To deal with this I added an optional partitioning_fn input into the Docs object. This allows a user to make a function which will turn each Text object's Doc into an integer category. The similarity search in the retrieval step will then be split by each integer category, ranked independently, then interleaved into a final output. This way each "partition" gets a shot at being included in the summary step.

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. enhancement New feature or request labels Dec 5, 2024
@mskarlin mskarlin changed the title Add partitioning func capabilities to allow doc types based embedding ranking Add partitioning func capabilities to allow doc-types-based embedding ranking Dec 5, 2024
@mskarlin mskarlin requested a review from nadolskit December 6, 2024 04:26
@mskarlin mskarlin requested a review from jamesbraza December 6, 2024 18:52
Copy link
Collaborator

@jamesbraza jamesbraza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Dec 6, 2024
@mskarlin mskarlin merged commit e3623ed into main Dec 10, 2024
5 checks passed
@mskarlin mskarlin deleted the add-partioning-funcs branch December 10, 2024 00:05
@jamesbraza jamesbraza mentioned this pull request Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request lgtm This PR has been approved by a maintainer size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants