Skip to content

Conversation

@danilojsl
Copy link
Contributor

Description

This PR introduces Chunking Strategies, which enhance the Partition and PartitionTransformer components by dividing content into meaningful units based on the document's structure and content.

Incorporated strategies:

  • basic: A general-purpose chunker that segments data into coherent chunks based on character limits
  • by_title: A structure aware chunker tailored for documents with headings, tables, and mixed semantic elements

Motivation and Context

  • Improved Retrieval Quality: Chunking respects natural topic boundaries, resulting in more semantically coherent and contextually relevant retrieval results.
  • Reduced Noise: By preventing the inclusion of unrelated or partial fragments, the updated chunking logic minimizes irrelevant content in retrieval pipelines.

How Has This Been Tested?

  • Local Tests
  • Google Colab
  • Databricks

Screenshots (if appropriate):

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • Code improvements with no or little impact
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING page.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@danilojsl danilojsl requested a review from DevinTDHa June 3, 2025 21:07
@danilojsl danilojsl self-assigned this Jun 3, 2025
@danilojsl danilojsl requested a review from maziyarpanahi June 3, 2025 21:07
@DevinTDHa DevinTDHa changed the base branch from master to release/603-release-candidate June 6, 2025 09:31
Copy link
Member

@DevinTDHa DevinTDHa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a small question, otherwise looks good to me!

@DevinTDHa DevinTDHa merged commit 5ed5063 into release/603-release-candidate Jun 9, 2025
4 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants