Skip to content

CaliLuke/julienne

Repository files navigation

Julienne

Julienne is a Rust library for cutting text into range-preserving chunks for retrieval, embedding, indexing, search, and context-building pipelines.

It is deliberately a chunking library, not an ingestion framework. Bring strings that you already extracted from documents, Markdown, HTML/XML, SQL, prose, or source code; Julienne returns chunks with explicit boundaries and provenance.

Documentation

Features

  • Character-based splitters:
    • CharacterTextSplitter
    • RecursiveCharacterTextSplitter
  • Sentence-aware splitters:
    • SentenceChunker
  • Semchunk-inspired recursive splitter:
    • SemchunkSplitter
  • Embedding-boundary splitter:
    • SemanticChunker
  • Structure-aware splitters:
    • MarkdownChunker
    • HtmlChunker
    • CodeChunker behind the code feature
  • Pluggable length function (LengthFn) for character, word, or tokenizer-based sizing.
  • Pluggable embedding function (EmbeddingFn) for semantic chunking.
  • Zero-copy structured chunks with byte and character offsets.

Splitters

Choosing A Chunker

Use SemchunkSplitter as the general-purpose default for natural language and mixed prose because it applies a punctuation-aware delimiter hierarchy before falling back to smaller units. Use RecursiveCharacterTextSplitter when you want LangChain-style separator behavior. Use SentenceChunker when sentence boundaries are mandatory and embeddings are not available. Use SemanticChunker when a domain-relevant embedder can identify topic shifts; it is only as good as the embedding signal and falls back to sentence-sized packing when no embedder is configured. Use try_split_chunks with fallible embedders when provider failures should be returned as ChunkError instead of panicking through the infallible convenience API. Use MarkdownChunker or HtmlChunker when the input format carries useful block structure. Use CodeChunker when source code should be split by parser-recognized Rust or Python AST nodes.

CharacterTextSplitter

LangChain-style single-separator splitting + merge with overlap.

Best for simple and predictable chunk boundaries.

RecursiveCharacterTextSplitter

LangChain-style recursive fallback over separators (\n\n -> \n -> -> char fallback).

Best general-purpose default when text structure varies.

SentenceChunker

Sentence-aware chunking with sentence-preserving overlap backtracking.

Supports:

  • min_characters_per_sentence
  • min_sentences_per_chunk
  • custom sentence delimiters

SemchunkSplitter

Semchunk-inspired recursive splitter with punctuation-aware hierarchy.

Supports:

  • adaptive merge (binary-search span fitting)
  • optional memoization (memoize)
  • stricter delimiter precedence mode (strict_mode)
  • configurable length_fn (including tokenizers)

SemanticChunker

Embedding-similarity boundary detection:

  • sentence windows
  • cosine similarity between adjacent windows
  • Savitzky-Golay smoothing
  • local minima boundary detection
  • optional skip-window reconnection for tangential asides

Falls back to sentence-based greedy splitting if no embedder is configured.

Quick Start

use julienne::{RecursiveCharacterTextSplitter, SentenceChunker, SemchunkSplitter, SemanticChunker};

let text = "Hello world. This is a document with multiple sentences.";

let recursive = RecursiveCharacterTextSplitter::new(500, 50);
let recursive_chunks = recursive.split_text(text);

let sentence = SentenceChunker::new(500, 50);
let sentence_chunks = sentence.split_text(text);

let semchunk = SemchunkSplitter::new(500, 50);
let semchunk_chunks = semchunk.split_text(text);

let semantic = SemanticChunker::new(500, 50);
let semantic_chunks = semantic.split_text(text);

For a fuller walkthrough, see Getting started.

Structured Chunks And Offsets

Use split_chunks when downstream code needs provenance:

use julienne::{RecursiveCharacterTextSplitter, TextChunk};

let text = "Intro.\n\nDetails with café.";
let splitter = RecursiveCharacterTextSplitter::new(80, 10);
let chunks: Vec<TextChunk<'_>> = splitter.split_chunks(text);

for chunk in chunks {
    assert_eq!(&text[chunk.start_byte..chunk.end_byte], chunk.text);
}

TextChunk contains text, start_byte, end_byte, start_char, end_char, measured_length, and optional metadata. The byte range always indexes the original input passed to the splitter. The character offsets are counted from the start of that same input.

split_text is the owned-string convenience API. split_chunks collects structured chunks. chunks is the iterator-style API for infallible splitters that can expose structured output without changing the caller-facing contract.

Token-Aware Length Example

use julienne::SemchunkSplitter;

fn word_len(s: &str) -> usize {
    s.split_whitespace().count()
}

let splitter = SemchunkSplitter::builder()
    .chunk_size(90)
    .chunk_overlap(15)
    .length_fn(std::sync::Arc::new(word_len))
    .build()
    .unwrap();

let chunks = splitter.split_text("Some longer text...");

Typed Sizing And Token Windows

use julienne::{ChunkConfig, ChunkSizer, WordSizer, TokenBoundaryProvider, TokenChunker, TokenSpan};

let word_config = ChunkConfig::new(90, 15, WordSizer);
assert_eq!(word_config.sizer.size("one two three"), 3);

#[derive(Clone)]
struct WhitespaceTokens;

impl TokenBoundaryProvider for WhitespaceTokens {
    fn token_spans(&self, input: &str) -> Result<Vec<TokenSpan>, julienne::ChunkError> {
        let mut spans = Vec::new();
        let mut start = None;
        for (idx, ch) in input.char_indices() {
            if ch.is_whitespace() {
                if let Some(s) = start.take() {
                    spans.push(TokenSpan { start_byte: s, end_byte: idx });
                }
            } else if start.is_none() {
                start = Some(idx);
            }
        }
        if let Some(s) = start {
            spans.push(TokenSpan { start_byte: s, end_byte: input.len() });
        }
        Ok(spans)
    }
}

let token_chunks = TokenChunker::new(WhitespaceTokens, 3, 1)
    .unwrap()
    .try_split_text("one two three four five")
    .unwrap();
assert_eq!(token_chunks, vec!["one two three", "three four five"]);

Optional tokenizer integrations are feature gated:

  • tiktoken-rs enables julienne::token::tiktoken::TiktokenBoundaryProvider.
  • tokenizers enables julienne::token::huggingface::HuggingFaceBoundaryProvider.
  • unicode-segmentation enables GraphemeSizer and UnicodeWordSizer.
#[cfg(feature = "tiktoken-rs")]
{
    use julienne::token::tiktoken::TiktokenBoundaryProvider;
    use julienne::TokenChunker;

    let provider = TiktokenBoundaryProvider::new(tiktoken_rs::cl100k_base().unwrap());
    let chunks = TokenChunker::new(provider, 128, 16)
        .unwrap()
        .try_split_text("Token-sized chunks using tiktoken.")
        .unwrap();
}
#[cfg(feature = "tokenizers")]
{
    use julienne::token::huggingface::HuggingFaceBoundaryProvider;
    use julienne::TokenChunker;
    use tokenizers::Tokenizer;

    let tokenizer = Tokenizer::from_file("tokenizer.json").unwrap();
    let provider = HuggingFaceBoundaryProvider::new(tokenizer);
    let chunks = TokenChunker::new(provider, 128, 16)
        .unwrap()
        .try_split_text("Token-sized chunks using Hugging Face tokenizers.")
        .unwrap();
}

Structure-Aware Chunking

use julienne::{MarkdownChunker, HtmlChunker};

let markdown = MarkdownChunker::new(500, 50)
    .unwrap()
    .split_text("# Title\n\nA paragraph.\n\n```rust\nfn main() {}\n```");

let html = HtmlChunker::new(500, 50)
    .unwrap()
    .split_text("<section><h1>Title</h1><p>Body</p></section>");

MarkdownChunker preserves headings, paragraphs, lists, and fenced code blocks. HtmlChunker works on already-extracted HTML/XML strings and uses block-level tag boundaries; it does not fetch pages, sanitize markup, remove boilerplate, or perform readability extraction.

With --features code, CodeChunker uses tree-sitter parsers for Rust and Python and returns explicit ChunkError values for parser failures or oversized semantic nodes.

SemanticChunker Example

use julienne::SemanticChunker;

fn simple_embedding(text: &str) -> Vec<f32> {
    let lower = text.to_lowercase();
    let a = ["sql", "table", "vectorizer"].iter().map(|k| lower.matches(k).count() as f32).sum::<f32>();
    let b = ["weather", "rain", "forecast"].iter().map(|k| lower.matches(k).count() as f32).sum::<f32>();
    vec![a, b]
}

let chunker = SemanticChunker::builder()
    .chunk_size(500)
    .chunk_overlap(50)
    .window_size(3)
    .skip_window(1)
    .reconnect_similarity_threshold(0.75)
    .max_aside_length(512)
    .embedding_fn(std::sync::Arc::new(simple_embedding))
    .build()
    .unwrap();

let chunks = chunker.split_text("...");

Testing

cargo test

Integration tests use fixtures under:

  • examples/summarize_article.sql
  • examples/embeddings_from_documents/documents/pgai.md

Benchmarks

cargo bench --bench splitters_bench

Code chunker benchmarks are feature-gated:

cargo bench --bench splitters_bench --features code -- splitters_code

The benchmark compares all splitter strategies on markdown and SQL fixtures, including word-count and tiktoken length profiles for semchunk.

API Contracts

Structured chunk APIs return TextChunk with stable source offsets. Public splitter types and builders are Send + Sync + Clone when their configured handles are. Infallible splitters expose split_text, split_chunks, and iterator-style chunks where available. Fallible integrations use try_* methods and return ChunkError; this includes batch embedding and tree-sitter code parsing.

Scope And Non-Goals

This crate is a chunking library. Compared with Chonkie, the focus is Rust-native range-preserving chunks, explicit fallibility, and configurable splitter strategies rather than a Python-first experimentation surface. Compared with Chunkr, this crate does not try to be document intelligence infrastructure: no OCR, hosted API, document loading service, layout extraction, vector database handshake, or ingestion pipeline is included. Bring already-extracted text, Markdown, HTML/XML, or source code strings; the crate returns chunks.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors