Julienne is a Rust library for cutting text into range-preserving chunks for retrieval, embedding, indexing, search, and context-building pipelines.
It is deliberately a chunking library, not an ingestion framework. Bring strings that you already extracted from documents, Markdown, HTML/XML, SQL, prose, or source code; Julienne returns chunks with explicit boundaries and provenance.
- Getting started
- Chunker guide
- Structured chunks and offsets
- Sizing and token windows
- Semantic chunking
- Structure-aware chunking
- API contracts
- Release and quality gates
- Character-based splitters:
CharacterTextSplitterRecursiveCharacterTextSplitter
- Sentence-aware splitters:
SentenceChunker
- Semchunk-inspired recursive splitter:
SemchunkSplitter
- Embedding-boundary splitter:
SemanticChunker
- Structure-aware splitters:
MarkdownChunkerHtmlChunkerCodeChunkerbehind thecodefeature
- Pluggable length function (
LengthFn) for character, word, or tokenizer-based sizing. - Pluggable embedding function (
EmbeddingFn) for semantic chunking. - Zero-copy structured chunks with byte and character offsets.
Use SemchunkSplitter as the general-purpose default for natural language and
mixed prose because it applies a punctuation-aware delimiter hierarchy before
falling back to smaller units. Use RecursiveCharacterTextSplitter when you
want LangChain-style separator behavior. Use SentenceChunker when sentence
boundaries are mandatory and embeddings are not available. Use
SemanticChunker when a domain-relevant embedder can identify topic shifts; it
is only as good as the embedding signal and falls back to sentence-sized packing
when no embedder is configured. Use try_split_chunks with fallible embedders
when provider failures should be returned as ChunkError instead of panicking
through the infallible convenience API. Use MarkdownChunker or HtmlChunker when the
input format carries useful block structure. Use CodeChunker when source code
should be split by parser-recognized Rust or Python AST nodes.
LangChain-style single-separator splitting + merge with overlap.
Best for simple and predictable chunk boundaries.
LangChain-style recursive fallback over separators (\n\n -> \n -> -> char fallback).
Best general-purpose default when text structure varies.
Sentence-aware chunking with sentence-preserving overlap backtracking.
Supports:
min_characters_per_sentencemin_sentences_per_chunk- custom sentence delimiters
Semchunk-inspired recursive splitter with punctuation-aware hierarchy.
Supports:
- adaptive merge (binary-search span fitting)
- optional memoization (
memoize) - stricter delimiter precedence mode (
strict_mode) - configurable
length_fn(including tokenizers)
Embedding-similarity boundary detection:
- sentence windows
- cosine similarity between adjacent windows
- Savitzky-Golay smoothing
- local minima boundary detection
- optional skip-window reconnection for tangential asides
Falls back to sentence-based greedy splitting if no embedder is configured.
use julienne::{RecursiveCharacterTextSplitter, SentenceChunker, SemchunkSplitter, SemanticChunker};
let text = "Hello world. This is a document with multiple sentences.";
let recursive = RecursiveCharacterTextSplitter::new(500, 50);
let recursive_chunks = recursive.split_text(text);
let sentence = SentenceChunker::new(500, 50);
let sentence_chunks = sentence.split_text(text);
let semchunk = SemchunkSplitter::new(500, 50);
let semchunk_chunks = semchunk.split_text(text);
let semantic = SemanticChunker::new(500, 50);
let semantic_chunks = semantic.split_text(text);For a fuller walkthrough, see Getting started.
Use split_chunks when downstream code needs provenance:
use julienne::{RecursiveCharacterTextSplitter, TextChunk};
let text = "Intro.\n\nDetails with café.";
let splitter = RecursiveCharacterTextSplitter::new(80, 10);
let chunks: Vec<TextChunk<'_>> = splitter.split_chunks(text);
for chunk in chunks {
assert_eq!(&text[chunk.start_byte..chunk.end_byte], chunk.text);
}TextChunk contains text, start_byte, end_byte, start_char, end_char,
measured_length, and optional metadata. The byte range always indexes the
original input passed to the splitter. The character offsets are counted from
the start of that same input.
split_text is the owned-string convenience API. split_chunks collects
structured chunks. chunks is the iterator-style API for infallible splitters
that can expose structured output without changing the caller-facing contract.
use julienne::SemchunkSplitter;
fn word_len(s: &str) -> usize {
s.split_whitespace().count()
}
let splitter = SemchunkSplitter::builder()
.chunk_size(90)
.chunk_overlap(15)
.length_fn(std::sync::Arc::new(word_len))
.build()
.unwrap();
let chunks = splitter.split_text("Some longer text...");use julienne::{ChunkConfig, ChunkSizer, WordSizer, TokenBoundaryProvider, TokenChunker, TokenSpan};
let word_config = ChunkConfig::new(90, 15, WordSizer);
assert_eq!(word_config.sizer.size("one two three"), 3);
#[derive(Clone)]
struct WhitespaceTokens;
impl TokenBoundaryProvider for WhitespaceTokens {
fn token_spans(&self, input: &str) -> Result<Vec<TokenSpan>, julienne::ChunkError> {
let mut spans = Vec::new();
let mut start = None;
for (idx, ch) in input.char_indices() {
if ch.is_whitespace() {
if let Some(s) = start.take() {
spans.push(TokenSpan { start_byte: s, end_byte: idx });
}
} else if start.is_none() {
start = Some(idx);
}
}
if let Some(s) = start {
spans.push(TokenSpan { start_byte: s, end_byte: input.len() });
}
Ok(spans)
}
}
let token_chunks = TokenChunker::new(WhitespaceTokens, 3, 1)
.unwrap()
.try_split_text("one two three four five")
.unwrap();
assert_eq!(token_chunks, vec!["one two three", "three four five"]);Optional tokenizer integrations are feature gated:
tiktoken-rsenablesjulienne::token::tiktoken::TiktokenBoundaryProvider.tokenizersenablesjulienne::token::huggingface::HuggingFaceBoundaryProvider.unicode-segmentationenablesGraphemeSizerandUnicodeWordSizer.
#[cfg(feature = "tiktoken-rs")]
{
use julienne::token::tiktoken::TiktokenBoundaryProvider;
use julienne::TokenChunker;
let provider = TiktokenBoundaryProvider::new(tiktoken_rs::cl100k_base().unwrap());
let chunks = TokenChunker::new(provider, 128, 16)
.unwrap()
.try_split_text("Token-sized chunks using tiktoken.")
.unwrap();
}#[cfg(feature = "tokenizers")]
{
use julienne::token::huggingface::HuggingFaceBoundaryProvider;
use julienne::TokenChunker;
use tokenizers::Tokenizer;
let tokenizer = Tokenizer::from_file("tokenizer.json").unwrap();
let provider = HuggingFaceBoundaryProvider::new(tokenizer);
let chunks = TokenChunker::new(provider, 128, 16)
.unwrap()
.try_split_text("Token-sized chunks using Hugging Face tokenizers.")
.unwrap();
}use julienne::{MarkdownChunker, HtmlChunker};
let markdown = MarkdownChunker::new(500, 50)
.unwrap()
.split_text("# Title\n\nA paragraph.\n\n```rust\nfn main() {}\n```");
let html = HtmlChunker::new(500, 50)
.unwrap()
.split_text("<section><h1>Title</h1><p>Body</p></section>");MarkdownChunker preserves headings, paragraphs, lists, and fenced code blocks.
HtmlChunker works on already-extracted HTML/XML strings and uses block-level
tag boundaries; it does not fetch pages, sanitize markup, remove boilerplate, or
perform readability extraction.
With --features code, CodeChunker uses tree-sitter parsers for Rust and
Python and returns explicit ChunkError values for parser failures or oversized
semantic nodes.
use julienne::SemanticChunker;
fn simple_embedding(text: &str) -> Vec<f32> {
let lower = text.to_lowercase();
let a = ["sql", "table", "vectorizer"].iter().map(|k| lower.matches(k).count() as f32).sum::<f32>();
let b = ["weather", "rain", "forecast"].iter().map(|k| lower.matches(k).count() as f32).sum::<f32>();
vec![a, b]
}
let chunker = SemanticChunker::builder()
.chunk_size(500)
.chunk_overlap(50)
.window_size(3)
.skip_window(1)
.reconnect_similarity_threshold(0.75)
.max_aside_length(512)
.embedding_fn(std::sync::Arc::new(simple_embedding))
.build()
.unwrap();
let chunks = chunker.split_text("...");cargo testIntegration tests use fixtures under:
examples/summarize_article.sqlexamples/embeddings_from_documents/documents/pgai.md
cargo bench --bench splitters_benchCode chunker benchmarks are feature-gated:
cargo bench --bench splitters_bench --features code -- splitters_codeThe benchmark compares all splitter strategies on markdown and SQL fixtures,
including word-count and tiktoken length profiles for semchunk.
Structured chunk APIs return TextChunk with stable source offsets. Public
splitter types and builders are Send + Sync + Clone when their configured
handles are. Infallible splitters expose split_text, split_chunks, and
iterator-style chunks where available. Fallible integrations use try_*
methods and return ChunkError; this includes batch embedding and tree-sitter
code parsing.
This crate is a chunking library. Compared with Chonkie, the focus is Rust-native range-preserving chunks, explicit fallibility, and configurable splitter strategies rather than a Python-first experimentation surface. Compared with Chunkr, this crate does not try to be document intelligence infrastructure: no OCR, hosted API, document loading service, layout extraction, vector database handshake, or ingestion pipeline is included. Bring already-extracted text, Markdown, HTML/XML, or source code strings; the crate returns chunks.