Milvus Segment Generator

Multi-language text segmentation library using the Gemma tokenizer from HuggingFace Transformers.

Owner(s)

Project description

Milvus Segment Generator helps you tokenize and segment text into fixed-size chunks with character-level span information. It supports multiple languages (Tibetan, English, Chinese) with language-specific delimiter handling and token post-processing rules.

Features

Multi-language support: Tibetan, English, and Chinese
Gemma tokenizer: Uses HuggingFace Transformers' Gemma model tokenizer
Language-specific rules: Custom delimiters and token merging for each language
Character spans: Returns precise character offsets for each segment
JSON export: Save segmentation results to JSON format

Project dependencies

Before using Milvus Segment Generator, ensure you have:

Python 3.8 or higher
pip package manager
HuggingFace account (for downloading Gemma model tokenizer)

Instructions for use

Installation

Clone the repository:

git clone https://github.com/OpenPecha/milvus_segment_generator.git
cd milvus_segment_generator

Install dependencies:

pip install -e .

This will install:

transformers>=4.30.0 - HuggingFace Transformers library
torch>=2.0.0 - PyTorch (required by transformers)

(Optional) For development, install dev dependencies:

pip install -e ".[dev]"

Quick Start

from milvus_segment_generator import segment_text, segment_text_to_json

# Segment Tibetan text
tibetan_text = "བཅོམ་ལྡན་འདས། དེ་བཞིན་གཤེགས་པ།"
spans = segment_text(tibetan_text, lang="tibetan", segment_size=2000)
print(spans)
# [{"span": {"start": 0, "end": 15}}, {"span": {"start": 15, "end": 30}}]

# Save to JSON file
segment_text_to_json(
    tibetan_text,
    lang="bo",
    output_path="output/segments.json",
    segment_size=2000
)

Supported Languages

Tibetan: tibetan, bo
English: english, en
Chinese: chinese, zh

API Reference

`segment_text(text, lang, segment_size=1990)`

Tokenize and segment text into chunks.

Parameters:

text (str): Input text to segment
lang (str): Language code
segment_size (int): Maximum tokens per segment (default: 1990)

Returns:

List of dictionaries with span containing start and end character offsets

`segment_text_to_json(text, lang, output_path, segment_size=1990)`

Segment text and save to JSON file.

Parameters:

text (str): Input text to segment
lang (str): Language code
output_path (str | Path): Output file path
segment_size (int): Maximum tokens per segment (default: 1990)

Returns:

Path object pointing to the created JSON file

Troubleshooting

Issue	Solution
ImportError: No module named 'transformers'	Install transformers: `pip install transformers torch`
HuggingFace authentication error	Login to HuggingFace: `huggingface-cli login` or set `HF_TOKEN` environment variable
ValueError: No delimiter found within window	Your text segment doesn't contain any delimiters within the segment_size. Add appropriate punctuation or increase segment_size.
Model download is slow	The first run downloads the Gemma tokenizer (~500MB). Subsequent runs use cached version.

Environment Variables

HF_TOKEN: HuggingFace API token for model access
TRANSFORMERS_CACHE: Directory for caching downloaded models (default: ~/.cache/huggingface)

Contributing guidelines

If you'd like to help out, check out our contributing guidelines.

Additional documentation

For more information:

New API Documentation - Detailed API reference and architecture
Test Documentation - Testing guidelines and structure
Examples - Usage examples for different languages

How to get help

File an issue.
Email us at openpecha[at]gmail.com.
Join our discord.

Terms of use

Milvus Segment Generator is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.github/workflows		.github/workflows
.vscode		.vscode
docs		docs
examples		examples
src/milvus_segment_generator		src/milvus_segment_generator
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
MIGRATION_TO_TRANSFORMERS.md		MIGRATION_TO_TRANSFORMERS.md
README.md		README.md
README_NEW_API.md		README_NEW_API.md
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Milvus Segment Generator

Owner(s)

Table of contents

Project description

Features

Project dependencies

Instructions for use

Installation

Quick Start

Supported Languages

API Reference

`segment_text(text, lang, segment_size=1990)`

`segment_text_to_json(text, lang, output_path, segment_size=1990)`

Troubleshooting

Environment Variables

Contributing guidelines

Additional documentation

How to get help

Terms of use

About

Uh oh!

Releases

Packages

Languages

License

OpenPecha/milvus_segment_generator

Folders and files

Latest commit

History

Repository files navigation

Milvus Segment Generator

Owner(s)

Table of contents

Project description

Features

Project dependencies

Instructions for use

Installation

Quick Start

Supported Languages

API Reference

segment_text(text, lang, segment_size=1990)

segment_text_to_json(text, lang, output_path, segment_size=1990)

Troubleshooting

Environment Variables

Contributing guidelines

Additional documentation

How to get help

Terms of use

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`segment_text(text, lang, segment_size=1990)`

`segment_text_to_json(text, lang, output_path, segment_size=1990)`

Packages