Multi-language text segmentation library using the Gemma tokenizer from HuggingFace Transformers.
Project description • Who this project is for • Project dependencies • Instructions for use • Contributing guidelines • Additional documentation • How to get help • Terms of use
Milvus Segment Generator helps you tokenize and segment text into fixed-size chunks with character-level span information. It supports multiple languages (Tibetan, English, Chinese) with language-specific delimiter handling and token post-processing rules.
- Multi-language support: Tibetan, English, and Chinese
- Gemma tokenizer: Uses HuggingFace Transformers' Gemma model tokenizer
- Language-specific rules: Custom delimiters and token merging for each language
- Character spans: Returns precise character offsets for each segment
- JSON export: Save segmentation results to JSON format
Before using Milvus Segment Generator, ensure you have:
- Python 3.8 or higher
- pip package manager
- HuggingFace account (for downloading Gemma model tokenizer)
- Clone the repository:
git clone https://github.com/OpenPecha/milvus_segment_generator.git
cd milvus_segment_generator- Install dependencies:
pip install -e .This will install:
transformers>=4.30.0- HuggingFace Transformers librarytorch>=2.0.0- PyTorch (required by transformers)
- (Optional) For development, install dev dependencies:
pip install -e ".[dev]"from milvus_segment_generator import segment_text, segment_text_to_json
# Segment Tibetan text
tibetan_text = "བཅོམ་ལྡན་འདས། དེ་བཞིན་གཤེགས་པ།"
spans = segment_text(tibetan_text, lang="tibetan", segment_size=2000)
print(spans)
# [{"span": {"start": 0, "end": 15}}, {"span": {"start": 15, "end": 30}}]
# Save to JSON file
segment_text_to_json(
tibetan_text,
lang="bo",
output_path="output/segments.json",
segment_size=2000
)- Tibetan:
tibetan,bo - English:
english,en - Chinese:
chinese,zh
Tokenize and segment text into chunks.
Parameters:
text(str): Input text to segmentlang(str): Language codesegment_size(int): Maximum tokens per segment (default: 1990)
Returns:
- List of dictionaries with
spancontainingstartandendcharacter offsets
Segment text and save to JSON file.
Parameters:
text(str): Input text to segmentlang(str): Language codeoutput_path(str | Path): Output file pathsegment_size(int): Maximum tokens per segment (default: 1990)
Returns:
- Path object pointing to the created JSON file
| Issue | Solution |
| ImportError: No module named 'transformers' | Install transformers: pip install transformers torch |
| HuggingFace authentication error | Login to HuggingFace: huggingface-cli login or set HF_TOKEN environment variable |
| ValueError: No delimiter found within window | Your text segment doesn't contain any delimiters within the segment_size. Add appropriate punctuation or increase segment_size. |
| Model download is slow | The first run downloads the Gemma tokenizer (~500MB). Subsequent runs use cached version. |
HF_TOKEN: HuggingFace API token for model accessTRANSFORMERS_CACHE: Directory for caching downloaded models (default:~/.cache/huggingface)
If you'd like to help out, check out our contributing guidelines.
For more information:
- New API Documentation - Detailed API reference and architecture
- Test Documentation - Testing guidelines and structure
- Examples - Usage examples for different languages
- File an issue.
- Email us at openpecha[at]gmail.com.
- Join our discord.
Milvus Segment Generator is licensed under the MIT License.