(py package) tokenizer based on BPE algorithm for the LLMs (supports the regex pattern and special tokens)
-
Updated
May 29, 2024 - Jupyter Notebook
A grammar describes the syntax of a programming language, and might be defined in Backus-Naur form (BNF). A lexer performs lexical analysis, turning text into tokens. A parser takes tokens and builds a data structure like an abstract syntax tree (AST). The parser is concerned with context: does the sequence of tokens fit the grammar? A compiler is a combined lexer and parser, built for a specific grammar.
(py package) tokenizer based on BPE algorithm for the LLMs (supports the regex pattern and special tokens)
Lexical analysis for tokenizing a basic programming language
⛄ Possibly the smallest Lua compiler ever
DOM-aware tokenization for Hugging Face language models
Byte-Pair Encoding tokenizer for large language models
Web tool to count LLM tokens (GPT, Claude, Llama, ...)
A multilingual morphological analysis library.
Tools and resources for the computational processing of Nheengatu (Modern Tupi)
A simple table data editor, with easily scalable functions and operations & a nice GUI
Byte-level byte pair encoding (BPE) in Haskell
Implementation of a logical programming language including a tokenizer, LL(1)-parser, translator, evaluator and an execution CLI
DadmaTools is a Persian NLP tools developed by Dadmatech Co.