wordpiece

Detect whether the text is AI-generated by training a new tokenizer and combining it with tree classification models or by training language models on a large dataset of human & AI-generated texts.

tokenizer classification ensemble large-dataset bpe wordpiece ai-generated llm

Updated Jan 26, 2024
Jupyter Notebook

Hank-Kuo / go-bert-tokenizer

Star

go-bert-tokenizer

go tokenizer bert wordpiece

Updated Oct 28, 2023
Go

burcgokden / BERT-Subword-Tokenizer-Wrapper

Star

A framework for generating subword vocabulary from a tensorflow dataset and building custom BERT tokenizer models.

machine-learning deep-learning tensorflow machine-translation vocabulary-builder bert subword wordpiece berttokenizer tensorflow-text

Updated Jul 6, 2021
Python

A fast and lightweight implementation of a likelihood-based Byte Pair Encoding (BPE) tokeniser. Unlike traditional BPE, this tokeniser selects merge candidates based on a normalized likelihood score, improving tokenisation quality by prioritising statistically significant merges.

tokeniser wordpiece parsonlabs

Updated Jun 21, 2025
Rust

Daniel-Heo / NemoTokenizer

Sponsor

Star

Fast wordpiece, sentencepiece tokenizer by Trie, OpenMP, SIMD, MemoryPool

tokenizer wordpiece sentencepiece fasttokenizer

Updated May 11, 2025
C++

vassef / Implementing-BPE-and-WordPiece-Tokenizers

Star

tokenization bpe wordpiece

Updated Sep 4, 2022
Jupyter Notebook

Improve this page

Add a description, image, and links to the wordpiece topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the wordpiece topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wordpiece

Here are 13 public repositories matching this topic...

NLPOptimize / flash-tokenizer

georg-jung / FastBertTokenizer

stephantul / piecelearn

danieldk / wordpieces

SeonbeomKim / Python-Byte_Pair_Encoding

NLPOptimize / awesome-tokenizers

SeanLee97 / BertWordPieceTokenizer.jl

Lizhecheng02 / Kaggle-LLM-Detect_AI_Generated_Text

Hank-Kuo / go-bert-tokenizer

burcgokden / BERT-Subword-Tokenizer-Wrapper

WillKirkmanM / wordpiece

Daniel-Heo / NemoTokenizer

vassef / Implementing-BPE-and-WordPiece-Tokenizers

Improve this page

Add this topic to your repo