Skip to content

Conversation

@CodeWithKyrian
Copy link
Owner

What:

  • Bug Fix
  • New Feature

Description:

This PR introduces a comprehensive suite of tests for various tokenizers, alongside fixes for issues identified during the test development process. Additionally, several new pretokenizers, decoders, and pretrained tokenizers have been added, as detailed below:

New Additions

  1. Pretokenizers:
    • WhitespacePretokenizer: Added support for tokenizing text by whitespace.
  2. Decoders:
    • VitsDecoder: Introduced a decoder for improved text-to-speech decoding performance.
  3. Pretrained Tokenizers:
    • BlenderBotTokenizer
    • CohereTokenizer
    • NougatTokenizer
    • GemmaTokenizer
    • Grok1Tokenizer
    • SiglipTokenizer
    • SpeechT5Tokenizer
    • VitsTokenizer

Modifications

  • SplitPretokenizer: Corrected functionality to ensure proper behavior, addressing issues with its previous implementation.

Bug Fixes

During the development of the test suite, several bugs in the tokenizers were identified and fixed to enhance reliability and functionality.

@CodeWithKyrian CodeWithKyrian merged commit 89ab0f1 into main Nov 17, 2024
@CodeWithKyrian CodeWithKyrian deleted the tests branch November 17, 2024 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants