Update dependency transformers to v4.48.2 #47
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR contains the following updates:
==4.47.1->==4.48.2Release Notes
huggingface/transformers (transformers)
v4.48.2: Patch release v4.48.2Compare Source
Patch release v4.48.2
Sorry because the fixes for
num_items_in_batchesare not done yet 😓 To follow along see this PR, a new patch will be available soon!Now, we mostly had BC issue with python version 3.9:
Then we had a small regression for DBRX saving:
Finally we have a fix for gemma and the hybrid attention architectures:
Miscellaneous:
v4.48.1: Patch release v4.48.1Compare Source
Patch release v4.48.1
Yet again we are dawned with a gradient accumulation fix! There is also a refactoring of the attention that let a small typo in, we made sure PHI is no longer broken!
Moonshinehad a small issue when wrapping generate so we removed that!🤗
v4.48.0: : ModernBERT, Aria, TimmWrapper, ColPali, Falcon3, Bamba, VitPose, DinoV2 w/ Registers, Emu3, Cohere v2, TextNet, DiffLlama, PixtralLarge, MoonshineCompare Source
New models
ModernBERT
The ModernBert model was proposed in Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference by Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Galalgher, Raja Bisas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Grifin Adams, Jeremy Howard and Iacopo Poli.
It is a refresh of the traditional encoder architecture, as used in previous models such as BERT and RoBERTa.
It builds on BERT and implements many modern architectural improvements which have been developed since its original release, such as:
Aria
The Aria model was proposed in Aria: An Open Multimodal Native Mixture-of-Experts Model by Li et al. from the Rhymes.AI team.
Aria is an open multimodal-native model with best-in-class performance across a wide range of multimodal, language, and coding tasks. It has a Mixture-of-Experts architecture, with respectively 3.9B and 3.5B activated parameters per visual token and text token.
TimmWrapper
We add a
TimmWrapperset of classes such that timm models can be loaded in as transformer models into the library.Here's a general usage example:
Thanks to this, timm models now have access to pipelines, as well as
Trainer, accelerate device maps, quantization, etc:Pixtral-Large
Pixtral modeling and checkpoint conversion code has been updated to support the new Pixtral-Large model.
ColPali
The ColPali model was proposed in ColPali: Efficient Document Retrieval with Vision Language Models by Manuel Faysse*, Hugues Sibille*, Tony Wu*, Bilel Omrani, Gautier Viaud, Céline Hudelot, Pierre Colombo (* denotes equal contribution). Work lead by ILLUIN Technology.
In the proposed ColPali approach, the authors leverage VLMs to construct efficient multi-vector embeddings directly from document images (“screenshots”) for document retrieval. They train the model to maximize the similarity between these document embeddings and the corresponding query embeddings, using the late interaction method introduced in ColBERT.
Falcon3
Falcon3 represents a natural evolution from previous releases, emphasizing expanding the models’ science, math, and code capabilities. This iteration includes five base models: Falcon3-1B-Base, Falcon3-3B-Base, Falcon3-Mamba-7B-Base, Falcon3-7B-Base, and Falcon3-10B-Base. In developing these models, the authors incorporated several key innovations aimed at improving the models’ performances while reducing training costs:
One pre-training: They conducted a single large-scale pretraining run on the 7B model, using 2048 H100 GPU chips, leveraging 14 trillion tokens featuring web, code, STEM, and curated high-quality and multilingual data. Depth up-scaling for improved reasoning: Building on recent studies on the effects of model depth, they upscaled the 7B model to a 10B parameters model by duplicating the redundant layers and continuing pre-training with 2TT of high-quality data. This yielded Falcon3-10B-Base which achieves state-of-the-art zero-shot and few-shot performance for models under 13B parameters. Knowledge distillation for better tiny models: To provide compact and efficient alternatives, we developed Falcon3-1B-Base and Falcon3-3B-Base by leveraging pruning and knowledge distillation techniques, using less than 100GT of curated high-quality data, thereby redefining pre-training efficiency.
Bamba
Bamba-9B is a decoder-only language model based on the Mamba-2 architecture and is designed to handle a wide range of text generation tasks. It is trained from scratch using a two-stage training approach. In the first stage, the model is trained on 2 trillion tokens from the Dolma v1.7 dataset. In the second stage, it undergoes additional training on 200 billion tokens, leveraging a carefully curated blend of high-quality data to further refine its performance and enhance output quality.
Checkout all Bamba-9B model checkpoints here.
VitPose
ViTPose is a state-of-the-art vision transformer-based model for human pose estimation, introduced by Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao in "ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation”.
The model leverages the capabilities of vision transformers to accurately predict 2D human keypoints. Adopting a top-down approach, ViTPose estimates keypoints locations for each detected person, allowing it to be easily used with any object detection model.
DINOv2 with registers
The DINOv2 with Registers model was proposed in Vision Transformers Need Registers by Timothée Darcet, Maxime Oquab, Julien Mairal, Piotr Bojanowski.
The Vision Transformer (ViT) is a transformer encoder model (BERT-like) originally introduced to do supervised image classification on ImageNet.
Next, people figured out ways to make ViT work really well on self-supervised image feature extraction (i.e. learning meaningful features, also called embeddings) on images without requiring any labels. Some example papers here include DINOv2 and MAE.
The authors of DINOv2 noticed that ViTs have artifacts in attention maps. It’s due to the model using some image patches as “registers”. The authors propose a fix: just add some new tokens (called “register” tokens), which you only use during pre-training (and throw away afterwards). This results in:
Emu3
The Emu3 model was proposed in Emu3: Next-Token Prediction is All You Need by Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, Yingli Zhao, Yulong Ao, Xuebin Min, Tao Li, Boya Wu, Bo Zhao, Bowen Zhang, Liangdong Wang, Guang Liu, Zheqi He, Xi Yang, Jingjing Liu, Yonghua Lin, Tiejun Huang, Zhongyuan Wang.
Emu3 sets a new standard in multimodal AI by using next-token prediction to handle images, text, and videos. It simplifies multimodal modeling by tokenizing all data into a unified format and training a single transformer. Visual data is tokenized using vector quantization methods based on VQ-VAE model. Discretized visual tokens are later fused with text token ids for image and text generation.
Emu3 outperforms leading models like SDXL and LLaVA-1.6 in both generation and perception tasks, without relying on diffusion or compositional methods..
Cohere2
A new Cohere update was added through a new "Cohere2" set of classes.
TextNet
TextNet is a lightweight and efficient architecture designed specifically for text detection, offering superior performance compared to traditional models like MobileNetV3. With variants TextNet-T, TextNet-S, and TextNet-B (6.8M, 8.0M, and 8.9M parameters respectively), it achieves an excellent balance between accuracy and inference speed.
DiffLlama
Differential Transformer combines the Llama architecture with Differential Transformer's Attention.
PixtralLarge
The conversion script needed a few update, while the modeling code was barely changed!
Moonshine
Moonshine is an autoregressive speech recognition encoder-decoder model that improves upon Whisper's architecture. Namely, it replaces absolute position embeddings with Rotary Position Embeddings (RoPE). This allows Moonshine to handle audio inputs of any length, unlike Whisper, which is restricted to fixed 30-second windows. It was introduced by Nat Jeffries, Evan King, Manjunath Kudlur, Guy Nicholson, James Wang, and Pete Warden in Moonshine: Speech Recognition for Live Transcription and Voice Commands
.
Quantization methods
VPTQ Quantization
From the VPTQ contributors:
HIGGS Quantization
From the contributors:
Cleanup
We merged a cleanup for vision language models, to make sure it all models are standardized.
Breaking changes
Conversion scripts
Many models in Transformers include scripts to convert the original model checkpoints into a Transformers-compatible format. These scripts can be found in the repo using the glob pattern
models/**/convert_*.py. They were a recurring source of vulnerability reports and CVEs because many models were originally released using insecure formats like older PyTorch.binweights orpicklefiles. The conversion scripts had to open these formats, and this meant that they were vulnerable to maliciously crafted inputs.In practice, we do not see this as a serious vulnerability. The conversion scripts are never imported or called by the rest of the library; each script is standalone, and so the only way to exploit the vulnerability is to create a malicious checkpoint, induce a user to download it, and then also induce them to manually call a specific conversion script on it.
However, even if there is little practical risk of an exploit, we are aware that open vulnerability reports create a compliance problem for users, and so beginning with this release we will be excluding these conversion scripts from release branches and wheels. They will remain accessible to developers on the
mainbranch.Backtracking in Nougat
A regular expression used within the Nougat code has been modified to ensure it does not hang. The method should output the same results but we cannot guarantee it; we recommend upgrading to the latest transformers if you use this model to ensure your code is performance-optimized.
Whisper decoding
This PR finalizes work that aimes to enable short-form (< 30 secs) and long-form generation using temperature fallback. It is a significant improvement to the whisper codebase, but it does result in the following breaking changes:
➡️ Previously:
• Short-form: Returned a
ModelOutputortorch.LongTensor, including decoder input IDs and the EOS token ID.• Long-form: Returned a
Dictortorch.LongTensor, excluding decoder input IDs and the EOS token ID.➡️ From now on:
Short-form and long-form generation are now treated identically, meaning output differentiation based on these modes is no longer applicable.
Decoder input IDs and EOS token IDs are never returned, except in two specific cases: when
return_dict_in_generate=Trueand (return_timestamps=Falseorforce_unique_generate_call=True).In this case, the output will be a
ModelOutput, which is the result of the underlying call to GenerationMixin’s generate. Indeed,return_timestamps=Falseensures no seeking occurs; only a single call to generate is made. Therefore, this output includes both decoder input IDs and the EOS token ID.Attention refactor
In order to have a cleaner, isolated, future-proof code for the attention layers, they have been refactored so as to keep the model attention code within their files; but attention definitions relating to SDPA, Flash Attention, and other types of attention have been moved to a common file.
Bugfixes and improvements
num_items_in_batchnot being an integer by @xspirus in #35115docs/source/ar/community.mdinto Arabic by @AhmedAlmaghz in #33027AssistedCandidateGeneratorfor Improved Modularity and Reusability by @keyboardAnt and @jmamou in #35009Threadfor SF conversion by @ydshieh in #35236rsfEwithpytestby @ydshieh in #35119benchmarkjob inpush-important-models.ymlby @ydshieh in #35292benchmarks_entrypoint.pyby @McPatate in #34495textby @probicheaux in #35201docs] Add link to ModernBERT Text Classification GLUE finetuning script by @tomaarsen in #35347Mamba2] Fix caching, slow path, and multi-gpu by @vasqu in #35154_make_causal_maskby @jiwoong-choi in #35291weights_only=Truewithtorch.loadfortransfo_xlby @ydshieh in #35241test_generate_with_static_cacheeven less flaky by @ydshieh in #34995is_causalis passed explicitly by @Cyrilvallez in #35390PaliGemmaProcessorby @alvarobartt in #35278.github/workflows/self-comment-ci.ymlfor now by @ydshieh in #35366GPTQ,CompressedTensors] Fix unsafe imports and metada check by @vasqu in #34815ACCELERATE_MIN_VERSIONon error by @KSafran in #35189model_accepts_loss_kwargsfor timm model by @qubvel in #35257sdpa_kernelby @jla524 in #35410docs/source/ar/tasks/question_answering.mdinto Arabic by @AhmedAlmaghz in #35196docs/source/ar/tasks/summarization.mdinto Arabic by @AhmedAlmaghz in #35195sdpa_kernelby @jla524 in #35461Significant community contributions
The following contributors have made significant changes to the library over the last release:
Threadfor SF conversion (#35236)rsfEwithpytest(#35119)benchmarkjob inpush-important-models.yml(#35292)weights_only=Truewithtorch.loadfortransfo_xl(#35241)test_generate_with_static_cacheeven less flaky (#34995).github/workflows/self-comment-ci.ymlfor now (#35366)Configuration
📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).
🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.
♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.
🔕 Ignore: Close this PR and you won't be reminded about this update again.
This PR was generated by Mend Renovate. View the repository job log.