Skip to content

add huggingface transformers plugin#350

Merged
maxkahan merged 5 commits intomainfrom
add-transformers
Feb 19, 2026
Merged

add huggingface transformers plugin#350
maxkahan merged 5 commits intomainfrom
add-transformers

Conversation

@maxkahan
Copy link
Copy Markdown
Contributor

@maxkahan maxkahan commented Feb 10, 2026

This pull request introduces support for streaming text-to-speech (TTS) in the Vision Agents core, allowing LLM responses to be sent to TTS as sentences are produced, reducing perceived latency. It also reorganizes and expands HuggingFace plugin examples, adding new local inference demos for both text and vision-language models. Additionally, it improves dependency management for local model support.

The most important changes are:

Streaming TTS Support in Core Agent:

  • Added a streaming_tts option to the Agent class, enabling TTS to receive text incrementally as sentences are generated by the LLM, rather than waiting for the full response. This includes buffering logic, sentence boundary detection, and buffer clearing on barge-in or turn events. [1] [2] [3] [4] [5] [6] [7]

HuggingFace Plugin Example Reorganization and Expansion:

  • Moved and renamed the HuggingFace Inference API example to plugins/huggingface/examples/inference_api/, updated its metadata, and removed the old example and README. [1] [2] [3] [4]
  • Added a new example for running local HuggingFace LLMs using the transformers library (transformers_llm_example.py), demonstrating agent setup for local inference. [1] [2]
  • Added a new example for running local vision-language models (transformers_vlm_example.py), showing how to use local VLMs with Vision Agents. [1] [2]

Dependency Management Improvements:

  • Introduced optional dependencies in the HuggingFace plugin for local model support ([project.optional-dependencies] for transformers and transformers-quantized), making it easier to install required packages for local inference.

Summary by CodeRabbit

  • New Features

    • Streaming text-to-speech option for real-time incremental speech output.
    • Local Transformer LLM and on-device Vision-Language Model (VLM) support for offline inference.
    • Optional dependency groups for transformers and quantized workflows.
  • Documentation & Examples

    • Added new local Transformers LLM and VLM example projects and a HuggingFace Inference API example.
    • Removed an outdated HuggingFace example README.
  • Tests

    • Added comprehensive tests for Transformers LLM and VLM integrations.
  • Chores

    • Ignore rule added for a new top-level tooling directory.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Feb 10, 2026

📝 Walkthrough

Walkthrough

Adds on-device HuggingFace Transformers LLM and VLM plugins with warmup, streaming, quantization and tool-call support; streaming TTS buffering and sentence-boundary emission in Agent; new examples and pyproject configs; extensive tests; and minor repo housekeeping (.claude/ in .gitignore and removal/consolidation of legacy example files).

Changes

Cohort / File(s) Summary
Agent — Streaming TTS
agents-core/vision_agents/core/agents/agents.py
Adds streaming_tts: bool ctor flag, _streaming_tts_buffer state, LLMResponseChunkEvent subscription to accumulate deltas, sentence-boundary emission to TTS, _flush_streaming_tts_buffer() helper, and buffer clearing on turn interruptions.
Transformers LLM Core
plugins/huggingface/vision_agents/plugins/huggingface/transformers_llm.py
New TransformersLLM: dtype/quantization helpers, model/tokenizer loading, warmup/unload, streaming and non-streaming generation, chunk events, tool-call parsing and handling, and resource management.
Transformers VLM Core
plugins/huggingface/vision_agents/plugins/huggingface/transformers_vlm.py
New TransformersVLM: processor/model loading, VLMResources, frame buffering/VideoForwarder, on-device inference, warmup/unload, and VLM event flows.
Plugin exports & deps
plugins/huggingface/vision_agents/plugins/huggingface/__init__.py, plugins/huggingface/pyproject.toml
Guards optional imports for TransformersLLM/TransformersVLM with helpful warning on missing optional deps; adds project.optional-dependencies groups (transformers, transformers-quantized).
Examples & configs (new)
plugins/huggingface/examples/transformers/..., plugins/huggingface/examples/inference_api/pyproject.toml, plugins/huggingface/examples/transformers/pyproject.toml
Adds new Transformers LLM/VLM example scripts and pyproject configs with editable local sources; updates inference API example docstring.
Examples removed / consolidated
plugins/huggingface/example/README.md, plugins/huggingface/example/pyproject.toml
Removes legacy example README and pyproject; content consolidated under new examples/ layout.
Tests
plugins/huggingface/tests/test_transformers_llm.py, plugins/huggingface/tests/test_transformers_vlm.py
Adds comprehensive unit and integration tests for LLM/VLM streaming/non-streaming flows, tool calls, fallbacks, error handling, and frame processing.
Repository housekeeping
.gitignore
Adds ignore rule for top-level .claude/ directory.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Agent
    participant LLM as "Transformers LLM (stream)"
    participant Buffer as "TTS Buffer"
    participant TTS

    User->>Agent: Send prompt / start turn
    Agent->>LLM: Request streaming generation
    LLM->>Agent: LLMResponseChunkEvent (delta)
    Agent->>Buffer: Append delta to _streaming_tts_buffer
    Buffer->>Buffer: Detect sentence boundary
    alt sentence complete
        Buffer->>TTS: Send sentence for synthesis
        TTS->>User: Play audio
        Buffer->>Buffer: Remove emitted sentence
    end
    LLM-->>Agent: LLMResponseCompletedEvent
    Agent->>Buffer: _flush_streaming_tts_buffer()
    Buffer->>TTS: Send remaining text
    TTS->>User: Play final audio
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Poem

The throat keeps ledger lines of halting speech,
a ledger inked in fragments, clipped and kept.
Sentences crowd like birds against a pane—
one wing, then another—until the pane cracks,
and something like full language falls at last.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 22.78% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'add huggingface transformers plugin' directly and concisely summarizes the main purpose of this changeset: introducing HuggingFace Transformers support to the Vision Agents framework.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch add-transformers

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Nash0x7E2
Copy link
Copy Markdown
Member

@maxkahan is this one nearing completion/merge-ready?

@maxkahan
Copy link
Copy Markdown
Contributor Author

This is actually mergeable as is, it adds the transformers VLM and LLM support. Still work to do on the processors and need to tidy up the inference api stuff but that can be another PR

@maxkahan maxkahan marked this pull request as ready for review February 16, 2026 17:47
@maxkahan
Copy link
Copy Markdown
Contributor Author

@aliev please review!

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🤖 Fix all issues with AI agents
In `@plugins/huggingface/tests/test_transformers_vlm.py`:
- Around line 1-6: Tests in plugins/huggingface/tests/test_transformers_vlm.py
heavily rely on unittest.mock.MagicMock which violates the "Never mock in tests"
guideline; replace MagicMock-based processor/model fakes with lightweight real
or explicit fake implementations (e.g., a tiny randomly-initialized transformers
processor/model or a small hand-rolled fake) used by the unit tests that
exercise TransformersVLM; specifically remove imports/usages of MagicMock,
create minimal real objects from transformers (config-only or small
AutoModel/processor instances) or deterministic fake classes that implement the
same methods used by the tests, and update any test setup/fixtures that
referenced MagicMock to use these concrete instances so tests no longer mock
behavior.

In `@plugins/huggingface/vision_agents/plugins/huggingface/transformers_llm.py`:
- Around line 276-294: The try/except around tokenizer.apply_chat_template
currently catches Exception broadly; change it to catch specific
template-related errors (e.g., jinja2.TemplateError) and likely
TypeError/ValueError to avoid swallowing unrelated exceptions: update the except
clauses around tokenizer.apply_chat_template (the block that references
messages, template_kwargs, tools_param and returns LLMResponseEvent) to handle
jinja2.TemplateError and then fallback to TypeError/ValueError as needed,
preserve the existing log messages and retry behavior (pop "tools" and retry
when tools_param is present), and add the appropriate import for
jinja2.TemplateError if not already imported.
- Around line 471-482: Replace the broad "except Exception as e" around the
await asyncio.to_thread(_do_generate) call with narrow, specific exception
handlers for the errors you expect from generation (for example RuntimeError,
ValueError, OSError and any library-specific errors your stack uses such as
Transformers/torch exceptions); for each specific except block still call
logger.error(...) and send events.LLMErrorEvent(plugin_name=PLUGIN_NAME,
error_message=str(e), event_data=e) and return LLMResponseEvent(original=None,
text=""), and add a final re-raise for any truly unexpected exceptions so you
don't swallow unknown failures; locate and update the try/except that wraps
_do_generate, logger.error, events.LLMErrorEvent, and LLMResponseEvent.
- Around line 374-383: The generation thread currently uses a broad "except
Exception as e" in run_generation which can mask critical torch/CUDA errors;
change this to catch torch.cuda.OutOfMemoryError and RuntimeError explicitly
(set generation_error, log distinct messages including the exception for each),
and move the unblocking call to
loop.call_soon_threadsafe(async_queue.put_nowait, None) into a finally block so
it always runs; do not swallow other unexpected exceptions—either let them
propagate or handle them explicitly if needed so you don't hide errors from
model.generate(**generate_kwargs).
- Around line 184-198: The load_kwargs currently uses the wrong key "dtype"
which AutoModelForCausalLM.from_pretrained ignores; update load_kwargs to use
the correct "torch_dtype" key so the dtype setting is applied when calling
AutoModelForCausalLM.from_pretrained; specifically change the entry in
load_kwargs (constructed near the load_kwargs variable and used in
AutoModelForCausalLM.from_pretrained(self.model_id, **load_kwargs)) from
"dtype": torch_dtype to "torch_dtype": torch_dtype and keep the existing
device_map and quantization_config logic
(get_quantization_config(self._quantization)) unchanged.

In `@plugins/huggingface/vision_agents/plugins/huggingface/transformers_vlm.py`:
- Around line 267-279: Replace the broad "except Exception as e" around the call
to _build_vlm_inputs with specific exception handlers (e.g., except ValueError
as e, except TypeError as e, except RuntimeError as e) so each anticipated error
type is caught and handled the same way (log with logger.error, send
VLMErrorEvent via self.events.send with plugin_name=PLUGIN_NAME and
inference_id, and return LLMResponseEvent(original=None, text="")). Do the same
replacement for the other two occurrences noted (the try/except blocks that call
_build_vlm_inputs at the other locations mentioned); do not swallow unexpected
exceptions—let truly unexpected errors propagate so they can be observed by
higher-level error handlers. Ensure you reference the existing symbols
_build_vlm_inputs, VLMErrorEvent, PLUGIN_NAME, and LLMResponseEvent when making
the changes.
- Around line 358-416: The method _build_vlm_inputs currently reads
self._frame_buffer inside a background thread which can race with VideoForwarder
appends; instead snapshot the frames on the calling (main) side and pass them
into _build_vlm_inputs (change its signature to accept frames:
List[av.VideoFrame] or List[Image]) or protect access with a threading.Lock
around reads/writes to _frame_buffer; update the site that calls
asyncio.to_thread(...) to pass the copied list (e.g., list(self._frame_buffer))
and modify _build_vlm_inputs to use the passed frames variable rather than
reading self._frame_buffer, ensuring any references to images/all_frames,
processor.apply_chat_template, and message construction remain unchanged.
- Around line 154-157: The load_kwargs dict currently uses the wrong key "dtype"
which prevents Transformers' from_pretrained from receiving the configured
precision; update the dict used when calling from_pretrained (the load_kwargs
variable used in the model loading path in transformers_vlm.py) to use
"torch_dtype": self._torch_dtype (or the existing torch_dtype variable) instead
of "dtype" so the model loads with the intended dtype/quantization settings.
🧹 Nitpick comments (5)
.gitignore (1)

96-96: LGTM! The ignore rule is correct.

The .claude/ pattern will properly ignore the directory. The syntax is valid and functional.

📂 Optional: Consider moving to the Editors / IDEs section

For better organization, you could move this entry to the "Editors / IDEs" section (after line 68) alongside .vscode/ and .idea/, since .claude/ appears to be an IDE or AI assistant workspace directory:

 # Editors / IDEs
 .vscode/
 .idea/
+.claude/

And remove it from line 96. This groups similar tooling artifacts together.

agents-core/vision_agents/core/agents/agents.py (2)

133-136: streaming_tts is missing from the docstring Args section.

The parameter is well-commented inline, but the class docstring at lines 143–170 documents every other __init__ parameter except this one.

📝 Proposed docstring addition

Add after the profiler entry in the Args block:

             profiler: Optional profiler for performance monitoring.
+            streaming_tts: Send text to TTS as sentences stream from the LLM
+                rather than waiting for the complete response. Reduces perceived
+                latency for non-realtime LLMs that emit LLMResponseChunkEvent.
             broadcast_metrics: Whether to periodically broadcast agent metrics

Also applies to: 215-216


361-381: Sentence boundary detection won't split on punctuation followed by a quote or parenthesis.

The boundary heuristic at line 375 only fires when the character after .!? is a space or newline. Patterns like "Hello!" he said or (end.) won't match because the immediate next character is " or ), not " " or "\n". These fragments will accumulate in the buffer until the next qualifying boundary or the final flush.

This is likely acceptable for typical conversational TTS output, but worth noting in case models produce quoted dialogue.

plugins/huggingface/vision_agents/plugins/huggingface/transformers_llm.py (1)

545-592: Generic JSON pattern won't match nested objects in arguments.

The regex on line 575 uses \{[^{}]*\} for the arguments value, which doesn't handle nested objects (e.g., {"type": {"nested": true}}). This is acceptable since the Hermes pattern (tried first) handles the common case via re.DOTALL, and deeply nested tool arguments from local models are uncommon. Just flagging for awareness.

plugins/huggingface/tests/test_transformers_llm.py (1)

43-62: Mock model's streaming simulation is clever but tightly coupled.

The _generate_side_effect manually calls streamer.put() and streamer.end() to simulate the TextIteratorStreamer protocol. This works but is brittle — if the streamer API changes, these tests won't catch it. Consider noting this coupling for future maintainers.

Copy link
Copy Markdown
Member

@aliev aliev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, just a couple of things to address. The main concern is the direct LLM.init(self) / VideoLLM.init(self) calls which bypass MRO in a multiple inheritance context. The rest are minor suggestions and nitpicks - see inline comments.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
agents-core/vision_agents/core/agents/agents.py (1)

144-170: ⚠️ Potential issue | 🟡 Minor

streaming_tts is missing from the docstring Args section.

All other __init__ parameters are documented; streaming_tts was omitted.

📝 Proposed fix
         broadcast_metrics_interval: Interval in seconds between metric broadcasts.
         multi_speaker_filter: Audio filter for handling overlapping speech from
             multiple participants.
             Takes effect only more than one participant is present.
             Defaults to `FirstSpeakerWinsFilter`, which uses VAD to lock onto
             the first participant who starts speaking and drops audio from
             everyone else until the active speaker's turn ends, or they go
             silent.
+        streaming_tts: When True, sends LLM output to TTS incrementally at
+            sentence boundaries rather than waiting for the full response.
+            Reduces perceived latency for non-realtime LLMs that emit
+            LLMResponseChunkEvent. Requires a TTS instance to be provided.
 
     """

As per coding guidelines, "Use Google style docstrings and keep them short."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agents-core/vision_agents/core/agents/agents.py` around lines 144 - 170, The
docstring for the Agent __init__ is missing the streaming_tts parameter; update
the Args section in agents_core/vision_agents/core/agents/agents.py to include a
short Google-style entry for streaming_tts (e.g., "streaming_tts: Streaming
text-to-speech service used for incremental/real-time audio output; not needed
when using a realtime LLM."), placing it near the other TTS/STT params so it’s
clear when it is required; modify the __init__ docstring for the Agent class
(look for __init__ and the Args block) to add this line and keep wording concise
and consistent with existing entries.
🧹 Nitpick comments (11)
plugins/huggingface/examples/inference_api/pyproject.toml (1)

1-18: Consider adding a [build-system] table.

Without a [build-system] declaration, certain tools (e.g., pip, older uv versions) may refuse to install this project in editable mode or treat it as a non-installable artifact. For a uv workspace example that only needs to be run locally this may be intentional, but it's worth being explicit.

⚙️ Proposed addition of a minimal build-system table
+[build-system]
+requires = ["hatchling"]
+build-backend = "hatchling.build"
+
 [project]
 name = "huggingface-inference-api-example"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/huggingface/examples/inference_api/pyproject.toml` around lines 1 -
18, Add a minimal [build-system] table to pyproject.toml so the package is
installable/editable by tools like pip; include keys like requires =
["setuptools>=61.0","wheel"] (or ["pip>=21.0"] depending on your build tool) and
build-backend = "setuptools.build_meta" (or another appropriate backend),
placing the [build-system] table alongside the existing [project] section.
agents-core/vision_agents/core/agents/agents.py (2)

315-315: Add -> None return type annotation.

Per coding guidelines, type annotations are required everywhere.

📝 Proposed fix
-    async def _flush_streaming_tts_buffer(self):
+    async def _flush_streaming_tts_buffer(self) -> None:

As per coding guidelines, "Use type annotations everywhere."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agents-core/vision_agents/core/agents/agents.py` at line 315, The method
_flush_streaming_tts_buffer lacks a return type annotation; update its async def
signature in agents.py to include the explicit return type -> None (i.e., change
"async def _flush_streaming_tts_buffer(self):" to "async def
_flush_streaming_tts_buffer(self) -> None:") so it conforms to the project's
type-annotation guideline and static checks.

373-381: Sentence boundary scan finds the LAST boundary, not the FIRST — reducing streaming latency gains.

The loop overwrites boundary on every hit, so when a chunk contains "Hello. World! More text", both complete sentences are batched into one TTS call. For fine-grained token streaming this rarely matters (one boundary per chunk), but for LLMs that emit larger deltas the first complete sentence is unnecessarily delayed. Breaking on the first hit sends each sentence to TTS the moment it's complete, which is the primary goal of streaming_tts.

⚡ Proposed fix — break on first boundary
-                boundary = -1
-                for i in range(len(buf) - 1):
-                    if buf[i] in ".!?" and buf[i + 1] in " \n":
-                        boundary = i
+                boundary = -1
+                for i in range(len(buf) - 1):
+                    if buf[i] in ".!?" and buf[i + 1] in " \n":
+                        boundary = i
+                        break
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agents-core/vision_agents/core/agents/agents.py` around lines 373 - 381, The
loop in streaming_tts currently scans buf and keeps the last sentence boundary
because it keeps overwriting boundary; change it to stop at the first sentence
terminator so sentences are sent as soon as they complete: inside the function
that contains variables buf, boundary, to_send and uses
self._streaming_tts_buffer and await
self.tts.send(self._sanitize_text(to_send)), detect the first index where buf[i]
in ".!?" and buf[i+1] in " \n", set boundary and break the loop immediately
(preserving the existing trimming into self._streaming_tts_buffer and the
conditional await self.tts.send call) so the first complete sentence is emitted
without delay.
plugins/huggingface/examples/transformers/transformers_llm_example.py (1)

33-33: Add Any annotation to **kwargs (same as the VLM example).

♻️ Suggested fix
-async def create_agent(**kwargs) -> Agent:
+async def create_agent(**kwargs: Any) -> Agent:
-async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
+async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs: Any) -> None:

As per coding guidelines, **/*.py: "Use type annotations everywhere."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/huggingface/examples/transformers/transformers_llm_example.py` at
line 33, The create_agent function's variadic keyword parameter is missing a
type annotation; update the signature of create_agent to annotate **kwargs with
typing.Any (e.g., **kwargs: Any) and add the corresponding import for Any from
the typing module so the function definition and file comply with the project's
"use type annotations everywhere" guideline.
plugins/huggingface/examples/transformers/transformers_vlm_example.py (1)

32-32: Add Any annotation to **kwargs.

Both create_agent and join_call accept **kwargs without a type annotation, which violates the "type annotations everywhere" guideline.

♻️ Suggested fix
-async def create_agent(**kwargs) -> Agent:
+async def create_agent(**kwargs: Any) -> Agent:
-async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
+async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs: Any) -> None:

Add from typing import Any to the imports (or use the existing Any if already imported transitively).

As per coding guidelines, **/*.py: "Use type annotations everywhere."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/huggingface/examples/transformers/transformers_vlm_example.py` at
line 32, The functions create_agent and join_call use untyped variadic keyword
parameters (**kwargs); add explicit typing by importing Any from typing and
annotating the parameters as **kwargs: Any to satisfy the "type annotations
everywhere" guideline—update the top-level imports to include "from typing
import Any" (or reuse an existing Any) and change the function signatures for
create_agent and join_call to accept **kwargs: Any.
plugins/huggingface/vision_agents/plugins/huggingface/__init__.py (1)

12-18: Move import warnings to module top level.

import warnings inside the except block violates the import ordering guideline. Stdlib imports belong at the top of the file.

♻️ Suggested fix
+import warnings
+
 from .huggingface_llm import HuggingFaceLLM as LLM
 from .huggingface_vlm import HuggingFaceVLM as VLM
 
 __all__ = ["LLM", "VLM"]
 
 try:
     from .transformers_llm import TransformersLLM
     from .transformers_vlm import TransformersVLM
 
     __all__ += ["TransformersLLM", "TransformersVLM"]
 except ImportError as e:
     if e.name not in ("torch", "transformers", "av", "aiortc", "jinja2"):
-        import warnings
-
         warnings.warn(

As per coding guidelines, **/*.py: "Order imports as: stdlib, third-party, local package, relative."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/huggingface/vision_agents/plugins/huggingface/__init__.py` around
lines 12 - 18, Move the stdlib import out of the except block by adding "import
warnings" at module top-level and removing the in-block import; specifically
modify the module containing the except that checks "if e.name not in
(\"torch\", \"transformers\", \"av\", \"aiortc\", \"jinja2\")" so the
warnings.warn call continues to work but the import lives with other top-level
imports to comply with import ordering.
plugins/huggingface/vision_agents/plugins/huggingface/transformers_llm.py (3)

471-482: _generate_non_streaming only catches RuntimeError; other exceptions propagate without emitting LLMErrorEvent.

asyncio.to_thread(_do_generate) propagates any exception raised by _do_generate directly. A ValueError (e.g., invalid generate_kwargs) or OSError (e.g., failed model weight access) would escape without sending LLMErrorEvent, leaving the caller without an error signal.

♻️ Suggested fix
-        except RuntimeError as e:
+        except (RuntimeError, ValueError, OSError) as e:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/huggingface/vision_agents/plugins/huggingface/transformers_llm.py`
around lines 471 - 482, The try/except in _generate_non_streaming only catches
RuntimeError so other exceptions (e.g., ValueError, OSError) raised by
_do_generate propagate without emitting an LLMErrorEvent; modify the error
handling in _generate_non_streaming to catch Exception (or add a second broad
except) around the asyncio.to_thread(_do_generate) call, log the full exception
via logger.exception, call self.events.send with
events.LLMErrorEvent(plugin_name=PLUGIN_NAME, error_message=str(e),
event_data=e) for the caught exception, and return the same fallback
LLMResponseEvent(original=None, text="") so all failures consistently emit
LLMErrorEvent and return the empty response.

59-61: Remove section-comment dividers throughout the file.

The repeated # -----...----- / label / # -----...----- blocks (e.g., lines 59–61, 109–111, 128–130, 170–172, etc.) are section comments and violate the guideline.

As per coding guidelines, **/*.py: "Do not use section comments like # -- some section --."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/huggingface/vision_agents/plugins/huggingface/transformers_llm.py`
around lines 59 - 61, Remove all section-divider comments composed of repeated
'#' and dashes (e.g., the blocks surrounding "Shared helpers (imported by
transformers_vlm.py)" and other similar headers) in transformers_llm.py; replace
them with either a concise single-line comment or remove them entirely so normal
inline comments/docstrings remain, ensuring you do not introduce new
section-style separators—look for the exact string "Shared helpers (imported by
transformers_vlm.py)" and other labeled divider blocks and delete the
surrounding "# -----...-----" lines.

28-28: Prefer modern type annotation syntax throughout.

With from __future__ import annotations in effect, Dict, List, and Optional from typing can be replaced with built-in generics and union syntax as required by the guidelines. For example: dict[str, Any], list[str], str | None.

♻️ Example diff (import line)
-from typing import TYPE_CHECKING, Any, Dict, List, Literal, Optional, cast
+from typing import TYPE_CHECKING, Any, Literal, cast

Then replace all Dict[K, V]dict[K, V], List[T]list[T], Optional[T]T | None throughout the file.

As per coding guidelines, **/*.py: "Use modern syntax: X | Y unions, dict[str, T] generics."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/huggingface/vision_agents/plugins/huggingface/transformers_llm.py` at
line 28, The import line currently brings Dict, List, and Optional from typing;
update to use modern built-in generics and union syntax (e.g., dict[str, Any],
list[Any], T | None) by removing Dict, List, and Optional from the import and
replacing their usages across the file (search for occurrences of Dict, List,
Optional in this module, e.g., in function/type hints). Ensure the file has
"from __future__ import annotations" at the top if not present, keep
TYPE_CHECKING, Any, Literal, and cast as needed, and convert all Dict[...] →
dict[...], List[...] → list[...], Optional[T] → T | None consistently.
plugins/huggingface/vision_agents/plugins/huggingface/transformers_vlm.py (2)

305-324: _do_generate exception handling only covers RuntimeError; other exceptions propagate without emitting error events.

Same pattern as _generate_non_streaming in transformers_llm.py — a ValueError or OSError from model.generate() escapes without sending VLMErrorEvent or LLMErrorEvent, and the caller receives an unhandled exception.

♻️ Suggested fix
-        except RuntimeError as e:
+        except (RuntimeError, ValueError, OSError) as e:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/huggingface/vision_agents/plugins/huggingface/transformers_vlm.py`
around lines 305 - 324, The except block for await
asyncio.to_thread(_do_generate) only catches RuntimeError, so other exceptions
(e.g., ValueError, OSError) bypass the VLM error reporting; change the handler
to catch a broader exception (catch Exception as e) in the block around
asyncio.to_thread(_do_generate), call logger.exception with context, send both
VLMErrorEvent (plugin_name=PLUGIN_NAME, inference_id=inference_id, error=e,
context="generation") and events.LLMErrorEvent (plugin_name=PLUGIN_NAME,
error_message=str(e), event_data=e) via self.events.send, and return an empty
LLMResponseEvent(original=None, text="") to ensure all errors from _do_generate
are reported and the caller receives a controlled response.

26-26: Same old-style typing and section-comment issues as transformers_llm.py.

Dict, List, Optional should use modern built-in generics (dict, list, T | None), and the # -----...----- section dividers should be removed — same guidelines as flagged in transformers_llm.py.

As per coding guidelines, **/*.py: "Use modern syntax: X | Y unions, dict[str, T] generics" and "Do not use section comments like # -- some section --."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/huggingface/vision_agents/plugins/huggingface/transformers_vlm.py` at
line 26, Update the typing import and usage in transformers_vlm.py: replace the
old-style imports Dict, List, Optional with modern built-in generics (use dict,
list and T | None unions) by removing Dict/List/Optional from the from typing
import line and updating any function signatures/annotations that reference
Dict[List]/Optional to dict[list]/list and X | None respectively; also remove
any legacy section-divider comments like lines starting with "# -----" (same
pattern as fixes in transformers_llm.py). Ensure TYPE_CHECKING, Any, and cast
remain only if still used.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@plugins/huggingface/vision_agents/plugins/huggingface/transformers_llm.py`:
- Around line 374-383: The run_generation() thread currently only places the
None sentinel into async_queue on RuntimeError; modify run_generation (in
_generate_streaming) to add a finally block that ensures
loop.call_soon_threadsafe(async_queue.put_nowait, None) is invoked whenever
generation did not finish normally (i.e., if neither
_AsyncBridgeStreamer.on_finalized_text(stream_end=True) already signaled
completion nor the generate call completed successfully); preserve setting
generation_error in the except RuntimeError block and also capture/assign any
non-RuntimeError exceptions to generation_error before the finally sentinel
push, and avoid double-pushing the sentinel if the streamer already finished.

---

Outside diff comments:
In `@agents-core/vision_agents/core/agents/agents.py`:
- Around line 144-170: The docstring for the Agent __init__ is missing the
streaming_tts parameter; update the Args section in
agents_core/vision_agents/core/agents/agents.py to include a short Google-style
entry for streaming_tts (e.g., "streaming_tts: Streaming text-to-speech service
used for incremental/real-time audio output; not needed when using a realtime
LLM."), placing it near the other TTS/STT params so it’s clear when it is
required; modify the __init__ docstring for the Agent class (look for __init__
and the Args block) to add this line and keep wording concise and consistent
with existing entries.

---

Nitpick comments:
In `@agents-core/vision_agents/core/agents/agents.py`:
- Line 315: The method _flush_streaming_tts_buffer lacks a return type
annotation; update its async def signature in agents.py to include the explicit
return type -> None (i.e., change "async def _flush_streaming_tts_buffer(self):"
to "async def _flush_streaming_tts_buffer(self) -> None:") so it conforms to the
project's type-annotation guideline and static checks.
- Around line 373-381: The loop in streaming_tts currently scans buf and keeps
the last sentence boundary because it keeps overwriting boundary; change it to
stop at the first sentence terminator so sentences are sent as soon as they
complete: inside the function that contains variables buf, boundary, to_send and
uses self._streaming_tts_buffer and await
self.tts.send(self._sanitize_text(to_send)), detect the first index where buf[i]
in ".!?" and buf[i+1] in " \n", set boundary and break the loop immediately
(preserving the existing trimming into self._streaming_tts_buffer and the
conditional await self.tts.send call) so the first complete sentence is emitted
without delay.

In `@plugins/huggingface/examples/inference_api/pyproject.toml`:
- Around line 1-18: Add a minimal [build-system] table to pyproject.toml so the
package is installable/editable by tools like pip; include keys like requires =
["setuptools>=61.0","wheel"] (or ["pip>=21.0"] depending on your build tool) and
build-backend = "setuptools.build_meta" (or another appropriate backend),
placing the [build-system] table alongside the existing [project] section.

In `@plugins/huggingface/examples/transformers/transformers_llm_example.py`:
- Line 33: The create_agent function's variadic keyword parameter is missing a
type annotation; update the signature of create_agent to annotate **kwargs with
typing.Any (e.g., **kwargs: Any) and add the corresponding import for Any from
the typing module so the function definition and file comply with the project's
"use type annotations everywhere" guideline.

In `@plugins/huggingface/examples/transformers/transformers_vlm_example.py`:
- Line 32: The functions create_agent and join_call use untyped variadic keyword
parameters (**kwargs); add explicit typing by importing Any from typing and
annotating the parameters as **kwargs: Any to satisfy the "type annotations
everywhere" guideline—update the top-level imports to include "from typing
import Any" (or reuse an existing Any) and change the function signatures for
create_agent and join_call to accept **kwargs: Any.

In `@plugins/huggingface/vision_agents/plugins/huggingface/__init__.py`:
- Around line 12-18: Move the stdlib import out of the except block by adding
"import warnings" at module top-level and removing the in-block import;
specifically modify the module containing the except that checks "if e.name not
in (\"torch\", \"transformers\", \"av\", \"aiortc\", \"jinja2\")" so the
warnings.warn call continues to work but the import lives with other top-level
imports to comply with import ordering.

In `@plugins/huggingface/vision_agents/plugins/huggingface/transformers_llm.py`:
- Around line 471-482: The try/except in _generate_non_streaming only catches
RuntimeError so other exceptions (e.g., ValueError, OSError) raised by
_do_generate propagate without emitting an LLMErrorEvent; modify the error
handling in _generate_non_streaming to catch Exception (or add a second broad
except) around the asyncio.to_thread(_do_generate) call, log the full exception
via logger.exception, call self.events.send with
events.LLMErrorEvent(plugin_name=PLUGIN_NAME, error_message=str(e),
event_data=e) for the caught exception, and return the same fallback
LLMResponseEvent(original=None, text="") so all failures consistently emit
LLMErrorEvent and return the empty response.
- Around line 59-61: Remove all section-divider comments composed of repeated
'#' and dashes (e.g., the blocks surrounding "Shared helpers (imported by
transformers_vlm.py)" and other similar headers) in transformers_llm.py; replace
them with either a concise single-line comment or remove them entirely so normal
inline comments/docstrings remain, ensuring you do not introduce new
section-style separators—look for the exact string "Shared helpers (imported by
transformers_vlm.py)" and other labeled divider blocks and delete the
surrounding "# -----...-----" lines.
- Line 28: The import line currently brings Dict, List, and Optional from
typing; update to use modern built-in generics and union syntax (e.g., dict[str,
Any], list[Any], T | None) by removing Dict, List, and Optional from the import
and replacing their usages across the file (search for occurrences of Dict,
List, Optional in this module, e.g., in function/type hints). Ensure the file
has "from __future__ import annotations" at the top if not present, keep
TYPE_CHECKING, Any, Literal, and cast as needed, and convert all Dict[...] →
dict[...], List[...] → list[...], Optional[T] → T | None consistently.

In `@plugins/huggingface/vision_agents/plugins/huggingface/transformers_vlm.py`:
- Around line 305-324: The except block for await
asyncio.to_thread(_do_generate) only catches RuntimeError, so other exceptions
(e.g., ValueError, OSError) bypass the VLM error reporting; change the handler
to catch a broader exception (catch Exception as e) in the block around
asyncio.to_thread(_do_generate), call logger.exception with context, send both
VLMErrorEvent (plugin_name=PLUGIN_NAME, inference_id=inference_id, error=e,
context="generation") and events.LLMErrorEvent (plugin_name=PLUGIN_NAME,
error_message=str(e), event_data=e) via self.events.send, and return an empty
LLMResponseEvent(original=None, text="") to ensure all errors from _do_generate
are reported and the caller receives a controlled response.
- Line 26: Update the typing import and usage in transformers_vlm.py: replace
the old-style imports Dict, List, Optional with modern built-in generics (use
dict, list and T | None unions) by removing Dict/List/Optional from the from
typing import line and updating any function signatures/annotations that
reference Dict[List]/Optional to dict[list]/list and X | None respectively; also
remove any legacy section-divider comments like lines starting with "# -----"
(same pattern as fixes in transformers_llm.py). Ensure TYPE_CHECKING, Any, and
cast remain only if still used.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (5)
plugins/huggingface/vision_agents/plugins/huggingface/transformers_llm.py (5)

684-703: Two guideline issues: del obj.attr (≡ delattr) and method ordering.

  1. del on object attributes (lines 687–688)del self._resources.model and del self._resources.tokenizer are runtime-equivalent to delattr. Setting them to None is idiomatic and doesn't require removing the attribute slot:
     if self._resources is not None:
-        del self._resources.model
-        del self._resources.tokenizer
+        self._resources.model = None   # type: ignore[assignment]
+        self._resources.tokenizer = None  # type: ignore[assignment]
         self._resources = None
  1. Method ordering — per the coding guidelines, ordering should be __init__ → public lifecycle → properties → public feature methods → private helpers → dunder. unload (public lifecycle) and the is_loaded/device properties currently sit after all private helpers. They should be moved up, ahead of simple_response / create_response.

As per coding guidelines, "Avoid getattr, hasattr, delattr, setattr; prefer normal attribute access" and "Order class methods as: __init__, public lifecycle methods, properties, public feature methods, private helpers, dunder methods."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/huggingface/vision_agents/plugins/huggingface/transformers_llm.py`
around lines 684 - 703, Replace the unsafe use of del on resources in unload by
assigning None to the attributes (e.g., set self._resources.model = None and
self._resources.tokenizer = None or set self._resources = None directly) instead
of using del, and move the public lifecycle and property methods (unload,
is_loaded, device) to follow __init__ and precede public feature methods like
simple_response and create_response so the class method ordering matches the
guideline (__init__ → public lifecycle → properties → public feature methods →
private helpers → dunder).

525-525: Redundant tools or [] guard on a non-optional parameter.

tools: List[ToolSchema] is never None by its type signature, and the function is only called when tools_spec is non-empty (line 266). The or [] adds noise without protection.

-        for t in tools or []:
+        for t in tools:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/huggingface/vision_agents/plugins/huggingface/transformers_llm.py` at
line 525, The loop uses a redundant fallback "for t in tools or []:" despite
tools being typed List[ToolSchema] and guaranteed non-None by callers; update
the loop in the function (in transformers_llm.py where the loop appears) to
iterate directly with "for t in tools:" and remove the unnecessary "or []" guard
to reduce noise and reflect the non-optional parameter contract.

28-28: Prefer modern built-in generics over typing.Dict/List/Optional.

The file has from __future__ import annotations on line 18, so PEP 585/604 syntax works on all supported Python versions. Replace the old-style imports with native generics throughout.

♻️ Proposed change (imports and representative annotations)
-from typing import TYPE_CHECKING, Any, Dict, List, Literal, Optional, cast
+from typing import TYPE_CHECKING, Any, Literal, cast

Then replace occurrences across the file, e.g.:

-    load_kwargs: Dict[str, Any] = {
+    load_kwargs: dict[str, Any] = {

-    messages: Optional[List[Dict[str, Any]]] = None,
+    messages: list[dict[str, Any]] | None = None,

-    tools_param: Optional[List[Dict[str, Any]]] = None
+    tools_param: list[dict[str, Any]] | None = None

-    ) -> Optional[Any]:
+    ) -> Any | None:

-    generation_error: Optional[Exception] = None
+    generation_error: Exception | None = None

As per coding guidelines, "Use modern syntax: X | Y unions, dict[str, T] generics."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/huggingface/vision_agents/plugins/huggingface/transformers_llm.py` at
line 28, Replace old-style typing generics in transformers_llm.py: remove usage
of typing.Dict, typing.List and typing.Optional in imports and annotations and
switch to built-in generics (dict[..., ...], list[...], and X | None for
optional) and PEP 604 unions where used; keep TYPE_CHECKING, Any, Literal, cast
if still needed. Update the import line (currently "from typing import
TYPE_CHECKING, Any, Dict, List, Literal, Optional, cast") to only import the
required symbols (e.g., TYPE_CHECKING, Any, Literal, cast) and update all
annotations in functions and classes such as the ones referencing Dict, List,
Optional to use dict[str, T], list[T], and T | None respectively.

416-416: thread.join(timeout=5.0) is a blocking call on the async event loop.

Although in practice the thread is nearly finished by this point, calling a blocking join directly inside a coroutine can stall the event loop for up to 5 seconds if the generation thread is slow to exit (e.g., due to PyTorch cleanup).

♻️ Proposed fix
-        thread.join(timeout=5.0)
+        await asyncio.to_thread(thread.join, 5.0)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/huggingface/vision_agents/plugins/huggingface/transformers_llm.py` at
line 416, Replace the blocking call thread.join(timeout=5.0) inside the
coroutine with a non-blocking await that runs thread.join in a threadpool; e.g.,
use asyncio.get_running_loop().run_in_executor(None, thread.join, 5.0) or
asyncio.to_thread(thread.join, 5.0) so the event loop is not blocked. Locate the
occurrence of thread.join in the coroutine (the join call currently used) and
change it to await the executor/to_thread call described above, preserving the
same timeout argument.

59-65: Remove # --- section-divider comments throughout the file.

Lines 59–61, 109–111, 128–130, 170–172, 218–220, 328–330, 504–506, 518–519, and 680–682 all use the # --------------------------------------------------------------------------- / # Section Name pattern. Class method grouping can be expressed through blank lines and ordered method placement without section banners.

As per coding guidelines, "Do not use section comments like # -- some section --."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/huggingface/vision_agents/plugins/huggingface/transformers_llm.py`
around lines 59 - 65, Remove all the long section-divider comments ("#
---------------------------------------------------------------------------" and
the following section-name comment lines) throughout the transformers_llm.py
file; for each occurrence (e.g., the banner above the
DeviceType/QuantizationType/TorchDtypeType block and other banners used to group
methods/classes referenced by transformers_vlm.py), delete the divider lines and
the accompanying section-name comment and instead separate logical groups with a
single blank line so class/method grouping relies on ordering and whitespace
only.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@plugins/huggingface/vision_agents/plugins/huggingface/transformers_llm.py`:
- Around line 374-426: In run_generation, broaden the exception handler so any
non-RuntimeError exceptions from model.generate(...) are captured into
generation_error and result in an LLMErrorEvent like RuntimeError does: keep the
existing except RuntimeError as e: generation_error = e; logger.exception(...)
block, and add a following except Exception as e: generation_error = e;
logger.error("Generation failed: %s", e, exc_info=True) (or similar) so all
exceptions set generation_error and still let the finally block call
loop.call_soon_threadsafe(async_queue.put_nowait, None); ensure downstream logic
that checks generation_error and emits events.LLMErrorEvent (and returns
LLMResponseEvent(original=None, text="")) remains unchanged.

---

Nitpick comments:
In `@plugins/huggingface/vision_agents/plugins/huggingface/transformers_llm.py`:
- Around line 684-703: Replace the unsafe use of del on resources in unload by
assigning None to the attributes (e.g., set self._resources.model = None and
self._resources.tokenizer = None or set self._resources = None directly) instead
of using del, and move the public lifecycle and property methods (unload,
is_loaded, device) to follow __init__ and precede public feature methods like
simple_response and create_response so the class method ordering matches the
guideline (__init__ → public lifecycle → properties → public feature methods →
private helpers → dunder).
- Line 525: The loop uses a redundant fallback "for t in tools or []:" despite
tools being typed List[ToolSchema] and guaranteed non-None by callers; update
the loop in the function (in transformers_llm.py where the loop appears) to
iterate directly with "for t in tools:" and remove the unnecessary "or []" guard
to reduce noise and reflect the non-optional parameter contract.
- Line 28: Replace old-style typing generics in transformers_llm.py: remove
usage of typing.Dict, typing.List and typing.Optional in imports and annotations
and switch to built-in generics (dict[..., ...], list[...], and X | None for
optional) and PEP 604 unions where used; keep TYPE_CHECKING, Any, Literal, cast
if still needed. Update the import line (currently "from typing import
TYPE_CHECKING, Any, Dict, List, Literal, Optional, cast") to only import the
required symbols (e.g., TYPE_CHECKING, Any, Literal, cast) and update all
annotations in functions and classes such as the ones referencing Dict, List,
Optional to use dict[str, T], list[T], and T | None respectively.
- Line 416: Replace the blocking call thread.join(timeout=5.0) inside the
coroutine with a non-blocking await that runs thread.join in a threadpool; e.g.,
use asyncio.get_running_loop().run_in_executor(None, thread.join, 5.0) or
asyncio.to_thread(thread.join, 5.0) so the event loop is not blocked. Locate the
occurrence of thread.join in the coroutine (the join call currently used) and
change it to await the executor/to_thread call described above, preserving the
same timeout argument.
- Around line 59-65: Remove all the long section-divider comments ("#
---------------------------------------------------------------------------" and
the following section-name comment lines) throughout the transformers_llm.py
file; for each occurrence (e.g., the banner above the
DeviceType/QuantizationType/TorchDtypeType block and other banners used to group
methods/classes referenced by transformers_vlm.py), delete the divider lines and
the accompanying section-name comment and instead separate logical groups with a
single blank line so class/method grouping relies on ordering and whitespace
only.

@maxkahan maxkahan merged commit 7446200 into main Feb 19, 2026
9 of 10 checks passed
@maxkahan maxkahan deleted the add-transformers branch February 19, 2026 21:15
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🧹 Nitpick comments (6)
plugins/huggingface/examples/inference_api/inference_api_example.py (1)

32-32: Stale function docstring after the module rename.

create_agent's docstring still says "Create the agent with HuggingFace LLM." while the module is now titled "HuggingFace Inference API Example." Worth aligning for consistency.

✏️ Proposed update
-    """Create the agent with HuggingFace LLM."""
+    """Create the agent with the HuggingFace Inference API."""
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/huggingface/examples/inference_api/inference_api_example.py` at line
32, Update the stale docstring for create_agent to reflect the module rename to
"HuggingFace Inference API Example": locate the create_agent function and
replace the current docstring "Create the agent with HuggingFace LLM." with a
concise, accurate description such as "Create the agent using the HuggingFace
Inference API" (or similar wording that mentions the Inference API) so the
function-level documentation matches the module title.
plugins/huggingface/pyproject.toml (1)

19-23: Consider bumping accelerate lower bound past its 1.0 breaking change.

The floor of accelerate>=0.25.0 is well below the 1.0.0 release, which introduced breaking changes including removal of Accelerator().use_fp16, removal of direct DataLoader config args on Accelerator() in favour of DataLoaderConfiguration, and other API-level removals. The current release is 1.12.0. While transformers will likely pull a higher version transitively, the explicit lower bound here is misleadingly permissive and could cause surprises in edge-case lockfile scenarios.

♻️ Suggested tightening
-    "accelerate>=0.25.0",
+    "accelerate>=1.0.0",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/huggingface/pyproject.toml` around lines 19 - 23, Update the explicit
accelerate dependency floor in plugins/huggingface/pyproject.toml: the current
declaration "accelerate>=0.25.0" is too permissive given the 1.0.0 breaking
changes, so bump the lower bound to a 1.x release (e.g., "accelerate>=1.0.0" or
the current stable "accelerate>=1.12.0") to avoid surprising installs; modify
the accelerate entry in the transformers list to the chosen tightened version
string.
agents-core/vision_agents/core/agents/agents.py (1)

373-376: Sentence boundary heuristic false-positives on common abbreviations.

buf[i] in ".!?" and buf[i + 1] in " \n" treats any . as a sentence end. Common abbreviations like "Dr. Smith", "Mr. Jones", "e.g. ", and "vs. " will trigger a flush mid-sentence, delivering "Dr." as a standalone TTS utterance and breaking prosody.

A minimal improvement is to also require that the character following the space is uppercase (title-case heuristic), or to maintain a blocked-abbreviation list:

♻️ Optional improvement
-                for i in range(len(buf) - 1):
-                    if buf[i] in ".!?" and buf[i + 1] in " \n":
-                        boundary = i
+                _ABBREVS = {"dr", "mr", "mrs", "ms", "prof", "sr", "jr", "vs", "etc", "e.g", "i.e"}
+                for i in range(len(buf) - 1):
+                    if buf[i] in ".!?" and buf[i + 1] in " \n":
+                        # Skip known abbreviations to avoid splitting mid-sentence
+                        word_start = buf.rfind(" ", 0, i) + 1
+                        preceding_word = buf[word_start:i].lower().rstrip(".")
+                        if preceding_word not in _ABBREVS:
+                            boundary = i
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agents-core/vision_agents/core/agents/agents.py` around lines 373 - 376, The
current sentence-boundary loop that sets boundary when buf[i] in ".!?" and
buf[i+1] in " \n" produces false positives on common abbreviations; update the
heuristic inside that loop (the code that updates boundary using buf and the
for-loop) to skip treating a period+space as a sentence end when the next
non-space character is lowercase or when the token before the period matches a
short-abbreviation blacklist (e.g.,
{"Mr","Mrs","Ms","Dr","Prof","Sr","Jr","vs","e.g","i.e"}); implement by scanning
ahead from i+1 to find the next non-space character and checking its isupper()
status and/or checking the word before the period against the blacklist before
assigning boundary.
plugins/huggingface/vision_agents/plugins/huggingface/transformers_vlm.py (1)

426-436: Same del-on-attribute redundancy as in TransformersLLM.unload().

del self._resources.model and del self._resources.processor are equivalent to delattr calls; setting self._resources = None releases both references. See the corresponding comment in transformers_llm.py.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/huggingface/vision_agents/plugins/huggingface/transformers_vlm.py`
around lines 426 - 436, The unload method in TransformersVLM currently deletes
attributes on the _resources object before nulling it, which is redundant;
update the TransformersVLM.unload implementation to stop calling del on
self._resources.model and self._resources.processor and instead simply set
self._resources = None to release references, keeping the surrounding logging,
gc.collect(), and CUDA cache clearing intact so the cleanup behavior remains the
same.
plugins/huggingface/vision_agents/plugins/huggingface/transformers_llm.py (2)

684-694: del self._resources.model / del self._resources.tokenizer are redundant and equivalent to delattr.

Once self._resources = None executes, the reference to ModelResources is dropped and its contents become eligible for GC. The preceding del calls are the functional equivalent of delattr(self._resources, "model"), which the coding guidelines prohibit.

♻️ Proposed simplification
     def unload(self) -> None:
         logger.info(f"Unloading model: {self.model_id}")
         if self._resources is not None:
-            del self._resources.model
-            del self._resources.tokenizer
             self._resources = None

As per coding guidelines, "Avoid using getattr, hasattr, delattr, setattr; prefer normal attribute access."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/huggingface/vision_agents/plugins/huggingface/transformers_llm.py`
around lines 684 - 694, Remove the explicit del statements in unload() that
delete attributes on self._resources (the lines deleting self._resources.model
and self._resources.tokenizer) and simply set self._resources = None; update the
unload() method so it logs and nulls out the ModelResources reference
(self._resources) directly, then runs gc.collect() and CUDA cache clearing as
before—this removes use of attribute deletion while preserving the original
cleanup behavior in the unload method.

59-61: Section-comment dividers violate the coding guideline throughout this file and transformers_vlm.py.

The # ---...---\n# Name\n# ---...--- pattern is the same construct as # -- some section -- — a section divider the guideline explicitly prohibits. This pattern recurs at lines 109, 128, 170, 218, 328, 504, 517, and 680 in this file, and similarly throughout transformers_vlm.py.

As per coding guidelines, "Do not use section comments like # -- some section --."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/huggingface/vision_agents/plugins/huggingface/transformers_llm.py`
around lines 59 - 61, Remove the forbidden section-divider comments (the "#
---...---" pattern) throughout this module and transformers_vlm.py and replace
them with simple, guideline-compliant single-line comments or small descriptive
comments above the related functions/blocks (e.g., above the shared helpers
block or near functions/classes that follow those dividers). Specifically,
search for the repeated pattern used as section dividers in this file (and in
transformers_vlm.py) and either delete the divider lines or convert them into
concise comments that name the subsequent block (for example, a single-line
comment like "# Shared helpers" above the helper functions) so the intent
remains clear without using the prohibited section-divider style.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@agents-core/vision_agents/core/agents/agents.py`:
- Around line 315-320: The method _flush_streaming_tts_buffer is missing a
return type annotation; update its signature in agents.CoreVisionAgent (function
_flush_streaming_tts_buffer) to include an explicit -> None return type (async
def _flush_streaming_tts_buffer(self) -> None:) and ensure any related type
hints remain consistent with project guidelines.
- Around line 133-136: The constructor (__init__) docstring for the agent class
in agents.py is missing the streaming_tts parameter in the Args section; update
the Google-style docstring for the __init__ method to include a short entry for
streaming_tts (bool) explaining that it streams TTS sentences from
LLMResponseChunkEvent to reduce perceived latency, matching the other parameter
entries and style used in the Args section.
- Around line 366-381: Add a -> None return type annotation to the method
_flush_streaming_tts_buffer and update the class __init__ docstring to include
an Args entry describing the streaming_tts parameter (name, type, purpose) to
satisfy type and docstring guidelines; in the streaming TTS chunk handling (the
loop that inspects buf in the method that appends to _streaming_tts_buffer and
calls await self.tts.send(self._sanitize_text(to_send))), harden the sentence
boundary heuristic to avoid splitting on common abbreviations (e.g., check
preceding token against an abbreviation list or require a capital letter after
the space/newline) before treating ".!?" + space/newline as a boundary, and
ensure the code still trims and lstrips the remaining _streaming_tts_buffer and
only calls self.tts.send when to_send.strip() is non-empty.

In `@plugins/huggingface/pyproject.toml`:
- Around line 24-27: The extras entry "transformers-quantized" currently pins
bitsandbytes as "bitsandbytes>=0.41.0", which silently breaks on macOS because
bitsandbytes did not provide macOS wheels until 0.49.0; fix by either bumping
the minimum to "bitsandbytes>=0.49.0" in the "transformers-quantized" extra to
ensure macOS support, or restrict the extra to supported platforms by adding an
explicit platform marker (e.g., only include bitsandbytes for linux/windows) and
update the package README to document the CUDA/MPS and platform constraints for
the "transformers-quantized" extra.

In `@plugins/huggingface/vision_agents/plugins/huggingface/transformers_llm.py`:
- Around line 374-384: run_generation currently only catches RuntimeError so
other exceptions (MemoryError, ValueError, OSError, etc.) are swallowed or
propagate; change it to catch BaseException? No — catch Exception (not
SystemExit/KeyboardInterrupt), set generation_error = e for any Exception, call
logger.exception with the exception, and ensure the async sentinel is always
posted exactly once (avoid queuing a second None when the success path already
emitted one via _AsyncBridgeStreamer.on_finalized_text); similarly update
_generate_non_streaming to wrap the model.generate call in a try/except
Exception block (catching non-RuntimeError exceptions), convert the exception
into the same LLM error path (set generation_error or raise a mapped
LLMErrorEvent/return an error response consistent with streaming flow) and log
it via logger.exception so non-RuntimeError failures are reported instead of
leaking or being ignored.
- Around line 524-532: The loop over tools mutates the caller's schema by
calling params.setdefault(...) on the params object which may be a direct
reference to t.get("parameters_schema") or t.get("parameters"); instead, make a
shallow (or deep if nested) copy of params before mutating (e.g., replace the
current params assignment with a copied version) so modifications to
params.setdefault("type", "object") and params.setdefault("properties", {}) do
not alter the original ToolSchema referenced by tools/t; update the code around
the variables tools, t, params, parameters_schema to operate on the copied dict.

In `@plugins/huggingface/vision_agents/plugins/huggingface/transformers_vlm.py`:
- Line 293: Accessing processor.tokenizer can raise AttributeError for some
AutoProcessor types; update the code around processor.tokenizer and pad_token_id
(used after AutoProcessor.from_pretrained) to defensively handle missing
tokenizer by wrapping the access in a try/except AttributeError and setting
pad_token_id = None on failure; ensure you reference processor.tokenizer and
pad_token_id in the fix so the rest of the function (e.g., any calls that rely
on pad_token_id) continues to work with a None fallback.
- Around line 305-324: The try/except around awaiting
asyncio.to_thread(_do_generate) only catches RuntimeError so other exceptions
from _do_generate (e.g., MemoryError, ValueError, OSError) will propagate out of
simple_response; update the exception handling to catch Exception (or the
appropriate broad base class) around the asyncio.to_thread(_do_generate) call,
log the full exception via logger.exception, and send the same VLMErrorEvent and
events.LLMErrorEvent (using VLMErrorEvent, events.LLMErrorEvent, inference_id)
before returning LLMResponseEvent(original=None, text="") so all errors from
model.generate() are handled consistently.

---

Duplicate comments:
In `@plugins/huggingface/tests/test_transformers_llm.py`:
- Line 4: Tests in this file violate the "never mock in tests" guideline by
using MagicMock for _make_mock_tokenizer, _make_mock_model, and _make_resources;
replace those mock-based helpers with real lightweight fixtures or factory
functions that construct real tokenizer/model/resource objects (or reuse the
real-path setup used by TestTransformersLLMIntegration) so unit tests exercise
actual behavior; remove the MagicMock import and update any tests calling
_make_mock_tokenizer/_make_mock_model/_make_resources to use the new real
factories/pytest fixtures, ensuring the tests remain fast by using minimal toy
models or shared test fixtures.

In `@plugins/huggingface/tests/test_transformers_vlm.py`:
- Line 5: The tests in TestTransformersVLM use unittest.mock.MagicMock for the
processor and model (the MagicMock instances created in that class), which
violates the "Never mock in tests" guideline; replace those MagicMock usages
with small, deterministic fake objects or minimal real implementations (e.g.,
lightweight fake Processor and FakeModel classes or pytest fixtures that return
real minimal behavior used by the tests) and update the test methods to call the
same methods/properties on those fakes (match names used in the tests such as
any process/forward/infer methods referenced in TestTransformersVLM) so the
tests exercise real logic instead of mocks and align with the pattern used by
TestTransformersVLMIntegration.

---

Nitpick comments:
In `@agents-core/vision_agents/core/agents/agents.py`:
- Around line 373-376: The current sentence-boundary loop that sets boundary
when buf[i] in ".!?" and buf[i+1] in " \n" produces false positives on common
abbreviations; update the heuristic inside that loop (the code that updates
boundary using buf and the for-loop) to skip treating a period+space as a
sentence end when the next non-space character is lowercase or when the token
before the period matches a short-abbreviation blacklist (e.g.,
{"Mr","Mrs","Ms","Dr","Prof","Sr","Jr","vs","e.g","i.e"}); implement by scanning
ahead from i+1 to find the next non-space character and checking its isupper()
status and/or checking the word before the period against the blacklist before
assigning boundary.

In `@plugins/huggingface/examples/inference_api/inference_api_example.py`:
- Line 32: Update the stale docstring for create_agent to reflect the module
rename to "HuggingFace Inference API Example": locate the create_agent function
and replace the current docstring "Create the agent with HuggingFace LLM." with
a concise, accurate description such as "Create the agent using the HuggingFace
Inference API" (or similar wording that mentions the Inference API) so the
function-level documentation matches the module title.

In `@plugins/huggingface/pyproject.toml`:
- Around line 19-23: Update the explicit accelerate dependency floor in
plugins/huggingface/pyproject.toml: the current declaration "accelerate>=0.25.0"
is too permissive given the 1.0.0 breaking changes, so bump the lower bound to a
1.x release (e.g., "accelerate>=1.0.0" or the current stable
"accelerate>=1.12.0") to avoid surprising installs; modify the accelerate entry
in the transformers list to the chosen tightened version string.

In `@plugins/huggingface/vision_agents/plugins/huggingface/transformers_llm.py`:
- Around line 684-694: Remove the explicit del statements in unload() that
delete attributes on self._resources (the lines deleting self._resources.model
and self._resources.tokenizer) and simply set self._resources = None; update the
unload() method so it logs and nulls out the ModelResources reference
(self._resources) directly, then runs gc.collect() and CUDA cache clearing as
before—this removes use of attribute deletion while preserving the original
cleanup behavior in the unload method.
- Around line 59-61: Remove the forbidden section-divider comments (the "#
---...---" pattern) throughout this module and transformers_vlm.py and replace
them with simple, guideline-compliant single-line comments or small descriptive
comments above the related functions/blocks (e.g., above the shared helpers
block or near functions/classes that follow those dividers). Specifically,
search for the repeated pattern used as section dividers in this file (and in
transformers_vlm.py) and either delete the divider lines or convert them into
concise comments that name the subsequent block (for example, a single-line
comment like "# Shared helpers" above the helper functions) so the intent
remains clear without using the prohibited section-divider style.

In `@plugins/huggingface/vision_agents/plugins/huggingface/transformers_vlm.py`:
- Around line 426-436: The unload method in TransformersVLM currently deletes
attributes on the _resources object before nulling it, which is redundant;
update the TransformersVLM.unload implementation to stop calling del on
self._resources.model and self._resources.processor and instead simply set
self._resources = None to release references, keeping the surrounding logging,
gc.collect(), and CUDA cache clearing intact so the cleanup behavior remains the
same.

Comment on lines +133 to +136
# Send text to TTS as sentences stream from the LLM rather than
# waiting for the complete response. Reduces perceived latency for
# non-realtime LLMs that emit LLMResponseChunkEvent.
streaming_tts: bool = False,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

streaming_tts is absent from the __init__ docstring's Args section.

📝 Proposed fix
             broadcast_metrics_interval: Interval in seconds between metric broadcasts.
             multi_speaker_filter: Audio filter for handling overlapping speech from
                 multiple participants.
                 Takes effect only more than one participant is present.
                 Defaults to `FirstSpeakerWinsFilter`, which uses VAD to lock onto
                 the first participant who starts speaking and drops audio from
                 everyone else until the active speaker's turn ends, or they go
                 silent.
+            streaming_tts: If True, LLM text chunks are buffered and forwarded
+                to TTS at sentence boundaries instead of waiting for the full
+                response. Reduces perceived latency for non-realtime LLMs that
+                emit LLMResponseChunkEvent. Requires a TTS instance to be
+                configured. Defaults to False.
         """

As per coding guidelines, "Use Google style docstrings and keep them short" — all constructor parameters must appear in the Args section.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Send text to TTS as sentences stream from the LLM rather than
# waiting for the complete response. Reduces perceived latency for
# non-realtime LLMs that emit LLMResponseChunkEvent.
streaming_tts: bool = False,
broadcast_metrics_interval: Interval in seconds between metric broadcasts.
multi_speaker_filter: Audio filter for handling overlapping speech from
multiple participants.
Takes effect only more than one participant is present.
Defaults to `FirstSpeakerWinsFilter`, which uses VAD to lock onto
the first participant who starts speaking and drops audio from
everyone else until the active speaker's turn ends, or they go
silent.
streaming_tts: If True, LLM text chunks are buffered and forwarded
to TTS at sentence boundaries instead of waiting for the full
response. Reduces perceived latency for non-realtime LLMs that
emit LLMResponseChunkEvent. Requires a TTS instance to be
configured. Defaults to False.
"""
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agents-core/vision_agents/core/agents/agents.py` around lines 133 - 136, The
constructor (__init__) docstring for the agent class in agents.py is missing the
streaming_tts parameter in the Args section; update the Google-style docstring
for the __init__ method to include a short entry for streaming_tts (bool)
explaining that it streams TTS sentences from LLMResponseChunkEvent to reduce
perceived latency, matching the other parameter entries and style used in the
Args section.

Comment on lines +315 to +320
async def _flush_streaming_tts_buffer(self):
"""Send any remaining text in the streaming TTS buffer."""
remaining = self._streaming_tts_buffer.strip()
self._streaming_tts_buffer = ""
if remaining and self.tts:
await self.tts.send(self._sanitize_text(remaining))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing -> None return type annotation on _flush_streaming_tts_buffer.

🛠️ Proposed fix
-    async def _flush_streaming_tts_buffer(self):
+    async def _flush_streaming_tts_buffer(self) -> None:

As per coding guidelines, "Use type annotations everywhere."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agents-core/vision_agents/core/agents/agents.py` around lines 315 - 320, The
method _flush_streaming_tts_buffer is missing a return type annotation; update
its signature in agents.CoreVisionAgent (function _flush_streaming_tts_buffer)
to include an explicit -> None return type (async def
_flush_streaming_tts_buffer(self) -> None:) and ensure any related type hints
remain consistent with project guidelines.

Comment on lines +366 to +381
if not self.tts or not event.delta:
return
if event.is_first_chunk:
self._streaming_tts_buffer = ""
self._streaming_tts_buffer += event.delta
# Send complete sentences to TTS immediately
buf = self._streaming_tts_buffer
boundary = -1
for i in range(len(buf) - 1):
if buf[i] in ".!?" and buf[i + 1] in " \n":
boundary = i
if boundary >= 0:
to_send = buf[: boundary + 1]
self._streaming_tts_buffer = buf[boundary + 1 :].lstrip()
if to_send.strip():
await self.tts.send(self._sanitize_text(to_send))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash

echo "=== Locating LLMResponseChunkEvent definition ==="
find . -type f -name "*.py" | xargs grep -l "class LLMResponseChunkEvent" | head -5

echo ""
echo "=== LLMResponseChunkEvent class definition and is_first_chunk ==="
find . -type f -name "*.py" -exec grep -l "class LLMResponseChunkEvent" {} \; | head -1 | xargs cat | head -100

echo ""
echo "=== All references to is_first_chunk in codebase ==="
rg -n "is_first_chunk" --type py

echo ""
echo "=== _flush_streaming_tts_buffer method signature ==="
rg -n "_flush_streaming_tts_buffer" -A 5 --type py

echo ""
echo "=== __init__ docstring in agents.py ==="
sed -n '115,180p' agents-core/vision_agents/core/agents/agents.py

Repository: GetStream/Vision-Agents

Length of output: 10756


Add return type annotation and docstring entry for streaming_tts parameter.

The is_first_chunk attribute exists on LLMResponseChunkEvent (line 122 of llm/events.py) and is properly used—no AttributeError concern.

However, two guideline violations remain:

  1. _flush_streaming_tts_buffer() at line 315 lacks a -> None return type annotation (required by "use type annotations everywhere").
  2. The __init__ docstring (lines 143–170) is missing an Args entry for the streaming_tts parameter (declared at line 133), which violates the Google style docstring guideline.

The sentence boundary heuristic at line 375 (buf[i] in ".!?" and buf[i + 1] in " \n") will incorrectly split on abbreviations like "Dr. " or "e.g. ", causing mid-sentence TTS calls; consider a more robust check.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agents-core/vision_agents/core/agents/agents.py` around lines 366 - 381, Add
a -> None return type annotation to the method _flush_streaming_tts_buffer and
update the class __init__ docstring to include an Args entry describing the
streaming_tts parameter (name, type, purpose) to satisfy type and docstring
guidelines; in the streaming TTS chunk handling (the loop that inspects buf in
the method that appends to _streaming_tts_buffer and calls await
self.tts.send(self._sanitize_text(to_send))), harden the sentence boundary
heuristic to avoid splitting on common abbreviations (e.g., check preceding
token against an abbreviation list or require a capital letter after the
space/newline) before treating ".!?" + space/newline as a boundary, and ensure
the code still trims and lstrips the remaining _streaming_tts_buffer and only
calls self.tts.send when to_send.strip() is non-empty.

Comment on lines +24 to +27
transformers-quantized = [
"vision-agents-plugins-huggingface[transformers]",
"bitsandbytes>=0.41.0",
]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

bitsandbytes macOS Apple Silicon support version history

💡 Result:

bitsandbytes macOS (Apple Silicon) support — version history (high level)

  • ≤ 0.42.0 (Jan 8, 2024): macOS/Apple Silicon was not officially supported (docs at the time say only CUDA; “M1 chips (MacOS) is coming soon”). The PyPI artifact for 0.42.0 was a platform-agnostic wheel (py3-none-any), so it could install on macOS even though GPU/MPS support wasn’t there. [1], [2]
  • 0.43.0 (Mar 8, 2024): users reported macOS (arm64) not supported for their use case; some projects worked around this by downgrading to 0.42.0. [3], [2]
  • 0.44.0 (Sep 24, 2024): no Mac wheel was published; on macOS, pip install bitsandbytes would resolve to 0.42.0 because 0.44.0 shipped only Windows/Linux wheels. [4], [2]
  • Planning note (Aug 29, 2024): maintainers stated Apple Silicon support was planned ~Q4/2024–Q2/2025 (seeking contributors). [5]
  • 0.49.0 (Dec 11, 2025): macOS 14+ wheels began being published. Release notes also state 4-bit and 8-bit quantization on MPS is supported via “slow implementations” (with plans for faster Metal kernels later). [6], [2]
  • 0.49.1 (Jan 8, 2026) and newer: continues the post-0.49.0 era where macOS wheels exist (see PyPI release history). [2], [6]

Sources: PyPI release history [2]; historical docs for 0.42.0 [1]; macOS install/resolution issues around 0.43–0.44 [3], [4]; Apple Silicon support planning discussion [5]; macOS wheel introduction in 0.49.0 release notes [6].


🌐 Web query:

bitsandbytes 0.41.0 0.44.0 release notes macOS support

💡 Result:

bitsandbytes 0.41.0 (Jul 22, 2023): release notes / macOS support

  • macOS support: Not supported. The 0.41.0 PyPI description states bitsandbytes is “currently only supported on Linux distributions” (and also says Windows is not supported). So macOS wheels/support are not provided in 0.41.0. [1]

bitsandbytes 0.44.0 (Sep 24, 2024): release notes / macOS support

  • Release notes / highlights (0.44.0):
    • New optimizer AdEMAMix (+ 8bit and paged variants)
    • 8-bit optimizer block size reduced 2048 → 256
    • Fix for CUDA Graphs capture support
    • Embedding4bit / Embedding8bit (embedding quantization)
    • “Continuous builds” for binary wheels on main [2]
  • macOS support (0.44.0):
    • A GitHub issue notes that pip on Mac installs an older version (e.g., 0.42) because 0.44.0 wheels were published only for Windows and Linux, i.e. no macOS wheel for 0.44.0. [3]
    • Hugging Face’s bitsandbytes docs for the 0.44.x line explicitly say “MacOS support is still a work in progress.” [4]

Sources:
[1] bitsandbytes 0.41.0 on PyPI (pypi.org)
[2] bitsandbytes v0.44.0 announcement/discussion (github.com)
[3] GitHub issue “Release v44 not available for Mac” (github.com)
[4] Hugging Face docs (v0.44.1) installation page (huggingface.co)

Citations:


🌐 Web query:

bitsandbytes latest version 2024 2025 platform support

💡 Result:

Latest bitsandbytes versions (2024–2025) + current latest

  • Latest overall (stable on PyPI): 0.49.1 (Jan 8, 2026). [1]
  • Latest released in 2025: 0.49.0 (Dec 11, 2025). [1]
  • Latest released in 2024: 0.45.0 (Dec 5, 2024) (with 0.44.1 on Sep 30, 2024). [1]

Platform / accelerator support (as of current project docs)

Minimum requirements

  • Python 3.10+
  • PyTorch 2.3+ [1]

OS support listed

  • Linux (glibc ≥ 2.24): x86_64 and aarch64. [1]
  • Windows 11 / Windows Server 2022+: x86_64. [1]
  • macOS 14+: arm64 (Apple Silicon); MPS (Metal) support is marked slow. [1]

Accelerator support highlights (from the project’s support matrix / release notes)

  • NVIDIA CUDA GPUs (SM60+ minimum; SM75+ recommended) across Linux/Windows; 8-bit optimizers supported on CUDA. [1]
  • Intel GPUs (XPU) supported on Linux and Windows (introduced as an “official support” highlight in 0.48.0). [2]
  • Intel Gaudi2/Gaudi3 (HPU) supported for some features; optimizers not implemented (per 0.48.0 notes / matrix). [1] [2]
  • CPU backend supported for int8 / 4-bit quantization; 8-bit optimizers not supported on CPU per the matrix. [1]

If you tell me your OS + GPU (NVIDIA/AMD/Intel/Apple) + PyTorch version, I can map it to the exact row in the current support table.

Sources: PyPI release history + requirements/support matrix [1]; GitHub release notes (0.48.0 Intel GPU/Gaudi support) [2].


bitsandbytes platform limitations not guarded — macOS installs will silently break.

The bitsandbytes library did not ship macOS wheels until version 0.49.0 (Dec 11, 2025). The lower bound of 0.41.0 explicitly documents Linux-only support at that time, and even versions through 0.48.x lack macOS artifacts. A developer on macOS installing the transformers-quantized extra will either fail to install a compatible wheel or get a non-functional package.

Bumping the minimum to 0.44.0 will not resolve this, as 0.44.0 has no macOS wheel either. The first version with official macOS 14+ arm64 support and MPS quantization is 0.49.0. Consider one of the following:

  1. Bump the minimum to 0.49.0 without a platform guard (to support macOS users):

    transformers-quantized = [
        "vision-agents-plugins-huggingface[transformers]",
    -    "bitsandbytes>=0.41.0",
    +    "bitsandbytes>=0.49.0",
    ]
  2. Or restrict to Linux/Windows with an explicit platform marker and document the CUDA-only constraint prominently:

    transformers-quantized = [
        "vision-agents-plugins-huggingface[transformers]",
    -    "bitsandbytes>=0.41.0",
    +    "bitsandbytes>=0.41.0; sys_platform != 'darwin'",
    ]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
transformers-quantized = [
"vision-agents-plugins-huggingface[transformers]",
"bitsandbytes>=0.41.0",
]
transformers-quantized = [
"vision-agents-plugins-huggingface[transformers]",
"bitsandbytes>=0.49.0",
]
Suggested change
transformers-quantized = [
"vision-agents-plugins-huggingface[transformers]",
"bitsandbytes>=0.41.0",
]
transformers-quantized = [
"vision-agents-plugins-huggingface[transformers]",
"bitsandbytes>=0.41.0; sys_platform != 'darwin'",
]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/huggingface/pyproject.toml` around lines 24 - 27, The extras entry
"transformers-quantized" currently pins bitsandbytes as "bitsandbytes>=0.41.0",
which silently breaks on macOS because bitsandbytes did not provide macOS wheels
until 0.49.0; fix by either bumping the minimum to "bitsandbytes>=0.49.0" in the
"transformers-quantized" extra to ensure macOS support, or restrict the extra to
supported platforms by adding an explicit platform marker (e.g., only include
bitsandbytes for linux/windows) and update the package README to document the
CUDA/MPS and platform constraints for the "transformers-quantized" extra.

Comment on lines +374 to +384
def run_generation() -> None:
nonlocal generation_error
try:
with torch.no_grad():
model.generate(**generate_kwargs) # type: ignore[operator]
except RuntimeError as e:
generation_error = e
logger.exception("Generation failed")
finally:
# Unblock the async consumer so it doesn't hang forever
loop.call_soon_threadsafe(async_queue.put_nowait, None)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Non-RuntimeError exceptions from model.generate() are silently swallowed in streaming mode, and propagate unchecked in non-streaming mode.

In run_generation, only RuntimeError is caught. Other realistic failures — MemoryError (CPU OOM), ValueError (invalid generation parameters), OSError (disk/cache errors) — escape the except clause. The finally block correctly unblocks the async consumer via the sentinel, but generation_error stays None, so no LLMErrorEvent is emitted and _generate_streaming returns an empty LLMResponseEvent indistinguishable from a legitimate empty response.

In _generate_non_streaming (line 474), the same narrow except RuntimeError means non-RuntimeError exceptions propagate unhandled through await asyncio.to_thread(...) back to the caller.

Additionally, on the success path the finally block queues a second None sentinel after _AsyncBridgeStreamer.on_finalized_text(stream_end=True) already queued one. The orphan is harmless — the consumer breaks on the first — but it indicates the guard wasn't written with the success path in mind.

🐛 Proposed fix for run_generation
+        generation_complete = False
+
         def run_generation() -> None:
-            nonlocal generation_error
+            nonlocal generation_error, generation_complete
             try:
                 with torch.no_grad():
                     model.generate(**generate_kwargs)  # type: ignore[operator]
+                generation_complete = True
-            except RuntimeError as e:
+            except (RuntimeError, MemoryError, ValueError, OSError) as e:
                 generation_error = e
                 logger.exception("Generation failed")
             finally:
-                # Unblock the async consumer so it doesn't hang forever
-                loop.call_soon_threadsafe(async_queue.put_nowait, None)
+                if not generation_complete:
+                    loop.call_soon_threadsafe(async_queue.put_nowait, None)
🐛 Proposed fix for _generate_non_streaming
-        except RuntimeError as e:
+        except (RuntimeError, MemoryError, ValueError, OSError) as e:

Also applies to: 472-474

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/huggingface/vision_agents/plugins/huggingface/transformers_llm.py`
around lines 374 - 384, run_generation currently only catches RuntimeError so
other exceptions (MemoryError, ValueError, OSError, etc.) are swallowed or
propagate; change it to catch BaseException? No — catch Exception (not
SystemExit/KeyboardInterrupt), set generation_error = e for any Exception, call
logger.exception with the exception, and ensure the async sentinel is always
posted exactly once (avoid queuing a second None when the success path already
emitted one via _AsyncBridgeStreamer.on_finalized_text); similarly update
_generate_non_streaming to wrap the model.generate call in a try/except
Exception block (catching non-RuntimeError exceptions), convert the exception
into the same LLM error path (set generation_error or raise a mapped
LLMErrorEvent/return an error response consistent with streaming flow) and log
it via logger.exception so non-RuntimeError failures are reported instead of
leaking or being ignored.

Comment on lines +524 to +532
result: List[Dict[str, Any]] = []
for t in tools or []:
name = t.get("name", "unnamed_tool")
description = t.get("description", "") or ""
params = t.get("parameters_schema") or t.get("parameters") or {}
if not isinstance(params, dict):
params = {}
params.setdefault("type", "object")
params.setdefault("properties", {})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

setdefault mutates the caller's tool-schema dict in-place.

params may be a direct reference to the parameters_schema or parameters value inside a ToolSchema. Calling params.setdefault(...) silently modifies the original dict. The first call is idempotent, but it still reaches into the caller's data structure.

🛡️ Proposed fix — copy before mutating
-        params = t.get("parameters_schema") or t.get("parameters") or {}
+        params = dict(t.get("parameters_schema") or t.get("parameters") or {})
         if not isinstance(params, dict):
             params = {}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/huggingface/vision_agents/plugins/huggingface/transformers_llm.py`
around lines 524 - 532, The loop over tools mutates the caller's schema by
calling params.setdefault(...) on the params object which may be a direct
reference to t.get("parameters_schema") or t.get("parameters"); instead, make a
shallow (or deep if nested) copy of params before mutating (e.g., replace the
current params assignment with a copied version) so modifications to
params.setdefault("type", "object") and params.setdefault("properties", {}) do
not alter the original ToolSchema referenced by tools/t; update the code around
the variables tools, t, params, parameters_schema to operate on the copied dict.

processor = self._resources.processor
model = self._resources.model

pad_token_id = processor.tokenizer.pad_token_id
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

processor.tokenizer access may raise AttributeError for some processor types.

AutoProcessor.from_pretrained returns different concrete processor classes depending on the model. While most VLM processors expose .tokenizer, it is not part of the AutoProcessor public contract. A defensive fallback preserves the intent without relying on a structural assumption:

tokenizer = getattr(processor, "tokenizer", None)
pad_token_id = tokenizer.pad_token_id if tokenizer is not None else None

However, since the guidelines discourage getattr, the idiomatic alternative is:

try:
    pad_token_id = processor.tokenizer.pad_token_id
except AttributeError:
    pad_token_id = None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/huggingface/vision_agents/plugins/huggingface/transformers_vlm.py` at
line 293, Accessing processor.tokenizer can raise AttributeError for some
AutoProcessor types; update the code around processor.tokenizer and pad_token_id
(used after AutoProcessor.from_pretrained) to defensively handle missing
tokenizer by wrapping the access in a try/except AttributeError and setting
pad_token_id = None on failure; ensure you reference processor.tokenizer and
pad_token_id in the fix so the rest of the function (e.g., any calls that rely
on pad_token_id) continues to work with a None fallback.

Comment on lines +305 to +324
try:
outputs = await asyncio.to_thread(_do_generate)
except RuntimeError as e:
logger.exception("VLM generation failed")
self.events.send(
VLMErrorEvent(
plugin_name=PLUGIN_NAME,
inference_id=inference_id,
error=e,
context="generation",
)
)
self.events.send(
events.LLMErrorEvent(
plugin_name=PLUGIN_NAME,
error_message=str(e),
event_data=e,
)
)
return LLMResponseEvent(original=None, text="")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Non-RuntimeError exceptions from model.generate() propagate unhandled to callers of simple_response.

asyncio.to_thread re-raises any exception thrown inside _do_generate on the event-loop thread. Only RuntimeError is caught; a MemoryError, ValueError, or OSError will propagate unhandled through simple_response without emitting a VLMErrorEvent.

🐛 Proposed fix
-        except RuntimeError as e:
+        except (RuntimeError, MemoryError, ValueError, OSError) as e:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@plugins/huggingface/vision_agents/plugins/huggingface/transformers_vlm.py`
around lines 305 - 324, The try/except around awaiting
asyncio.to_thread(_do_generate) only catches RuntimeError so other exceptions
from _do_generate (e.g., MemoryError, ValueError, OSError) will propagate out of
simple_response; update the exception handling to catch Exception (or the
appropriate broad base class) around the asyncio.to_thread(_do_generate) call,
log the full exception via logger.exception, and send the same VLMErrorEvent and
events.LLMErrorEvent (using VLMErrorEvent, events.LLMErrorEvent, inference_id)
before returning LLMResponseEvent(original=None, text="") so all errors from
model.generate() are handled consistently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants