Revert "[lora][tinker] Add pause and resume for multi-tenant lora "#1678
Conversation
)" This reverts commit 77f92c2.
There was a problem hiding this comment.
Code Review
This pull request removes the transient per-LoRA targeted pause and resume functionality, including the lora_name parameter from generation control methods and the sample_with_retry logic. Associated server endpoints and test suites have also been deleted. Feedback identifies a potential issue in the sample method where removing dictionary guards for prompt and sampling_params could lead to AttributeError if those fields are explicitly provided as null.
| prompt = body.get("prompt", {}) | ||
|
|
||
| # Render prompt: flatten text tokens and, if images are present, | ||
| # call the render endpoint to get placeholder tokens + features. | ||
| token_ids, mm_features = await self._render_for_sample(prompt, session_id, model=model) | ||
|
|
||
| return await self._sample_with_rendered_tokens( | ||
| token_ids=token_ids, | ||
| mm_features=mm_features, | ||
| body=body, | ||
| session_id=session_id, | ||
| model=model, | ||
| ) | ||
|
|
||
| async def _sample_with_rendered_tokens( | ||
| self, | ||
| token_ids: List[int], | ||
| mm_features: Optional[MultiModalFeatures], | ||
| body: Dict[str, Any], | ||
| session_id: Optional[str], | ||
| model: str, | ||
| ) -> SampleResponse: | ||
| """Dispatch a single rendered sample request to /inference/v1/generate. | ||
|
|
||
| Shared between ``sample`` (one shot) and ``sample_with_retry`` (loops | ||
| on this method with accumulating ``token_ids`` and decremented | ||
| ``max_tokens``). No retry, no LoRA gating; those live in the caller. | ||
| """ | ||
| num_samples = body.get("num_samples", 1) | ||
| tinker_params = body.get("sampling_params", {}) or {} | ||
| tinker_params = body.get("sampling_params", {}) |
There was a problem hiding this comment.
The removal of the or {} guard for prompt and sampling_params reduces the robustness of the sample method. If the input JSON explicitly contains null for these fields, body.get(key, {}) will return None, leading to an AttributeError when .get() or other dictionary methods are called on them later (e.g., at line 635).
| prompt = body.get("prompt", {}) | |
| # Render prompt: flatten text tokens and, if images are present, | |
| # call the render endpoint to get placeholder tokens + features. | |
| token_ids, mm_features = await self._render_for_sample(prompt, session_id, model=model) | |
| return await self._sample_with_rendered_tokens( | |
| token_ids=token_ids, | |
| mm_features=mm_features, | |
| body=body, | |
| session_id=session_id, | |
| model=model, | |
| ) | |
| async def _sample_with_rendered_tokens( | |
| self, | |
| token_ids: List[int], | |
| mm_features: Optional[MultiModalFeatures], | |
| body: Dict[str, Any], | |
| session_id: Optional[str], | |
| model: str, | |
| ) -> SampleResponse: | |
| """Dispatch a single rendered sample request to /inference/v1/generate. | |
| Shared between ``sample`` (one shot) and ``sample_with_retry`` (loops | |
| on this method with accumulating ``token_ids`` and decremented | |
| ``max_tokens``). No retry, no LoRA gating; those live in the caller. | |
| """ | |
| num_samples = body.get("num_samples", 1) | |
| tinker_params = body.get("sampling_params", {}) or {} | |
| tinker_params = body.get("sampling_params", {}) | |
| prompt = body.get("prompt") or {} | |
| num_samples = body.get("num_samples", 1) | |
| tinker_params = body.get("sampling_params") or {} |
Reverts #1657, since we can skip pause/resume for multi-tenant lora after #1677