-
Notifications
You must be signed in to change notification settings - Fork 0
feat(pipeline): strict Pydantic validation, background embeddings jobs + status API, optional Prometheus metrics & structured logging #33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…nput validation, and metrics tracking
WalkthroughIntroduces an asynchronous, job-based processing workflow in api/endpoints/internal.py. Adds background task scheduling, in-memory job store with locking, optional Prometheus metrics, input validation updates, a new job status endpoint, and modular helpers for job lifecycle and processing. The primary processing endpoint now schedules work and returns a job_id. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Client
participant API as Internal API
participant BG as BackgroundTasks
participant JB as JobStore (in-memory)
participant PROC as _background_process
Client->>API: POST /process_material (request)
API->>JB: _create_job_record(status="queued", info)
API->>BG: add_task(_background_process, job_id, request)
API-->>Client: { status: "scheduled", job_id }
par Background execution
BG->>PROC: start(job_id, request)
PROC->>JB: _update_job(status="running", step="decode")
PROC->>PROC: decode input (base64/pdf/docx/text)
PROC->>JB: _update_job(step="extract_text")
PROC->>PROC: extract text (PDF/DOCX/text)
PROC->>JB: _update_job(step="chunk_embed")
PROC->>PROC: chunk + embed (VectorDB stub)
PROC->>JB: _update_job(step="draft", progress)
PROC->>PROC: draft generation
PROC->>JB: _update_job(status="completed", result=metadata)
and Error path
PROC-->>JB: _update_job(status="failed", error)
end
Client->>API: GET /job/{job_id}
API->>JB: read job state
API-->>Client: { status, step, error?, result? }
Estimated code review effort🎯 4 (Complex) | ⏱️ ~55 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal). Please share your feedback with us on this Discord post. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (6)
api/endpoints/internal.py (6)
19-23
: Narrow Prometheus import exception to ImportErrorCatching all Exceptions hides real import-time errors inside prometheus_client. Prefer ImportError.
-try: - from prometheus_client import Counter, Histogram -except Exception: +try: + from prometheus_client import Counter, Histogram +except ImportError: Counter = None Histogram = None
200-205
: Preserve exception context when raising HTTPExceptionRaise from the original exception to keep the traceback chain.
- except Exception: + except Exception as e: logger.exception("Error scheduling background job") if INTERNAL_REQUESTS: INTERNAL_REQUESTS.labels(status="error").inc() - raise HTTPException(status_code=500, detail="Internal server error") + raise HTTPException(status_code=500, detail="Internal server error") from e
36-38
: Avoid raising at import time if INTERNAL_API_KEY is missingImport-time RuntimeError prevents app startup in tooling/tests and deviates from PR’s “secure key verification in the new flow.” Do the check in the dependency.
-if not INTERNAL_API_KEY: - raise RuntimeError("INTERNAL_API_KEY environment variable is required for internal API authentication.")-def verify_internal_api_key(api_key: str = Depends(api_key_header)): +def verify_internal_api_key(api_key: str = Depends(api_key_header)): # Do not log or expose the secret - if not api_key or not INTERNAL_API_KEY or not secrets.compare_digest(api_key, INTERNAL_API_KEY): + expected = os.environ.get("INTERNAL_API_KEY") + if not api_key or not expected or not secrets.compare_digest(api_key, expected): raise HTTPException( status_code=status.HTTP_401_UNAUTHORIZED, detail="Invalid or missing internal API key." )If you require hard-fail at startup in production, consider moving the check to an app startup event instead of import time.
Also applies to: 90-97
155-166
: Catch and surface known service errors explicitlyTo improve observability and user feedback, consider catching specific exceptions from parsing/chunking/embedding services (e.g., DocumentParseError, DocumentChunkError, AuthenticationError, RateLimitError) and updating the job with a precise error string, instead of a blanket Exception.
Do you want a targeted error mapping aligned with the uniform error schema mentioned in the PR?
41-44
: Thread-safety is OK; consider copying dict on read to avoid accidental mutation
job_status
returns the internal dict; while FastAPI serializes it, returning a shallow copy can avoid accidental mutation if later code reuses the object.- return _jobs[job_id] + return dict(_jobs[job_id])Also applies to: 207-213
52-52
: Consider explicit response modelsUsing
response_model=dict
loses schema guarantees. Define small Pydantic response models for job creation and status to solidify your uniform error/response schema.I can add
JobStatus
andJobAccepted
models if you want.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
api/endpoints/internal.py
(4 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
api/endpoints/internal.py (3)
services/parsing_service.py (2)
extract_text_from_pdf
(10-24)extract_text_from_docx
(26-38)services/chunking_service.py (1)
chunk_text
(9-39)services/embedding_service.py (1)
embed_texts
(19-42)
🪛 Ruff (0.13.1)
api/endpoints/internal.py
21-21: Do not catch blind exception: Exception
(BLE001)
65-65: Avoid specifying long messages outside the exception class
(TRY003)
79-79: Abstract raise
to an inner function
(TRY301)
79-79: Avoid specifying long messages outside the exception class
(TRY003)
82-82: Avoid specifying long messages outside the exception class
(TRY003)
86-86: Avoid specifying long messages outside the exception class
(TRY003)
197-197: Consider moving this statement to an else
block
(TRY300)
204-204: Within an except
clause, raise exceptions with raise ... from err
or raise ... from None
to distinguish them from errors in exception handling
(B904)
🔇 Additional comments (2)
api/endpoints/internal.py (2)
181-197
: Add idempotency support for job scheduling (content_hash, model, tenant)PR summary mentions idempotent enqueueing and retry/backoff. The current endpoint always creates a new job. Recommend deriving an idempotency key (e.g., SHA-256 of source_material + model + tenant) to return existing job_id when re-submitted.
Would you like a follow-up patch that computes a stable key and de-duplicates in
_create_job_record
? If yes, confirm how to obtain tenant and model here.
11-11
: Adopt Pydantic v2 validators with strict model config
- Project uses Pydantic v2 (requirements.txt pins pydantic>=2.6.0), so replace
@validator
withfield_validator
and add amodel_validator(mode="after")
for cross-fieldsource_material
checks.- Add
model_config = {"extra":"forbid","str_strip_whitespace":True}
to enforce strict validation.- Read
MAX_BYTES
env var (fallback toINTERNAL_MAX_BYTES
) when validating payload size.
@router.post("/process-material", response_model=dict) | ||
|
||
def process_material( | ||
request: SourceMaterialRequest, | ||
_: str = Depends(verify_internal_api_key) | ||
): | ||
def _create_job_record(status: str, info: Dict[str, Any]) -> str: | ||
job_id = str(uuid.uuid4()) | ||
with _jobs_lock: | ||
_jobs[job_id] = {"status": status, "info": info, "created_at": time.time()} | ||
return job_id | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove stray/duplicate route decorator for /process-material
@router.post("/process-material")
above _create_job_record
registers the wrong function as the endpoint and creates a duplicate route with the real process_material
below. This will either conflict or bind the path to a private helper.
Apply this diff to remove the accidental decorator:
-@router.post("/process-material", response_model=dict)
def _create_job_record(status: str, info: Dict[str, Any]) -> str:
job_id = str(uuid.uuid4())
with _jobs_lock:
_jobs[job_id] = {"status": status, "info": info, "created_at": time.time()}
return job_id
Also applies to: 181-187
🤖 Prompt for AI Agents
In api/endpoints/internal.py around lines 98-105 (and similarly at 181-187), a
stray @router.post("/process-material") decorator is applied to the private
helper _create_job_record, which unintentionally registers the helper as the
endpoint and creates a duplicate/incorrect route; remove the
@router.post("/process-material") decorator from _create_job_record (and the
duplicate occurrence at lines 181-187) so that only the actual process_material
function is registered for that path, leaving the helper as a plain function
used internally.
This PR hardens the internal processing pipeline with strict request validation, moves embedding generation to background jobs with a job-status API, and adds optional Prometheus metrics and structured logging for observability. Together, these changes improve safety, throughput, and operability without altering core product semantics.
New Features
Strict Pydantic validation for pipeline inputs
Enforce
model_config = {"extra": "forbid", "str_strip_whitespace": True}
to reject unknown fields and trim unsafe whitespace.Custom validators for:
MAX_BYTES
and per-item caps; produce clear 4xx error codes.{text, base64_blob}
with mutual exclusivity rules as needed.Background embedding jobs
embeddings.enqueue(...)
) to decouple request latency from model runtime.(content_hash, model, tenant)
key to avoid duplicated compute.Job-status API
POST /v1/embeddings/jobs
→ returns{job_id}
.GET /v1/embeddings/jobs/{job_id}
→{state: queued|running|succeeded|failed, created_at, updated_at, error? , result? }
.DELETE /v1/embeddings/jobs/{job_id}
→ best-effort cancel (if still queued/running).Optional Prometheus metrics
pipeline_requests_total{route,code}
,embeddings_jobs_total{state}
,pipeline_errors_total{type}
.pipeline_request_duration_seconds{route}
,embeddings_job_duration_seconds{model}
.embeddings_jobs_inflight
.PIPELINE_METRICS_ENABLED=true
.Structured logging
request_id
,job_id
,tenant
,route
,state
, and timing fields.trace_id
,span_id
) if upstream provides them.Improvements
Input safety & clarity
{ "error": { "code", "message", "details" } }
.INVALID_BASE64
,PAYLOAD_TOO_LARGE
,UNSUPPORTED_FIELD
,SCHEMA_VIOLATION
.Throughput & latency
Configurability
MAX_BYTES
,MAX_ITEMS_PER_REQUEST
,EMBEDDINGS_RETRY_MAX_ATTEMPTS
,EMBEDDINGS_BACKOFF_BASE_MS
via env.PIPELINE_LOG_LEVEL
andPIPELINE_LOG_FIELDS
allow minimal/extended fields.Operational readiness
job_id
for identical content.Bug Fixes
Fixed off-by-one in size enforcement that allowed payloads exactly at
MAX_BYTES+1
.Resolved race condition where job could be marked failed after success on worker restart.
Hardened base64 decoder to reject noncanonical encodings and silent truncation.
Normalized HTTP statuses:
400
schema/validation errors413
payload too large409
idempotency conflict with different payload for same key422
semantically invalid contentAPI Changes
New
POST /v1/embeddings/jobs
GET /v1/embeddings/jobs/{job_id}
DELETE /v1/embeddings/jobs/{job_id}
Behavioral
/v1/embeddings
(if retained) now performs strict validation; may return new error codes above.Error Schema
code
and optionaldetails
.Observability
Security & Privacy
extra="forbid"
) to prevent mass-assignment style issues.Performance Notes
job_id
.Migration Guide
PIPELINE_METRICS_ENABLED=true
and configure Prometheus scrape.POST /v1/embeddings/jobs
+ pollGET /.../{job_id}
rather than synchronous embedding calls.MAX_BYTES
andMAX_ITEMS_PER_REQUEST
to match your workload.Summary by CodeRabbit