SupplyHub Backend - Current State

Document reconciliation service for foreign-trade paperwork. FastAPI delivers both JSON APIs and a minimal Jinja web UI, with Celery workers orchestrating OCR, LLM-based field extraction, review, and validation workflows.

Status Snapshot

Working: batch CRUD plus upload (app/api/routes/batches.py:28), background pipeline fallback when Celery unavailable (app/services/pipeline.py:281), report generation (app/services/reporting.py:34), validation rules persisted to DB (app/services/validation.py:940), HTML review UI (app/api/routes/web.py:31).
Partially implemented: OCR integrates the local dots.ocr + vLLM runtime with stub fallback (app/services/ocr.py:17); the JSON filler adapter still needs real API keys and otherwise returns deterministic stubs (app/services/json_filler.py:17). Confidence scoring is random and unsuitable for production (app/services/confidence.py:9). Document classification only covers a subset of DocumentType values (app/services/classification.py:9).
Missing: Production-ready frontend, authentication and authorization, migrations, automated tests, monitoring dashboards, hardened AI integrations.

Architecture Overview

API entrypoint mounts batch storage and routers (app/main.py:13); /ping provides basic health (app/main.py:22).
Config and infrastructure sourced from Settings (app/core/config.py:7); storage helpers create /srv/supplyhub/batches/{id} tree for raw, derived, and report assets (app/core/storage.py:22).
Data model (Batch, Document, FilledField, Validation, SystemStatusSnapshot) defined with SQLAlchemy (app/models.py:20).
Schemas and validation baseline enumerate expected fields per document type (app/core/schema.py:9) and feed the rule engine (app/services/validation.py:60).
Services layer
- Batch lifecycle utilities with PDF splitter (app/services/batches.py:59).
- Pipeline orchestrating OCR -> classification -> JSON fill -> scoring -> persistence (app/services/pipeline.py:78) and validation/reporting loop (app/services/pipeline.py:262).
- Manual review helpers (app/services/review.py:32) and system metrics snapshotting (app/services/status.py:12, app/services/status.py:50).
- Stubbed confidence scoring (app/services/confidence.py:9) and regex-based classification (app/services/classification.py:9).
Workers boot Celery with Redis broker and backend (app/workers/celery_app.py:9, app/workers/tasks.py:17). Failover runs tasks inline via asyncio if Celery dispatch fails (app/services/pipeline.py:281).
Frontend surface served via FastAPI templating (app/api/routes/web.py:31, app/templates/batch.html) for upload, batch view, and manual field confirmation; text is currently Russian and styling minimal.
Documentation assets: high-level overview (docs/OVERVIEW.md), detailed Russian-language spec (SPEC.md). Integration scripts for OCR and LLM testing live under est_ocr/.

Data Flow

Create batch -> batches.create_batch() (app/api/routes/batches.py:28) persists a row and storage folders (app/services/batches.py:59).
Upload files -> sanitized, optionally split per PDF page, stored under aw/ (app/services/batches.py:81).
Process request enqueues Celery; fallback runs local pipeline (app/services/pipeline.py:281). The pipeline writes OCR payloads, flattened filled fields, and updates document status (app/services/pipeline.py:118, app/services/pipeline.py:230).
Review endpoints surface low-confidence and missing fields (app/api/routes/batches.py:64, app/services/review.py:32) sorted by confidence.
Manual edits create new FilledField versions (app/services/review.py:80). Completion triggers validation (app/services/pipeline.py:262) and report generation with warnings merged into validations (app/services/reporting.py:34).
Reports and archive retrieval read from the filesystem (app/services/reports.py:9, app/api/routes/archive.py:17). /system/status returns DB stats plus the latest snapshot (app/api/routes/system.py:14).

Directory Layout

app/ api/ # FastAPI routers and schemas core/ # config, storage, enums, document schema mock_services/ # ChatGPT OCR and JSON filler adapters plus templates scripts/ # init_db helper services/ # business logic (pipeline, validation, review, etc.) templates/ # HTML views for manual review workers/ # Celery wiring docs/ # human-facing docs test_ocr/ # local adapters smoke tests plus sample assets Dockerfile, docker-compose.yml, SPEC.md, pyproject.toml

Configuration and Deployment

Runtime dependencies declared in pyproject.toml; Docker image installs in editable mode for live reload.
docker-compose.yml provisions Postgres, Redis, API, worker, init-db, and shared storage volume; the worker expects the JSON filler adapter on host port 9002 (OCR now runs in-process).
Environment variables prefixed SUPPLYHUB_ configure DB, Redis, Celery, paths, and stub toggles (app/core/config.py:7). SUPPLYHUB_USE_STUB_SERVICES=1 forces internal mocks.

External and Mock Services

OCR pipeline runs in-process via app/services/ocr.py and app/services/dots_ocr_adapter.py, talking to the vLLM container directly (HTTP fallback is only used when SUPPLYHUB_OCR_ENDPOINT is set). The JSON filler adapter (app/mock_services/chatgpt_json_filler.py:57) targets OpenRouter/OpenAI but falls back to deterministic stubs loaded from app/mock_services/docs_json/.
CLI smoke tests ( est_ocr/testim.py, est_ocr/test_json_filler.py) can hit adapters or external endpoints directly.

Current Gaps and Risks

AI quality: classification misses PROFORMA and SPECIFICATION types, confidence is random, OCR and LLM prompts not tuned, no grounding in document schema beyond template merge.
Frontend UX: Jinja pages lack pagination, diffing, previews, keyboard shortcuts, or upload progress; no localization toggle or auth.
Ops: no Alembic migrations, tests, tracing, or rate limiting; error messages partly in Russian, inconsistent logging; warnings stored in batch meta but not surfaced elsewhere.
Security and compliance: unauthenticated endpoints, no audit logging, static file serving exposes the entire storage root.

Backlog and Next Steps

Frontend

Replace Jinja views with a modern SPA (or richer server-rendered UI) supporting live status, previews, validation drilldown, and multi-user review.
Implement authentication, session management, and role-based access.
Improve localization (RU and EN), accessibility, and error messaging consistency.

AI and Document Processing

Build a deterministic confidence model (for example leverage OCR confidence and schema heuristics) to replace random scoring (app/services/confidence.py:9).
Extend classification beyond regex keywords (app/services/classification.py:9) using ML or template matching; cover all DocumentType values.
Integrate production OCR and LLM endpoints with retry, backoff, payload chunking, token budgets, and observability.
Enhance schema handling for nested product tables and support multi-page or multi-product outputs; add a normalization layer before validation.
Capture provenance metadata (model version, prompt ID) alongside FilledField records.

Platform and Quality

Introduce migrations (Alembic) and seed scripts; add unit and integration tests (pytest plus httpx) and CI.
Harden storage (per-tenant directories, antivirus, cleanup of derived artifacts).
Add structured logging, metrics, and alerting around Celery queues and pipeline failures.
Implement concurrency safeguards (idempotent processing, deduplication) and batch cancellation flows.
Document API schema via OpenAPI tags and provide a Postman or Insomnia collection.

Artifacts referenced above:

app/main.py:13
app/core/config.py:7
app/core/storage.py:22
app/models.py:20
app/core/schema.py:9
app/api/routes/batches.py:28
app/api/routes/web.py:31
app/services/batches.py:59
app/services/pipeline.py:78
app/services/pipeline.py:118
app/services/pipeline.py:230
app/services/pipeline.py:262
app/services/pipeline.py:281
app/services/classification.py:9
app/services/confidence.py:9
app/services/json_filler.py:17
app/services/ocr.py:17
app/services/review.py:32
app/services/review.py:80
app/services/validation.py:940
app/services/reporting.py:34
app/services/status.py:12
app/api/routes/archive.py:17
app/api/routes/system.py:14
app/services/reports.py:9
docs/OVERVIEW.md
SPEC.md

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
app		app
docs		docs
frontend		frontend
test_ocr		test_ocr
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
SPEC.md		SPEC.md
docker-compose.yml		docker-compose.yml
manage.ps1		manage.ps1
package-lock.json		package-lock.json
parse_test.py		parse_test.py
pyproject.toml		pyproject.toml
temp_batch_before.txt		temp_batch_before.txt
temp_patch.txt		temp_patch.txt
validation_current.py		validation_current.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SupplyHub Backend - Current State

Status Snapshot

Architecture Overview

Data Flow

Directory Layout

Configuration and Deployment

External and Mock Services

Current Gaps and Risks

Backlog and Next Steps

Frontend

AI and Document Processing

Platform and Quality

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

DestYchen/LOGOS

Folders and files

Latest commit

History

Repository files navigation

SupplyHub Backend - Current State

Status Snapshot

Architecture Overview

Data Flow

Directory Layout

Configuration and Deployment

External and Mock Services

Current Gaps and Risks

Backlog and Next Steps

Frontend

AI and Document Processing

Platform and Quality

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages