Production-style AI platform lab for DevOps workflows with split inference architecture, RAG retrieval, model routing/fallback, and Kubernetes-native deployment patterns.
This project also includes CI/CD failure-analysis agent capabilities with metadata-aware log ingestion and focused operational response mode.
rag-service: public API for ingest/retrieve/ask/chat, dedicated CI/CD log ingestion, and in-memory cost aggregation.inference-router: internal generation gateway with provider abstraction (openai,ollama,mock), fallback logic, retries, JSON logs, and Prometheus metrics.deploy/k8s: namespace, config, secret, deployments, services, PVC, and HPAs.dashboard: Grafana dashboards (v1,v2,v3) plus screenshot galleries for observability and CI/CD analysis scenarios.eval: golden prompt harness for regression checks (eval/run_eval.py,eval/golden.json).docs: architecture and routing design notes.app.py: legacy monolith prototype kept for reference.
devops-genai/
├── README.md
├── app.py
├── requests.http
├── ROADMAP.md
├── dashboard/
│ ├── grafana-dashboard-1.jpg
│ ├── grafana-dashboard-2.jpg
│ ├── grafana-dashboard-3.jpg
│ ├── grafana-dashboard-v1.json
│ └── grafana-dashboard-v2.json
│ ├── grafana-dashboard-v3.json
│ ├── response-scenario-1.jpg
│ ├── response-scenario-2.jpg
│ └── response-scenario-3.jpg
├── deploy/k8s/
│ ├── namespace.yaml
│ ├── configmap.yaml
│ ├── secret.yaml
│ ├── pvc.yaml
│ ├── rag-deployment.yaml
│ ├── rag-service.yaml
│ ├── router-deployment.yaml
│ ├── router-service.yaml
│ └── hpa.yaml
├── docs/
│ ├── architecture_diagram.png
│ ├── inference-architecture.md
│ └── multi-model-routing.md
├── eval/
│ ├── golden.json
│ └── run_eval.py
└── services/
├── inference-router/
│ ├── Dockerfile
│ ├── requirements.txt
│ └── app/
└── rag-service/
├── Dockerfile
├── requirements.txt
└── app/
- Accepts raw docs/logs via
/ingest. - Applies content-aware chunking (
docsvslogs). - Generates embeddings (
text-embedding-3-small). - Stores chunks + metadata in ChromaDB.
- Accepts CI/CD logs via
/ingest-log. - Forces
content_type="logs"and defaults source tocicdwhen missing. - Persists CI/CD metadata fields for filtered retrieval:
repo,pipeline,environment,status,workflow,service_name
This enables targeted incident queries like: “show failed deploy logs for payments-api in dev.”
- Embeds the question.
- Retrieves top-k chunks from ChromaDB with optional metadata filters.
- Builds citation-aware prompt template.
- Calls
inference-router(/v1/generate). - Returns answer, retrieved chunks, token usage, and endpoint cost fields.
- Uses CI/CD-specific system prompt and response template.
- Separates immediate failure from likely underlying cause.
- Produces concise, operations-first output with:
Immediate FailureLikely Underlying CauseEvidenceFirst 3 ChecksSuggested FixConfidence
model_hintstarts withgpt*→openaiadapter.model_hintstarts withllama*,phi*,mistral*,qwen*,gemma*→ollamaadapter.model_hintstarts withmock*→mockadapter.model_hint == OLLAMA_DEFAULT_MODEL→ollamaadapter.- Otherwise →
ROUTER_DEFAULT_PROVIDER. - On primary failure, router optionally retries with
ROUTER_FALLBACK_PROVIDERwhen enabled.
| Provider | Status | Notes |
|---|---|---|
| OpenAI | ✅ Supported | Default cloud inference path (gpt* model hints). |
| Ollama | ✅ Supported | Local inference via OLLAMA_BASE_URL and OLLAMA_DEFAULT_MODEL. |
| Mock | ✅ Supported | Lightweight fallback/testing provider (mock* hints). |
GET /healthzPOST /chatPOST /ingestPOST /ingest-logPOST /askGET /sourcesGET /ingestedGET /costsDELETE /reset?confirm=trueDELETE /delete_source?source=...
GET /healthzGET /metricsPOST /v1/generate
python -m venv .venv
source .venv/Scripts/activateCreate .env in repo root:
OPENAI_API_KEY=your_key_here
OPENAI_MODEL=gpt-4o-mini
OPENAI_MODEL_DEFAULT=gpt-4o-mini
OPENAI_EMBED_MODEL=text-embedding-3-small
CHROMA_DIR=./chroma_db
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_DEFAULT_MODEL=llama3.2:1b
ROUTER_DEFAULT_PROVIDER=openai
ROUTER_ENABLE_FALLBACK=true
ROUTER_FALLBACK_PROVIDER=mock
INFERENCE_ROUTER_URL=http://localhost:8001
INFERENCE_TIMEOUT_S=30pip install -r services/inference-router/requirements.txt
pip install -r services/rag-service/requirements.txtTerminal A (inference-router):
cd services/inference-router
uvicorn app.main:app --host 0.0.0.0 --port 8001 --reloadTerminal B (rag-service):
cd services/rag-service
uvicorn app.main:app --host 0.0.0.0 --port 8000 --reloadUse the requests.http file for quick endpoint testing.
Ingest CI/CD logs with metadata:
curl -X POST http://localhost:8000/ingest-log \
-H "Content-Type: application/json" \
-d '{
"source": "github-actions",
"text": "<paste your failed pipeline logs here>",
"repo": "payments-api",
"pipeline": "deploy",
"environment": "dev",
"status": "failed",
"workflow": "deploy.yml",
"service_name": "payments-api"
}'Run CI/CD failure analysis:
curl -X POST http://localhost:8000/ask \
-H "Content-Type: application/json" \
-d '{
"question": "Why did this deployment fail and what should I check first?",
"top_k": 5,
"source": "github-actions",
"repo": "payments-api",
"pipeline": "deploy",
"environment": "dev",
"status": "failed",
"analysis_mode": "cicd",
"model_hint": "gpt-4o-mini"
}'Build images:
docker build -t rag-service:local services/rag-service
docker build -t inference-router:local services/inference-routerRun example:
docker run --rm -p 8001:8000 --env-file .env inference-router:local
docker run --rm -p 8000:8000 --env-file .env -e INFERENCE_ROUTER_URL=http://host.docker.internal:8001 rag-service:localApply resources:
kubectl apply -f deploy/k8s/namespace.yaml
kubectl apply -f deploy/k8s/configmap.yaml
kubectl apply -f deploy/k8s/secret.yaml
kubectl apply -f deploy/k8s/pvc.yaml
kubectl apply -f deploy/k8s/router-deployment.yaml
kubectl apply -f deploy/k8s/router-service.yaml
kubectl apply -f deploy/k8s/rag-deployment.yaml
kubectl apply -f deploy/k8s/rag-service.yaml
kubectl apply -f deploy/k8s/hpa.yamlNotes:
- Namespace:
devops-genai rag-servicemounts PVCchroma-pvcat/data/chroma_db- HPA configured for both deployments (
min=1,max=3, CPU target60%) secret.yamlmust be updated with a validOPENAI_API_KEY
inference_requests_total{provider,purpose,model}inference_failures_total{provider,purpose,failure_stage}inference_latency_seconds{provider,purpose,model}inference_input_tokens_total{provider,purpose,model}inference_output_tokens_total{provider,purpose,model}
dashboard/grafana-dashboard-v1.jsondashboard/grafana-dashboard-v2.jsondashboard/grafana-dashboard-v3.json
- Local Grafana home: http://localhost:3000
- Suggested dashboard path after import: http://localhost:3000/dashboards
- JSON: Dashboard V1
- JSON: Dashboard V2
- JSON: Dashboard V3
| Dashboard View 1 | Dashboard View 2 | Dashboard View 3 |
|---|---|---|
![]() |
![]() |
![]() |
This project now supports a CI/CD-focused analysis workflow that combines:
- log-first retrieval
- metadata-scoped filtering
- operational response formatting
- evidence-backed recommendations with citations
| Scenario 1 | Scenario 2 | Scenario 3 |
|---|---|---|
![]() |
![]() |
![]() |
Golden tests call POST /ask and validate answer shape/content and citations.
python eval/run_eval.pyOptional custom file:
python eval/run_eval.py eval/golden.jsondocs/inference-architecture.mddocs/multi-model-routing.mdROADMAP.md
OPENAI_API_KEY not set: provide key in.env(local) ordeploy/k8s/secret.yaml(K8s)./askreturns “Insufficient context”: ingest relevant docs/logs first via/ingest.- CI/CD query returns weak/no evidence: use
/ingest-logand include matching metadata (repo,pipeline,environment,status). - Router 502 errors: inspect router logs for primary/fallback failure events.
- No autoscaling in cluster: verify metrics-server is installed for HPA.






