Enterprise AI Gateway with Intelligent Semantic Routing on AWS Bedrock
ModelMesh is a production-ready AI gateway built on LiteLLM that exposes a single OpenAI-compatible inference endpoint while intelligently routing requests across multiple AWS Bedrock foundation models using semantic embeddings. It enforces AI governance through PII detection and prompt sanitization via Microsoft Presidio before any request reaches a model.
┌─────────────────────────────────────┐
│ API Client │
│ (OpenAI SDK / curl / any HTTP) │
└──────────────────┬──────────────────┘
│ POST /v1/chat/completions
▼
┌─────────────────────────────────────┐
│ LiteLLM Gateway │ ← Centralized auth, routing, logging
│ (port 4000) │
└──────┬───────────────────┬──────────┘
│ │
▼ ▼
┌─────────────┐ ┌─────────────────┐
│ PostgreSQL │ │ Presidio Guards │ ← PII detection & anonymization
│ (metadata) │ │ Analyzer + │ runs before every LLM call
└─────────────┘ │ Anonymizer │
└────────┬────────┘
│ sanitized prompt
▼
┌─────────────────┐
│ Semantic Router │ ← Cohere multilingual embeddings
│ (router.json) │ cosine similarity matching
└────────┬────────┘
│ selected model
▼
┌─────────────────────────────────┐
│ AWS Bedrock │
│ │
│ ├── Claude Sonnet 4 (reasoning)│
│ ├── Gemma 3 27B (writing) │
│ ├── Llama 4 Maverick (creative) │
│ └── Nova Lite (casual) │
└─────────────────────────────────┘
A single OpenAI-compatible endpoint (POST /v1/chat/completions) abstracts all backend foundation models. Clients using the OpenAI SDK require zero code changes — only the base_url and API key need updating. Centralized authentication, request logging, and model configuration are managed through LiteLLM backed by PostgreSQL.
Requests are not routed by static rules. Each incoming prompt is embedded using Cohere embed-multilingual-v3 via AWS Bedrock. The resulting vector is compared against a library of intent utterances defined in router.json using cosine similarity. The model whose utterances most closely match the request is automatically selected.
| Model | Inference Profile | Best For |
|---|---|---|
| Claude Sonnet 4 | us.anthropic.claude-sonnet-4-6 |
Complex reasoning, architecture, code review, security analysis |
| Gemma 3 27B | google.gemma-3-27b-it |
Documentation, summarization, business writing, translation |
| Llama 4 Maverick | meta.llama4-maverick-17b-instruct-v1:0 |
Creative generation, marketing, ideation, blog writing |
| Nova Lite | us.amazon.nova-lite-v1:0 |
Lightweight conversations, casual queries, simple factual Q&A |
All models are accessed through AWS Bedrock cross-region inference profiles, enabling automatic failover and load distribution across AWS regions.
Every request passes through a pre-call guardrail powered by Microsoft Presidio before it reaches any foundation model. Personally identifiable information is detected and anonymized, enforcing data privacy policy at the infrastructure level rather than the application level.
Detected and anonymized entity types include:
- Email addresses
- Phone numbers
- Credit card numbers
- Aadhaar numbers
- Social Security Numbers (SSN)
- Personal names
- Vendor abstraction — swap or add models without changing client code
- Cost optimization — lightweight models handle simple tasks; premium models handle complex ones
- Intelligent model selection — semantic routing eliminates manual model selection
- Centralized governance — all policy enforcement, authentication, and logging in one layer
- OpenAI SDK compatibility — no client-side migration required
- Cloud-native deployment — fully containerized, orchestrated via Docker Compose
ModelMesh/
├── config.yml # LiteLLM gateway config: master key, guardrail bindings, settings
├── docker-compose.yml # Orchestrates LiteLLM, PostgreSQL, Presidio Analyzer, Presidio Anonymizer
├── router.json # Semantic routing config: encoder, model routes, intent utterances
└── .env.example # Environment variable template
config.yml — Configures the LiteLLM proxy: master key (loaded from env), database persistence, and the presidio-pre-guard guardrail binding that runs Presidio in pre_call mode.
docker-compose.yml — Defines four containerized services on a shared litellm Docker network: the LiteLLM gateway, PostgreSQL for model/config persistence, Presidio Analyzer (port 5002), and Presidio Anonymizer (port 5001).
router.json — Defines the semantic routing engine. Specifies cohere.embed-multilingual-v3 as the encoder, then declares each model route with a name, set of representative utterances, and a similarity score threshold of 0.15.
- Docker and Docker Compose
- AWS account with Bedrock access enabled for the models listed above
- AWS credentials configured locally
git clone https://github.com/your-username/ModelMesh.git
cd ModelMeshcp .env.example .envEdit .env with your values:
LITELLM_MASTER_KEY=your-secure-master-key
POSTGRES_PASSWORD=your-postgres-password
DATABASE_URL=postgresql://litellm:your-postgres-password@postgres:5432/litellmModelMesh uses your local AWS credentials to call Bedrock. Ensure these are set:
export AWS_ACCESS_KEY_ID=your-access-key
export AWS_SECRET_ACCESS_KEY=your-secret-key
export AWS_DEFAULT_REGION=us-east-1Or use an IAM role if deploying on EC2/ECS.
docker network create litellm
docker compose up -ddocker compose ps
curl http://localhost:4000/healthSend any prompt to the smart-router model — the semantic router selects the best foundation model automatically.
curl http://localhost:4000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your-master-key" \
-d '{
"model": "smart-router",
"messages": [
{
"role": "user",
"content": "Design a high availability Kubernetes architecture for a payment processing system."
}
]
}'This request will be routed to Claude Sonnet 4 based on semantic similarity to architecture and system design utterances.
- Client sends a
POST /v1/chat/completionsrequest withmodel: smart-router - LiteLLM receives the request and authenticates via master key
- The prompt text is extracted and sent to the semantic router
- The router calls
cohere.embed-multilingual-v3on AWS Bedrock to produce a vector embedding of the prompt - Cosine similarity is computed against all pre-defined utterance embeddings in
router.json - The route with the highest similarity score above the
0.15threshold wins - LiteLLM rewrites the model target to the matched Bedrock inference profile
- The request is forwarded to AWS Bedrock
- An inbound request is intercepted by the
presidio-pre-guardpolicy before the model call - The raw prompt is sent to Presidio Analyzer (
http://presidio-analyzer:3000), which returns a list of detected PII entities with confidence scores and character spans - The prompt is forwarded to Presidio Anonymizer (
http://presidio-anonymizer:3000) with the detected entity spans - Anonymizer replaces each entity with a typed placeholder (e.g.,
<PERSON>,<EMAIL_ADDRESS>,<CREDIT_CARD>) - The sanitized prompt continues to the routing engine and eventually to the foundation model
- The original PII never leaves the gateway layer
| Layer | Technology | Role |
|---|---|---|
| Gateway | LiteLLM Proxy | OpenAI-compatible API, routing, auth, logging |
| Inference | AWS Bedrock | Managed foundation model inference |
| Routing | Semantic Router + Cohere Embeddings | Intent-based model selection |
| Guardrails | Microsoft Presidio | PII detection and prompt sanitization |
| Persistence | PostgreSQL 16 | Model config, keys, request metadata |
| Orchestration | Docker Compose | Containerized multi-service deployment |
| Configuration | YAML + JSON | Gateway and routing configuration |
| Language | Python (LiteLLM internals) | Runtime |
- RAG Integration — Retrieval-augmented generation with vector database backends (pgvector, Pinecone, Weaviate)
- Redis Caching — Semantic cache layer to deduplicate embeddings and reduce Bedrock API costs
- Observability — Prometheus metrics export and Grafana dashboards for request volume, latency, routing distribution, and cost per model
- Kubernetes Deployment — Helm chart for production-grade, horizontally scalable deployment
- Multi-region — Active-active Bedrock routing across AWS regions for latency optimization and resilience
- Authentication Providers — SSO/OIDC integration (Okta, Azure AD) for enterprise identity management
- Rate Limiting — Per-user and per-team token budgets with enforcement at the gateway layer
- Cost Dashboards — Real-time spend tracking and alerting per model, team, and use case
- Additional Guardrails — Prompt injection detection, toxicity filtering, and output validation
MIT
