A production-style distributed inference system for serving local LLMs with load balancing, caching, and low-latency request handling.
- FastAPI inference gateway
- Distributed worker processes
- Redis-based response caching
- Round-robin load balancing
- Local GPT4All LLM inference (CPU-only)
- Python
- FastAPI
- Redis
- GPT4All
- Local GGUF models
Client → API Gateway → Cache → Worker Nodes → LLM
- Start Redis
- Run multiple workers
- Start gateway
- Send inference requests
- Model files are local and not committed
- CPU-only, no training required