Distributed LLM Inference Platform

A production-style distributed inference system for serving local LLMs with load balancing, caching, and low-latency request handling.

Features

FastAPI inference gateway
Distributed worker processes
Redis-based response caching
Round-robin load balancing
Local GPT4All LLM inference (CPU-only)

Tech Stack

Python
FastAPI
Redis
GPT4All
Local GGUF models

Architecture

Client → API Gateway → Cache → Worker Nodes → LLM

How to Run

Start Redis
Run multiple workers
Start gateway
Send inference requests

Notes

Model files are local and not committed
CPU-only, no training required

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
common		common
gateway		gateway
workers		workers
.dockerignore		.dockerignore
.gitignore		.gitignore
Readme.md		Readme.md
docker-compose.yml		docker-compose.yml
overview.md		overview.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Distributed LLM Inference Platform

Features

Tech Stack

Architecture

How to Run

Notes

About

Uh oh!

Releases

Packages

Languages

RajeshHsejar27/Distributed-LLM-Inference-Platform

Folders and files

Latest commit

History

Repository files navigation

Distributed LLM Inference Platform

Features

Tech Stack

Architecture

How to Run

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages