Skip to content

Scalable inference platform supporting multiple language models with load balancing, caching, and low-latency request handling across nodes

Notifications You must be signed in to change notification settings

RajeshHsejar27/Distributed-LLM-Inference-Platform

Repository files navigation

Distributed LLM Inference Platform

A production-style distributed inference system for serving local LLMs with load balancing, caching, and low-latency request handling.

Features

  • FastAPI inference gateway
  • Distributed worker processes
  • Redis-based response caching
  • Round-robin load balancing
  • Local GPT4All LLM inference (CPU-only)

Tech Stack

  • Python
  • FastAPI
  • Redis
  • GPT4All
  • Local GGUF models

Architecture

Client → API Gateway → Cache → Worker Nodes → LLM

How to Run

  1. Start Redis
  2. Run multiple workers
  3. Start gateway
  4. Send inference requests

Notes

  • Model files are local and not committed
  • CPU-only, no training required

About

Scalable inference platform supporting multiple language models with load balancing, caching, and low-latency request handling across nodes

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published