End-to-end LLM inference system with experimentation and monitoring.
Client → nginx (8780) → Gateway (8000) → Modal vLLM
↓
Prometheus (9090)
- nginx: Load balancer with health checks
- Gateway: FastAPI with X-Technique routing, Prometheus metrics
- vLLM: Modal-hosted Mistral-7B (baseline + chunked prefill)
git clone <repo-url>
cd InferenceProject
uv pip install -r requirements.txtcp .env.example .env
# Edit .env with your Modal URLs:
# VLLM_BASELINE_URL=https://your-account--vllm-baseline-serve.modal.run
# VLLM_CHUNKED_URL=https://your-account--vllm-chunked-prefill-serve.modal.runmodal deploy vllm_engine/modal_baseline.py
modal deploy vllm_engine/modal_chunked.pyTerminal 1 - Gateway:
uv run python gateway/main.pyTerminal 2 - nginx:
nginx -c $(pwd)/nginx/nginx.confTerminal 3 - Prometheus (optional):
docker run -p 9090:9090 -v $(pwd)/monitoring/prometheus.yml:/etc/prometheus/prometheus.yml prom/prometheuscurl http://localhost:8780/health
curl http://localhost:8000/metricsuv run python experiments/runner.pyRuns 60s load test against both baseline and chunked_prefill arms. Results saved to data/metrics/.
uv run python experiments/evaluate_agent.pyRuns code generation tasks from golden set, reports pass rates.
uv run python experiments/generate_plots.pyCreates latency distributions and throughput comparisons in figures/.
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "X-Technique: baseline" \
-d '{
"model": "mistralai/Mistral-7B-Instruct-v0.3",
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 10
}'curl -X POST http://localhost:8780/v1/chat/completions \
-H "X-Technique: chunked_prefill" \
-d '{...}'.
├── gateway/ # FastAPI gateway with metrics
├── vllm_engine/ # Modal deployment scripts
├── agent/ # Code generation agent
├── experiments/ # Load testing and evaluation
├── data/ # Experiment results
├── figures/ # Generated plots
├── nginx/ # Load balancer config
└── monitoring/ # Prometheus config
Gateway exposes Prometheus metrics at http://localhost:8000/metrics:
llm_gateway_requests_total{technique, model}llm_gateway_request_duration_seconds{technique}llm_gateway_errors_total{error_type, layer}
Generate submission PDF:
jupyter nbconvert submission.ipynb --to pdfSee submission.ipynb for full analysis and results.