A self-hosted platform for fine-tuning, RAG, and inference with local LLMs via Ollama.
llm-forge/
├── backend/ FastAPI API (Python)
│ ├── main.py
│ ├── routers/
│ │ ├── data.py Upload & ingest documents
│ │ ├── rag.py RAG query / chat
│ │ ├── finetune.py LoRA fine-tuning jobs
│ │ └── models.py Ollama model management
│ ├── services/
│ │ ├── vector_store.py ChromaDB + embeddings
│ │ ├── ollama_service.py Ollama HTTP client
│ │ ├── rag_pipeline.py RAG pipeline
│ │ └── finetune_service.py LoRA trainer
│ ├── requirements.txt
│ └── Dockerfile
├── frontend/
│ └── index.html Single-page dashboard
├── docker/
│ └── nginx.conf
├── data/
│ ├── uploads/ Uploaded files land here
│ ├── adapters/ LoRA adapter weights
│ └── example_dataset.jsonl
└── docker-compose.yml
# 1. Clone / copy this project
cd llm-forge
# 2. Start all services
docker compose up --build -d
# 3. Open the UI
open http://localhost:3000
# 4. Pull your first model (one-time)
# Either through the UI → Models → Pull
# Or via CLI:
docker exec llmforge-ollama ollama pull llama3.2:3bBrowser (localhost:3000)
│
▼
Nginx (frontend) ──/api/──► FastAPI Backend (:8000)
│ │
Ollama ChromaDB
(:11434) (:8001)
(LLM models) (vectors)
RAG:
User question
→ embed question (sentence-transformers)
→ query ChromaDB for top-K chunks
→ build prompt: system=context + user=question
→ Ollama generate
→ return answer + source chunks
Fine-tune:
Upload JSONL dataset (instruction/output pairs)
→ load HuggingFace base model
→ attach LoRA adapters (PEFT)
→ SFTTrainer (TRL)
→ save adapter weights to /data/adapters/<job_id>/
| Method | Path | Description |
|---|---|---|
| GET | /api/health |
Health check |
| POST | /api/data/ingest/text |
Ingest raw text |
| POST | /api/data/ingest/file |
Upload & ingest file |
| POST | /api/data/ingest/jsonl-dataset |
Upload fine-tune dataset |
| GET | /api/data/stats |
Vector DB stats |
| DELETE | /api/data/clear |
Clear vector DB |
| POST | /api/rag/query |
RAG single query |
| POST | /api/rag/chat |
RAG multi-turn chat |
| GET | /api/models/ |
List Ollama models |
| POST | /api/models/pull |
Pull Ollama model |
| POST | /api/models/generate |
Direct generate |
| POST | /api/finetune/start |
Start LoRA job |
| GET | /api/finetune/jobs |
List all jobs |
| GET | /api/finetune/jobs/{id} |
Job status + logs |
Create a .jsonl file with one JSON object per line:
{"instruction": "What is the capital of France?", "output": "Paris."}
{"instruction": "Write a Python hello world.", "output": "print('Hello, world!')"}Upload it via Ingest → Upload Fine-tune Dataset, then use the filename in Fine-tune → Dataset File.
Uncomment the deploy.resources block in docker-compose.yml for NVIDIA GPU:
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]Requires NVIDIA Container Toolkit.
| Variable | Default | Description |
|---|---|---|
OLLAMA_URL |
http://ollama:11434 |
Ollama API endpoint |
CHROMA_HOST |
chromadb |
ChromaDB host |
DEFAULT_MODEL |
llama3.2:3b |
Default Ollama model |
EMBED_MODEL |
all-MiniLM-L6-v2 |
Sentence-transformer model |
RAG_TOP_K |
5 |
Default RAG chunks to retrieve |
CHUNK_SIZE |
512 |
Words per chunk |
CHUNK_OVERLAP |
64 |
Overlap between chunks |
- Add PDF support: Install
pypdfand extend_extract_text()inrouters/data.py - Add evaluation: Create
services/eval_service.pywith accuracy metrics - Scheduled retraining: Add a cron job or APScheduler that calls
start_finetune()with new data - Export to Ollama: After fine-tuning, use
llama.cppto convert the merged model to GGUF and load it into Ollama viaollama create - Swap embedding model: Change
EMBED_MODELenv var to any sentence-transformers model