Gen3 Admin Ollama Host Setup

Ollama Host Setup

Start Here

Pick your platform preset below: 4gb-minimum, v100-16gb, dgx-spark, or rtx-pro-6000-96gb.
On the Ubuntu GPU host, install Ollama and apply the preset systemd override so Ollama listens on 0.0.0.0:11434.
Pull embeddinggemma (required for GT AI OS embeddings) plus at least one chat model for your VRAM tier.
In Control Panel Models → Inference Providers, edit the Ollama provider and set Base URL to http://<host-lan-ip>:11434 (operator-entered LAN address—no in-cluster bridge URL).
Run Test connection, then Discover on Configured Models. Confirm models appear before setting the deployment default embedding model in Models.

Why this matters

GT AI OS uses Ollama for on-premises embeddings and optional local chat inference. The Control Panel discovers models from a reachable Ollama root URL (http://host:11434, not an OpenAI-style /v1 path). Cluster pods must route to the host’s LAN IP and port you configure—operators enter that endpoint explicitly in Inference Providers.

Without embeddinggemma (or another enabled embedding model) and a valid provider URL, dataset ingestion, contextual memory, and default embedding selection will fail or remain incomplete in QuickStart.

Details

Pick your platform

Anchor	When to use	Typical VRAM
4gb-minimum	Any Ubuntu host with an NVIDIA GPU 4 GB or larger	4–8 GB
v100-16gb	Volta V100 workstation or server	16 GB
dgx-spark	NVIDIA DGX Spark (GB10, unified memory)	128 GB UMA
rtx-pro-6000-96gb	RTX PRO 6000 Blackwell Workstation Edition	96 GB GDDR7

4gb-minimum

Minimum tier for Ubuntu + NVIDIA GPU (4 GB+). Tuned for a single small chat model plus embeddings.

Install script

set -euo pipefail

if ! command -v ollama >/dev/null; then
  curl -fsSL https://ollama.com/install.sh | sh
fi

sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/override.conf >/dev/null <<'EOF'
[Service]
# Ubuntu + 4–8 GB GPU (minimum tier)
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_KEEP_ALIVE=30m"
Environment="OLLAMA_MAX_LOADED_MODELS=1"
Environment="OLLAMA_NUM_PARALLEL=1"
Environment="OLLAMA_CONTEXT_LENGTH=4096"
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"
Environment="OLLAMA_GPU_OVERHEAD=536870912"
Environment="OLLAMA_FLASH_ATTENTION=1"
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now ollama
sudo systemctl restart ollama

sudo ufw disable 2>/dev/null || true
sudo systemctl stop ufw 2>/dev/null || true
sudo systemctl disable ufw 2>/dev/null || true
if systemctl list-unit-files firewalld.service &>/dev/null; then
  sudo systemctl stop firewalld
  sudo systemctl disable firewalld
fi

LAN_IP="$(hostname -I | awk '{print $1}')"
echo "==> systemd: $(systemctl is-active ollama)"
curl -fsS http://127.0.0.1:11434/api/tags
curl -fsS "http://${LAN_IP}:11434/api/tags"
ollama ps || true
echo "Done. Next: pull models (see below)."

Pull models

ollama pull embeddinggemma
ollama pull llama3.2:1b
# or: ollama pull phi3:mini

v100-16gb

Volta V100 with 16 GB VRAM. For 32 GB cards, set OLLAMA_MAX_LOADED_MODELS=2 in the override.

Install script

set -euo pipefail

if ! command -v ollama >/dev/null; then
  curl -fsSL https://ollama.com/install.sh | sh
fi

sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/override.conf >/dev/null <<'EOF'
[Service]
# Volta V100 16 GB
# Multi-GPU: uncomment:
# Environment="CUDA_VISIBLE_DEVICES=0,1,2,3"
# Environment="OLLAMA_SCHED_SPREAD=1"
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_KEEP_ALIVE=6h"
Environment="OLLAMA_MAX_LOADED_MODELS=1"
Environment="OLLAMA_NUM_PARALLEL=1"
Environment="OLLAMA_CONTEXT_LENGTH=8192"
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now ollama
sudo systemctl restart ollama

sudo ufw disable 2>/dev/null || true
sudo systemctl stop ufw 2>/dev/null || true
sudo systemctl disable ufw 2>/dev/null || true
if systemctl list-unit-files firewalld.service &>/dev/null; then
  sudo systemctl stop firewalld
  sudo systemctl disable firewalld
fi

LAN_IP="$(hostname -I | awk '{print $1}')"
echo "==> systemd: $(systemctl is-active ollama)"
curl -fsS http://127.0.0.1:11434/api/tags
curl -fsS "http://${LAN_IP}:11434/api/tags"
ollama ps || true
echo "Done. Next: pull models (see below)."

Pull models

ollama pull embeddinggemma
ollama pull llama3.2:3b
ollama pull nemotron-mini

dgx-spark

NVIDIA DGX Spark (GB10, 128 GB unified memory). Supports higher concurrency than discrete-GPU tiers.

Install script

set -euo pipefail

if ! command -v ollama >/dev/null; then
  curl -fsSL https://ollama.com/install.sh | sh
fi

sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/override.conf >/dev/null <<'EOF'
[Service]
# DGX Spark — 128 GB UMA Blackwell
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_KEEP_ALIVE=6h"
Environment="OLLAMA_MAX_LOADED_MODELS=4"
Environment="OLLAMA_NUM_PARALLEL=4"
Environment="OLLAMA_CONTEXT_LENGTH=16384"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now ollama
sudo systemctl restart ollama

sudo ufw disable 2>/dev/null || true
sudo systemctl stop ufw 2>/dev/null || true
sudo systemctl disable ufw 2>/dev/null || true
if systemctl list-unit-files firewalld.service &>/dev/null; then
  sudo systemctl stop firewalld
  sudo systemctl disable firewalld
fi

LAN_IP="$(hostname -I | awk '{print $1}')"
echo "==> systemd: $(systemctl is-active ollama)"
curl -fsS http://127.0.0.1:11434/api/tags
curl -fsS "http://${LAN_IP}:11434/api/tags"
ollama ps || true
echo "Done. Next: pull models (see below)."

Pull models

ollama pull embeddinggemma
ollama pull nemotron
ollama pull llama3.3:70b

rtx-pro-6000-96gb

NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96 GB GDDR7). For 48 GB Max-Q SKUs, set OLLAMA_MAX_LOADED_MODELS=2 and OLLAMA_NUM_PARALLEL=2.

Install script

set -euo pipefail

if ! command -v ollama >/dev/null; then
  curl -fsSL https://ollama.com/install.sh | sh
fi

sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/override.conf >/dev/null <<'EOF'
[Service]
# RTX PRO 6000 Blackwell Workstation Edition (96 GB GDDR7)
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_KEEP_ALIVE=6h"
Environment="OLLAMA_MAX_LOADED_MODELS=3"
Environment="OLLAMA_NUM_PARALLEL=4"
Environment="OLLAMA_CONTEXT_LENGTH=16384"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now ollama
sudo systemctl restart ollama

sudo ufw disable 2>/dev/null || true
sudo systemctl stop ufw 2>/dev/null || true
sudo systemctl disable ufw 2>/dev/null || true
if systemctl list-unit-files firewalld.service &>/dev/null; then
  sudo systemctl stop firewalld
  sudo systemctl disable firewalld
fi

LAN_IP="$(hostname -I | awk '{print $1}')"
echo "==> systemd: $(systemctl is-active ollama)"
curl -fsS http://127.0.0.1:11434/api/tags
curl -fsS "http://${LAN_IP}:11434/api/tags"
ollama ps || true
echo "Done. Next: pull models (see below)."

Pull models

ollama pull embeddinggemma
ollama pull llama3.3:70b
ollama pull nemotron

Model pulls by VRAM (quick reference)

VRAM tier	Example chat pulls	Embedding (required)
4–8 GB (minimum)	`llama3.2:1b` or `phi3:mini`	`ollama pull embeddinggemma`
16 GB (V100)	`llama3.2:3b`, `nemotron-mini`	`embeddinggemma`
128 GB UMA (DGX Spark)	`nemotron`, `llama3.3:70b`	`embeddinggemma`
96 GB (RTX PRO 6000)	`llama3.3:70b`, `nemotron`	`embeddinggemma`

Always pull embeddinggemma before setting the deployment default embedding model in Control Panel.

Point GT AI OS at the host

GT AI OS does not use in-cluster bridge URLs for Ollama. The operator enters the host’s routable LAN endpoint in Control Panel Models → Inference Providers.

Where you test	URL
Control Panel Ollama provider Base URL	`http://<host-lan-ip>:11434` (example: `http://192.168.1.50:11434`)
On the Ollama host shell	`http://127.0.0.1:11434`

Configuration checklist:

Base URL must be the server root (no /v1 suffix).
Models listing path is typically /api/tags for Ollama.
Use an IPv4 address pods can route to from the cluster network—not localhost from the cluster’s perspective.
Run Test connection, then Discover on the Configured Models tab.

See Models and the Ollama discovery step in QuickStart.

Clients on the LAN

Client	Endpoint style	Notes
Ollama native API	`http://<lan-ip>:11434/api/chat`, `/api/embed`, `/api/tags`	GT AI OS adapter uses native Ollama routes
OpenAI-compatible shim	`http://<lan-ip>:11434/v1/...`	Optional for external tools; do not use `/v1` as the Control Panel base URL

Troubleshooting

Symptom	What to check
Connection refused from Control Panel	Confirm `OLLAMA_HOST=0.0.0.0:11434` in the systemd override; `systemctl is-active ollama`; from another machine on the LAN: `curl http://<lan-ip>:11434/api/tags`
Firewall blocking port 11434	Lab scripts disable ufw / firewalld for trusted LANs; in production, allow TCP `11434` only from cluster node subnets
Wrong endpoint in Control Panel	Base URL must be `http://<lan-ip>:11434` without `/v1`; rediscover after correcting
Pods cannot reach host IP	Verify cluster nodes route to the Ollama host subnet; avoid NAT-only addresses pods cannot reach
Empty model list after discovery	Run `ollama pull embeddinggemma` and at least one chat model on the host; retry discovery
OOM or 503 under load	Lower `OLLAMA_NUM_PARALLEL`, then `OLLAMA_MAX_LOADED_MODELS`, then `OLLAMA_CONTEXT_LENGTH` in the systemd override
GPU not visible to Ollama	Run `nvidia-smi` on the host; reinstall NVIDIA driver stack if the GPU is missing

After OOM or slow responses: reduce parallelism first, then loaded-model count, then context length.

Security note

Disabling ufw / firewalld is intended for trusted lab or LAN inference hosts. Do not expose port 11434 to the public internet without TLS and access control in front.

Related pages

GT AI OS Instructions

Home

Self-Hosted deployment

Uh oh!

Gen3 Admin Ollama Host Setup

Ollama Host Setup

Start Here

Why this matters

Details

Pick your platform

4gb-minimum

Install script

Pull models

v100-16gb

Install script

Pull models

dgx-spark

Install script

Pull models

rtx-pro-6000-96gb

Install script

Pull models

Model pulls by VRAM (quick reference)

Point GT AI OS at the host

Clients on the LAN

Troubleshooting

Security note

Related pages

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!