Skip to content

Gen3 Admin Ollama Host Setup

GT AI OS Release edited this page Jun 11, 2026 · 2 revisions

Ollama Host Setup

Start Here

  1. Pick your platform preset below: 4gb-minimum, v100-16gb, dgx-spark, or rtx-pro-6000-96gb.
  2. On the Ubuntu GPU host, install Ollama and apply the preset systemd override so Ollama listens on 0.0.0.0:11434.
  3. Pull embeddinggemma (required for GT AI OS embeddings) plus at least one chat model for your VRAM tier.
  4. In Control Panel Models → Inference Providers, edit the Ollama provider and set Base URL to http://<host-lan-ip>:11434 (operator-entered LAN address—no in-cluster bridge URL).
  5. Run Test connection, then Discover on Configured Models. Confirm models appear before setting the deployment default embedding model in Models.

Why this matters

GT AI OS uses Ollama for on-premises embeddings and optional local chat inference. The Control Panel discovers models from a reachable Ollama root URL (http://host:11434, not an OpenAI-style /v1 path). Cluster pods must route to the host’s LAN IP and port you configure—operators enter that endpoint explicitly in Inference Providers.

Without embeddinggemma (or another enabled embedding model) and a valid provider URL, dataset ingestion, contextual memory, and default embedding selection will fail or remain incomplete in QuickStart.

Details

Pick your platform

Anchor When to use Typical VRAM
4gb-minimum Any Ubuntu host with an NVIDIA GPU 4 GB or larger 4–8 GB
v100-16gb Volta V100 workstation or server 16 GB
dgx-spark NVIDIA DGX Spark (GB10, unified memory) 128 GB UMA
rtx-pro-6000-96gb RTX PRO 6000 Blackwell Workstation Edition 96 GB GDDR7

4gb-minimum

Minimum tier for Ubuntu + NVIDIA GPU (4 GB+). Tuned for a single small chat model plus embeddings.

Install script

set -euo pipefail

if ! command -v ollama >/dev/null; then
  curl -fsSL https://ollama.com/install.sh | sh
fi

sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/override.conf >/dev/null <<'EOF'
[Service]
# Ubuntu + 4–8 GB GPU (minimum tier)
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_KEEP_ALIVE=30m"
Environment="OLLAMA_MAX_LOADED_MODELS=1"
Environment="OLLAMA_NUM_PARALLEL=1"
Environment="OLLAMA_CONTEXT_LENGTH=4096"
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"
Environment="OLLAMA_GPU_OVERHEAD=536870912"
Environment="OLLAMA_FLASH_ATTENTION=1"
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now ollama
sudo systemctl restart ollama

sudo ufw disable 2>/dev/null || true
sudo systemctl stop ufw 2>/dev/null || true
sudo systemctl disable ufw 2>/dev/null || true
if systemctl list-unit-files firewalld.service &>/dev/null; then
  sudo systemctl stop firewalld
  sudo systemctl disable firewalld
fi

LAN_IP="$(hostname -I | awk '{print $1}')"
echo "==> systemd: $(systemctl is-active ollama)"
curl -fsS http://127.0.0.1:11434/api/tags
curl -fsS "http://${LAN_IP}:11434/api/tags"
ollama ps || true
echo "Done. Next: pull models (see below)."

Pull models

ollama pull embeddinggemma
ollama pull llama3.2:1b
# or: ollama pull phi3:mini

v100-16gb

Volta V100 with 16 GB VRAM. For 32 GB cards, set OLLAMA_MAX_LOADED_MODELS=2 in the override.

Install script

set -euo pipefail

if ! command -v ollama >/dev/null; then
  curl -fsSL https://ollama.com/install.sh | sh
fi

sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/override.conf >/dev/null <<'EOF'
[Service]
# Volta V100 16 GB
# Multi-GPU: uncomment:
# Environment="CUDA_VISIBLE_DEVICES=0,1,2,3"
# Environment="OLLAMA_SCHED_SPREAD=1"
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_KEEP_ALIVE=6h"
Environment="OLLAMA_MAX_LOADED_MODELS=1"
Environment="OLLAMA_NUM_PARALLEL=1"
Environment="OLLAMA_CONTEXT_LENGTH=8192"
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now ollama
sudo systemctl restart ollama

sudo ufw disable 2>/dev/null || true
sudo systemctl stop ufw 2>/dev/null || true
sudo systemctl disable ufw 2>/dev/null || true
if systemctl list-unit-files firewalld.service &>/dev/null; then
  sudo systemctl stop firewalld
  sudo systemctl disable firewalld
fi

LAN_IP="$(hostname -I | awk '{print $1}')"
echo "==> systemd: $(systemctl is-active ollama)"
curl -fsS http://127.0.0.1:11434/api/tags
curl -fsS "http://${LAN_IP}:11434/api/tags"
ollama ps || true
echo "Done. Next: pull models (see below)."

Pull models

ollama pull embeddinggemma
ollama pull llama3.2:3b
ollama pull nemotron-mini

dgx-spark

NVIDIA DGX Spark (GB10, 128 GB unified memory). Supports higher concurrency than discrete-GPU tiers.

Install script

set -euo pipefail

if ! command -v ollama >/dev/null; then
  curl -fsSL https://ollama.com/install.sh | sh
fi

sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/override.conf >/dev/null <<'EOF'
[Service]
# DGX Spark — 128 GB UMA Blackwell
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_KEEP_ALIVE=6h"
Environment="OLLAMA_MAX_LOADED_MODELS=4"
Environment="OLLAMA_NUM_PARALLEL=4"
Environment="OLLAMA_CONTEXT_LENGTH=16384"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now ollama
sudo systemctl restart ollama

sudo ufw disable 2>/dev/null || true
sudo systemctl stop ufw 2>/dev/null || true
sudo systemctl disable ufw 2>/dev/null || true
if systemctl list-unit-files firewalld.service &>/dev/null; then
  sudo systemctl stop firewalld
  sudo systemctl disable firewalld
fi

LAN_IP="$(hostname -I | awk '{print $1}')"
echo "==> systemd: $(systemctl is-active ollama)"
curl -fsS http://127.0.0.1:11434/api/tags
curl -fsS "http://${LAN_IP}:11434/api/tags"
ollama ps || true
echo "Done. Next: pull models (see below)."

Pull models

ollama pull embeddinggemma
ollama pull nemotron
ollama pull llama3.3:70b

rtx-pro-6000-96gb

NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96 GB GDDR7). For 48 GB Max-Q SKUs, set OLLAMA_MAX_LOADED_MODELS=2 and OLLAMA_NUM_PARALLEL=2.

Install script

set -euo pipefail

if ! command -v ollama >/dev/null; then
  curl -fsSL https://ollama.com/install.sh | sh
fi

sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/override.conf >/dev/null <<'EOF'
[Service]
# RTX PRO 6000 Blackwell Workstation Edition (96 GB GDDR7)
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_KEEP_ALIVE=6h"
Environment="OLLAMA_MAX_LOADED_MODELS=3"
Environment="OLLAMA_NUM_PARALLEL=4"
Environment="OLLAMA_CONTEXT_LENGTH=16384"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now ollama
sudo systemctl restart ollama

sudo ufw disable 2>/dev/null || true
sudo systemctl stop ufw 2>/dev/null || true
sudo systemctl disable ufw 2>/dev/null || true
if systemctl list-unit-files firewalld.service &>/dev/null; then
  sudo systemctl stop firewalld
  sudo systemctl disable firewalld
fi

LAN_IP="$(hostname -I | awk '{print $1}')"
echo "==> systemd: $(systemctl is-active ollama)"
curl -fsS http://127.0.0.1:11434/api/tags
curl -fsS "http://${LAN_IP}:11434/api/tags"
ollama ps || true
echo "Done. Next: pull models (see below)."

Pull models

ollama pull embeddinggemma
ollama pull llama3.3:70b
ollama pull nemotron

Model pulls by VRAM (quick reference)

VRAM tier Example chat pulls Embedding (required)
4–8 GB (minimum) llama3.2:1b or phi3:mini ollama pull embeddinggemma
16 GB (V100) llama3.2:3b, nemotron-mini embeddinggemma
128 GB UMA (DGX Spark) nemotron, llama3.3:70b embeddinggemma
96 GB (RTX PRO 6000) llama3.3:70b, nemotron embeddinggemma

Always pull embeddinggemma before setting the deployment default embedding model in Control Panel.


Point GT AI OS at the host

GT AI OS does not use in-cluster bridge URLs for Ollama. The operator enters the host’s routable LAN endpoint in Control Panel Models → Inference Providers.

Where you test URL
Control Panel Ollama provider Base URL http://<host-lan-ip>:11434 (example: http://192.168.1.50:11434)
On the Ollama host shell http://127.0.0.1:11434

Configuration checklist:

  • Base URL must be the server root (no /v1 suffix).
  • Models listing path is typically /api/tags for Ollama.
  • Use an IPv4 address pods can route to from the cluster network—not localhost from the cluster’s perspective.
  • Run Test connection, then Discover on the Configured Models tab.

See Models and the Ollama discovery step in QuickStart.


Clients on the LAN

Client Endpoint style Notes
Ollama native API http://<lan-ip>:11434/api/chat, /api/embed, /api/tags GT AI OS adapter uses native Ollama routes
OpenAI-compatible shim http://<lan-ip>:11434/v1/... Optional for external tools; do not use /v1 as the Control Panel base URL

Troubleshooting

Symptom What to check
Connection refused from Control Panel Confirm OLLAMA_HOST=0.0.0.0:11434 in the systemd override; systemctl is-active ollama; from another machine on the LAN: curl http://<lan-ip>:11434/api/tags
Firewall blocking port 11434 Lab scripts disable ufw / firewalld for trusted LANs; in production, allow TCP 11434 only from cluster node subnets
Wrong endpoint in Control Panel Base URL must be http://<lan-ip>:11434 without /v1; rediscover after correcting
Pods cannot reach host IP Verify cluster nodes route to the Ollama host subnet; avoid NAT-only addresses pods cannot reach
Empty model list after discovery Run ollama pull embeddinggemma and at least one chat model on the host; retry discovery
OOM or 503 under load Lower OLLAMA_NUM_PARALLEL, then OLLAMA_MAX_LOADED_MODELS, then OLLAMA_CONTEXT_LENGTH in the systemd override
GPU not visible to Ollama Run nvidia-smi on the host; reinstall NVIDIA driver stack if the GPU is missing

After OOM or slow responses: reduce parallelism first, then loaded-model count, then context length.


Security note

Disabling ufw / firewalld is intended for trusted lab or LAN inference hosts. Do not expose port 11434 to the public internet without TLS and access control in front.

Related pages

Clone this wiki locally