<a href="https://colab.research.google.com/github/OJB-Quantum/Notebooks-for-Ideas/blob/main/Mistral_LLM_Usage_in_Colab_Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
"""Install and run Ollama on Google Colab with an NVIDIA L4 or A100 GPU.

This notebook-style script:
  1) Detects the active GPU (expects L4 or A100, continues otherwise).
  2) Installs dependencies and Ollama (Linux).
  3) Starts `ollama serve` in the background.
  4) Pulls the `mistral-nemo` model.
  5) Runs a minimal chat request using `from ollama import chat`.

References (documentation-level):
  - Ollama Linux install and manual install guidance.
  - Ollama hardware support (NVIDIA, CUDA_VISIBLE_DEVICES).
  - Ollama API version endpoint for health checks.
"""

from __future__ import annotations

import json
import os
import socket
import subprocess
import time
import urllib.error
import urllib.request
from dataclasses import dataclass
from pathlib import Path
from typing import Optional, Sequence


@dataclass(frozen=True)
class ColabOllamaConfig:
    """Configuration for installing and running Ollama in Colab.

    Args:
        model_name: The Ollama model tag to pull and run.
        ollama_host: Host interface to bind the Ollama server.
        ollama_port: TCP port for the Ollama server.
        models_dir: Directory where Ollama will store downloaded models.
        log_path: File path for Ollama server logs.
        cuda_visible_devices: CUDA device selection string, for multi-GPU cases.
        num_ctx: Optional context window size to request per chat call.
        install_ollama: Whether to install Ollama (set False if already present).
    """

    model_name: str = "mistral-nemo"
    ollama_host: str = "127.0.0.1"
    ollama_port: int = 11434
    models_dir: Path = Path("/content/ollama_models")
    log_path: Path = Path("/content/ollama_serve.log")
    cuda_visible_devices: str = "0"
    num_ctx: int = 8192
    install_ollama: bool = True


CFG = ColabOllamaConfig()


def _is_root() -> bool:
    """Return True if the current process has root privileges."""
    try:
        return os.geteuid() == 0
    except AttributeError:
        return False


def _sudo_prefix() -> str:
    """Return 'sudo ' when needed, otherwise ''."""
    return "" if _is_root() else "sudo "


def run_bash(command: str, *, check: bool = True) -> subprocess.CompletedProcess:
    """Run a bash command in a Colab-friendly way.

    Args:
        command: Command string to run via `bash -lc`.
        check: When True, raise CalledProcessError on non-zero exit.

    Returns:
        CompletedProcess containing return code and execution metadata.
    """
    print(f"\n[run] {command}\n")
    return subprocess.run(
        ["bash", "-lc", command],
        check=check,
    )


def capture_bash(command: str) -> str:
    """Run a bash command and capture stdout as text.

    Args:
        command: Command string to run via `bash -lc`.

    Returns:
        Standard output, as a stripped string.
    """
    out = subprocess.check_output(["bash", "-lc", command], text=True)
    return out.strip()


def is_tcp_port_open(host: str, port: int, timeout_s: float = 0.25) -> bool:
    """Return True if a TCP port is accepting connections.

    Args:
        host: Hostname or IP address.
        port: TCP port number.
        timeout_s: Socket timeout in seconds.

    Returns:
        True if connect() succeeds, otherwise False.
    """
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as sock:
        sock.settimeout(timeout_s)
        return sock.connect_ex((host, port)) == 0


def wait_for_ollama_ready(
    host: str,
    port: int,
    timeout_s: float = 30.0,
    poll_s: float = 0.5,
) -> dict:
    """Wait until the Ollama server responds to GET /api/version.

    Args:
        host: Ollama server host.
        port: Ollama server port.
        timeout_s: Max time to wait for readiness.
        poll_s: Poll interval.

    Returns:
        Parsed JSON response from /api/version.

    Raises:
        TimeoutError: If the server does not become ready in time.
    """
    url = f"http://{host}:{port}/api/version"
    deadline = time.time() + timeout_s
    last_err: Optional[Exception] = None

    while time.time() < deadline:
        try:
            with urllib.request.urlopen(url, timeout=2.0) as resp:
                payload = resp.read().decode("utf-8")
            return json.loads(payload)
        except (urllib.error.URLError, json.JSONDecodeError) as err:
            last_err = err
            time.sleep(poll_s)

    raise TimeoutError(f"Ollama did not become ready: {last_err}")


def nvidia_smi_summary() -> str:
    """Return a concise GPU summary from nvidia-smi, if available."""
    if subprocess.call(["bash", "-lc", "command -v nvidia-smi >/dev/null 2>&1"]) != 0:
        return "nvidia-smi not found (GPU runtime may be disabled)."
    query = (
        "nvidia-smi --query-gpu=name,driver_version,memory.total "
        "--format=csv,noheader"
    )
    return capture_bash(query)


def assert_or_warn_l4_a100(gpu_summary: str) -> None:
    """Warn if the GPU does not look like an L4 or A100.

    Args:
        gpu_summary: Output line(s) from nvidia_smi_summary().
    """
    upper = gpu_summary.upper()
    is_expected = (" L4" in upper) or ("A100" in upper)
    if is_expected:
        print(f"[gpu] Detected supported target GPU family: {gpu_summary}")
        return

    print(
        "[gpu] WARNING: This runtime does not look like an L4 or A100.\n"
        f"      Detected: {gpu_summary}\n"
        "      The notebook will continue, since Ollama can still run on other "
        "      NVIDIA GPUs, although performance and capacity will differ."
    )

In [2]:
# GPU / driver check (Colab runtime must be set to GPU).
gpu = nvidia_smi_summary()
print(gpu)
assert_or_warn_l4_a100(gpu)

# Disk sanity (models can be multiple gigabytes).
run_bash("df -h /content")
run_bash("uname -a")

NVIDIA L4, 550.54.15, 23034 MiB
[gpu] Detected supported target GPU family: NVIDIA L4, 550.54.15, 23034 MiB

[run] df -h /content


[run] uname -a



CompletedProcess(args=['bash', '-lc', 'uname -a'], returncode=0)

In [3]:
# Install a few baseline tools. Ollama's installer may require standard
# utilities, and (depending on packaging) zstd can matter for extraction.
sudo = _sudo_prefix()

if CFG.install_ollama:
    run_bash(f"{sudo}apt-get update -y")
    run_bash(
        f"{sudo}apt-get install -y "
        "curl ca-certificates zstd pciutils lshw"
    )

    # Official install script (Linux).
    # This installs Ollama and its libraries, and it may attempt systemd setup,
    # which is typically inactive inside Colab (that is expected).
    run_bash("curl -fsSL https://ollama.com/install.sh | sh")

# Verify installation.
run_bash("command -v ollama")
run_bash("ollama -v")



[run] apt-get update -y


[run] apt-get install -y curl ca-certificates zstd pciutils lshw


[run] curl -fsSL https://ollama.com/install.sh | sh


[run] command -v ollama


[run] ollama -v



CompletedProcess(args=['bash', '-lc', 'ollama -v'], returncode=0)

In [4]:
# Ensure models directory exists (and is writable).
CFG.models_dir.mkdir(parents=True, exist_ok=True)

# If a server is already bound, do not start a second one.
if is_tcp_port_open(CFG.ollama_host, CFG.ollama_port):
    print(
        f"[ollama] Server already listening on "
        f"{CFG.ollama_host}:{CFG.ollama_port}"
    )
else:
    env = os.environ.copy()
    env["OLLAMA_HOST"] = f"{CFG.ollama_host}:{CFG.ollama_port}"
    env["OLLAMA_MODELS"] = str(CFG.models_dir)

    # For completeness, and for multi-GPU environments, Ollama docs recommend
    # CUDA_VISIBLE_DEVICES for selecting NVIDIA GPUs.
    env["CUDA_VISIBLE_DEVICES"] = CFG.cuda_visible_devices

    CFG.log_path.parent.mkdir(parents=True, exist_ok=True)
    print(f"[ollama] Logging to: {CFG.log_path}")

    with CFG.log_path.open("a", encoding="utf-8") as log_file:
        proc = subprocess.Popen(
            ["ollama", "serve"],
            env=env,
            stdout=log_file,
            stderr=subprocess.STDOUT,
        )

    print(f"[ollama] Started server PID={proc.pid}")

# Health check via documented endpoint: GET /api/version
version_payload = wait_for_ollama_ready(CFG.ollama_host, CFG.ollama_port)
print("[ollama] /api/version =>", version_payload)


[ollama] Logging to: /content/ollama_serve.log
[ollama] Started server PID=4220
[ollama] /api/version => {'version': '0.14.3'}


In [5]:
# Pull the model (first-time download can take a while and uses disk).
run_bash(f"ollama pull {CFG.model_name}")

# Install the official Python client library for Ollama.
run_bash("python -m pip install -U pip")
run_bash("python -m pip install -U ollama")

# Optional: show installed models.
run_bash("ollama list")



[run] ollama pull mistral-nemo


[run] python -m pip install -U pip


[run] python -m pip install -U ollama


[run] ollama list



CompletedProcess(args=['bash', '-lc', 'ollama list'], returncode=0)

In [6]:
from ollama import chat

response = chat(
    model="mistral-nemo",
    messages=[{"role": "user", "content": "Hello!"}],
    # If you want to explicitly request a larger context window:
    # options={"num_ctx": CFG.num_ctx},
)
print(response.message.content)


Hi there! How can I assist you today? Let's chat about anything you'd like. ðŸ˜Š


In [7]:
# Ollama provides `ollama ps` to show whether a model is loaded on GPU/CPU.
run_bash("ollama ps")

# Also verify via NVIDIA telemetry. You can rerun this during generation.
run_bash("nvidia-smi")


[run] ollama ps


[run] nvidia-smi



CompletedProcess(args=['bash', '-lc', 'nvidia-smi'], returncode=0)

In [8]:
"""Interactive prompting loop for Ollama chat in Google Colab.

This cell creates a minimal REPL that:
  1) Reads your prompt via input().
  2) Sends it to Ollama using `from ollama import chat`.
  3) Prints the assistant response.
  4) Preserves conversation state across turns.

To stop the loop:
  - Type: quit
  - Or click the Colab stop button for the running cell.
"""

from __future__ import annotations

from typing import Any, Dict, List

from ollama import chat

# Control knobs (edit these freely).
MODEL_NAME = "mistral-nemo"
NUM_CTX = 8192  # Context tokens, higher uses more VRAM (Video Random Access Memory).
SYSTEM_PROMPT = ""  # Example: "You are a precise technical assistant."

# Conversation state.
messages: List[Dict[str, str]] = []
if SYSTEM_PROMPT.strip():
    messages.append({"role": "system", "content": SYSTEM_PROMPT.strip()})


def send_turn(user_text: str) -> str:
    """Send one user turn, return one assistant turn.

    Args:
        user_text: The user's prompt.

    Returns:
        The assistant's response text.
    """
    messages.append({"role": "user", "content": user_text})
    response = chat(
        model=MODEL_NAME,
        messages=messages,
        options={"num_ctx": NUM_CTX},
    )
    assistant_text = response.message.content
    messages.append({"role": "assistant", "content": assistant_text})
    return assistant_text


while True:
    user_text = input("\nYou: ").strip()
    if user_text.lower() in {"quit", "exit", "q"}:
        print("Stopped.")
        break
    if not user_text:
        continue

    try:
        reply = send_turn(user_text)
    except Exception as exc:  # pylint: disable=broad-except
        raise RuntimeError(
            "Prompting failed. Verify these in earlier cells:\n"
            "  1) `ollama serve` is running (and reachable on localhost:11434)\n"
            "  2) the model is present (`ollama pull mistral-nemo`)\n"
            "  3) the `ollama` Python package is installed (`pip install ollama`)\n"
        ) from exc

    print(f"\nAssistant: {reply}")


You: Write a rigorous symbolic mathematical derivation of the averaged momentum equation. Then provide two numerical examples with realistic values and then show the results in scientific e-notation rounded to 4 decimal places. 

Assistant: **Derivation of the Averaged Momentum Equation:**

Consider a control volume with inflow and outflow velocities, \(V_{in}\) and \(V_{out}\), respectively. The mass flow rates at the inlet and outlet are \(\dot{m}_{in} = \rho V_{in}A\) and \(\dot{m}_{out} = \rho V_{out}A\), where \(\rho\) is the fluid density and \(A\) is the cross-sectional area of the control volume.

The momentum entering the control volume per unit time is \((\rho V_{in}^2)A + (\rho g)(\frac{1}{2})V_{in}tA\), where \(g\) is the acceleration due to gravity, \(t\) is the thickness of the boundary layer, and we've assumed the inlet flow is perpendicular to the control volume's surface.

The momentum leaving the control volume per unit time is \((\rho V_{out}^2)A + (\rho g)(\frac{1}