# Run an LLM with Ollama server in a Colab Notebook

This Colab notebook demonstrates how to easily run an LLM using Ollama.  We’ll set up an Ollama server within Colab, allowing you to interact with powerful language models directly from your browser.

To efficiently run large language models (LLMs), leveraging the power of Ollama servers within Google Colab Notebooks is a practical approach. This setup combines Ollama's computational capabilities with Colab’s accessible cloud environment, allowing users to execute advanced AI tasks directly from their browsers without needing local resources.

Let’s get started!

By: [DNALinux.com](https://dnalinux.com)

# Preparation work (install dependencies)

In [1]:
!apt install pciutils lshw
!curl -fsSL https://ollama.com/install.sh | sh


Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  libpci3 pci.ids usb.ids
The following NEW packages will be installed:
  libpci3 lshw pci.ids pciutils usb.ids
0 upgraded, 5 newly installed, 0 to remove and 34 not upgraded.
Need to get 883 kB of archives.
After this operation, 3,256 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 pci.ids all 0.0~2022.01.22-1ubuntu0.1 [251 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/main amd64 libpci3 amd64 1:3.7.0-6 [28.9 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy/main amd64 lshw amd64 02.19.git.2021.06.19.996aaad9c7-2build1 [321 kB]
Get:4 http://archive.ubuntu.com/ubuntu jammy/main amd64 pciutils amd64 1:3.7.0-6 [63.6 kB]
Get:5 http://archive.ubuntu.com/ubuntu jammy/main amd64 usb.ids all 2022.04.02-1 [219 kB]
Fetched 883 kB in 2s (378 kB/s)
Selecting previously unselected package pc

# Start the LLM server (Ollama)

In [2]:
import os
import threading
import subprocess
import requests
import json

def ollama():
    os.environ['OLLAMA_HOST'] = '0.0.0.0:11434'
    os.environ['OLLAMA_ORIGINS'] = '*'
    subprocess.Popen(["ollama", "serve"])


ollama_thread = threading.Thread(target=ollama)
ollama_thread.start()

# Download an LLM model


Your choice of LLM model depends on your GPU's capabilities and your tolerance for wait times.

## Key Considerations

Larger models (e.g., **Llama4** or **Llama3.3:70B**) require significant GPU RAM (over 50 GB). On platforms like Google Colab, this is only feasible with an A100 GPU. Don't use T4 GPUs for these models, as they are incompatible.

Speed vs. Quality: Larger models deliver higher-quality outputs but respond slower. Smaller models are faster but produce less accurate results.

Balance: For a middle ground between speed, resource usage, and output quality, consider **Gemma:12B** or **Llama3.1:8B**.

T4 Compatibility: **Phi4** offers strong performance and, despite its size, is compatible with T4 GPUs.

Don't use V2 TPU on Google Colab since it is slow for most tasks from the notebook.

In [6]:
# @title Select models to download
import subprocess
# @markdown ---

# @markdown Select a LLM model:

LLM_model = "gemma3:4b" # @param ["phi4", "deepseek-r1:7b", "gemma3:4b", "gemma3:12b", "llama4", "llama3.3:70b", "llama3.2:3b", "llama3.1:8b"]
!ollama pull {LLM_model}
# @markdown ---
# @markdown ### For more information on available models, check [Ollama](https://ollama.com/search)

!ollama list

[?2026h[?25l[1Gpulling manifest ⠙ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠙ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠹ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠼ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠼ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠴ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠦ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠧ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠇ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠏ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠋ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠹ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠹ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠸ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠼ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠦ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠦ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠧ [K[?25h[?2026l[?2026h[?25l[1Gpulling ma

Next cell reads all the PDF files in the **data_dir** directory, and it is processed into a Vector database named **db_name** and located at **db_dir**

In [None]:
# @title Run a query against the LLM
# @markdown Select the model and enter the prompt

# @markdown ---
# @markdown ### Select a model:
LLM_model = "gemma3:4b" # @param ["phi4", "gemma3:4b", "gemma3:12b", "llama4", "llama3.3:70b", "llama3.2:3b", "llama3.1:8b"]
# @markdown ### Enter a prompt:
query = "shorter version of this: In the rapidly evolving landscape of artificial intelligence, leveraging large language models has become integral to a myriad of applications. Running these sophisticated models efficiently and effectively is crucial for researchers, developers, and enthusiasts alike. One powerful yet accessible way to achieve this is by utilizing the Ollama server within Google Colab Notebooks. This combination offers an ideal platform that merges the computational power and versatility of Ollama with the ease of access provided by Colab’s cloud-based environment. In this guide, we will explore how you can set up and run an LLM using the Ollama server in a Colab Notebook, enabling you to harness advanced AI capabilities directly from your browser without any local infrastructure requirements. Whether for experimentation, research, or practical applications, mastering this setup is a valuable skill that opens up new possibilities in the realm of artificial intelligence. Let's dive into the process and unlock the potential of LLMs with Ollama and Colab!" # @param {type:"string"}
escaped_input = query.replace("'", "\\'")
# @markdown ---

!ollama run {LLM_model} {escaped_input}


# Advanced options

In [None]:
import tempfile


# @title Adjust model parameters
# @markdown To adjust model parameter in Ollama, you need to create a derived model. This cell will generate a derived model.

# @markdown **Warning**: Not all model support all parameter. Using wrong parameters may generate a degraded model.

# @markdown ---
# @markdown Select a model to change (input model):
LLM_model = "gemma3:4b" # @param ["phi4", "gemma3:4b", "gemma3:12b", "llama4", "llama3.3:70b", "llama3.2:3b", "llama3.1:8b"]
# @markdown Enter new model name:
new_model = "genma3T07" # @param {type:"string"}
# @markdown Enter new template file name (if blank will use a random name):
tpl_fn = "" # @param {type:"string","placeholder":"Modelfile"}

if tpl_fn == "":
    # make a random filename
    tpl_fn = tempfile.mktemp(suffix='.txt')

# @markdown Temperature:  Increasing the temperature will make the model answer more creatively.
temp = 0.7 # @param {"type":"slider","min":0,"max":1,"step":0.05}
# @markdown num_ctx: Sets the size of the context window used to generate the next token
num_ctx = 4018 # @param {"type":"slider","min":1024,"max":18000,"step":1}
# @markdown seed: Sets the size of the context window used to generate the next token
seed = 4482 # @param {"type":"slider","min":0,"max":10000,"step":1}
# @markdown Maximum number of tokens to predict when generating text. (Default: -1, infinite generation)
num_predict = 4790 # @param {"type":"slider","min":-1,"max":10000,"step":1}
# @markdown Top K: Reduces the probability of generating nonsense. A higher value will give more diverse answers, while a lower value will be more conservative.
top_k = 20 # @param {"type":"slider","min":1,"max":100,"step":1}
# @markdown Top P: Works together with top-k. A higher value will lead to more diverse text, while a lower value will generate more focused and conservative text.
top_p = 0.47 # @param {"type":"slider","min":0,"max":1,"step":0.01}
# @markdown Min P: Alternative to the top_p, and aims to ensure a balance of quality and variety. The parameter p represents the minimum probability for a token to be considered, relative to the probability of the most likely token. For example, with p=0.05 and the most likely token having a probability of 0.9, logits with a value less than 0.045 are filtered out.
min_p = 0 # @param {"type":"slider","min":0,"max":1,"step":0.01}


# @markdown ---

# @markdown [More about model parameters](https://github.com/ollama/ollama/blob/main/docs/modelfile.md#valid-parameters-and-values)

# @markdown ---
model_file = f"""FROM {LLM_model}
PARAMETER temperature {temp}
PARAMETER num_ctx {num_ctx}
PARAMETER seed {seed}
PARAMETER num_predict {num_predict}
PARAMETER top_k {top_k}
PARAMETER top_p {top_p}
PARAMETER min_p {min_p}
"""

#mdir = "/content/miniforge3/envs/ml/lib/python3.10/site-packages/ollama/models/"

with open(f"{tpl_fn}", "w") as f:
    f.write(model_file)


!ollama create {new_model} -f {tpl_fn}



[?2026h[?25l[1Ggathering model components [K
using existing layer sha256:aeda25e63ebd698fab8638ffb778e68bed908b960d39d0becc650fa981609d25 [K
using existing layer sha256:e0a42594d802e5d31cdc786deb4823edb8adff66094d49de8fffe976d753e348 [K
using existing layer sha256:dd084c7d92a3c1c14cc09ae77153b903fd2024b64a100a0cc8ec9316063d2dbc [K
creating new layer sha256:5c713b945982e66464f922de12120c66e8453df5ea5bfa57c16b629e7c6d7cd6 [K
writing manifest [K
success [K[?25h[?2026l
