# **LLMs usage with Ollama**

******************************************************************
Choose: Runtime > CPU

Install Ollama (server + CLI)

In [None]:
# "!" at the beginning means this command runs in the system shell
#   (works in Colab/Jupyter, not regular Python).

# Install the Ollama server and CLI (Command Line Interface).
# - curl -fsSL:
#     -f : fail on HTTP errors
#     -s : run silently
#     -S : show errors (even when -s is used)
#     -L : follow redirects
!curl -fsSL https://ollama.com/install.sh | sh

>>> Cleaning up old version at /usr/local/lib/ollama
>>> Installing ollama to /usr/local
>>> Downloading Linux amd64 bundle
######################################################################## 100.0%
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.


Install the Python client

In [None]:
# Install the Ollama Python client (for using ollama.chat / ollama.generate).
# Note: this does NOT start the Ollama server.
# https://ollama.com/search
!pip install ollama



Start the Ollama server

In [None]:
# Start the Ollama server in the background:
# - nohup : keep it running even if the notebook/terminal stops
# - > /tmp/ollama.log 2>&1 : redirect all output (stdout & stderr) to a log file
# - & : run in the background so the notebook stays usable
!nohup ollama serve > /tmp/ollama.log 2>&1 &

# Give the server a couple seconds to start up before using it
!sleep 2

Check if the server is running

In [None]:
# List processes related to "ollama" to check if the server is running.
# Note: this will also show the "grep ollama" command itself — that's normal.
!ps aux | grep ollama

root        2983  2.5  0.2 1707884 28096 ?       Sl   09:54   0:00 ollama serve
root        3033  0.0  0.0   7376  3496 ?        S    09:54   0:00 /bin/bash -c ps aux | grep ollama
root        3035  0.0  0.0   6484  2284 ?        S    09:54   0:00 grep ollama


Download a model

In [None]:
# Download the model from the Ollama registry.
# - Requires the Ollama server to be running.
# - This pulls the model weights into the local cache; it won’t start the model yet.
# - You can later run it with:  ollama run <model>   or   via ollama.chat in Python.
# - On Colab, models are saved under: /root/.ollama/models

!ollama pull gemma3:4b # !ollama pull llama3.2:1b

[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l


******************************************************************
*Example: Zero-shot prompting using Ollama from Python (chat API)*

Block 1: Import Ollama client

In [None]:
# Import the Python client that talks to the local Ollama server.
# By default, it connects to http://127.0.0.1:11434
import ollama

Block 2: Define the conversation messages

In [None]:
# Define the "system" message:
# - This sets the assistant’s role, style, or instructions.
system_content = "You are a friendly lecturer who explains things simply and concretely."

# Define the "user" message:
# - This is the actual question or request we want answered.
user_content = "Explain large language models in one sentence."

Block 3: Send the chat request

In [None]:
# Send a chat request to the Ollama server.
# - model=<model> chooses a small, CPU-friendly model.
#   (make sure you already pulled it with:  !ollama pull <model>)
# - messages is a list of role-tagged turns; order matters (system → user → ...).
# - If your server is running on another machine, add host="http://<ip>:11434"

llm_model = "gemma3:4b" # "llama3.2:1b"

response = ollama.chat(
    model=llm_model,
    messages=[
        {"role": "system", "content": system_content},
        {"role": "user", "content": user_content}
    ]
)

Block 4: Display the LLM’s reply

In [None]:
# The response comes back as a Python dictionary.
# The generated text is stored under: response['message']['content']
print(response['message']['content'])

Okay, let's tackle that! 

Essentially, a large language model is like a really, *really* good student who's read a massive library of text and learned to predict what words should come next – that's how it generates its responses! 

Does that make sense as a starting point? We can definitely break it down further if you’d like!


******************************************************************
*Example: Few-shot prompting using Ollama from Python (chat API)*

Block 1: Import Ollama client

In [None]:
# Import the Python client that talks to the local Ollama server.
# By default, it connects to http://127.0.0.1:11434
import ollama

Block 2: Define the system prompt

In [None]:
# The "system" message sets the assistant’s overall role and behavior.
system_content = "You are a friendly lecturer who explains things simply and concretely."

Block 3: Provide few-shot examples

In [None]:
# Few-shot prompting = showing the model a few example Q&A pairs
# before asking the real question.
# This helps guide the style and depth of the answer.
few_shots = [
    {"role": "user", "content": "What is overfitting?"},
    {"role": "assistant", "content": "Overfitting is when a model memorizes the training data so well that it fails to generalize to new data."},

    {"role": "user", "content": "Explain gradient descent in one sentence."},
    {"role": "assistant", "content": "Gradient descent repeatedly nudges model parameters in the direction that most reduces error, based on the current slope of the loss."}
]

Block 4: Add the actual prompt

In [None]:
# The real user question (after the examples).
user_content = "Explain large language models in one sentence."

Block 5: Send the chat request

In [None]:
# Send a chat request to the Ollama server.
# - model=<model> selects a small, CPU-friendly model.
#   (make sure you’ve pulled it first with: !ollama pull <model>)
# - messages include: system prompt → few-shot examples → user question

response = ollama.chat(
    model=llm_model,
    messages=[
        {"role": "system", "content": system_content},
        *few_shots,
        {"role": "user", "content": user_content},
    ],
)

Block 6: Show the LLM's reply

In [None]:
# The response is a dictionary.
# The generated text is stored inside response["message"]["content"]
print(response["message"]["content"])

Large language models are essentially super-smart computer programs trained on massive amounts of text to predict the next word in a sequence, allowing them to generate human-like text. 

Does that make sense, or would you like me to break it down a little further?


******************************************************************
*Example: Minimal RAG using Ollama from Python (chat API)*

Block 1: Pull the embedding model (once)

In [None]:
# Pull the embedding model used for vector search.
# (Also make sure you've pulled your generator model, e.g., !ollama pull <model>)

!ollama pull nomic-embed-text # embedding model

[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l


Block 2: Imports + config

In [None]:
# Talk to the local Ollama server and use NumPy for vector math
import ollama, numpy as np

# Generator (LLM) and embedding model
GEN_MODEL = llm_model          # small, CPU-friendly generator
EMB_MODEL = "nomic-embed-text"     # embeddings for retrieval

# System prompt controls tone/behavior of the assistant
system_content = "You are a friendly lecturer who explains things simply and accurately."

Block 3: Tiny knowledge base (your source texts)

In [None]:
# Documents you want the model to ground its answers on.
# Replace these with chunks from your notes, PDFs, web pages, etc.
docs = [
    {"id": "note1", "text": "A large language model (LLM) is a neural network trained on vast text to predict the next token and perform language tasks."},
    {"id": "note2", "text": "Tokenization splits text into subword units; LLMs learn statistical patterns over these tokens."},
    {"id": "note3", "text": "During inference, the model autoregressively generates tokens conditioned on the prompt and previous outputs."},
    {"id": "note4", "text": "Pretraining uses self-supervised objectives (like next-token prediction) on large corpora; fine-tuning adapts to specific tasks."},
]

Block 4: Embedding + retrieval helpers

In [None]:
def embed(texts):
    """
    Convert a list of strings into embedding vectors using Ollama.
    Returns: numpy array of shape (n_texts, embedding_dim).
    """
    vecs = []
    for t in texts:
        # Ask Ollama to produce an embedding vector for this text
        e = ollama.embeddings(model=EMB_MODEL, prompt=t)["embedding"]
        vecs.append(e)
    # Stack all embeddings into a single NumPy array
    return np.array(vecs, dtype=np.float32)


def cosine_sim_matrix(A, b):
    """
    Compute cosine similarity between:
      - A = 2D array of many embeddings (shape: [n_docs, dim])
      - b = 1D array, a single query embedding (shape: [dim])
    Returns: similarity scores (length = n_docs)
    """
    # Normalize each vector in A to unit length
    A_norm = A / (np.linalg.norm(A, axis=1, keepdims=True) + 1e-12)
    # Normalize query vector
    b_norm = b / (np.linalg.norm(b) + 1e-12)
    # Cosine similarity = dot product of normalized vectors
    return A_norm @ b_norm


# --- Indexing step (done once at startup) ---
# Extract raw text from docs
doc_texts = [d["text"] for d in docs]
# Precompute embeddings for all docs (so we don’t recompute each query)
doc_vecs = embed(doc_texts)


def retrieve(query, k=3):
    """
    Given a query string:
      1. Embed the query
      2. Compute cosine similarity against all document embeddings
      3. Return the top-k most similar docs (with IDs, text, and scores)
    """
    # Step 1: Embed the query (returns [1, dim], so take the first row)
    q_vec = embed([query])[0]
    # Step 2: Compare query embedding to all doc embeddings
    sims = cosine_sim_matrix(doc_vecs, q_vec)
    # Step 3: Pick indices of top-k most similar docs (highest scores)
    idxs = np.argsort(-sims)[:k]
    # Step 4: Return matching docs with similarity scores
    return [
        {"id": docs[i]["id"], "text": docs[i]["text"], "score": float(sims[i])}
        for i in idxs
    ]


Block 5: Build a context block the LLM will read

In [None]:
def build_context_block(retrieved):
    """
    Take a list of retrieved documents (with id, text, score)
    and format them into a readable block of text that we can
    pass into the model as context.
    """
    lines = []  # will hold formatted strings for each retrieved doc

    # Loop over retrieved docs, numbering them starting at 1
    for i, r in enumerate(retrieved, 1):
        # Each entry shows:
        # - the document number in this batch (Doc 1, Doc 2, …)
        # - the doc ID (from our original docs list)
        # - the similarity score (3 decimals)
        # - the actual text content
        lines.append(f"[Doc {i} | {r['id']} | score={r['score']:.3f}]\n{r['text']}")

    # Join all docs together into one string, separated by blank lines
    return "\n\n".join(lines)

Block 6: Ask a question -> retrieve -> generate (RAG)

In [None]:
# Your question
user_content = "Explain large language models in one sentence."

# Retrieve supporting snippets
retrieved = retrieve(user_content, k=2)
context_block = build_context_block(retrieved)

# Instruct the model to rely on the retrieved context
rag_instructions = (
    "Use ONLY the context to answer if possible. "
    "If the answer isn't in the context, say so briefly. "
    "Cite the doc numbers you used like [Doc 1]. Keep it concise."
)

# Compose messages (system + user with context + question)
messages = [
    {"role": "system", "content": system_content},
    {"role": "user", "content": f"{rag_instructions}\n\n=== Context ===\n{context_block}\n\n=== Question ===\n{user_content}"},
]

# Generate the grounded answer
response = ollama.chat(model=GEN_MODEL, messages=messages)
answer = response["message"]["content"]
print(answer)

Large language models are neural networks trained to predict the next token in a sequence of text [Doc 1].


Block 7: (Optional) Inspect what was retrieved

In [None]:
# See which documents were fed to the model and how similar they were.
for r in retrieved:
    print(r)

{'id': 'note1', 'text': 'A large language model (LLM) is a neural network trained on vast text to predict the next token and perform language tasks.', 'score': 0.7794513702392578}
{'id': 'note2', 'text': 'Tokenization splits text into subword units; LLMs learn statistical patterns over these tokens.', 'score': 0.6357552409172058}


******************************************************************
Stop any running Ollama server processes

In [None]:
# Stop any running Ollama server processes (if they exist).
# - pkill -f "ollama serve" : looks for processes matching the string "ollama serve" and kills them
# - || true : ensures this command never errors out
#              (so if no server is running, the cell still succeeds quietly)
!pkill -f "ollama serve" || true

# Wait a moment to let the process fully shut down
!sleep 3

^C


Check if Ollama server is running

In [None]:
# List all running processes that mention "ollama" in their command line.
# - ps aux : show details of all processes
# - grep ollama : filter the list to only lines containing "ollama"
# Note: this will also show the "grep ollama" command itself — that's normal.
!ps aux | grep ollama

root        3777  0.0  0.0   7376  3572 ?        S    09:57   0:00 /bin/bash -c ps aux | grep ollama
root        3779  0.0  0.0   6484  2260 ?        S    09:57   0:00 grep ollama
