# RAG with Ollama and Gemma3

**NOTE:** We will want to use a GPU to run the examples in this notebook. In Google Colab, go to **Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4**.

----

In [None]:
%%capture
# %%capture prevents this cell from printing a ton of STDERR stuff to the screen

## First, check to see if lightning is installed, if not, install it.
##
## NOTE: If you **do** need to install something, just know that you may need to
##       restart your session for python to find the new module(s).
##
##       To restart your session:
##       - In Google Colab, click on the "Runtime" menu and select
##         "Restart Session" from the pulldown menu
##       - In a local jupyter notebook, click on the "Kernel" menu and select
##         "Restart Kernel" from the pulldown menu
import pip
try:
  __import__("llama-index",
  "llama-index-llms-ollama",
  "llama-index-embeddings-huggingface")
except ImportError:
  pip.main(['install', "llama-index",
  "llama-index-llms-ollama",
  "llama-index-embeddings-huggingface"])

In [None]:
# Install Ollama
## Ollama is basically a relatively easy way to download and
## use LLMs locally. It takes care of all the details
## required to get them going. Ollama uses a client
## server model and gives us (and LlamaIndex)
## an API for accessing the LLMs.
!curl -fsSL https://ollama.com/install.sh | sh

For this RAG tutorial, we're going to use **LlamaIndex**, which is a data framework for LLMs.

In [None]:
from llama_index.core import Settings
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import VectorStoreIndex, ServiceContext, SimpleDirectoryReader

## NOTE: If you get an error, just restart the session.
##
##       To restart your session:
##       - In Google Colab, click on the "Runtime" menu and select
##         "Restart Session" from the pulldown menu
##       - In a local jupyter notebook, click on the "Kernel" menu and select
##         "Restart Kernel" from the pulldown menu

In [None]:
# Start the local Ollama Server
## Ollama is basically a relatively easy way to download and
## use LLMs locally. It takes care of all the details
## required to get them going and gives us (and LlamaIndex)
## an API for accessing them.
!nohup ollama serve > ollama.log 2>&1 &

In [None]:
## Download the weights for the smallest gemma3 model: 1 billion weights
!ollama pull gemma3:1b

In [None]:
# Set the LLM that we will use to the smallest gemma3 LLM
llm=Ollama(model="gemma3:1b", request_timeout=120.0)
## Download an Embedding Model (aka, Encoder-Only Transformer)
## bge-large-en-v1.5" is an embedding model with 335M parameters
## so, even though it is called "large", it's not that big.
embed_model = HuggingFaceEmbedding( model_name="BAAI/bge-large-en-v1.5", trust_remote_code=True)

-----

Now we need to upload a PDF file that we can use for RAG. There are two demo PDF files that we can download. The original **[BERT manuscript](https://github.com/StatQuest/UphillConf2025/blob/main/pdf_files/bert_manuscript.pdf)**, and a short description of **[Norm and 'Squatch](https://github.com/StatQuest/UphillConf2025/blob/main/pdf_files/norm_and_squatch_text.pdf)**.

Download both PDFs, but, for now, just upload the one about BERT. To upload a file, click on the **File** icon on the left and drag and drop PDF into that directory.

In [None]:
## Click on the file icon on the left and drag and drop PDFs into that directory.

## NOTE: Start by adding just "bert_manuscript.pdf" to the directory so
## we can demonstrate that it works, but also show how it gracefully
## fails when we ask about something not in that manuscript.
input_dir_path = "/content/"

In [None]:
# load data
loader = SimpleDirectoryReader(
            input_dir = input_dir_path,
            required_exts=[".pdf"],
            recursive=True
        )
docs = loader.load_data()

In [None]:
# Creating an index from the loaded data
Settings.embed_model = embed_model
index = VectorStoreIndex.from_documents(docs, show_progress=True)

In [None]:
# Create an index from loaded data
## First, set the LlamaIndex embedding model with 'Settings'
Settings.embed_model = embed_model
## Second, with the embedding model set, we can break
## the documents in to chunks (called "nodes" in LlamaIndex speak)
## and apply the embedding model to each chunk, creating an index
## of vectors that we can use for our RAG database.
index = VectorStoreIndex.from_documents(docs, show_progress=True)

# Create the query engine - a query engine takes a query, retrieves
## chunks (notes) of text from the RAG vector database, and then uses
## and LLM to generate a nice response to the query that is based on
## the stuff in the RAG database that matched the query.

## First, just like we did for the embedding model, we set the LlamaIndex
## LLM with Settings
Settings.llm = llm
## Now create the query engine that we will ultimately pass the prompt
## to.
query_engine = index.as_query_engine()

# Generate the response
response = query_engine.query("What exactly is BERT?",)

In [None]:
print(response)

In [None]:
# Generate the response
response = query_engine.query("Who are Norm and 'Squatch?",)

In [None]:
print(response)

Now upload the file **norm_and_squatch_text.pdf**, create vectors and then ask the same question...

In [None]:
# load data
loader = SimpleDirectoryReader(  # consider trying WikipediaReader(): https://github.com/run-llama/llama_index/tree/main/llama-index-integrations/readers/llama-index-readers-wikipedia
            input_dir = input_dir_path,
            required_exts=[".pdf"],
            recursive=True
        )
docs = loader.load_data()

# Creating an index from loaded data
## First, set the LlamaIndex embedding model with 'Settings'
Settings.embed_model = embed_model
## Second, with the embedding model set, we can break
## the documents in to chunks (called "nodes" in LlamaIndex speak)
## and apply the embedding model to each chunk, creating an index
## of vectors that we can use for our RAG database.
index = VectorStoreIndex.from_documents(docs, show_progress=True)

# Create the query engine - a query engine takes a query, retrieves
## chunks (notes) of text from the RAG vector database, and then uses
## and LLM to generate a nice response to the query that is based on
## the stuff in the RAG database that matched the query.

## First, just like we did for the embedding model, we set the LlamaIndex
## LLM with Settings
Settings.llm = llm
## Now create the query engine that we will ultimately pass the prompt
## to.
query_engine = index.as_query_engine()

# Generate the response
response = query_engine.query("Who are Norm and 'Squatch?",)

In [None]:
print(response)