In [1]:
%%capture
# %%capture prevents this cell from printing a ton of STDERR stuff to the screen

## First, check to see if lightning is installed, if not, install it.
##
## NOTE: If you **do** need to install something, just know that you may need to
##       restart your session for python to find the new module(s).
##
##       To restart your session:
##       - In Google Colab, click on the "Runtime" menu and select
##         "Restart Session" from the pulldown menu
##       - In a local jupyter notebook, click on the "Kernel" menu and select
##         "Restart Kernel" from the pulldown menu
import pip
try:
  __import__("llama-index",
  "llama-index-llms-ollama",
  "llama-index-embeddings-huggingface")
except ImportError:
  pip.main(['install', "llama-index",
  "llama-index-llms-ollama",
  "llama-index-embeddings-huggingface"])

In [2]:
# Install Ollama
## Ollama is basically a relatively easy way to download and
## use LLMs locally. It takes care of all the details
## required to get them going. Ollama uses a client
## server model and gives us (and LlamaIndex)
## an API for accessing the LLMs.
!curl -fsSL https://ollama.com/install.sh | sh

>>> Installing ollama to /usr/local
>>> Downloading Linux amd64 bundle
######################################################################## 100.0%
>>> Creating ollama user...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.


For this RAG tutorial, we're going to use **LlamaIndex**, which is a data framework for LLMs.

In [1]:
from llama_index.core import Settings
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import VectorStoreIndex, ServiceContext, SimpleDirectoryReader

## NOTE: If you get an error, just restart the session.
##
##       To restart your session:
##       - In Google Colab, click on the "Runtime" menu and select
##         "Restart Session" from the pulldown menu
##       - In a local jupyter notebook, click on the "Kernel" menu and select
##         "Restart Kernel" from the pulldown menu

In [3]:
# Start the local Ollama Server
## Ollama is basically a relatively easy way to download and
## use LLMs locally. It takes care of all the details
## required to get them going and gives us (and LlamaIndex)
## an API for accessing them.
!nohup ollama serve > ollama.log 2>&1 &

In [4]:
## Download the weights for the smallest gemma3 model: 1 billion weights
!ollama pull gemma3:1b

[?2026h[?25l[1Gpulling manifest ⠙ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠙ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠹ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠼ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠼ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠴ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠧ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠧ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ⠇ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest [K
pulling 7cd4618c1faf...   0% ▕▏    0 B/815 MB                  [K[?25h[?2026l[?2026h[?25l[A[1Gpulling manifest [K
pulling 7cd4618c1faf...   0% ▕▏    0 B/815 MB                  [K[?25h[?2026l[?2026h[?25l[A[1Gpulling manifest [K
pulling 7cd4618c1faf...   0% ▕▏    0 B/815 MB                  [K[?25h[?2026l[?2026h[?25l[A[1Gpulling manifest [K
pulling 7cd4618c1faf...   4% ▕▏  33 MB/815 MB                  [K[?25h[?2026l[?2026h[?25l[A[1Gpulling 

In [5]:
# Set the LLM that we will use to the smallest gemma3 LLM
llm=Ollama(model="gemma3:1b", request_timeout=120.0)
## Download an Embedding Model (aka, Encoder-Only Transformer)
embed_model = HuggingFaceEmbedding( model_name="BAAI/bge-large-en-v1.5", trust_remote_code=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/779 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

In [6]:
## Click on the file icon on the left and drag and drop PDFs into that directory.
input_dir_path = "/content/"

In [8]:
# load data
loader = SimpleDirectoryReader(
            input_dir = input_dir_path,
            required_exts=[".pdf"],
            recursive=True
        )
docs = loader.load_data()

In [9]:
# Creating an index from the loaded data
Settings.embed_model = embed_model
index = VectorStoreIndex.from_documents(docs, show_progress=True)

Parsing nodes:   0%|          | 0/16 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/28 [00:00<?, ?it/s]

In [10]:
# Create an index from loaded data
## First, set the LlamaIndex embedding model with 'Settings'
Settings.embed_model = embed_model
## Second, with the embedding model set, we can break
## the documents in to chunks (called "nodes" in LlamaIndex speak)
## and apply the embedding model to each chunk, creating an index
## of vectors that we can use for our RAG database.
index = VectorStoreIndex.from_documents(docs, show_progress=True)

# Create the query engine - a query engine takes a query, retrieves
## chunks (notes) of text from the RAG vector database, and then uses
## and LLM to generate a nice response to the query that is based on
## the stuff in the RAG database that matched the query.

## First, just like we did for the embedding model, we set the LlamaIndex
## LLM with Settings
Settings.llm = llm
## Now create the query engine that we will ultimately pass the prompt
## to.
query_engine = index.as_query_engine()

# Generate the response
response = query_engine.query("What exactly is BERT?",)

Parsing nodes:   0%|          | 0/16 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/28 [00:00<?, ?it/s]

In [11]:
print(response)

BERT is Bidirectional Encoder Representations from Transformers. It’s a language model that pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers.


In [12]:
# Generate the response
response = query_engine.query("Who are Norm and 'Squatch?",)

In [13]:
print(response)

Norm and 'Squatch' are not mentioned in the provided text.


Now upload the file **norm_and_squatch_text.pdf**, create vectors and then ask the same question...

In [14]:
# load data
loader = SimpleDirectoryReader(  # consider trying WikipediaReader(): https://github.com/run-llama/llama_index/tree/main/llama-index-integrations/readers/llama-index-readers-wikipedia
            input_dir = input_dir_path,
            required_exts=[".pdf"],
            recursive=True
        )
docs = loader.load_data()

# Creating an index from loaded data
## First, set the LlamaIndex embedding model with 'Settings'
Settings.embed_model = embed_model
## Second, with the embedding model set, we can break
## the documents in to chunks (called "nodes" in LlamaIndex speak)
## and apply the embedding model to each chunk, creating an index
## of vectors that we can use for our RAG database.
index = VectorStoreIndex.from_documents(docs, show_progress=True)

# Create the query engine - a query engine takes a query, retrieves
## chunks (notes) of text from the RAG vector database, and then uses
## and LLM to generate a nice response to the query that is based on
## the stuff in the RAG database that matched the query.

## First, just like we did for the embedding model, we set the LlamaIndex
## LLM with Settings
Settings.llm = llm
## Now create the query engine that we will ultimately pass the prompt
## to.
query_engine = index.as_query_engine()

# Generate the response
response = query_engine.query("Who are Norm and 'Squatch?",)



Parsing nodes:   0%|          | 0/17 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/29 [00:00<?, ?it/s]

In [15]:
print(response)

Norm and 'Squatch' are characters in StatQuest videos that are a play-off of Sasquatch. 'Squatch' asks a lot of questions and plays the role of the “novice” in the videos and the goal is to teach ‘Squatch’.
