<a href="https://colab.research.google.com/github/Chrisbarbosa711/CSC360researchProject-RAGLLMs/blob/main/RAGmodel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Building a Mini Retrieval-Augmented Generation (RAG) System Using Gemma-3 + Wikipedia

This Colab notebook walks through:

Using a Gemma-3 model as a lightweight LLM for generation.

Building a RAG pipeline using

Wikipedia dataset (HuggingFace)

Document chunking

Sentence Transformers for embeddings

FAISS vector store

Running example queries from earlier.

Monitoring simple latency and token-count metrics.

Highly documented code, suitable for training, demos, or prototyping.

This notebook is fully self-contained and runnable in Colab GPU mode.

In [56]:
!pip install -q transformers datasets sentence-transformers faiss-cpu langchain accelerate langchain-community tokenizers torch

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/566.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m563.2/566.1 kB[0m [31m17.8 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m566.1/566.1 kB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
[?25h

The Wikipedia dataset is very large (many GB), so for a simple prototype RAG system we will:

Load only the first 5,000 examples.

Extract each page's text field.

Perform simple chunking (character-based) into ~500–700-character chunks.


This gives us enough material to test retrieval & generation realistically.

In [3]:
from datasets import load_dataset

# Load small subset to keep memory + compute low
# The 'wikipedia' dataset script is deprecated. Use 'wikimedia/wikipedia' with a specific configuration.
# Using a recent English snapshot '20231101.en'.
wiki = load_dataset("wikimedia/wikipedia", "20231101.en", split="train[:5000]")

print("Loaded:", len(wiki), "Wikipedia pages")
print("Sample page:", wiki[0]["title"])
print(wiki[0]["text"][:500], "...")

README.md: 0.00B [00:00, ?B/s]

Resolving data files:   0%|          | 0/41 [00:00<?, ?it/s]

Downloading data:   0%|          | 0/41 [00:00<?, ?files/s]

20231101.en/train-00000-of-00041.parquet:   0%|          | 0.00/420M [00:00<?, ?B/s]

20231101.en/train-00001-of-00041.parquet:   0%|          | 0.00/351M [00:00<?, ?B/s]

20231101.en/train-00002-of-00041.parquet:   0%|          | 0.00/329M [00:00<?, ?B/s]

20231101.en/train-00003-of-00041.parquet:   0%|          | 0.00/331M [00:00<?, ?B/s]

20231101.en/train-00004-of-00041.parquet:   0%|          | 0.00/307M [00:00<?, ?B/s]

20231101.en/train-00005-of-00041.parquet:   0%|          | 0.00/244M [00:00<?, ?B/s]

20231101.en/train-00006-of-00041.parquet:   0%|          | 0.00/266M [00:00<?, ?B/s]

20231101.en/train-00007-of-00041.parquet:   0%|          | 0.00/228M [00:00<?, ?B/s]

20231101.en/train-00008-of-00041.parquet:   0%|          | 0.00/248M [00:00<?, ?B/s]

20231101.en/train-00009-of-00041.parquet:   0%|          | 0.00/227M [00:00<?, ?B/s]

20231101.en/train-00010-of-00041.parquet:   0%|          | 0.00/234M [00:00<?, ?B/s]

20231101.en/train-00011-of-00041.parquet:   0%|          | 0.00/232M [00:00<?, ?B/s]

20231101.en/train-00012-of-00041.parquet:   0%|          | 0.00/239M [00:00<?, ?B/s]

20231101.en/train-00013-of-00041.parquet:   0%|          | 0.00/241M [00:00<?, ?B/s]

20231101.en/train-00014-of-00041.parquet:   0%|          | 0.00/223M [00:00<?, ?B/s]

20231101.en/train-00015-of-00041.parquet:   0%|          | 0.00/235M [00:00<?, ?B/s]

20231101.en/train-00016-of-00041.parquet:   0%|          | 0.00/503M [00:00<?, ?B/s]

20231101.en/train-00017-of-00041.parquet:   0%|          | 0.00/231M [00:00<?, ?B/s]

20231101.en/train-00018-of-00041.parquet:   0%|          | 0.00/231M [00:00<?, ?B/s]

20231101.en/train-00019-of-00041.parquet:   0%|          | 0.00/195M [00:00<?, ?B/s]

20231101.en/train-00020-of-00041.parquet:   0%|          | 0.00/225M [00:00<?, ?B/s]

20231101.en/train-00021-of-00041.parquet:   0%|          | 0.00/216M [00:00<?, ?B/s]

20231101.en/train-00022-of-00041.parquet:   0%|          | 0.00/202M [00:00<?, ?B/s]

20231101.en/train-00023-of-00041.parquet:   0%|          | 0.00/213M [00:00<?, ?B/s]

20231101.en/train-00024-of-00041.parquet:   0%|          | 0.00/221M [00:00<?, ?B/s]

20231101.en/train-00025-of-00041.parquet:   0%|          | 0.00/221M [00:00<?, ?B/s]

20231101.en/train-00026-of-00041.parquet:   0%|          | 0.00/208M [00:00<?, ?B/s]

20231101.en/train-00027-of-00041.parquet:   0%|          | 0.00/214M [00:00<?, ?B/s]

20231101.en/train-00028-of-00041.parquet:   0%|          | 0.00/188M [00:00<?, ?B/s]

20231101.en/train-00029-of-00041.parquet:   0%|          | 0.00/218M [00:00<?, ?B/s]

20231101.en/train-00030-of-00041.parquet:   0%|          | 0.00/204M [00:00<?, ?B/s]

20231101.en/train-00031-of-00041.parquet:   0%|          | 0.00/215M [00:00<?, ?B/s]

20231101.en/train-00032-of-00041.parquet:   0%|          | 0.00/214M [00:00<?, ?B/s]

20231101.en/train-00033-of-00041.parquet:   0%|          | 0.00/203M [00:00<?, ?B/s]

20231101.en/train-00034-of-00041.parquet:   0%|          | 0.00/219M [00:00<?, ?B/s]

20231101.en/train-00035-of-00041.parquet:   0%|          | 0.00/224M [00:00<?, ?B/s]

20231101.en/train-00036-of-00041.parquet:   0%|          | 0.00/610M [00:00<?, ?B/s]

20231101.en/train-00037-of-00041.parquet:   0%|          | 0.00/674M [00:00<?, ?B/s]

20231101.en/train-00038-of-00041.parquet:   0%|          | 0.00/538M [00:00<?, ?B/s]

20231101.en/train-00039-of-00041.parquet:   0%|          | 0.00/465M [00:00<?, ?B/s]

20231101.en/train-00040-of-00041.parquet:   0%|          | 0.00/422M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/6407814 [00:00<?, ? examples/s]

Loaded: 5000 Wikipedia pages
Sample page: Anarchism
Anarchism is a political philosophy and movement that is skeptical of all justifications for authority and seeks to abolish the institutions it claims maintain unnecessary coercion and hierarchy, typically including nation-states, and capitalism. Anarchism advocates for the replacement of the state with stateless societies and voluntary free associations. As a historically left-wing movement, this reading of anarchism is placed on the farthest left of the political spectrum, usually described as ...


RAG pipelines typically split long documents into chunks so that:

Each chunk can be embedded efficiently.

Retrieval can return only relevant parts.

LLM context window does not overflow.

We make a simple fixed-size chunker, which is fine for demos.
(Production systems may use token-based or semantic chunking.)

In [36]:
from langchain_core.documents import Document

def chunk_text(text, chunk_size=700, overlap=100):
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk)
        start = end - overlap  # ensure overlap for coherence
    return chunks

docs = []
for row in wiki:
    text = row["text"]
    if not text:
        continue
    chunks = chunk_text(text)
    for chunk in chunks:
        docs.append(Document(page_content=chunk))

print("Total chunks created:", len(docs))
print("Example chunk:\n", docs[0].page_content[:300], "...")

Total chunks created: 119038
Example chunk:
 Anarchism is a political philosophy and movement that is skeptical of all justifications for authority and seeks to abolish the institutions it claims maintain unnecessary coercion and hierarchy, typically including nation-states, and capitalism. Anarchism advocates for the replacement of the state  ...


We use:

sentence-transformers/all-MiniLM-L6-v2
Light & fast; great for RAG demos.

FAISS for similarity search

We embed all document chunks, then build the FAISS index.

In [57]:
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS

embedding_model_name = "sentence-transformers/all-MiniLM-L6-v2"
embeddings = HuggingFaceEmbeddings(model_name=embedding_model_name)

# Build the FAISS vector store
vector_store = FAISS.from_documents(docs, embeddings)
retriever = vector_store.as_retriever(search_kwargs={"k": 3})

print("Vector store ready.")

RuntimeError: Failed to import transformers.models.fastspeech2_conformer.configuration_fastspeech2_conformer because of the following error (look up to see its traceback):
No module named 'transformers.models.fastspeech2_conformer.configuration_fastspeech2_conformer'

We load the Gemma-3 model through Hugging Face Transformers.
If this exact model name is not yet published, substitute any Gemma model such as:

"google/gemma-2-2b"

"google/gemma-7b"

"google/gemma-3-4b" (if available)

We wrap it using HuggingFacePipeline to integrate it with LangChain.

In [59]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from langchain.llms import HuggingFacePipeline

model_name = "google/gemma-3-4b"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

gen_pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=350,
    temperature=0.1,
)

llm = HuggingFacePipeline(pipeline=gen_pipe)
print("Gemma-3 Loaded.")

ModuleNotFoundError: No module named 'langchain.llms'

We now connect:

Retriever (FAISS)

LLM (Gemma-3)

Simple “stuff” chain (concatenate retrieved chunks)

LangChain’s RetrievalQA handles this.

In [49]:
from langchain.chains import RetrievalQA

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True
)

print("RAG system ready.")

ModuleNotFoundError: No module named 'langchain.chains'

We now run the same example questions provided earlier:

Example Queries

“Explain the history and major events of the Roman Empire.”

“What are the main differences between mitosis and meiosis?”

“List best practices for storing sensitive data securely.”

“Summarize the key concepts from the article titled ‘Quantum entanglement and Bell inequalities’.”

“Given documentation excerpt… generate an FAQ.”

“MCQ: What does a seismograph measure?”

We will evaluate:

Retrieved documents

Model answer

Time taken

Simple rough token counts (input + output)

In [None]:
import time
import math

def count_tokens(text):
    # Approximate token count using whitespace split (cheap + OK for demo)
    return len(text.split())

def run_query(query):
    print("="*70)
    print("QUERY:", query)
    print("="*70)

    start = time.time()
    result = qa(query)
    elapsed = time.time() - start

    answer = result["result"]
    sources = result["source_documents"]

    # collect metrics
    retrieved_text = "\n".join([s.page_content for s in sources])
    input_tokens = count_tokens(retrieved_text) + count_tokens(query)
    output_tokens = count_tokens(answer)

    print("\n--- ANSWER ---\n", answer)
    print("\n--- SOURCES (first 300 chars each) ---")
    for i, s in enumerate(sources):
        print(f"[{i}] {s.page_content[:300]} ...\n")

    print("--- METRICS ---")
    print("Time (sec):", round(elapsed, 3))
    print("Approx input tokens:", input_tokens)
    print("Approx output tokens:", output_tokens)
    print()

**Run All Example Queries**

In [None]:
queries = [
    "Explain the history and major events of the Roman Empire.",
    "What are the main differences between mitosis and meiosis?",
    "List best practices for storing sensitive data securely.",
    "Summarize key concepts from the article titled 'Quantum entanglement and Bell inequalities'.",
    "Generate a short FAQ answer for users needing to comply with chemical safety standards.",
    "What does a seismograph measure? Choose from A) Earthquakes B) Rainfall C) Sunlight D) Temperature."
]

for q in queries:
    run_query(q)

This notebook’s metrics include:

Latency

Approx token counts

To estimate energy:

You can instrument GPU power using nvidia-smi --query-gpu=power.draw --format=csv in a loop during inference.
But even without exact measurement, you can rely on typical per-token energy usage for 2–4B-parameter models (~low Wh per query).

Collected metrics:

input tokens ≈ amount of text retrieved + query tokens.

output tokens ≈ LLM answer.

Energy tends to scale with total tokens processed.