# Agentic Wikipedia Proof-of-Concept (Vertex AI Edition)

Welcome to a short roadside pull-off on the information highway. Today’s trip is simple: pick a narrow topic, collect a few Wikipedia pages, embed them with Vertex AI, and see whether retrieval puts the right pages in the passenger seat before the model starts talking.

This notebook mirrors `docs/agentic_wikipedia_gcp_spec.md`, but with extra signposts for incremental development. Keep the knobs small at first (e.g., `WIKIPEDIA_MAX_DOCS=5`), then widen the map once the pipeline runs end-to-end.

## Data Source (what we’re picking up)

`WikipediaLoader` ingests documents from the Wikipedia API and converts them into LangChain `Document` objects. The page content typically includes the first sections of an article, and each document comes with metadata you can use for grounding and citations.

**Recommendation**: Filter down to a manageable slice (often **10k or fewer** documents), then expand only if retrieval quality needs it.

In the metadata of each document, you’ll see:

| Column  | Definition |
|---------|------------|
| title   | The Wikipedia page title (e.g., "Quantum Computing"). |
| summary | A short extract or condensed description from the page content. |
| source  | The URL link to the original Wikipedia article. |


In [None]:
# Notebook Setup (Install Dependencies)

# If you're running this notebook locally from the repo root:
# %pip install -U -qqqq -r requirements.txt
# %pip install -U -qqqq -r requirements-dev.txt
#
# Optional (FAISS vector store):
# %pip install -U -qqqq -r requirements-faiss.txt


In [None]:
# Python Package Imports (Reference)

from langchain_community.document_loaders import WikipediaLoader

from langchain_google_vertexai import ChatVertexAI, VertexAIEmbeddings
import vertexai

# Vector stores
from langchain_chroma import Chroma

try:
    from langchain_community.vectorstores import FAISS
    _HAS_FAISS = True
except Exception:
    _HAS_FAISS = False


In [None]:
# Vertex AI Initialization

import os

try:
    from dotenv import load_dotenv

    load_dotenv()
except Exception:
    pass

PROJECT_ID = os.environ.get("GOOGLE_CLOUD_PROJECT", "your-project-id")
LOCATION = os.environ.get("GOOGLE_CLOUD_LOCATION", "us-central1")

vertexai.init(project=PROJECT_ID, location=LOCATION)

# **Advice**: Ensure `aiplatform.googleapis.com` is enabled and your account has Vertex AI permissions.


In [None]:
# Config (LLMs, Embeddings, Vector Store, Data Loader)

import os

# DataLoader Config
query_terms = [
    "Interstate 81",
    "Shenandoah Valley",
    "Roanoke",
    "Appalachian Mountains",
]  # fallback if WIKIPEDIA_QUERY is unset

wiki_query = os.environ.get("WIKIPEDIA_QUERY", "").strip() or " OR ".join(query_terms)
max_docs = int(os.environ.get("WIKIPEDIA_MAX_DOCS", "10"))

# Retriever Config
k = int(os.environ.get("WIKIPEDIA_TOP_K", "3"))
EMBEDDING_MODEL = os.environ.get("VERTEX_EMBEDDING_MODEL", "text-embedding-005")

# LLM Config
LLM_MODEL_NAME = os.environ.get("VERTEX_LLM_MODEL", "gemini-flash-latest")

# Vector store choice
VECTORSTORE = os.environ.get("VECTORSTORE", "chroma").lower()
CHROMA_PERSIST_DIR = os.environ.get("CHROMA_PERSIST_DIR", "./.chroma")

example_question = "What states does Interstate 81 run through?"


In [None]:
# Wikipedia Data Loader

docs = WikipediaLoader(query=wiki_query, load_max_docs=max_docs).load()

print(f"Loaded {len(docs)} docs for query: {wiki_query!r}")


In [None]:
# Retriever: Vector store + Vertex AI embeddings

embeddings = VertexAIEmbeddings(model_name=EMBEDDING_MODEL)

if VECTORSTORE == "faiss":
    if not _HAS_FAISS:
        raise RuntimeError(
            "FAISS requested but not available. Install it with: python -m pip install -r requirements-faiss.txt"
        )
    vector_store = FAISS.from_documents(docs, embeddings)
else:
    vector_store = Chroma.from_documents(
        docs,
        embeddings,
        persist_directory=CHROMA_PERSIST_DIR,
        collection_name="agentic-wikipedia",
    )

results = vector_store.similarity_search(example_question, k=k)
for res in results:
    title = res.metadata.get("title", "(no title)")
    preview = res.page_content.replace("\n", " ")[:220]
    print(f"* {title}: {preview}…")


In [None]:
# LLM: Using Vertex AI Foundation Model

llm = ChatVertexAI(model_name=LLM_MODEL_NAME)
response = llm.invoke(example_question)
print(response.content)

# Note: this is a direct LLM call (not RAG). Use the retrieved docs above to build a grounded prompt.


## a) GenAI Application Development

**REQUIRED**: This section is where you input your custom logic to create and run your agentic workflow. Feel free to add as many code cells as needed.




In [None]:
# TODO: Enter your Agentic workflow code here

## b) Reflection

**REQUIRED**: Provide a detailed reflection addressing these two questions:
1. If you had more time, which specific improvements or enhancements would you make to your agentic workflow, and why?
2. What concrete steps are required to move this workflow from prototype to production?

> Enter your reflection here


## Vertex AI Checklist (Pre-Run)

- Confirm your project is correct and billing is enabled.
- Enable `aiplatform.googleapis.com`.
- Ensure the notebook service account has `Vertex AI User` and `Service Account User` roles.
- Verify model availability in your region (Gemini + Embeddings).
- Check quotas for embedding and LLM calls.
- Keep `max_docs` small for the first run and scale gradually.
- Note expected costs for embeddings and model usage.

## Primers and Background to Explore (Internet Research)

- Vertex AI authentication patterns (local vs. managed notebooks).
- LangChain + Vertex AI integration guide.
- Wikipedia API usage limits and best practices.
- Vector search fundamentals: embeddings, cosine similarity, FAISS indexing types.
- Retrieval-Augmented Generation (RAG) design patterns.
- LangGraph basics and agentic workflow orchestration.
- Prompt engineering for grounding and citation safety.


## Alternatives and Discussion Items

- **Vector stores**: Vertex AI Vector Search, ChromaDB, Weaviate, Pinecone, or Qdrant.
- **Embeddings**: OpenAI embeddings, Hugging Face sentence-transformers, or ertex AI text embeddings.
- **LLMs**: Gemini variants, open-source LLMs (e.g., Llama), or hosted APIs.
- **Retrievers**: hybrid search (BM25 + embeddings), rerankers, or multi-vector approaches.
- **Agent frameworks**: LangGraph, LangChain Agents, Semantic Kernel, or custom orchestration.
- **Evaluation**: How you will measure retrieval accuracy and answer quality (golden sets, human review).
- **Safety and grounding**: How to handle hallucinations, citations, and source validation.


## Incremental Build Plan (Suggested)

1. Validate Vertex AI authentication and model calls with a tiny prompt.
2. Load a very small set of Wikipedia articles (e.g., 5–10).
3. Build embeddings and run a simple similarity search.
4. Add LLM response that cites retrieved documents.
5. Implement a basic agentic workflow (single tool + loop).
6. Add evaluation checks and reflection notes after each stage.