# Agentic Wikipedia Proof-of-Concept (Vertex AI Edition)

This notebook is a **text-first template** that mirrors the `agentic_wikipedia_gcp_spec.md` and adds advice, primers, and alternatives for incremental development on Vertex AI. Use these notes to guide your implementation as you build and test each step.

## Data Source

The Wikipedia Loader ingests documents from the Wikipedia API and converts them into LangChain document objects. The page content includes the first sections of the Wikipedia articles and the metadata is described in detail below.

**Recommendation**: If you are using the LangChain document loader we recommend filtering down to 10k or fewer documents. The `query_terms` argument below can be updated to update the search term used to search Wikipedia. Make sure you update this based on the use case you defined.

In the metadata of the LangChain document object, we have the following information:

| Column  | Definition |
|---------|------------|
| title   | The Wikipedia page title (e.g., "Quantum Computing"). |
| summary | A short extract or condensed description from the page content. |
| source  | The URL link to the original Wikipedia article. |


In [33]:
# Notebook Setup (Install Dependencies)

# %pip install -U -qqqq langchain-google-vertexai google-cloud-aiplatform langgraph==0.5.3 uv chromadb sentence-transformers langchain-huggingface langchain-chroma wikipedia faiss-cpu


In [34]:
# Python Package Imports (Reference)

from langchain_community.document_loaders import WikipediaLoader
import faiss
from langchain_community.docstore.in_memory import InMemoryDocstore
from langchain_community.vectorstores import FAISS

from langchain_google_vertexai import (
     ChatVertexAI,
     VertexAIEmbeddings,
)
import vertexai


# **Advice**: If you do not use `faiss` directly, you can omit that import; `FAISS` from LangChain will use it internally when available.


In [35]:
# Vertex AI Initialization
PROJECT_ID = "gen-lang-client-0041376756"
LOCATION = "global"
vertexai.init(project=PROJECT_ID, location=LOCATION)

# **Advice**: Ensure `aiplatform.googleapis.com` is enabled and your service account has Vertex AI permissions.


In [36]:
# Config (LLMs, Embeddings, Vector Store, Data Loader)


# DataLoader Config
query_terms = ["sport", "football", "soccer", "basketball","baseball", "track","swimming", "gymnastics"] # TODO: update to match your use case requirements
max_docs = 10 # TODO: recommend starting with a smaller number for testing purposes

# Retriever Config
k = 2 # number of documents to return
EMBEDDING_MODEL = "text-embedding-005" # Vertex AI Embedding model

# LLM Config
LLM_MODEL_NAME = "gemini-flash-latest" # Vertex AI Gemini model

example_question = "What is the most popular sport in the US?"

# **Advice**: Verify model availability in your region. Some model IDs are region-specific.


In [37]:
# Wikipedia Data Loader

docs = WikipediaLoader(query=query_terms, load_max_docs=max_docs).load() # Load in documents from Wikipedia takes about 10 minutes for 1K articles

# **Advice**: Some versions of `WikipediaLoader` expect `query` to be a string rather than a list. If you hit errors, try `query = " OR ".join(query_terms)` or loop over terms and merge results.


In [38]:
# FAISS Retriever: Using Vertex AI Embedding Model

# # Define the embeddings and the FAISS vector store
embeddings = VertexAIEmbeddings(model_name=EMBEDDING_MODEL) # Use to generate embeddings
vector_store = FAISS.from_documents(docs, embeddings)

# Example of how to invoke the vector store
results = vector_store.similarity_search(
    "What is the most popular sport in the US?",
     k=k
)
for res in results:
     print(f"* {res.page_content} [{res.metadata}]")

# **Advice**: Start with a very small `max_docs` to validate the pipeline quickly before scaling up.


* Gymnastics is a group of sport that includes physical exercises requiring balance, strength, flexibility, agility, coordination, artistry and endurance. The movements involved in gymnastics contribute to the development of the arms, legs, shoulders, back, chest, and abdominal muscle groups. Gymnastics evolved from exercises used by the ancient Greeks that included skills for mounting and dismounting a horse.
The most common form of competitive gymnastics is artistic gymnastics; for women, the events include floor, vault, uneven bars, and balance beam; for men, besides floor and vault, it includes rings, pommel horse, parallel bars, and horizontal bar.
The governing body for competition in gymnastics throughout the world is World Gymnastics. Eight sports are governed by the FIG, including gymnastics for all, men's and women's artistic gymnastics, rhythmic gymnastics (women's branch only), trampolining (including double mini-trampoline), tumbling, acrobatic, aerobic, parkour and para-g

In [39]:
# LLM: Using Vertex AI Foundation Model

llm = ChatVertexAI(model_name=LLM_MODEL_NAME)
response = llm.invoke("What is the most popular sport in the US?")
print("\n", response.content)

# **Example output (will vary)**:
# * The USA Gymnastics National Championships is the annual artistic gymnastics national competition held in the United States for elite-level competition. It is currently organized by USA Gymnastics, the governing body for gymnastics in the United States. The national championships have been held since 1963.



 The most popular sport in the US is **American football** (often just called "football").

It consistently leads in terms of viewership, revenue, and overall fan interest. The **National Football League (NFL)** is the dominant professional sports league in the United States.

Here is a breakdown of the typical popularity ranking:

1. **American Football (NFL)**
2. **Baseball (MLB)**
3. **Basketball (NBA)**
4. **Ice Hockey (NHL)**
5. **Soccer (MLS)** (Growing rapidly in popularity)


## a) GenAI Application Development

**REQUIRED**: This section is where you input your custom logic to create and run your agentic workflow. Feel free to add as many code cells as needed.




In [40]:
# TODO: Enter your Agentic workflow code here

## b) Reflection

**REQUIRED**: Provide a detailed reflection addressing these two questions:
1. If you had more time, which specific improvements or enhancements would you make to your agentic workflow, and why?
2. What concrete steps are required to move this workflow from prototype to production?

> Enter your reflection here


## Vertex AI Checklist (Pre-Run)

- Confirm your project is correct and billing is enabled.
- Enable `aiplatform.googleapis.com`.
- Ensure the notebook service account has `Vertex AI User` and `Service Account User` roles.
- Verify model availability in your region (Gemini + Embeddings).
- Check quotas for embedding and LLM calls.
- Keep `max_docs` small for the first run and scale gradually.
- Note expected costs for embeddings and model usage.

## Primers and Background to Explore (Internet Research)

- Vertex AI authentication patterns (local vs. managed notebooks).
- LangChain + Vertex AI integration guide.
- Wikipedia API usage limits and best practices.
- Vector search fundamentals: embeddings, cosine similarity, FAISS indexing types.
- Retrieval-Augmented Generation (RAG) design patterns.
- LangGraph basics and agentic workflow orchestration.
- Prompt engineering for grounding and citation safety.


## Alternatives and Discussion Items

- **Vector stores**: Vertex AI Vector Search, ChromaDB, Weaviate, Pinecone, or Qdrant.
- **Embeddings**: OpenAI embeddings, Hugging Face sentence-transformers, or ertex AI text embeddings.
- **LLMs**: Gemini variants, open-source LLMs (e.g., Llama), or hosted APIs.
- **Retrievers**: hybrid search (BM25 + embeddings), rerankers, or multi-vector approaches.
- **Agent frameworks**: LangGraph, LangChain Agents, Semantic Kernel, or custom orchestration.
- **Evaluation**: How you will measure retrieval accuracy and answer quality (golden sets, human review).
- **Safety and grounding**: How to handle hallucinations, citations, and source validation.


## Incremental Build Plan (Suggested)

1. Validate Vertex AI authentication and model calls with a tiny prompt.
2. Load a very small set of Wikipedia articles (e.g., 5–10).
3. Build embeddings and run a simple similarity search.
4. Add LLM response that cites retrieved documents.
5. Implement a basic agentic workflow (single tool + loop).
6. Add evaluation checks and reflection notes after each stage.