# Vector Store Retriever (LangChain + Chroma)

Purpose: show how to create a Chroma vector store from documents using embeddings, convert it into a retriever, and run similarity searches.

Run cells top-to-bottom. The first cell installs dependencies if needed.

In [None]:
!pip install langchain chromadb openai tiktoken pypdf langchain_google_genai langchain-community wikipedia

## Prerequisites & API keys

- Ensure environment variables for any provider keys (e.g., `GEMINI_API_KEY`) are set.  
- Install packages (first cell).  
- Use small sample documents for quick runs.

In [34]:
from google.colab import userdata
gemini_api_key = userdata.get('GEMINI_API_KEY')

## Imports & Clients

This cell imports the Chroma vectorstore adapter, embeddings client, and `Document` model used to create records.

In [46]:
from langchain_community.vectorstores import Chroma
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_core.documents import Document

## Step 1 — Prepare source documents

Provide the documents you want to embed. Keep content short for demos; in production chunk longer texts.

In [47]:
# Step 1: Your source documents
documents = [
    Document(page_content="Football is the most popular sport in the world, played by over 250 million players across more than 200 countries."),
    Document(page_content="Lionel Messi is known for his incredible dribbling, vision, and goal-scoring ability, earning multiple Ballon d'Or awards."),
    Document(page_content="The FIFA World Cup is held every four years and is the most prestigious international football tournament."),
    Document(page_content="Tactics in football involve formations, pressing strategies, and player roles that determine how a team controls the game."),
]

## Step 2 — Initialize embeddings

Create an embeddings client. The example uses Google Gemini embeddings; replace with OpenAI or other providers as needed.

In [48]:
# Step 2: Initialize embedding model
embeddings = GoogleGenerativeAIEmbeddings(
    model="models/gemini-embedding-001",
    google_api_key=gemini_api_key
)


## Step 3 — Create Chroma vector store

Create the Chroma collection and persist it. Use `persist_directory` to keep the DB between runs.

In [52]:
# Step 3: Create Chroma vector store in memory
vectorstore = Chroma.from_documents(
    documents=documents,
    embedding=embeddings,
    persist_directory="my_chroma_db",
    collection_name="my_collection"
)

## Step 4 — Convert vector store to retriever

Use `as_retriever` to get a retriever with search parameters (e.g., `k`). This retriever can be used in RAG pipelines.

In [53]:
# Step 4: Convert vectorstore into a retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 2})

## Query & view results

Run a query using the retriever and inspect the returned `Document` objects. You can also call `vectorstore.similarity_search(...)` directly.

In [54]:
query = "Who is Lionel Messi and what makes him a great football player?"

In [55]:
results = retriever.invoke(query)

In [56]:
for i, doc in enumerate(results):
    print(f"\n--- Result {i+1} ---")
    print(doc.page_content)


--- Result 1 ---
Lionel Messi is known for his incredible dribbling, vision, and goal-scoring ability, earning multiple Ballon d'Or awards.

--- Result 2 ---
Football is the most popular sport in the world, played by over 250 million players across more than 200 countries.


## Similarity search vs Retriever

`retriever.invoke(query)` returns `Document` objects via the retriever interface. `vectorstore.similarity_search(query, k)` uses the vector store API directly and may return different metadata shapes.

In [57]:
results = vectorstore.similarity_search(query, k=2)

In [58]:
for i, doc in enumerate(results):
    print(f"\n--- Result {i+1} ---")
    print(doc.page_content)


--- Result 1 ---
Lionel Messi is known for his incredible dribbling, vision, and goal-scoring ability, earning multiple Ballon d'Or awards.

--- Result 2 ---
Football is the most popular sport in the world, played by over 250 million players across more than 200 countries.
