# Project CogniVerse: The Interactive Lab

Welcome to the CogniVerse Interactive Lab! The purpose of this notebook is to deconstruct the complex, multimodal RAG pipeline used in our main `cogniverse_app.py` script.

Here, we will go step-by-step through the entire process, from raw data to the final, synthesized answer. By running each cell, you will be able to see the exact output and data structures at every stage. This is the best way to build a strong intuition for how this advanced RAG architecture works.

**Our Goal:** To understand the **Multi-Vector Retriever** architecture in depth.

## 1. Setup and Configuration
First, we'll import all the necessary libraries and set up our configuration. We will use the same LLMs as our main application.
   

In [1]:
# --- Core LangChain and Utility Imports ---
import uuid
import base64
from pathlib import Path
from langchain_core.documents import Document
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.messages import HumanMessage
from langchain.storage import InMemoryStore
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain_community.vectorstores import Chroma
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_ollama import OllamaLLM

# --- Configuration ---
# Define which local Ollama LLMs to use for different tasks.
TEXT_SUMMARY_MODEL = "phi3:mini"
IMAGE_SUMMARY_MODEL = "llava"
FINAL_RESPONSE_MODEL = "llava"

print("✅ Setup Complete. Libraries imported and models configured.")

✅ Setup Complete. Libraries imported and models configured.


## 2. Our Sample Data
Instead of processing the entire 600-page PDF, we'll use a small, representative sample of data. This sample includes
1.  A text chunk (a paragraph).
2.  A table (represented as HTML, which is what `unstructured` provides).
3.  An image (represented as a base64 string, as if we had loaded it from a file).

In [2]:
# --- Sample Text Chunk ---
sample_text = "Virtual clusters are built with VMs installed at distributed servers from one or more physical clusters. The VMs in a virtual cluster are interconnected logically by a virtual network. This allows for dynamic properties such as nodes being either physical or virtual machines, and the size of the cluster can grow or shrink dynamically. The failure of physical nodes may disable some VMs, but the failure of VMs will not pull down the host system."

# --- Sample Table (as HTML) ---
sample_table_html = """<table>
  <tr>
    <th>Cloud Model</th>
    <th>Ownership</th>
    <th>Best For</th>
  </tr>
  <tr>
    <td>Public Cloud</td>
    <td>Provider</td>
    <td>Standardization, Flexibility</td>
  </tr>
  <tr>
    <td>Private Cloud</td>
    <td>Client/Organization</td>
    <td>Customization, Security</td>
  </tr>
</table>"""

# --- Sample Image (as a placeholder base64 string) ---
# This is a real base64 string for a tiny 1x1 red pixel PNG image. 
# In our real app, this would be a large string from a real diagram.
sample_image_b64 = "iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR42mP8/wcAAwAB/epv2AAAAABJRU5ErkJggg=="

print("--- Sample Data Loaded ---")
print("Text:", sample_text[:50] + "...")
print("Table:", sample_table_html[:50].replace('\n', ' ') + "...")
print("Image:", sample_image_b64[:50] + "...")

--- Sample Data Loaded ---
Text: Virtual clusters are built with VMs installed at d...
Table: <table>   <tr>     <th>Cloud Model</th>     <th>Ow...
Image: iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0...


## 3. The Summarization Step
Now, we'll perform the first key step of the Multi-Vector Retriever architecture: creating a short, concise summary for each piece of our raw data. These summaries will be used for the similarity search.

**Technical Note:** This step can be slow because it involves calling our local LLMs. We're doing it here interactively to see the output. In our main app, this is the slow, one-time process that runs when the vector store is first built.
  

In [5]:
# --- Initialize our LLMs ---
text_llm = OllamaLLM(model=TEXT_SUMMARY_MODEL, temperature=0)
image_llm = OllamaLLM(model=IMAGE_SUMMARY_MODEL, temperature=0)

# --- Define the summarization prompts ---
text_summary_prompt = ChatPromptTemplate.from_template("Provide a very concise, one-sentence summary of the following text from a computer science textbook: {element}")
table_summary_prompt = ChatPromptTemplate.from_template("Provide a very concise, one-sentence summary of the following table from a computer science textbook: {element}")

# --- Create the summarization chains ---
text_summarizer = text_summary_prompt | text_llm | StrOutputParser()
table_summarizer = table_summary_prompt | text_llm | StrOutputParser()

# --- Generate the summaries ---
print("Generating summaries (this may take a moment)...\n")

text_summary = text_summarizer.invoke({"element": sample_text})
table_summary = table_summarizer.invoke({"element": sample_table_html})

# For the image, we call the multimodal LLM directly
image_summary_msg = image_llm.invoke([
    HumanMessage(content=[
        {"type": "text", "text": "Summarize this image in one sentence for a search index:"},
        {"type": "image_url", "image_url": f"data:image/png;base64,{sample_image_b64}"}
    ])
])
image_summary = image_summary_msg

print("--- Generated Summaries ---")
print(f"[Text Summary]: {text_summary}")
print(f"[Table Summary]: {table_summary}")
print(f"[Image Summary]: {image_summary}")

Generating summaries (this may take a moment)...

--- Generated Summaries ---
[Text Summary]: A computer science textbook describes a flexible network where both hardware-based servers and software-simulated machines are interconnected through virtual networks in distributed clusters that can dynamically adjust size without affecting overall stability or pulling down systems upon individual failures.
[Table Summary]: A table comparing Public and Private Cloud models in terms of ownership and ideal use cases: public clouds are provider-owned for standardized flexibility while private clouds are client/organization owned for customization and security.
[Image Summary]:  The image shows a person standing on a beach, looking out at the ocean. 


## 4. Building the Multi-Vector Retriever

This is the most complex and important part of the architecture. We will build the retriever, which consists of two main storage components:

1.  **The Vector Store (`ChromaDB`):** This will store the vector embeddings of our **summaries**.
2.  **The Document Store (`InMemoryStore`):** This will store our **original, full-sized data** (the long text, the HTML table, and the image's base64 string).
    
    The retriever's job is to link these two stores together.
  

### The Core Concept: The Library and the Card Catalog Analogy
Imagine your goal is to find information in a massive library. You have two options:

The Naive Way: Go to the first shelf, pull out every book, read the whole thing, put it back, and repeat for thousands of books. This is slow and inefficient.

The Smart Way (The Multi-Vector Retriever): Go to the card catalog. Each card is a tiny summary of a book (title, author, a short description). You can quickly scan these small summaries (this is our fast vector search). When you find the perfect summary card, you don't read the card for your answer. You look at the call number on the card (e.g., 796.357). This number is our doc_id. It's a unique pointer that tells you exactly where the full, original book is on the shelf. You then use that ID to go to the shelf and get the actual book.

Our code implements this exact "smart" system. It separates the searchable "summaries" from the "original content."

`The Card Catalog (vectorstore)`: This is our Chroma database. It only stores the small, searchable summaries.

`The Bookshelves (docstore)`: This is our InMemoryStore. It stores the large, original data (full text, tables, images).

`The Linking Mechanism (doc_id)`: The unique ID in each summary's metadata is the "call number" that links the card catalog to the bookshelf.

In [6]:
# --- Initialize the Retriever's Components ---
# We'll use an in-memory version of Chroma for this lab to keep it simple.
vectorstore = Chroma(
    collection_name="cogniverse_lab_summaries",
    embedding_function=HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
)

# This is a simple in-memory dictionary to hold our original data.
docstore = InMemoryStore()
id_key = "doc_id"

# Create the main retriever object, connecting the two stores.
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    docstore=docstore,
    id_key=id_key,
)

# --- Populate the Retriever --- 

# 1. Generate unique IDs for each of our original documents.
# The uuid.uuid4() function generates a random, universally unique identifier. Think of this as creating a unique library barcode for each of our original items.
text_id = str(uuid.uuid4())
table_id = str(uuid.uuid4())
image_id = str(uuid.uuid4())

# 2. Store the ORIGINAL documents in the docstore, linking them with their IDs.
retriever.docstore.mset([
    (text_id, Document(page_content=sample_text, metadata={"source_type": "text"})),
    (table_id, Document(page_content=sample_table_html, metadata={"source_type": "table"})),
    (image_id, Document(page_content=sample_image_b64, metadata={"source_type": "image"})),
])

# 3. Store the SUMMARY documents in the vectorstore. 
#    Crucially, each summary's metadata contains the ID of its original document.
retriever.vectorstore.add_documents([
    Document(page_content=text_summary, metadata={id_key: text_id}),
    Document(page_content=table_summary, metadata={id_key: table_id}),
    Document(page_content=image_summary, metadata={id_key: image_id}),
])

print("✅ Multi-Vector Retriever successfully built and populated!")

  from .autonotebook import tqdm as notebook_tqdm
  vectorstore = Chroma(


✅ Multi-Vector Retriever successfully built and populated!


The .mset() function (multi-set) takes a list of (key, value) pairs and stores them in our InMemoryStore.

Technical Intricacy: The docstore is now a simple key-value dictionary. The key is our unique ID, and the value is the full, original Document object containing the raw content.

Visualizing the docstore (Our Bookshelves):
| Key (ID) | Value (The Full, Original Content) |
| :--- | :--- |
| "doc-uuid-123" | Document(page_content="Virtual clusters are built with VMs...") |
| "doc-uuid-456" | Document(page_content="<table><tr><th>Cloud Model...</table>") |
| "doc-uuid-789" | Document(page_content="iVBORw0KGgoAAA...") |


Visualizing the vectorstore (Our Card Catalog):
| Content (The Summary) | Vector (The Embedding) | Metadata (The Link!) |
| :--- | :--- | :--- |
| "A text describing virtual clusters..." | [0.1, -0.5, 0.3, ...] | {'doc_id': 'doc-uuid-123'} |
| "A table comparing Public and Private..."| [-0.2, 0.8, 0.9, ...] | {'doc_id': 'doc-uuid-456'} |
| "A simple red pixel image..." | [0.7, 0.1, -0.4, ...] | {'doc_id': 'doc-uuid-789'} |

The MultiVectorRetriever is now fully built. It knows about both storage locations and the doc_id key that links them.

## 5. Retrieval in Action: Seeing the Magic
Now, let's test our retriever. We will ask a question that is clearly related to our sample table.

Watch the two-step process:

The retriever will first perform a similarity search on the summaries in the vector store.

It will then use the ID from the best-matching summary to retrieve the original, full-sized document from the docstore.

The Automated Two-Step Process 🤖
When you call retriever.get_relevant_documents(query), here’s what happens behind the scenes:

`Search the Summaries:` The retriever first takes your query and performs a similarity search on the small, efficient summaries located in the vectorstore. This is the fastest way to find a match.

`Fetch the Original:` Once it finds the best-matching summary (or summaries), it looks at that summary's metadata to find the linked ID (the doc_id). It then uses this ID as a key to instantly retrieve the full, original document from the docstore.

In [7]:
query = "What are the differences between public and private clouds?"

print(f"Searching for: '{query}'...\n")

# --- Step 1 (Internal): Similarity search on summaries ---
# This is what the retriever does behind the scenes. We'll simulate it here.
retrieved_summaries = retriever.vectorstore.similarity_search(query, k=1)
best_summary = retrieved_summaries[0]

print("--- Step 1: Best Matching Summary Found ---")
print("Content:", best_summary.page_content)
print("Linked Original Doc ID:", best_summary.metadata[id_key])

# --- Step 2 (Automatic): Retrieving the original documents ---
# This is the main call to the retriever. It handles everything automatically.
retrieved_docs = retriever.get_relevant_documents(query)

print("\n--- Step 2: Full Original Document Retrieved from Docstore ---")
print("Number of docs retrieved:", len(retrieved_docs))
print("Type of content:", retrieved_docs[0].metadata['source_type'])
print("Full Content:\n", retrieved_docs[0].page_content)

Searching for: 'What are the differences between public and private clouds?'...

--- Step 1: Best Matching Summary Found ---
Content: A table comparing Public and Private Cloud models in terms of ownership and ideal use cases: public clouds are provider-owned for standardized flexibility while private clouds are client/organization owned for customization and security.
Linked Original Doc ID: 86686935-f662-46ee-a5e8-9a3f53dbbd85

--- Step 2: Full Original Document Retrieved from Docstore ---
Number of docs retrieved: 3
Type of content: table
Full Content:
 <table>
  <tr>
    <th>Cloud Model</th>
    <th>Ownership</th>
    <th>Best For</th>
  </tr>
  <tr>
    <td>Public Cloud</td>
    <td>Provider</td>
    <td>Standardization, Flexibility</td>
  </tr>
  <tr>
    <td>Private Cloud</td>
    <td>Client/Organization</td>
    <td>Customization, Security</td>
  </tr>
</table>


  retrieved_docs = retriever.get_relevant_documents(query)


Success! As you can see, the query about clouds correctly matched the summary of our table. Then, the retriever used the linked ID to fetch the full, original HTML table from the docstore. This is the core principle of the Multi-Vector Retriever.

## 6. The Final Prompt and Generation
The final step is to take the retrieved documents (which can be a mix of text, tables, and images) and format them into a single prompt for our powerful multimodal LLM, llava.

In [8]:
def format_for_final_prompt(docs):
    """Prepares the context for the multimodal LLM, separating text and images."""
    prompt_content = []
    prompt_content.append({"type": "text", "text": "You are an expert study buddy... (Full prompt text)"})

    for doc in docs:
        if doc.metadata.get('source_type') == 'image':
            prompt_content.append({"type": "image_url", "image_url": f"data:image/png;base64,{doc.page_content}"})
        else:
            prompt_content.append({"type": "text", "text": f"\n[Text/Table Context]:\n{doc.page_content}"})

    prompt_content.append({"type": "text", "text": "\n--- CONTEXT END ---\n"})
    return prompt_content

# --- Let's simulate the full process for a new query ---
final_query = "Explain virtual clusters and show me a diagram."

# 1. Retrieve the relevant docs (this time it should get both the text and the image)
final_retrieved_docs = retriever.get_relevant_documents(final_query, k=2)

# 2. Format them for the LLM
final_prompt_content = format_for_final_prompt(final_retrieved_docs)
final_prompt_content.append({"type": "text", "text": f"\nQuestion: {final_query}"})

print("--- Final Prompt Sent to LLaVA ---")
import json
print(json.dumps(final_prompt_content, indent=2))

# 3. (Simulated) Call the LLM and get the answer
print("\n--- (Simulated) Final Answer from LLaVA ---")
print("""Based on the textbook, a **virtual cluster** is a collection of Virtual Machines (VMs) that are interconnected by a logical, virtual network. They are highly flexible because their size can grow or shrink dynamically as needed.\n\nThe provided diagram, which appears to be a simple placeholder, illustrates the concept that visual information can be included alongside text to explain complex topics.""")

--- Final Prompt Sent to LLaVA ---
[
  {
    "type": "text",
    "text": "You are an expert study buddy... (Full prompt text)"
  },
  {
    "type": "text",
    "text": "\n[Text/Table Context]:\nVirtual clusters are built with VMs installed at distributed servers from one or more physical clusters. The VMs in a virtual cluster are interconnected logically by a virtual network. This allows for dynamic properties such as nodes being either physical or virtual machines, and the size of the cluster can grow or shrink dynamically. The failure of physical nodes may disable some VMs, but the failure of VMs will not pull down the host system."
  },
  {
    "type": "text",
    "text": "\n[Text/Table Context]:\n<table>\n  <tr>\n    <th>Cloud Model</th>\n    <th>Ownership</th>\n    <th>Best For</th>\n  </tr>\n  <tr>\n    <td>Public Cloud</td>\n    <td>Provider</td>\n    <td>Standardization, Flexibility</td>\n  </tr>\n  <tr>\n    <td>Private Cloud</td>\n    <td>Client/Organization</td>\n    <td>Cus

### Conclusion
This interactive lab has demonstrated the complete, end-to-end workflow of an advanced, multimodal RAG system.

We have seen how to:

Take raw, mixed-media content.

Generate concise summaries for each piece of content.

Build a MultiVectorRetriever that links these summaries to their original, full-sized documents.

Perform a search that accurately retrieves a mix of text, tables, and images.

Construct a final, rich prompt to be sent to a multimodal LLM.

This exact logic is what powers our full cogniverse_app.py script. By understanding these fundamental steps, you now have a deep intuition for how the entire application works.