# Part 3: "The Librarian" (Advanced RAG System)

## Project 01 - Operation Ledger-Mind
**Course Module:** Weeks 01-03 (Prompt Engineering, Fine-Tuning, Advanced RAG)
**Scenario:** Financial Analysis of Uber Technologies (2024 Annual Report)

### Technical Requirements Checklist:
- [x] **Vector Database**: Weaviate (Cloud or Embedded)
- [x] **Hybrid Search**: Dense Vector + BM25 Keyword Search
- [x] **Refinement**: Explicit Reciprocal Rank Fusion (RRF)
- [x] **Citations**: Exact Page Number Mapping
- [x] **Reranking**: Cross-Encoder (ms-marco-MiniLM-L-6-v2)
- [x] **Inference**: `query_librarian(question)`

## 0. Setup & Dependency Installation

Standardizing dependencies for both Google Colab and Local environments.

In [1]:
import os
import sys
import subprocess

def is_colab():
    return 'google.colab' in str(get_ipython())

if is_colab():
    print(" Detected Google Colab environment.")
    PROJECT_NAME = "ZuuCrew-AEE-Project01"
    REPO_URL = "https://github.com/Sulamaxx/ZuuCrew-AEE-Project01.git"
    
    if not os.path.exists(PROJECT_NAME):
        !git clone {REPO_URL}
    else:
        !git -C {PROJECT_NAME} pull
    
    os.chdir(PROJECT_NAME)
    
    if os.path.abspath("src") not in sys.path:
        sys.path.append(os.path.abspath("src"))
    
    print(" Installing dependencies...")
    !pip install "numpy>=1.26.4,<2.0" -q
    !pip install -r requirements.txt -q
    !pip install "weaviate-client>=4.5.4" -q
    
    print(" Installation complete.")
else:
    print(" Running in local environment.")

 Detected Google Colab environment.
remote: Enumerating objects: 15, done.[K
remote: Counting objects: 100% (15/15), done.[K
remote: Compressing objects: 100% (4/4), done.[K
remote: Total 8 (delta 4), reused 8 (delta 4), pack-reused 0 (from 0)[K
Unpacking objects: 100% (8/8), 5.25 KiB | 1.75 MiB/s, done.
From https://github.com/Sulamaxx/ZuuCrew-AEE-Project01
   fe570f4..bfcff9e  main       -> origin/main
Updating fe570f4..bfcff9e
Fast-forward
 Assessment_extracted.txt                    |   9 [31m-[m
 notebooks/02_finetuning_intern.ipynb        |  84 [32m++++[m[31m--[m
 notebooks/02_finetuning_intern_backup.ipynb | 414 [31m----------------------------[m
 notebooks/03_rag_librarian.ipynb            | 222 [32m++++++++++++[m[31m---[m
 src/utils/data_processing.py                |  13 [32m+[m[31m-[m
 5 files changed, 267 insertions(+), 475 deletions(-)
 delete mode 100644 Assessment_extracted.txt
 delete mode 100644 notebooks/02_finetuning_intern_backup.ipynb
 Installin

## 1. Environment & Advanced Ingestion

Preserving **Page Numbers** during ingestion to support exact citations.

In [2]:
import torch
import yaml
from dotenv import load_dotenv
from src.services.llm_services import load_config, get_llm, get_text_embeddings
from src.utils.data_processing import load_pdf_with_pages, chunk_text

# Load environment & config
load_dotenv(".env" if os.path.exists(".env") else "../.env")
config = load_config("src/config/config.yaml" if os.path.exists("src/config/config.yaml") else "../src/config/config.yaml")

# Ingestion with Metadata
pdf_path = config.get("pdf_path", "data/pdfs/2024-Annual-Report.pdf")
if not os.path.exists(pdf_path): pdf_path = "../" + pdf_path

print(f" Loading document with metadata: {pdf_path}...")
pages = load_pdf_with_pages(pdf_path)

processed_chunks = []
for pg in pages:
    pg_chunks = chunk_text(pg['content'], chunk_size=1500, chunk_overlap=200)
    for c in pg_chunks:
        processed_chunks.append({"content": c, "page": pg['page_number']})

print(f" Created {len(processed_chunks)} chunks across {len(pages)} pages.")

ImportError: cannot import name 'get_text_embeddings' from 'src.services.llm_services' (/content/ZuuCrew-AEE-Project01/src/services/llm_services.py)

## 2. Weaviate Schema & Indexing (v4 API)

Registering properties for keyword (BM25) and vector search.

In [None]:
w_url = os.getenv("WEAVIATE_URL") or config.get("weaviate_url")
w_key = os.getenv("WEAVIATE_API_KEY")

is_local = "localhost" in w_url or "127.0.0.1" in w_url

if is_colab() and is_local:
    print("\n⚠️  WARNING: You are in Google Colab but using a 'localhost' Weaviate URL.")
    print("   Colab cannot reach your local machine's localhost. ")
    print("   Please use a Weaviate Cloud (WCD) URL or a tunnel (like ngrok).\n")

print(f" Connecting to Weaviate at {w_url}...")
if is_local:
    client = weaviate.connect_to_local(
        host="localhost" if "localhost" in w_url else "127.0.0.1",
        headers={"X-Google-Vertex-Api-Key": os.getenv("GOOGLE_API_KEY", "")}
    )
else:
    client = weaviate.connect_to_weaviate_cloud(
        cluster_url=w_url,
        auth_credentials=Auth.api_key(w_key) if w_key else None,
        headers={"X-Google-Vertex-Api-Key": os.getenv("GOOGLE_API_KEY", "")}
    )


## 3. Hybrid RAG Pipeline (Explicit RRF + Cross-Encoder)

Implementing the core 'Librarian' logic with Fusion and Reranking.

In [None]:
from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
llm = get_llm(config)

def query_librarian(question, top_k=15, final_n=4):
    # 1. Hybrid Search with EXPLICIT RRF (fusion_type)
    query_vector = embeddings_model.embed_query(question)
    
    response = uber_report.query.hybrid(
        query=question,
        vector=query_vector,
        alpha=0.5,
        fusion_type=wvc.query.HybridFusion.RELATIVE_SCORE, # Advanced Rank Fusion
        limit=top_k
    )
    
    candidates = [{"content": obj.properties["content"], "page": obj.properties["page"]} for obj in response.objects]
    
    # 2. Cross-Encoder Reranking
    pairs = [[question, cand['content']] for cand in candidates]
    scores = reranker.predict(pairs)
    
    # Sort and pick top results
    sorted_indices = torch.argsort(torch.tensor(scores), descending=True)[:final_n]
    reranked = [candidates[i] for i in sorted_indices]
    
    # 3. LLM Generation with CITATIONS
    context_blocks = []
    for i, r in enumerate(reranked):
        context_blocks.append(f"[DOC {i+1} | Page {r['page']}]: {r['content']}")
    
    context_str = "\n\n".join(context_blocks)
    system_msg = "You are 'The Librarian'. Answer questions precisely based on the context. You MUST cite the page numbers used (e.g., [Page 45]). If the context doesn't have the answer, say you don't know."
    
    prompt = f"Context Blocks:\n{context_str}\n\nQuestion: {question}\n\nAnswer:"
    ans = llm.invoke([("system", system_msg), ("user", prompt)])
    
    return ans.content if hasattr(ans, 'content') else ans

## 4. Verification

Demonstrating precise citations for numbers and entities.

In [None]:
test_queries = [
    "What was Uber's total revenue in 2024?",
    "What are the specific risk factors mentioned regarding autonomous vehicle competitors?",
    "How many monthly active platform consumers (MAPCs) did Uber have in Q4 2024?"
]

for q in test_queries:
    print(f"\n{'='*50}\nQUERY: {q}\n{'='*50}")
    print(f"RESPONSE: {query_librarian(q)}\n")