# üîé RAG Information Retrieval Techniques

**Advanced Search and Filtering Methods for RAG Systems**

This notebook demonstrates various information retrieval techniques that can improve RAG system performance.

## üéØ What You'll Learn

1. **Semantic Search**: Vector-based semantic similarity search
2. **Metadata Filtering**: Using document metadata to refine search results
3. **Keyword Search**: Full-text search for exact matches
4. **Hybrid Search**: Combining multiple search strategies for better results

---

## üìä Project Overview

**Goal:** Explore different retrieval techniques to improve search quality in RAG systems.

**Key Techniques:**
- üîç **Semantic Search**: Finding documents by meaning
- üè∑Ô∏è **Metadata Filters**: Filtering by categories, dates, etc.
- üî§ **Keyword Search**: Exact word/phrase matching
- üéØ **Hybrid Approach**: Combining multiple methods

---


## Section 1: Setup and Installation

In [None]:
!pip install -q chromadb

In [None]:
import chromadb

In [None]:
# --- Configuration ---
CHROMA_PATH = "./chroma_ir_demo"
client = chromadb.PersistentClient(path=CHROMA_PATH)

## 2. Dataset Creation and Ingestion

In [None]:
data = [
    {"id": "doc1", "text": "The quick brown fox jumps over the lazy dog. A true mammal.", "category": "animals", "year": 2023},
    {"id": "doc2", "text": "Artificial intelligence will transform healthcare in the coming decade. LLMs are key.", "category": "technology", "year": 2024},
    {"id": "doc3", "text": "Python is a versatile programming language for data science and machine learning. Code is king.", "category": "technology", "year": 2023},
    {"id": "doc4", "text": "A beautiful sunset over the Pacific Ocean is truly breathtaking. Nature's art.", "category": "nature", "year": 2022},
    {"id": "doc5", "text": "Retrieval-Augmented Generation (RAG) improves LLM factual consistency and is the future.", "category": "technology", "year": 2024},
    {"id": "doc6", "text": "Dogs and foxes are mammals, part of the animal kingdom. They are friendly.", "category": "animals", "year": 2023},
]

# Prepare data for ChromaDB
ids = [d['id'] for d in data]
documents = [d['text'] for d in data]
metadatas = [{'category': d['category'], 'year': d['year']} for d in data]

## 3. Create and Populate Collection

In [None]:
COLLECTION_NAME = "rag_ir_techniques"

# Delete and recreate for a clean run
try:
    client.delete_collection(name=COLLECTION_NAME)
except:
    pass # Ignore if it doesn't exist

collection = client.get_or_create_collection(
    name=COLLECTION_NAME,
)

collection.add(
    documents=documents,
    metadatas=metadatas,
    ids=ids
)

print(f"ChromaDB Collection '{COLLECTION_NAME}' created and populated with {len(ids)} documents in '{CHROMA_PATH}'.")

## 4. Information Retrieval Demonstrations


In [None]:
# Helper function to display results cleanly
def display_chroma_results(title, results):
    print("\n" + "="*50)
    print(f"{title}")
    print("* Lower Score (distance) means higher similarity.")
    print("="*50)
    if results and results['ids'] and results['ids'][0]:
        for i in range(len(results['ids'][0])):
            doc_id = results['ids'][0][i]
            score = results['distances'][0][i] if results.get('distances') else 'N/A'
            metadata = results['metadatas'][0][i]
            document = results['documents'][0][i]
            print(f"[{i+1}] Score: {score:.4f} | ID: {doc_id} | Category: {metadata['category']} | Year: {metadata['year']}")
            print(f"    Text: {document[:70]}...")
    else:
        print("    No results found.")
    print("-"*50)

### A. KEYWORD SEARCH (Full-Text Search)

In [None]:
# Matches documents based on the literal presence of a word or phrase.
# ChromaDB supports full-text search using `where_document`.
query_keyword_text = "mammal"
keyword_results = collection.query(
    query_texts=["irrelevant query for full-text search"], # Dummy query text, we only care about where_document
    n_results=5,
    where_document={"$contains": query_keyword_text} # Full-text/Document content filter
)
display_chroma_results(f"C. KEYWORD SEARCH (Full-Text Search for: '{query_keyword_text}')", keyword_results)

### B. SEMANTIC SEARCH (Vector Search)

In [None]:
# Finds documents based on meaning/context (vector similarity).
query_semantic = "AI and medical advancements and health"
semantic_results = collection.query(
    query_texts=[query_semantic],
    n_results=2
)
display_chroma_results("A. SEMANTIC SEARCH (AI, Medical, Health)", semantic_results)

### C. HYBRID SEARCH (Semantic + Keyword)

In [None]:
# This is simulated by combining Semantic Search with Chroma's `where_document` filter.
# The `where_document` acts as a first-pass lexical filter on the documents to consider.
query_hybrid_semantic = "pets and animals"
query_hybrid_keyword = "mammal" # Only documents containing this keyword are considered

hybrid_results = collection.query(
    query_texts=[query_hybrid_semantic], # Semantic Query: "pets and animals"
    n_results=3,
    where_document={"$contains": query_hybrid_keyword} # Keyword Filter: Must contain "mammal"
)
display_chroma_results(f"D. HYBRID SEARCH (Keyword: '{query_hybrid_keyword}' + Semantic: '{query_hybrid_semantic}')", hybrid_results)

### D. METADATA FILTERING

In [None]:
# Restricts the semantic search space to documents where the category is 'technology'
query_vector_tech = "latest RAG research in 2024"
metadata_filter = {
    "$and": [
        {"category": {"$eq": "technology"}},  # Condition 1
        {"year": {"$eq": 2024}}               # Condition 2
    ]
}

metadata_results = collection.query(
    query_texts=[query_vector_tech],
    n_results=3,
    where=metadata_filter # The metadata filter applied before/during vector search
)
display_chroma_results(f"B. METADATA FILTERING (Category: Tech, Year: 2024) - Query: latest RAG research", metadata_results)

## 5. Cleanup

In [None]:
client.delete_collection(name=COLLECTION_NAME)
print(f"Collection '{COLLECTION_NAME}' dropped.")