## Project 1: A Hybrid Vector Database with Reranking and Filtering

This project is a great way to explore the full lifecycle of a RAG pipeline's retrieval component. A hybrid approach combines the speed of vector search with the precision of traditional keyword search and reranking.


### Core Components:

* Vector Indexing: Build an in-memory or on-disk vector index using a library like FAISS or ScaNN. Your database should be able to add, delete, and update vectors.

* Metadata Storage: A relational database (like SQLite or PostgreSQL) or a simple JSON file store to hold the original text chunks and associated metadata (e.g., document ID, author, date, categories).

* Hybrid Search:

    * Vector Search: Use your vector index to find the top N most semantically similar documents to a user's query.

    * Keyword Search: Implement a basic keyword search using a library like whoosh or even a simple inverted index to find documents with exact keyword matches.

    * Combining Results: Fuse the results from both searches. You can use a simple scoring system that weighs both semantic similarity and keyword presence.

* Reranking: The top results from the hybrid search may not be in the best order. Implement a reranker model (a small, performant cross-encoder from Hugging Face) that takes the user query and each of the top-retrieved documents and re-scores them to improve the final ranking.

* Advanced Filtering: Allow users to filter their search results based on the metadata. For example, a user could search for "deep learning" but filter the results to only include documents from the "2024" year.

## Project Description

The goal of this project is to build a complete, end-to-end retrieval system that goes beyond a simple vector database. The system will be a "hybrid" search engine that combines three key retrieval methods:

* Vector Search: For semantic understanding and finding documents conceptually similar to a query. Documents and queries are converted into numerical vectors (embeddings). The system finds documents whose vectors are geometrically closest to the query vector.

* Keyword Search: For precise matching of specific terms, which can be critical for proper names, acronyms, and direct references. A traditional search technique using an inverted index. It's excellent for finding exact term matches and is highly performant.

* Reranking: To re-order the top-K results from the hybrid search, ensuring the most relevant documents are at the very top. A secondary, more powerful model (typically a cross-encoder) is used to re-evaluate the top results from the initial hybrid search. This is computationally more expensive, but since it only operates on a small number of candidates, it significantly improves the final accuracy.

* Hybrid Search: The process of combining the results from both vector and keyword searches. The challenge here is how to score and merge the results to get a single, ranked list.

* Metadata Filtering: The ability to narrow down search results based on specific attributes of the documents, such as author, date, or category. This adds a powerful layer of user control to the search process.


### Proposed System Architecture

The system will have two primary phases: an Ingestion Pipeline and a Query Pipeline.

1. Ingestion Pipeline: This is where you prepare your knowledge base.

Document Loader: A component that reads in raw documents (e.g., text files, PDFs).

Text Splitter: Breaks down large documents into smaller, manageable chunks. This is crucial for RAG, as you want to retrieve specific relevant snippets, not entire long documents.

Embedding Model: Converts each text chunk into a vector embedding.

Metadata Extractor: Parses and extracts metadata (e.g., title, year, author, source) for each chunk.

Storage: Stores the text chunks, their embeddings, and the associated metadata.

2. Query Pipeline: This is how the system responds to a user's query.

User Query: A user inputs a query and any desired filters.

Query Embedding: The user's query is converted into an embedding using the same model as the ingestion pipeline.

Parallel Search: The query is sent to two different search components simultaneously.

Vector Database: Performs a semantic search to find the top-K semantically similar documents.

Keyword Search Index: Performs a keyword search to find the top-K documents with matching terms.

Hybrid Combination: The results from both searches are combined and a score is calculated for each document. A simple method is to re-rank the union of the two result sets.

Metadata Filtering: The combined results are filtered to only include documents that match the user's specified metadata criteria.

Reranking Model: The final top-N documents from the hybrid search are passed to a reranker model, which re-sorts them for optimal relevance.

Final Results: The top documents are returned to the user.

## Suggested Technology Stack

Programming Language: Python is the standard for this type of project due to its rich ecosystem of libraries.

Embedding Model: Use a pre-trained model from Hugging Face via the sentence-transformers library.

Recommendation: BAAI/bge-small-en-v1.5 for strong performance.

Vector Index: The FAISS library from Facebook AI is a highly optimized and performant library for similarity search.

Keyword Index: The Whoosh library is a pure-Python library for creating and searching text. It's a great choice for this project as it's simple to set up and use.

Reranking Model: Use a cross-encoder model from Hugging Face, also via the sentence-transformers library.

Recommendation: cross-encoder/ms-marco-MiniLM-L-6-v2 is a good, lightweight choice.

Data Storage: A simple JSON file or a lightweight database like TinyDB can be used to store the original text chunks and their metadata.

## Project Workflow and Code Structure

Your project can be organized into a few key modules:

ingestion.py:

A function to load a document.

A function to split text into chunks.

A function to generate embeddings using sentence-transformers.

A function to build the FAISS index and the Whoosh index, and save the metadata to a file.

database.py:

A class that acts as the main interface to your hybrid database.

Methods like add_documents(), delete_documents(), and hybrid_search().

It will handle loading the FAISS index, the Whoosh index, and the metadata on startup.

query.py:

A simple command-line interface or a main function that demonstrates the query process.

Takes user input for the query and filters.

Calls the hybrid_search() method from your database class and prints the final reranked results.

In [None]:
# Before running, you will need to install the required libraries:
# pip install sentence-transformers faiss-cpu whoosh
# We will use faiss-cpu for simplicity, but for GPU support, you would install faiss-gpu.

In [None]:
! pip install sentence-transformers faiss-cpu whoosh

In [None]:

import os
import json
import faiss
import numpy as np
from whoosh.index import create_in
from whoosh.fields import *
from whoosh.qparser import QueryParser
from sentence_transformers import SentenceTransformer, CrossEncoder
from whoosh.filedb.filestore import FileStorage
from whoosh.analysis import StemmingAnalyzer
from whoosh.qparser.default import MultifieldParser

In [None]:

class HybridDatabase:
    """
    A class to manage the hybrid vector database, including loading indices,
    performing searches, and reranking results.
    """
    def __init__(self, index_dir="index"):
        """
        Initializes the database by loading the pre-built FAISS index, Whoosh index,
        and metadata. Also loads the embedding and reranking models.
        """
        print("Initializing HybridDatabase...")
        self.index_dir = index_dir
        self.embedding_model = SentenceTransformer('BAAI/bge-small-en-v1.5')
        self.reranker_model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
        self.faiss_index = None
        self.whoosh_index = None
        self.metadata = []

        # Load the indices and metadata from the specified directory
        self._load_indices()
        print("Database initialized and ready to query.")

    def _load_indices(self):
        """
        Loads the FAISS index, Whoosh index, and metadata from the index directory.
        """
        try:
            # Load FAISS index
            self.faiss_index = faiss.read_index(os.path.join(self.index_dir, "faiss.index"))

            # Load Whoosh index
            storage = FileStorage(os.path.join(self.index_dir, "whoosh"))
            self.whoosh_index = storage.open_index()

            # Load metadata
            with open(os.path.join(self.index_dir, "metadata.json"), 'r') as f:
                self.metadata = json.load(f)

        except FileNotFoundError as e:
            print(f"Error loading indices: {e}")
            print("Please run the ingestion pipeline first.")
            exit()

    def hybrid_search(self, query: str, top_k: int = 10, alpha: float = 0.5, filters: dict = None):
        """
        Performs a hybrid search combining vector search and keyword search.

        Args:
            query (str): The user's search query.
            top_k (int): The number of top results to return.
            alpha (float): A weighting factor for combining vector and keyword scores.
            filters (dict): A dictionary of metadata filters (e.g., {"year": 2023}).

        Returns:
            list: A list of reranked and filtered document snippets.
        """
        # 1. Perform parallel searches
        print(f"\nSearching for: '{query}'...")

        # Vector search (semantic search)
        query_embedding = self.embedding_model.encode(query, convert_to_numpy=True)
        query_embedding = np.expand_dims(query_embedding, axis=0)
        vector_distances, vector_ids = self.faiss_index.search(query_embedding, top_k * 2)

        # Keyword search
        keyword_results = []
        with self.whoosh_index.searcher() as searcher:
            query_parser = MultifieldParser(["content", "title"], schema=self.whoosh_index.schema)
            parsed_query = query_parser.parse(query)
            results = searcher.search(parsed_query, limit=top_k * 2)
            keyword_results = [result['doc_id'] for result in results]

        # 2. Combine and filter results
        # Collect a union of unique document IDs from both search methods
        unique_doc_ids = set(vector_ids[0]).union(set(keyword_results))
        combined_candidates = []
        for doc_id in unique_doc_ids:
            doc_metadata = self.metadata[doc_id]
            
            # Apply metadata filters
            if filters:
                is_filtered = False
                for key, value in filters.items():
                    if key in doc_metadata and doc_metadata[key] != value:
                        is_filtered = True
                        break
                if is_filtered:
                    continue

            combined_candidates.append(doc_metadata['content'])
        
        # If no results after filtering, return empty list
        if not combined_candidates:
            print("No results found after filtering.")
            return []

        # 3. Rerank the top candidates
        # Create a list of tuples for the reranker model: [(query, doc1), (query, doc2), ...]
        cross_encoder_input = [(query, doc) for doc in combined_candidates]
        reranker_scores = self.reranker_model.predict(cross_encoder_input)

        # 4. Sort the candidates based on reranker scores
        scored_results = sorted(zip(combined_candidates, reranker_scores), key=lambda x: x[1], reverse=True)

        # Return the top_k results
        return [result[0] for result in scored_results[:top_k]]


def ingest_documents(documents: list, index_dir="index"):
    """
    Ingests documents into the hybrid database. This function simulates
    the ingestion pipeline, creating and saving the indices and metadata.
    """
    print("Starting document ingestion pipeline...")
    # Create index directories
    if not os.path.exists(index_dir):
        os.makedirs(index_dir)
    if not os.path.exists(os.path.join(index_dir, "whoosh")):
        os.makedirs(os.path.join(index_dir, "whoosh"))

    # Load the embedding model
    embedding_model = SentenceTransformer('BAAI/bge-small-en-v1.5')

    # Prepare document chunks and metadata
    metadata = []
    chunk_id = 0
    # For a real-world app, you would use a proper text splitter here.
    # We will use simple splitting for this example.
    for doc in documents:
        # Split documents into smaller chunks
        chunks = doc['content'].split('. ')
        for chunk in chunks:
            if chunk:
                # Add a unique ID and other metadata
                chunk_metadata = {
                    "doc_id": chunk_id,
                    "title": doc.get('title', 'Unknown'),
                    "author": doc.get('author', 'Unknown'),
                    "year": doc.get('year', 'Unknown'),
                    "content": chunk.strip() + '.'
                }
                metadata.append(chunk_metadata)
                chunk_id += 1

    # 1. Build FAISS index for vector search
    print("Building FAISS vector index...")
    text_chunks = [item['content'] for item in metadata]
    embeddings = embedding_model.encode(text_chunks, convert_to_numpy=True)
    dimension = embeddings.shape[1]
    faiss_index = faiss.IndexFlatL2(dimension)
    faiss_index.add(embeddings)
    faiss.write_index(faiss_index, os.path.join(index_dir, "faiss.index"))
    print("FAISS index built and saved.")

    # 2. Build Whoosh index for keyword search
    print("Building Whoosh keyword index...")
    schema = Schema(doc_id=ID(stored=True), title=TEXT(stored=True, analyzer=StemmingAnalyzer()),
                    author=TEXT(stored=True), year=NUMERIC(stored=True),
                    content=TEXT(stored=True, analyzer=StemmingAnalyzer()))
    storage = FileStorage(os.path.join(index_dir, "whoosh"))
    whoosh_ix = create_in(storage, schema)
    writer = whoosh_ix.writer()
    for item in metadata:
        writer.add_document(doc_id=str(item['doc_id']), title=item['title'], author=item['author'],
                            year=item['year'], content=item['content'])
    writer.commit()
    print("Whoosh index built and saved.")

    # 3. Save metadata
    print("Saving metadata...")
    with open(os.path.join(index_dir, "metadata.json"), 'w') as f:
        json.dump(metadata, f, indent=4)
    print("Metadata saved.")
    print("Ingestion pipeline complete.")

if __name__ == "__main__":
    # Mock data to simulate new documents being added to the knowledge base
    mock_documents = [
        {
            "title": "The Rise of AI",
            "author": "Alice",
            "year": 2024,
            "content": (
                "Artificial intelligence has seen a massive surge in popularity. "
                "Large Language Models, or LLMs, are at the forefront of this revolution. "
                "These models use a transformer architecture to process and generate human-like text. "
                "The core technology of LLMs is based on deep learning principles."
            )
        },
        {
            "title": "Deep Learning Techniques",
            "author": "Bob",
            "year": 2023,
            "content": (
                "Deep learning is a subset of machine learning that utilizes neural networks. "
                "A key component is the backpropagation algorithm. "
                "Transfer learning is a technique that uses pre-trained models. "
                "Researchers are exploring new ways to optimize these complex models."
            )
        },
        {
            "title": "Vector Databases Explained",
            "author": "Alice",
            "year": 2024,
            "content": (
                "Vector databases are specialized databases for storing and searching vector embeddings. "
                "They are crucial for efficient semantic search in RAG applications. "
                "FAISS and Milvus are popular examples of vector database technologies. "
                "These databases use approximate nearest neighbor algorithms."
            )
        },
        {
            "title": "Quantum Computing",
            "author": "Charlie",
            "year": 2024,
            "content": (
                "Quantum computing promises to solve problems intractable for classical computers. "
                "It uses principles of quantum mechanics, such as superposition and entanglement. "
                "Qubits are the fundamental units of quantum information. "
                "The field is still in its early stages of development."
            )
        }
    ]

    # --- Step 1: Ingestion ---
    # This step would typically be run periodically to update the knowledge base.
    # It creates all the necessary indices and stores the metadata.
    ingest_documents(mock_documents)

    # --- Step 2: Querying ---
    # This simulates a user querying the already-ingested knowledge base.
    db = HybridDatabase()

    # Example 1: Pure semantic search query
    print("--- Example 1: Semantic Query ---")
    results = db.hybrid_search(query="How do LLMs work?", top_k=3)
    for i, res in enumerate(results):
        print(f"{i+1}. {res}")

    # Example 2: Keyword-heavy search query
    print("\n--- Example 2: Keyword Query ---")
    results = db.hybrid_search(query="superposition and entanglement", top_k=2)
    for i, res in enumerate(results):
        print(f"{i+1}. {res}")

    # Example 3: Hybrid query with filtering
    print("\n--- Example 3: Hybrid Query with Metadata Filter ---")
    # This query looks for information about AI, but only from documents published in 2023
    results = db.hybrid_search(query="what is transfer learning in AI?", top_k=2, filters={"year": 2023})
    for i, res in enumerate(results):
        print(f"{i+1}. {res}")
