# Documenation for backend/document_loading.py

### Description:

#### Imports

- `os` : A standard library module for interacting with the operating system, primarily for file and directory operations.

- `FAISS` : A library for efficient similarity search and clustering of dense vectors.

- `HuggingFaceEmbeddings` : A class for generating embeddings using models from Hugging Face's model hub.

- `BM25Retriever` : An implementation of the BM25 retrieval algorithm, which ranks documents based on their relevance to a query.

- `EnsembleRetriever` : Combines multiple retrieval strategies for improved performance.

- `RecursiveCharacterTextSplitter` : A utility for splitting text into smaller chunks for processing.

- `PyPDFDirectoryLoader` : A loader for reading and processing PDF documents from a directory.

- `tqdm` : Import tqdm for displaying progress bars in loops and iterative processes.

#### Configuration Variables

- `tqdm.pandas()` : Integrate tqdm with pandas to display progress bars for `DataFrame` operations.

- `EMBEDDING_MODEL_NAME = "Alibaba-NLP/gte-large-en-v1.5"`

- EMBEDDING_MODEL_NAME: A string specifying the name of the embedding model to be used. This particular model is hosted on Hugging Face and is designed for general English text.

- `model_kwargs = {'trust_remote_code': True}`

- model_kwargs: A dictionary of keyword arguments to pass to the embedding model. In this case, `trust_remote_code` is set to True, allowing the use of remote code execution from the model.

#### Embedding Function

`EMBEDDING_FUNCTION = HuggingFaceEmbeddings(model_name=EMBEDDING_MODEL_NAME, model_kwargs=model_kwargs)`

- EMBEDDING_FUNCTION: An instance of the `HuggingFaceEmbeddings` class, initialized with the specified model name and any additional keyword arguments. This function will be used to convert text into embeddings for downstream processing, such as similarity search and document retrieval.
  
#### Process Overview:

- Loading Documents: Use `PyPDFDirectoryLoader` to load PDF documents from a specified directory.

- Text Splitting: Use `RecursiveCharacterTextSplitter` to split documents into smaller, manageable chunks for processing.

- Generating Embeddings: Call `EMBEDDING_FUNCTION` to convert text chunks into embeddings.

- Storing and Retrieving: Use `FAISS` for efficient storage of embeddings and to perform similarity searches. Combine with `BM25Retriever` or `EnsembleRetriever` to rank and retrieve relevant documents based on queries.

In [3]:
import os
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFDirectoryLoader
from tqdm import tqdm

# Set up tqdm for console use
tqdm.pandas()

EMBEDDING_MODEL_NAME = "Alibaba-NLP/gte-large-en-v1.5"
model_kwargs = {'trust_remote_code': True}
EMBEDDING_FUNCTION = HuggingFaceEmbeddings(model_name=EMBEDDING_MODEL_NAME, model_kwargs=model_kwargs)
print("Modules imported and tqdm configured for pandas.")

Modules imported and tqdm configured for pandas.


### Description:

`load_documents_from_directory`

- This function is designed to load PDF documents from a specified directory, split the documents into smaller text chunks based on given parameters, and return those chunks for further processing.
  
#### Parameters

- `document_path (str)` : The path to the directory containing the PDF files. This is a required argument.

- `chunk_size (int, optional)`: The size of each text chunk in characters. The default value is 2048. This parameter controls how much text will be included in each chunk.

- `chunk_overlap (int, optional)`:The number of overlapping characters between consecutive chunks. The default value is 200. This helps maintain context between chunks and can improve performance in certain applications.

- Returns `List of document chunks`: A list containing the text chunks obtained from splitting the loaded documents. Each chunk is a string of text.

#### Process overview

- Loading Documents: The function prints a message indicating the loading process and the specified document path. It then uses the `PyPDFDirectoryLoader` to load and split the PDF documents found in the given directory.

- Creating a Text Splitter: The function initializes a `RecursiveCharacterTextSplitter` using a Tiktoken encoder. This text splitter is configured with the specified `chunk_size` and `chunk_overlap`.

- Splitting Documents: Finally, the function splits the loaded documents into smaller chunks using the text splitter and returns the resulting list of document chunks.

- `chunks = load_documents_from_directory('/path/to/pdf/directory', chunk_size=1024, chunk_overlap=100)` 

- This documentation provides a clear and comprehensive understanding of how to use the `load_documents_from_directory` function, its parameters, return value, and internal logic.

In [8]:
def load_documents_from_directory(
    document_path: str, 
    chunk_size: int = 2048, 
    chunk_overlap: int = 200
):
    """
    Load PDF documents from a directory and split them into chunks.

    Args:
        document_path (str): Path to the directory containing PDF files.
        chunk_size (int): Size of each text chunk (default: 2048).
        chunk_overlap (int): Overlap between chunks (default: 200).

    Returns:
        List of document chunks.
    """
    print(f"Loading documents from {document_path}...\n")
    # Load PDF documents from the specified directory
    documents = PyPDFDirectoryLoader(document_path).load_and_split()
    # Create a text splitter using tiktoken encoder
    text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    # Split the documents into chunks
    chunks = text_splitter.split_documents(documents)
    
    print("Documents loaded and split into chunks.")  # Output message indicating action taken
    return chunks

# Example function call
document_path = "path/to/your/pdf/directory"  # Specify the correct path to your PDF directory
chunks = load_documents_from_directory(document_path)

Loading documents from path/to/your/pdf/directory...

Documents loaded and split into chunks.


### Description:
- The class is defined to represent a document with:
    - `id` : A unique identifier for the document.
    - `page_content` : The main text content of the document.
    - `metadata` : An optional dictionary for any additional data related to the document. If not provided, it defaults to an empty dictionary.
- `EMBEDDING_MODEL_NAME` : Specifies the name of the embedding model to be used.
  
- `model_kwargs` : A dictionary containing additional parameters for the embedding model; in this case, it allows for trusting remote code.
  
- `EMBEDDING_FUNCTION`: Initializes the embedding function using the specified model name and parameters.
  
- `load_or_create_faiss_vector_store` : This function is responsible for either loading an existing FAISS vector store from disk or creating a new one if it does not exist. It utilizes the FAISS library for efficient similarity search and indexing of document embeddings.
  
#### Parameters

- `documents` : A list of documents to be indexed in the FAISS vector store. These documents should be pre-processed and embedded using the specified embedding function.

- `collection_name (str)` : The name of the collection that will be used to name the FAISS index file. This is a required argument that helps identify the specific collection of documents.

- `persist_directory (str)` : The directory where the `FAISS` index will be saved or loaded from. This should be a valid directory path on the filesystem.

  
Returns `FAISS vector store object`:

An instance of the FAISS vector store that can be used for similarity search and retrieval of documents based on their embeddings.

#### Function Logic

1. Determine Index Path:
The function constructs the file path for the FAISS index by combining the `persist_directory` and the `collection_name` to create a file name in the format `<collection_name>_faiss_index` .

2. Check for Existing Index:
    - If the index file exists at the specified path, it attempts to load the existing `FAISS` vector store. A message is printed to indicate that the existing store is being loaded.

    - The `FAISS.load_local()` method is called to load the index, with the `allow_dangerous_deserialization` option set to True for compatibility with potentially unsafe serialized data.

3. Create New Index:
    - If the index file does not exist, a new `FAISS` vector store is created from the provided documents. A message is printed indicating that a new store is being created.
    - The `FAISS.from_documents()` method is used to create the index based on the provided document embeddings. The newly created index is then saved to disk using faiss_store.save_local(index_path).
      
4. Return the Vector Store: Finally, the function returns the FAISS vector store object, whether it was loaded from disk or newly created.
`faiss_store = load_or_create_faiss_vector_store(documents, 'my_collection', '/path/to/persist/directory')`

    - Ensure that the documents passed to this function are already embedded using the specified embedding function, which should be defined in the same context.
    - The persist_directory should be accessible and writable; otherwise, the function may fail to create or save the FAISS index.

5. This documentation provides a clear and comprehensive overview of how to use the load_or_create_faiss_vector_store function, including its parameters, return value and logic.

In [13]:
import os
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings

# Define a Document class with id, page_content, and metadata
class Document:
    def __init__(self, doc_id, page_content, metadata=None):
        self.id = doc_id  # Add an id attribute
        self.page_content = page_content
        self.metadata = metadata or {}

# Assume EMBEDDING_FUNCTION is defined somewhere in your code.
# Replace with your actual embedding function as needed.
EMBEDDING_MODEL_NAME = "Alibaba-NLP/gte-large-en-v1.5"
model_kwargs = {'trust_remote_code': True}
EMBEDDING_FUNCTION = HuggingFaceEmbeddings(model_name=EMBEDDING_MODEL_NAME, model_kwargs=model_kwargs)

def load_or_create_faiss_vector_store(
    documents, 
    collection_name, 
    persist_directory
):
    """
    Load an existing FAISS vector store or create a new one if it doesn't exist.
    
    Args:
        documents: List of documents to be indexed.
        collection_name (str): Name of the collection.
        persist_directory (str): Directory to save/load the FAISS index.
        
    Returns:
        FAISS vector store object.
    """
    index_path = os.path.join(persist_directory, f'{collection_name}_faiss_index')
    
    if os.path.exists(index_path):
        # Load existing FAISS index
        print(f"Loading existing FAISS vector store from {index_path}...\n")
        faiss_store = FAISS.load_local(index_path, embeddings=EMBEDDING_FUNCTION, allow_dangerous_deserialization=True)
        print("FAISS vector store loaded successfully.")
    else:
        # Create new FAISS index
        print(f"Creating new FAISS vector store in {index_path}...\n")
        faiss_store = FAISS.from_documents(documents, embedding=EMBEDDING_FUNCTION)
        faiss_store.save_local(index_path)
        print(f"New FAISS vector store created and saved to {index_path}.")
        print(f"Number of documents indexed: {len(documents)}.")
        
    return faiss_store

# Example usage
# Replace these with your actual document contents
documents = [
    Document(doc_id=1, page_content="Document 1 content", metadata={"source": "source1"}),
    Document(doc_id=2, page_content="Document 2 content", metadata={"source": "source2"})
]

collection_name = "example_collection"
persist_directory = "path/to/persist/directory"  # Replace with your actual path

# Call the function and print the output
faiss_store = load_or_create_faiss_vector_store(documents, collection_name, persist_directory)

Loading existing FAISS vector store from path/to/persist/directory/example_collection_faiss_index...

FAISS vector store loaded successfully.


### Description:

1. Document Class: Represents a document with an ID, content, and optional metadata.

- Attributes:
- `id` : Unique identifier for the document.
    - `page_content`: Textual content of the document.
    - `metadata`: Optional dictionary for additional document information.
`doc = Document(doc_id, page_content, metadata)`

2. BM25Retriever Class : Retrieves documents using the BM25 ranking algorithm.

- `from_documents(documents, search_kwargs)` : Creates a BM25 retriever from a list of documents.
- Returns: A message indicating the number of documents and the value of k.
`bm25 = BM25Retriever.from_documents(documents, {'k': 5})`

3. FakeFAISS Class : Simulates a FAISS vector store for demonstration.
- `as_retriever(search_kwargs)`: Returns a vector retriever with specified search parameters.
- Returns: A message indicating the value of k.
- `vector_retriever = FakeFAISS.as_retriever({'k': 5})`
  
4. EnsembleRetriever Class : Combines multiple retrievers into a single ensemble for document retrieval.
- Attributes:
    - retrievers: List of retriever instances.
    - weights: List of weights for each retriever.
- `ensemble = EnsembleRetriever([bm25, vector_retriever], [0.6, 0.4])`

- `get_hybrid_retriever` : This function creates a hybrid retriever that combines the BM25 retrieval method with a vector search using a FAISS vector store. The hybrid retriever allows for improved search performance by leveraging the strengths of both retrieval methods.

#### Parameters

- `documents`: A list of documents that will be used by the BM25 retriever. These documents should be pre-processed and in a suitable format for retrieval.

- `vector_store`: An instance of a FAISS vector store that will be used for vector-based retrieval. This store should already contain embeddings of the documents.

- `k (int)`:The number of documents to retrieve from the combined search. This parameter specifies how many top results will be returned from the hybrid retriever.

Returns

- `EnsembleRetriever object`: An instance of the `EnsembleRetriever` that combines the BM25 and vector retrievers. This object can be used to perform searches that leverage both retrieval techniques.

#### Function Logic

- 1. Create BM25 Retriever: The function initializes a BM25 retriever using the provided documents. This is done using the `BM25Retriever.from_documents()` method, with the `search_kwargs` parameter set to retrieve `k` documents.

- 2. Create Vector Retriever: A vector retriever is created from the provided FAISS vector store by calling `vector_store.as_retriever()`, also specifying `search_kwargs` to retrieve k documents.

- 3. Combine Retrievers:
An EnsembleRetriever is instantiated to combine the two retrievers (BM25 and vector) with specified weights. In this case, BM25 is weighted at 0.6 and the vector search at 0.4, allowing for a balanced contribution from both methods.

- 4. Return the Hybrid Retriever:
The function returns the combined EnsembleRetriever object, which can now be used to perform searches using the hybrid approach.

- 5. `hybrid_retriever = get_hybrid_retriever(documents, faiss_store, k=5)`
    - The weights assigned to the retrievers in the ensemble can be adjusted based on the specific use case and the performance of each retrieval method.

    - Ensure that the FAISS vector store contains the necessary embeddings for the documents prior to using this function.

- This documentation provides a clear overview of how to use the `get_hybrid_retriever` function, including its parameters, return value and internal logic

In [17]:
# Sample Document class for demonstration purposes
class Document:
    def __init__(self, doc_id, page_content, metadata=None):
        self.id = doc_id
        self.page_content = page_content
        self.metadata = metadata or {}

# Sample BM25Retriever class for demonstration purposes
class BM25Retriever:
    @staticmethod
    def from_documents(documents, search_kwargs):
        return f"BM25 Retriever created with {len(documents)} documents and k={search_kwargs['k']}"

# Sample FAISS Vector Store class for demonstration purposes
class FakeFAISS:
    @staticmethod
    def as_retriever(search_kwargs):
        return f"Vector retriever created with k={search_kwargs['k']}"

# Sample EnsembleRetriever class for demonstration purposes
class EnsembleRetriever:
    def __init__(self, retrievers, weights):
        self.retrievers = retrievers
        self.weights = weights

    def __repr__(self):
        return f"EnsembleRetriever with {len(self.retrievers)} retrievers."

# Define the hybrid retriever function
def get_hybrid_retriever(documents, vector_store, k):
    """
    Create a hybrid retriever combining BM25 and vector search.
    Args:
        documents: List of documents for BM25 retriever.
        vector_store: FAISS vector store for vector retriever.
        k (int): Number of documents to retrieve.
    Returns:
        EnsembleRetriever object combining BM25 and vector search.
    """
    # Create BM25 retriever
    bm25_retriever = BM25Retriever.from_documents(documents, search_kwargs={'k': k})
    # Create vector retriever
    vector_retriever = vector_store.as_retriever(search_kwargs={'k': k})
    # Combine retrievers with specified weights
    fusion_retriever = EnsembleRetriever(
        retrievers=[bm25_retriever, vector_retriever],
        weights=[0.6, 0.4]
    )
    return fusion_retriever

# Sample documents for testing
documents = [
    Document(doc_id=1, page_content="Document content 1"),
    Document(doc_id=2, page_content="Document content 2"),
]

# Create a fake vector store
vector_store = FakeFAISS()

# Set the number of documents to retrieve
k = 5

# Call the hybrid retriever function and print the output
fusion_retriever = get_hybrid_retriever(documents, vector_store, k)
print(f"Hybrid retriever created: {fusion_retriever}")

Hybrid retriever created: EnsembleRetriever with 2 retrievers.
