Source: https://python.langchain.com/docs/tutorials/retrievers/


If you run into an `import` problem in vscode, make sure you select the right python interpreter `> Python: Select Interpreter` and kernel `Notebook: Select Notebook Kernel`.

In [1]:
# type: ignore
import os
from pathlib import Path

import dotenv
from httpx import ConnectError
from langchain_community.document_loaders import PyPDFLoader
from langchain_core.documents import Document
from langchain_mongodb import MongoDBAtlasVectorSearch
from langchain_ollama import OllamaEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from pymongo.mongo_client import MongoClient
from pymongo.operations import SearchIndexModel

Set Project path

In [2]:
current_dir = Path(os.getcwd())

# # method 1: based on the root dir name
# root_dir_name = 'RAG'
# for p in current_dir.parents:
# if p.name.lower() == root_dir_name.lower():
#     root_dir = p
#     break
# else:
#     raise Exception(f"Root dir \"{root_dir_name}\" Not found")

# method 2: based on the ".git" dir presence
for p in current_dir.parents:
    if ".git" in os.listdir(current_dir.parent) or ".project-root" in os.listdir(current_dir.parent):
        root_dir = current_dir.parent
        print(root_dir)
        break
else:
    raise Exception("No root directory was found that contains a .git directory")

/Users/firoozas/Documents/AI-from-scratch/RAG


In [3]:
# load variables into env
f = root_dir / ".secrets" / ".env"
assert f.exists(), f"File not found: {f}"
dotenv.load_dotenv(f)

True

## <span style='color:Orange;'>LangChain Document Object</span>

LangChain implements a Document abstraction, which is intended to represent a unit of text and associated metadata. It has three attributes:

- `page_content`: a string representing the content;
- `metadata`: a dict containing arbitrary metadata;
- `id`: (optional) a string identifier for the document.

The metadata attribute can capture **information about the source** of the document, its **relationship to other documents,** and other information. 

> _Note that an individual Document object often represents a chunk of a larger document._

In [4]:
documents = [
    Document(
        page_content="Dogs are great companions, known for their loyalty and friendliness.",
        metadata={"source": "mammal-pets-doc"},
    ),
    Document(
        page_content="Cats are independent pets that often enjoy their own space.", metadata={"source": "mammal-pets-doc"}
    ),
]

### <span style='color:Khaki;'>Document loaders</span>

DocumentLoaders load data into the standard LangChain Document format.

https://python.langchain.com/docs/integrations/document_loaders/


See this guide for more detail on PDF document loaders.

https://python.langchain.com/docs/how_to/document_loader_pdf/

In [5]:
file_path = root_dir / "data" / "vmd_sample.pdf"
assert os.path.exists(file_path)
loader = PyPDFLoader(file_path)

docs = loader.load()

print(f"{len(docs)=}")
print("CONTENT")
print(f"{docs[37].page_content[:200]}\n")
print(docs[0].metadata)

len(docs)=200
CONTENT
described in Chapter 4). If VMD is unable to guess the appropriate le type or guesses incorrectly,
you must select it from the list manually.
You can control into which VMD molecule you want to load 

{'source': '/Users/firoozas/Documents/AI-from-scratch/RAG/data/vmd_sample.pdf', 'page': 0}


#### <span style='color:LightGreen;'>Splitting Text</span>

Further splitting the PDF will help ensure that the meanings of relevant portions of the document are not "washed out" by surrounding text.

We will split our documents **into chunks of 1000 characters with 200 characters of overlap** between chunks. The overlap helps mitigate the possibility of separating a statement from important context related to it. We use the [`RecursiveCharacterTextSplitter`](https://python.langchain.com/docs/how_to/recursive_text_splitter/), which will recursively split the document using common separators like new lines until each chunk is the appropriate size. This is the recommended text splitter for generic text use cases.

In [6]:
# `add_start_index=True`` will preserve the character index where each split Document starts within the initial Document, as a metadata attribute “start_index”.
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200, add_start_index=True)
all_splits = text_splitter.split_documents(docs)

len(all_splits)

626

## <span style='color:Orange;'>Embeddings</span>

Vector search is a common way to store and search over unstructured data (such as unstructured text). The idea is to store numeric vectors that are associated with the text.


### <span style='color:Khaki;'>Installing Ollama</span>

- [Installing Ollama](https://github.com/ollama/ollama?tab=readme-ov-file#ollama)

- [Available models](https://ollama.com/search)

In [7]:
embeddings_model = OllamaEmbeddings(model="llama3.2")

In [8]:
try:
    # embedding example
    vector_1 = embeddings_model.embed_query(all_splits[0].page_content)
    vector_2 = embeddings_model.embed_query(all_splits[1].page_content)

    embedding_length = len(vector_1)

    print("Both embedding have same length:", len(vector_1) == len(vector_2))
    print(f"Generated vectors of length {embedding_length}\n")
    print(vector_1[:5])
except ConnectError as e:
    print(e)
    print("Please install and run Ollama server locally or use the hosted version")

Both embedding have same length: True
Generated vectors of length 3072

[0.0018432532, -0.025140692, 0.014092629, -0.023098888, -0.0020301666]


## <span style='color:Orange;'>Vector stores</span>

##### <span style='color:Khaki;'>Option 1: Local Qdrant Vector DB</span>

```bash
pip install qdrant-client llama-index
```

```python
import qdrant_client
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core.storage.storage_context import StorageContext
from llama_index.core.indices.vector_store.base import VectorStoreIndex

# Specify the collection (DB) name
collection_name = "chat_with_docs"

# Initialize Qdrant client
client = qdrant_client.QdrantClient(
    host="localhost",
    port=6333
)

# Configure the Qdrant vector store
vector_store = QdrantVectorStore(
    client=client,
    collection_name=collection_name
)

# Set up the storage context
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# Create the Vector Store Index
index = VectorStoreIndex(
    nodes,  # This should be a list of document nodes using (llama_index.Document.from_text())
    storage_context=storage_context
)
```


### <span style='color:Khaki;'>Option 2: Cloud MongoDB Atlas Vector DB</span>
[Available DBs](https://python.langchain.com/docs/integrations/vectorstores/)

```bash
pip install "pymongo[srv]"
```

###### <span style='color:LightGreen;'>Notes</span>

1. The `relevance_score_fn` parameter (in `MongoDBAtlasVectorSearch()`) in the client library ensures that the client understands how the relevance scores returned by MongoDB should be interpreted. It will not automatically create a "vector search index". It should be created manually.
   - Cosine Similarity: Returns values typically between -1 (opposite) and 1 (identical).
   - Euclidean Distance: Returns non-negative values where 0 means identical, and larger numbers indicate greater dissimilarity.
   - Dot Product: Returns unbounded values where higher scores indicate greater similarity.

2. 


In [34]:
# https://www.mongodb.com/docs/manual/reference/connection-string/
# https://swethag04.medium.com/rag-using-mongodb-atlas-vector-search-and-langchain-cba57b67fe29
# https://www.mongodb.com/docs/atlas/atlas-vector-search/vector-search-type/


class AtlasClient:
    def __init__(self, atlas_uri=None, dbname: str = None, collection_name: str = None, index_name: str = None):
        if atlas_uri is None:
            atlas_uri = os.getenv("MONGODB_URI")
        if atlas_uri is None:
            raise ValueError("Please provide a valid MongoDB Atlas URI or set MONGODB_URI in .env file")

        self._clt = MongoClient(atlas_uri)
        self.database = self._clt[dbname] if dbname is not None else None
        self.collection = self.database[collection_name] if self.database is not None else None
        self._init_vector_store = False
        self.index_name = index_name

    # A quick way to test if we can connect to Atlas instance
    def ping(self, debug=True):
        try:
            self._clt.admin.command("ping")
            print("Ping: Successfully connected to MongoDB!") if debug else None
        except Exception as e:
            print("Ping:", e)
            print(
                "You may need to add your IP address to Network Access list in MongoDB deployment\n"
                "https://cloud.mongodb.com -> Security -> Network Access"
            ) if debug else None

    def create_indexes(self, embedding_model, index_name: str, index_def: dict = None):
        existing_indexes = [d["name"] for d in list(client.collection.list_search_indexes())]
        if index_name in existing_indexes:
            print(f"Index {index_name} already exists.")
            return
        
        # get the embedding length
        embedding_length = len(embedding_model.embed_query("test"))
        # Define the vector search index
        vector_search_index_definition = (
            {"fields": [{"type": "vector", "path": "embedding", "similarity": "cosine", "numDimensions": embedding_length}]}
            if index_def is None
            else index_def
        )
        # Create the search index model
        search_index_model = SearchIndexModel(definition=vector_search_index_definition, name=index_name, type="vectorSearch")
        # Create the index on the collection
        self.collection.create_search_index(model=search_index_model)
        self.index_name = index_name
        self._similarity = vector_search_index_definition["fields"][0]["similarity"]

    def init_vector_store(self, embedding_model, score_fn: str = None):
        if self.collection is None or self.index_name is None:
            raise ValueError("Run reinit(...) with db, collection, and index names as needed.")

        if score_fn is None and self._similarity is not None:
            score_fn = self._similarity
        elif score_fn is None:
            score_fn = "cosine"

        print(f"Using similarity function: {score_fn} and index: {self.index_name}")

        self._vector_store = MongoDBAtlasVectorSearch(
            embedding=embedding_model, collection=self.collection, index_name=self.index_name, relevance_score_fn=score_fn
        )
        self._init_vector_store = True

    @property
    def vector_store(self):
        if not self._init_vector_store:
            raise ValueError("Please run init_vector_store(...) first.")
        return self._vector_store

    # init a new collection
    def reinit(self, dbname: str = None, collection_name: str = None, index_name: str = None):
        self.database = self._clt[dbname] if dbname is not None else self.database
        self.collection = self.database[collection_name] if collection_name is not None else self.collection
        self.index_name = index_name if index_name is not None else self.index_name


# Create a new client and connect to the server
client = AtlasClient(dbname="VMD_RAG", collection_name="VMD_PDF")
client.ping()

Ping: Successfully connected to MongoDB!


In [35]:
client.create_indexes(embeddings_model, index_name="pdf_cosine")

Index pdf_cosine already exists.


In [None]:
# Having instantiated our vector store, we can now index the documents.
if len(all_splits) != client.collection.count_documents({}):
    client.init_vector_store(embeddings_model)
    ids = client.vector_store.add_documents(documents=all_splits)
else:
    print("Documents already indexed")