# Lesson 4: Retrieval Methods and Vector Databases

**Objective**: Build a retrieval system that efficiently searches for relevant document chunks.

**Topics**:
- Sparse vs. dense retrieval methods
- Hybrid search methods (e.g., combining BM25 with dense retrieval)
- Overview of vector databases: Milvus, Faiss, Qdrant

**Practical Task**: Set up a vector database and implement a retrieval method.

**Resources**:
- What is a vector database
- Choosing a vector database


#### Load the dataset

In [1]:
from langchain_community.document_loaders import PyPDFLoader
from dotenv import load_dotenv

load_dotenv()

file_path = (
    "/Users/maximilianocruz/Documents/GitHub/practicos-rag/data/Regulaciones cacao y chocolate 2003.pdf"
)
loader = PyPDFLoader(file_path)
docs = loader.load_and_split()

### Embeddings function

In [2]:
from langchain_huggingface import HuggingFaceEmbeddings

embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
embedded_document = embedding_model.embed_query(docs[0].page_content)
embedded_document[:3]

  from .autonotebook import tqdm as notebook_tqdm


[0.0287257619202137, -0.04786234349012375, 0.0054370807483792305]

# A first approach

In [3]:
from dotenv import load_dotenv

load_dotenv()

True

In [4]:
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams

In [5]:
client = QdrantClient(path="/tmp/langchain_qdrant")

In [6]:
client.delete_collection(collection_name="demo_collection")

True

In [7]:
client.create_collection(
    collection_name="demo_collection",
    vectors_config=VectorParams(size=768, distance=Distance.COSINE),
)

vector_store = QdrantVectorStore(
    client=client,
    collection_name="demo_collection",
    embedding=embedding_model,
)

In [8]:
vector_store.add_documents(docs)


['d13875579c594ea4abf20a7d88d04d17',
 '47e30feba06e4b32a0f2b5333278a567',
 'e891a6d48a314b608bdd98e03de21b89',
 'd0ef2d46620f4d8695c32e19618d286a',
 'dc9a861fd92b4b378697f5634ed222c1',
 '6d8b161242344afd83a6ed5d960c6c8c',
 '6fc7601c7ce54c27bbd4a6129b1ed94b',
 '81bd5374068b41d59ba611e3365d74ce',
 'efbd2463af4249f1bcf6e00bf0239404',
 'e0091781cd944a69832f93623c2d9d4e',
 '9bac38d192cd4ec7bd85b737fe17e441',
 '75bcea056c6d40b0b731db6f2d9bb841',
 '713076bb181643dbb9003ea2b64cecb3']

In [9]:
client.scroll(collection_name="demo_collection", limit=3)

([Record(id='47e30feba06e4b32a0f2b5333278a567', payload={'page_content': 'Document Generated: 2023-04-24\nStatus:  This is the original version (as it was originally made).\n“catering establishment” means a restaurant, canteen, club, public house, school, hospital or\nsimilar establishment (including a vehicle or a fixed or mobile stall) where, in the course of a\nbusiness, food is prepared for delivery to the ultimate consumer and is ready for consumption\nwithout further preparation;\n“designated product” means any cocoa or chocolate product specified in column 2 of\nSchedule 1, as read with any Note to that Schedule and any provision of regulation 3 and\nSchedule 2 relating to that product; and “designated chocolate product” and “designated cocoa\nproduct” mean any such product which is respectively a chocolate product or a cocoa product;\n“EEA Agreement” means the Agreement on the European Economic Area(4) signed at Oporto\non 2nd May 1992 as adjusted by the Protocol(5) signed at B

# Dense search

In [10]:
from langchain_qdrant import RetrievalMode

qdrant = QdrantVectorStore.from_documents(
    docs,
    embedding=embedding_model,
    location=":memory:",
    collection_name="my_documents",
    retrieval_mode=RetrievalMode.DENSE,
)

query = "What did the president say about Ketanji Brown Jackson"
found_docs = qdrant.similarity_search(query)

In [11]:
found_docs

[Document(metadata={'source': '/Users/maximilianocruz/Documents/GitHub/practicos-rag/data/Regulaciones cacao y chocolate 2003.pdf', 'page': 4, '_id': 'ef1656b64f46458b8772cd0f4ddcb341', '_collection_name': 'my_documents'}, page_content='Document Generated: 2023-04-24\nStatus:  This is the original version (as it was originally made).\n(c) section 20 (offences due to fault of another person);\n(d) section 21 (defence of due diligence) as it applies for the purposes of section 8, 14 or 15\nof the Act;\n(e) section 22 (defence of publication in the course of business);\n(f) section 30(8) (which relates to documentary evidence);\n(g) section 33(1) (obstruction etc. of officers);\n(h) section 33(2), with the modification that the reference to “any such requirement as is\nmentioned in subsection (1)(b) above”, shall be deemed to be a reference to any such\nrequirement as is mentioned in that subsection as applied by sub-paragraph (g) above;\n(i) section 35(1) (punishment of offences) in so f

# Sparse Vector Search

To search with only sparse vectors,

The retrieval_mode parameter should be set to RetrievalMode.SPARSE.
An implementation of the SparseEmbeddings interface using any sparse embeddings provider has to be provided as value to the sparse_embedding parameter.
The langchain-qdrant package provides a FastEmbed based implementation out of the box.

In [12]:
from langchain_qdrant import FastEmbedSparse, RetrievalMode

sparse_embeddings = FastEmbedSparse(model_name="Qdrant/bm25", cache_dir="cache")

qdrant = QdrantVectorStore.from_documents(
    docs,
    embedding=embedding_model,
    sparse_embedding=sparse_embeddings,
    location=":memory:",
    collection_name="my_documents",
    retrieval_mode=RetrievalMode.SPARSE,
)

query = "What is chocolate?"
found_docs = qdrant.similarity_search(query)

Fetching 29 files: 100%|██████████| 29/29 [00:00<00:00, 85537.85it/s]


In [13]:
found_docs

[Document(metadata={'source': '/Users/maximilianocruz/Documents/GitHub/practicos-rag/data/Regulaciones cacao y chocolate 2003.pdf', 'page': 11, '_id': '0ee0ddfcbaa34500b6d0f0d9da105dba', '_collection_name': 'my_documents'}, page_content='Document Generated: 2023-04-24\nStatus:  This is the original version (as it was originally made).\nUsual name of vegetable fat Scientific name of the plants from which\nthe fats listed can be obtained\n6. Mango kernel Mangifera indica\n3. Coconut oil may be used in chocolate for the manufacture of ice cream and similar frozen\nproducts.\n4. In this Schedule—\n“P” means palmitic acid;\n“O” means oleic acid;\n“St” means stearic acid.\nEXPLANATORY NOTE\n(This note is not part of the Regulations)\nThese Regulations, which apply to England, implement Directive 2000/36/EC of the European\nParliament and the Council relating to cocoa and chocolate products intended for human\nconsumption(23). They revoke and replace the Cocoa and Chocolate Products Regulatio

# Hybrid Search

To perform a hybrid search using dense and sparse vectors with score fusion,

The retrieval_mode parameter should be set to RetrievalMode.HYBRID.
A dense embeddings value should be provided to the embedding parameter.
An implementation of the SparseEmbeddings interface using any sparse embeddings provider has to be provided as value to the sparse_embedding parameter.
Note that if you've added documents with the HYBRID mode, you can switch to any retrieval mode when searching. Since both the dense and sparse vectors are available in the collection.

In [14]:
from langchain_qdrant import FastEmbedSparse, RetrievalMode

sparse_embeddings = FastEmbedSparse(model_name="Qdrant/bm25")

qdrant = QdrantVectorStore.from_documents(
    docs,
    embedding=embedding_model,
    sparse_embedding=sparse_embeddings,
    location=":memory:",
    collection_name="my_documents",
    retrieval_mode=RetrievalMode.HYBRID,
)

query = "What did the president say about Ketanji Brown Jackson"
found_docs = qdrant.similarity_search(query)

Fetching 29 files: 100%|██████████| 29/29 [00:00<00:00, 210077.40it/s]


In [15]:
found_docs

[Document(metadata={'source': '/Users/maximilianocruz/Documents/GitHub/practicos-rag/data/Regulaciones cacao y chocolate 2003.pdf', 'page': 4, '_id': '69892ab960bd40e48f7bbc730dd698e2', '_collection_name': 'my_documents'}, page_content='Document Generated: 2023-04-24\nStatus:  This is the original version (as it was originally made).\n(c) section 20 (offences due to fault of another person);\n(d) section 21 (defence of due diligence) as it applies for the purposes of section 8, 14 or 15\nof the Act;\n(e) section 22 (defence of publication in the course of business);\n(f) section 30(8) (which relates to documentary evidence);\n(g) section 33(1) (obstruction etc. of officers);\n(h) section 33(2), with the modification that the reference to “any such requirement as is\nmentioned in subsection (1)(b) above”, shall be deemed to be a reference to any such\nrequirement as is mentioned in that subsection as applied by sub-paragraph (g) above;\n(i) section 35(1) (punishment of offences) in so f

In [16]:
#If you want to execute a similarity search and receive the corresponding scores you can run:
results = vector_store.similarity_search_with_score(
    query="What is chocolate?", k=1
)
for doc, score in results:
    print(f"* [SIM={score:3f}] {doc.page_content} [{doc.metadata}]")

* [SIM=0.479522] Document Generated: 2023-04-24
Status:  This is the original version (as it was originally made).
Column 1 Column 2
Reserved Descriptions Designated Products
— not less than 25 per cent total fat (cocoa
butter and milk fat).
6. White chocolate The product obtained from cocoa butter, milk
or milk products and sugars which contains not
less than 20 per cent cocoa butter and not less
than 14 per cent dry milk solids obtained by
partly or wholly dehydrating whole milk, semi-
skimmed or skimmed milk, cream, or from
partly or wholly dehydrated cream, butter or
milk fat, of which not less than 3.5 per cent is
milk fat.
7. Filled chocolate or
Chocolate with … filling or
Chocolate with … centre
The filled product, the outer part of which
consists of a product specified in column
2 of item 3, 4, 5 or 6 of this Schedule and
constitutes not less than 25 per cent of the total
weight of the product, but does not include any
filled product, the inside of which consists of
bakery prod

# Metadata filtering

In [17]:
from qdrant_client.http import models

results = vector_store.similarity_search(
    query="What is chocolate?",
    k=1,
    filter=models.Filter(
        should=[
            models.FieldCondition(
                key="page",
                match=models.MatchValue(
                    value="5"
                ),
            ),
        ]
    ),
)
for doc in results:
    print(f"* {doc.page_content} [{doc.metadata}]")

In [18]:
results

[]

## Query by turning into a retriever

In [19]:
retriever = vector_store.as_retriever(search_type="mmr", search_kwargs={"k": 5})
retriever.invoke("What is chocolate?")


[Document(metadata={'source': '/Users/maximilianocruz/Documents/GitHub/practicos-rag/data/Regulaciones cacao y chocolate 2003.pdf', 'page': 9, '_id': 'e0091781cd944a69832f93623c2d9d4e', '_collection_name': 'demo_collection'}, page_content='Document Generated: 2023-04-24\nStatus:  This is the original version (as it was originally made).\nColumn 1 Column 2\nReserved Descriptions Designated Products\n— not less than 25 per cent total fat (cocoa\nbutter and milk fat).\n6. White chocolate The product obtained from cocoa butter, milk\nor milk products and sugars which contains not\nless than 20 per cent cocoa butter and not less\nthan 14 per cent dry milk solids obtained by\npartly or wholly dehydrating whole milk, semi-\nskimmed or skimmed milk, cream, or from\npartly or wholly dehydrated cream, butter or\nmilk fat, of which not less than 3.5 per cent is\nmilk fat.\n7. Filled chocolate or\nChocolate with … filling or\nChocolate with … centre\nThe filled product, the outer part of which\nco