# Lesson 4: Retrieval Methods and Vector Databases

**Objective**: Build a retrieval system that efficiently searches for relevant document chunks.

**Topics**:
- Sparse vs. dense retrieval methods
- Hybrid search methods (e.g., combining BM25 with dense retrieval)
- Overview of vector databases: Milvus, Faiss, Qdrant

**Practical Task**: Set up a vector database and implement a retrieval method.

**Resources**:
- What is a vector database
- Choosing a vector database


#### Load the dataset

In [1]:
from langchain_community.document_loaders import PyPDFLoader

file_path = (
    "../data/Regulaciones cacao y chocolate 2003.pdf"
)
loader = PyPDFLoader(file_path)
splitted_doc = loader.load_and_split()

## Documents

In [4]:
# page content
splitted_doc[0].page_content

'Status:  This is the original version (as it was originally made).\nSTATUTORY INSTRUMENTS\n2003 No. 1659\nFOOD, ENGLAND\nThe Cocoa and Chocolate Products (England) Regulations 2003\nMade        -      -       -      - 25th June 2003\nLaid before Parliament 3rd July 2003\nComing into force       -      - 3rd August 2003\nThe Secretary of State, in exercise of the powers conferred by sections 16(1)(e), 17(1), 26(1) and (3)\nand 48(1) of the Food Safety Act 1990 (1) and now vested in him (2) and of all other powers enabling\nhim in that behalf, having had regard in accordance with section 48(4A) of that Act to relevant\nadvice given by the Food Standards Agency, and after consultation both as required by Article 9\nof Regulation (EC) No. 178/2002  of the European Parliament and of the Council laying down the\ngeneral principles and requirements of food law, establishing the European Food Safety Authority\nand laying down procedures in matters of food safety (3) and in accordance with sec

In [5]:
# metadata
splitted_doc[0].metadata

{'source': '../data/Regulaciones cacao y chocolate 2003.pdf', 'page': 0}

### Embeddings function

In [7]:
from langchain_huggingface import HuggingFaceEmbeddings

embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
embedded_document = embedding_model.embed_query(splitted_doc[0].page_content)
embedded_document[:3]

[0.02872576005756855, -0.047862425446510315, 0.005437064450234175]

# A first approach

In [9]:
from dotenv import load_dotenv

load_dotenv()

True

In [11]:
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams

In [13]:
client = QdrantClient(path="/tmp/langchain_qdrant")

In [15]:
client.delete_collection(collection_name="demo_collection")

True

In [16]:
client.create_collection(
    collection_name="demo_collection",
    vectors_config=VectorParams(size=768, distance=Distance.COSINE),
)

vector_store = QdrantVectorStore(
    client=client,
    collection_name="demo_collection",
    embedding=embedding_model,
)

In [18]:
vector_store.add_documents(splitted_doc)

['34c8015847ed42d5b044422a54a6e949',
 '1d5404308b014962b5ff72c202897496',
 '1364bd14b8d4406a913e2db53f5b4ed1',
 '561918c1ecbd4469bdcae34a3c47b3a6',
 'e97d7aeb7b6545b795a9116979e7f01a',
 '473e831e58ae438ba848ac7e30ec8711',
 '521ab2c7a937400bb66163e7f57575b7',
 '91e8ea2ba0c94114944e2ef75b5fb42d',
 '6895b98da7174891b154487ba3c071b3',
 '5f749923c308414e8352983b03fddd2f',
 '9ae3749d1c844886b8c61b572e074f69',
 '9bf10e2daa6845c9a575624f228d581b',
 'd517cd42781d41caaabec3124f3afa0b']

In [21]:
client.scroll(collection_name="demo_collection", limit=3)

([Record(id='00e1564af7744febb15d32453ea5d00f', payload={'page_content': '“EEA Agreement” means the Agreement on the European Economic Area (4) signed at Oporto\non 2nd May 1992 as adjusted by the Protocol (5) signed at Brussels on 17th March 1993;\n“EEA State” means a State which is a Contracting Party to the EEA Agreement;\n“food authority” does not include—\n(a) the council of a district in a non-metropolitan county except where the county functions\nhave been transferred to that council pursuant to a structural change, or\n(b) the appropriate Treasurer referred to in section 5(1)(c) of the Act (which deals with the\nInner and the Middle Temple);\n“the 1996 Regulations” means the Food Labelling Regulations 1996 (6);\n“other edible substances” does not include vegetable fats referred to in regulation 3 or the\nfilling of any product specified in column 2 of item 7 or of item 10(a) of Schedule 1;\n“preparation” includes manufacture and any form of processing or treatment;\n“reserved d

# Dense search

In [23]:
from langchain_qdrant import RetrievalMode

qdrant = QdrantVectorStore.from_documents(
    splitted_doc,
    embedding=embedding_model,
    location=":memory:",
    collection_name="my_documents",
    retrieval_mode=RetrievalMode.DENSE,
)

query = "What did the president say about Ketanji Brown Jackson"
found_docs = qdrant.similarity_search(query)

In [24]:
found_docs

[Document(metadata={'source': '../data/Regulaciones cacao y chocolate 2003.pdf', 'page': 4, '_id': '06f68166c5ef46b4a1b8961866619faf', '_collection_name': 'my_documents'}, page_content='Document Generated: 2023-04-24\nStatus:  This is the original version (as it was originally made).\n(c)section 20 (offences due to fault of another person);\n(d)section 21 (defence of due diligence) as it applies for the purposes of section 8, 14 or 15\nof the Act;\n(e)section 22 (defence of publication in the course of business);\n(f)section 30(8) (which relates to documentary evidence);\n(g)section 33(1) (obstruction etc. of officers);\n(h)section 33(2), with the modification that the reference to “any such requirement as is\nmentioned in subsection (1)(b) above”, shall be deemed to be a reference to any such\nrequirement as is mentioned in that subsection as applied by sub-paragraph (g) above;\n(i)section 35(1) (punishment of offences) in so far as it relates to offences under section 33(1)\nas appli

# Sparse Vector Search

To search with only sparse vectors,

The retrieval_mode parameter should be set to RetrievalMode.SPARSE.
An implementation of the SparseEmbeddings interface using any sparse embeddings provider has to be provided as value to the sparse_embedding parameter.
The langchain-qdrant package provides a FastEmbed based implementation out of the box.

In [26]:
from langchain_qdrant import FastEmbedSparse, RetrievalMode

sparse_embeddings = FastEmbedSparse(model_name="Qdrant/bm25")

qdrant = QdrantVectorStore.from_documents(
    splitted_doc,
    embedding=embedding_model,
    sparse_embedding=sparse_embeddings,
    location=":memory:",
    collection_name="my_documents",
    retrieval_mode=RetrievalMode.SPARSE,
)

query = "What did the president say about Ketanji Brown Jackson"
found_docs = qdrant.similarity_search(query)

Fetching 29 files: 100%|██████████| 29/29 [00:00<?, ?it/s]


In [27]:
found_docs

[]

# Hybrid Search

To perform a hybrid search using dense and sparse vectors with score fusion,

The retrieval_mode parameter should be set to RetrievalMode.HYBRID.
A dense embeddings value should be provided to the embedding parameter.
An implementation of the SparseEmbeddings interface using any sparse embeddings provider has to be provided as value to the sparse_embedding parameter.
Note that if you've added documents with the HYBRID mode, you can switch to any retrieval mode when searching. Since both the dense and sparse vectors are available in the collection.

In [29]:
from langchain_qdrant import FastEmbedSparse, RetrievalMode

sparse_embeddings = FastEmbedSparse(model_name="Qdrant/bm25")

qdrant = QdrantVectorStore.from_documents(
    splitted_doc,
    embedding=embedding_model,
    sparse_embedding=sparse_embeddings,
    location=":memory:",
    collection_name="my_documents",
    retrieval_mode=RetrievalMode.HYBRID,
)

query = "What did the president say about Ketanji Brown Jackson"
found_docs = qdrant.similarity_search(query)

Fetching 29 files: 100%|██████████| 29/29 [00:00<?, ?it/s]


In [30]:
found_docs

[Document(metadata={'source': '../data/Regulaciones cacao y chocolate 2003.pdf', 'page': 4, '_id': '0bc76322de1b427b9933631bca239882', '_collection_name': 'my_documents'}, page_content='Document Generated: 2023-04-24\nStatus:  This is the original version (as it was originally made).\n(c)section 20 (offences due to fault of another person);\n(d)section 21 (defence of due diligence) as it applies for the purposes of section 8, 14 or 15\nof the Act;\n(e)section 22 (defence of publication in the course of business);\n(f)section 30(8) (which relates to documentary evidence);\n(g)section 33(1) (obstruction etc. of officers);\n(h)section 33(2), with the modification that the reference to “any such requirement as is\nmentioned in subsection (1)(b) above”, shall be deemed to be a reference to any such\nrequirement as is mentioned in that subsection as applied by sub-paragraph (g) above;\n(i)section 35(1) (punishment of offences) in so far as it relates to offences under section 33(1)\nas appli

In [31]:
#If you want to execute a similarity search and receive the corresponding scores you can run:
results = vector_store.similarity_search_with_score(
    query="Will it be hot tomorrow", k=1
)
for doc, score in results:
    print(f"* [SIM={score:3f}] {doc.page_content} [{doc.metadata}]")

* [SIM=0.024672] dehydrating whole milk, semi-skimmed
or skimmed milk, cream, or from partly
or wholly dehydrated cream, butter or
milk fat and (2) from finely ground
hazelnuts in such quantities that 100
grams of the product contain not less than
15 grams and not more than 40 grams of
hazelnuts; and to which may have been
added almonds, hazelnuts and other nut
varieties, either whole or broken, in such
quantities that, together with the ground
hazelnuts, they do not exceed 60 per cent
of the total weight of the product.
(c) (c)  If “Milk” is replaced by— (c) (c) 
(i)“cream” (i)The product containing a minimum milk
fat content of 5.5 per cent.
(ii)“skimmed milk” (ii)The product containing a milk fat content
not greater than 1 per cent.
5.Family milk chocolate or Milk chocolate The product obtained from cocoa products,
sugars and milk or milk products which
contains—
—not less than 20 per cent total dry cocoa
solids;
—not less than 20 per cent dry milk solids
obtained by partly or wholl

# Metadata filtering

In [32]:
from qdrant_client.http import models

results = vector_store.similarity_search(
    query="Who are the best soccer players in the world?",
    k=1,
    filter=models.Filter(
        should=[
            models.FieldCondition(
                key="page_content",
                match=models.MatchValue(
                    value="The top 10 soccer players in the world right now."
                ),
            ),
        ]
    ),
)
for doc in results:
    print(f"* {doc.page_content} [{doc.metadata}]")

## Query by turning into a retriever

In [33]:
retriever = vector_store.as_retriever(search_type="mmr", search_kwargs={"k": 1})
retriever.invoke("Stealing from the bank is a crime")

[Document(metadata={'source': '../data/Regulaciones cacao y chocolate 2003.pdf', 'page': 4, '_id': 'e97d7aeb7b6545b795a9116979e7f01a', '_collection_name': 'demo_collection'}, page_content='Document Generated: 2023-04-24\nStatus:  This is the original version (as it was originally made).\n(c)section 20 (offences due to fault of another person);\n(d)section 21 (defence of due diligence) as it applies for the purposes of section 8, 14 or 15\nof the Act;\n(e)section 22 (defence of publication in the course of business);\n(f)section 30(8) (which relates to documentary evidence);\n(g)section 33(1) (obstruction etc. of officers);\n(h)section 33(2), with the modification that the reference to “any such requirement as is\nmentioned in subsection (1)(b) above”, shall be deemed to be a reference to any such\nrequirement as is mentioned in that subsection as applied by sub-paragraph (g) above;\n(i)section 35(1) (punishment of offences) in so far as it relates to offences under section 33(1)\nas ap

In [8]:
from langchain_ollama import OllamaLLM

llm = OllamaLLM(model="llama3.1")

In [7]:
from langchain_huggingface import HuggingFaceEmbeddings

embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
embedded_document = embedding_model.embed_query(splits[0].page_content)
embedded_document[:3]

In [9]:
# Retrieve and generate using the relevant snippets of the blog.
retriever = vectorstore.as_retriever()
prompt = hub.pull("rlm/rag-prompt")

  prompt = loads(json.dumps(prompt_object.manifest))


In [12]:
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

rag_chain.invoke("Can you tell me what are the regulations about cacao and chocolate in 2003?")

"Based on the provided text, it appears that you are asking about a specific part of the regulations regarding food labeling. However, I don't see any question being asked. Could you please provide the actual question or problem you need help with? If not, I'll do my best to summarize and highlight key points from the given regulations.\n\nThe provided text seems to be an excerpt from UK legislation (the Cocoa and Chocolate Products (England) Regulations 2003), which outlines specific requirements for labeling cocoa and chocolate products. Key takeaways include:\n\n1. **Labeling Requirements**: Specific details about ingredients, reserved descriptions, and their corresponding designations are outlined.\n2. **Offenses and Penalties**: Failure to comply with regulations can lead to offenses punishable by fines.\n3. **Defence in Relation to Exports**: A defense is allowed for exports under certain conditions.\n4. **Application of Provisions of the Food Safety Act 1990**: Certain sections 