# Lesson 4: Retrieval Methods and Vector Databases

**Objective**: Build a retrieval system that efficiently searches for relevant document chunks.

**Topics**:
- Sparse vs. dense retrieval methods
- Hybrid search methods (e.g., combining BM25 with dense retrieval)
- Overview of vector databases: Milvus, Faiss, Qdrant

**Practical Task**: Set up a vector database and implement a retrieval method.

**Resources**:
- What is a vector database
- Choosing a vector database


#### Load the dataset

In [1]:
from langchain_community.document_loaders import PyPDFLoader

file_path = (
    "../data/Regulaciones cacao y chocolate 2003.pdf"
)
loader = PyPDFLoader(file_path)
splitted_doc = loader.load_and_split()

### Embeddings function

In [2]:
from langchain_community.embeddings.fastembed import FastEmbedEmbeddings

dense_embedding_model = FastEmbedEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
dense_embeddings = dense_embedding_model.embed_documents(splitted_doc[0].page_content)
len(dense_embeddings[0])

  from .autonotebook import tqdm as notebook_tqdm
Fetching 5 files: 100%|██████████| 5/5 [00:00<00:00, 5000.36it/s]


384

# A first approach

In [3]:
!export LANGCHAIN_TRACING_V2="true"
!export LANGCHAIN_API_KEY="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"

"export" no se reconoce como un comando interno o externo,
programa o archivo por lotes ejecutable.
"export" no se reconoce como un comando interno o externo,
programa o archivo por lotes ejecutable.


In [4]:
import getpass
import os

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass()
# When this starts paste the api key on the search bar
# LANGSMITH_API_KEY = "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"

In [5]:
from langchain import hub
from langchain_chroma import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

In [7]:
vectorstore = Chroma.from_documents(documents=splitted_doc, embedding=dense_embedding_model)

In [8]:
from langchain_ollama import OllamaLLM

llm = OllamaLLM(model="llama3.1")

In [9]:
# Retrieve and generate using the relevant snippets of the blog.
retriever = vectorstore.as_retriever()
prompt = hub.pull("rlm/rag-prompt")

  prompt = loads(json.dumps(prompt_object.manifest))


In [12]:
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

rag_chain.invoke("Can you tell me what are the regulations about cacao and chocolate in 2003?")

"Based on the provided text, it appears that you are asking about a specific part of the regulations regarding food labeling. However, I don't see any question being asked. Could you please provide the actual question or problem you need help with? If not, I'll do my best to summarize and highlight key points from the given regulations.\n\nThe provided text seems to be an excerpt from UK legislation (the Cocoa and Chocolate Products (England) Regulations 2003), which outlines specific requirements for labeling cocoa and chocolate products. Key takeaways include:\n\n1. **Labeling Requirements**: Specific details about ingredients, reserved descriptions, and their corresponding designations are outlined.\n2. **Offenses and Penalties**: Failure to comply with regulations can lead to offenses punishable by fines.\n3. **Defence in Relation to Exports**: A defense is allowed for exports under certain conditions.\n4. **Application of Provisions of the Food Safety Act 1990**: Certain sections 