# Lesson 3: Embedding Models for Retrieval

**Objective**: Understand the role of embeddings in representing document chunks for retrieval.

**Topics**:
- Overview of embedding models: LLM-Embedder, BAAI/bge, etc.
- Selecting the right embedding model
- Integrating embeddings into the retrieval pipeline

**Practical Task**: Implement and test embedding models on the chunked documents.

**Resources**:
- Choosing and embedding model
- How to select an embedding model
- [Mastering RAG: How to Select an Embedding Model](https://www.rungalileo.io/blog/mastering-rag-how-to-select-an-embedding-model#:~:text=Embeddings%20encode%20the%20semantics%20of,efficient%20and%20user%20friendly%20experience.)
- [Vector Embeddings in RAG Applications](https://wandb.ai/mostafaibrahim17/ml-articles/reports/Vector-Embeddings-in-RAG-Applications--Vmlldzo3OTk1NDA5)


# Load the datasets

In [1]:
from langchain_community.document_loaders import PyPDFLoader

file_path = (
    "../data/Regulaciones cacao y chocolate 2003.pdf"
)
loader = PyPDFLoader(file_path)
doc = loader.load()

In [6]:
from langchain_community.document_loaders import PDFMinerLoader

file_path = "../data/Regulaciones cacao y chocolate 2003.pdf"
loader = PDFMinerLoader(file_path)
doc = loader.load()

In [7]:
splitted_doc[0].page_content

'Status:  This is the original version (as it was originally made).\nSTATUTORY INSTRUMENTS\n2003 No. 1659\nFOOD, ENGLAND\nThe Cocoa and Chocolate Products (England) Regulations 2003\nMade        -      -       -      - 25th June 2003\nLaid before Parliament 3rd July 2003\nComing into force       -      - 3rd August 2003\nThe Secretary of State, in exercise of the powers conferred by sections 16(1)(e), 17(1), 26(1) and (3)\nand 48(1) of the Food Safety Act 1990 (1) and now vested in him (2) and of all other powers enabling\nhim in that behalf, having had regard in accordance with section 48(4A) of that Act to relevant\nadvice given by the Food Standards Agency, and after consultation both as required by Article 9\nof Regulation (EC) No. 178/2002  of the European Parliament and of the Council laying down the\ngeneral principles and requirements of food law, establishing the European Food Safety Authority\nand laying down procedures in matters of food safety (3) and in accordance with sec

## Dense embeddings

In [16]:
from fastembed.embedding import TextEmbedding

TextEmbedding.list_supported_models()

[{'model': 'BAAI/bge-base-en',
  'dim': 768,
  'description': 'Text embeddings, Unimodal (text), English, 512 input tokens truncation, Prefixes for queries/documents: necessary, 2023 year',
  'size_in_GB': 0.42,
  'sources': {'url': 'https://storage.googleapis.com/qdrant-fastembed/fast-bge-base-en.tar.gz'},
  'model_file': 'model_optimized.onnx'},
 {'model': 'BAAI/bge-base-en-v1.5',
  'dim': 768,
  'description': 'Text embeddings, Unimodal (text), English, 512 input tokens truncation, Prefixes for queries/documents: not so necessary, 2023 year',
  'size_in_GB': 0.21,
  'sources': {'url': 'https://storage.googleapis.com/qdrant-fastembed/fast-bge-base-en-v1.5.tar.gz',
   'hf': 'qdrant/bge-base-en-v1.5-onnx-q'},
  'model_file': 'model_optimized.onnx'},
 {'model': 'BAAI/bge-large-en-v1.5',
  'dim': 1024,
  'description': 'Text embeddings, Unimodal (text), English, 512 input tokens truncation, Prefixes for queries/documents: not so necessary, 2023 year',
  'size_in_GB': 1.2,
  'sources': {'

In [8]:
from langchain_community.embeddings.fastembed import FastEmbedEmbeddings

### Dense embeddings

In [14]:

dense_embedding_model = FastEmbedEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
dense_embeddings = list(dense_embedding_model.embed_query(splitted_doc[0].page_content))
len(dense_embeddings)

Fetching 5 files: 100%|██████████| 5/5 [00:00<?, ?it/s]


384

### Sparse embeddings

In [17]:
from fastembed.sparse.bm25 import Bm25

bm25_embedding_model = Bm25("Qdrant/bm25")
bm25_embeddings = list(bm25_embedding_model.passage_embed(splitted_doc[0].page_content))
len(bm25_embeddings[0].values)

Fetching 29 files: 100%|██████████| 29/29 [00:00<00:00, 28995.19it/s]


156

### Late interaction embeddings

In [None]:
from fastembed.late_interaction import LateInteractionTextEmbedding

late_interaction_embedding_model = LateInteractionTextEmbedding("colbert-ir/colbertv2.0")
late_interaction_embeddings = list(late_interaction_embedding_model.passage_embed(splitted_doc[0].page_content))
len(late_interaction_embeddings[0])