# Building RAG pipelines with Optimized Embedding Models

In the following notebook we will show how to utilize two fastRAG components that use an optimized and quantized bi-encoder.

We will showcase `IPEXSentenceTransformersDocumentEmbedder` for embedding documents in a vectors store, and `IPEXBiEncoderSimilarityRanker` for re-ranking documents in a retrieval pipeline.

**NOTE**: Please read carefuly the [guide](../scripts/optimizations/embedders/README.md) we provided on how to maximize the speed/latency on Intel Xeon backends.

First, lets build an index; we create 3 documents:

In [6]:
from haystack import Document

In [7]:
examples = [
    "There is a blue house on Oxford street.",
    "Paris is the capital of France.",
    "fastRAG had its first commit in 2022."
]

In [8]:
documents = []
for i, d in enumerate(examples):
    documents.append(Document(content=d, id=(i + 1)))

In [9]:
from haystack.document_stores.in_memory import InMemoryDocumentStore

In [10]:
from fastrag.embedders import IPEXSentenceTransformersDocumentEmbedder, IPEXSentenceTransformersTextEmbedder

Bi-encoders are implemented as two classes, one encoding the documents and the other encoding the queries.
Embedding performance on Intel Hardware depends on the data input strategy. It is recommended to calibrate the batch size and padding strategy to maximize the latency or throughput when embedding.

If the length of the sequences is shorter than the maximum length of the model (for example shorter than 512 for BGE), it is recommended to truncate it to speed up encoding. (via `max_sequence_length` argument)
Padding can be set to `True` so that each batch is padded to the maximum length (could vary between batches) or to `max_length` that will pad the batch to the maximum set length.
Varying with batch size and `padding=True` will affect the throughput of the embedding model, as larger batches could be encoded to larger sequences and smaller batches could produce a large number of varying in sizes batches.

Experimentation on your data is key to maximize performance!

We load our quantized embedding model for both:

In [11]:
query_embedder = IPEXSentenceTransformersTextEmbedder(model="Intel/bge-small-en-v1.5-rag-int8-static", batch_size=1, max_seq_length=512, padding=True)

In [12]:
doc_embedder = IPEXSentenceTransformersDocumentEmbedder(model="Intel/bge-small-en-v1.5-rag-int8-static", batch_size=32, max_seq_length=512, padding=True)

In [14]:
doc_embedder.warm_up(); query_embedder.warm_up()

We embed the documents and store them in a simple in-memory store:

In [15]:
docs_with_embeddings = doc_embedder.run(documents)["documents"]

Batches: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.30it/s]


In [16]:
document_store = InMemoryDocumentStore()

In [17]:
document_store.write_documents(docs_with_embeddings)

3

Retrieving is done using a wrapper class called `InMemoryEmbeddingRetriever`:

In [19]:
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever

In [20]:
retriever = InMemoryEmbeddingRetriever(document_store)

We embed the query and retrieve:

In [21]:
query_vec = query_embedder.run("What is Paris?")

Batches: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.67it/s]


In [22]:
print(retriever.run(query_vec['embedding'], top_k=1)['documents'])

[Document(id=2, content: 'Paris is the capital of France.', score: 57.35980398339408)]


---

## Optimized Re-ranker and Running a Pipeline

We can add an optimized ranker to re-order the documents coming from the retriever. 
Note that this is component has no dependencies on the previous retrieval steps. It takes the document content and query, and encodes all to vectors to be re-ordered by ordering the similarities in a descending order.

We build a pipeline to automate the previous sections where we had to manually embed queries before doing retrieval:

In [23]:
from fastrag.rankers import IPEXBiEncoderSimilarityRanker

In [24]:
ranker = IPEXBiEncoderSimilarityRanker("Intel/bge-small-en-v1.5-rag-int8-static")

Combining all into a pipeline.

In [25]:
from haystack import Pipeline

In [26]:
pipe = Pipeline()

In [27]:
pipe.add_component("retriever", retriever)
pipe.add_component("embedder", query_embedder)
pipe.add_component("ranker", ranker)

In [28]:
pipe.connect("embedder", "retriever")
pipe.connect("retriever", "ranker.documents")

<haystack.core.pipeline.pipeline.Pipeline object at 0x7f56f4c578d0>
🚅 Components
  - retriever: InMemoryEmbeddingRetriever
  - embedder: IPEXSentenceTransformersTextEmbedder
  - ranker: IPEXBiEncoderSimilarityRanker
🛤️ Connections
  - retriever.documents -> ranker.documents (List[Document])
  - embedder.embedding -> retriever.query_embedding (List[float])

In [29]:
query = "What is Paris?"

In [30]:
result = pipe.run(
    {
        "embedder": {"text": query},
        "ranker": {"query": query},
    }
)

Loading IPEX ST Transformer model


Passing the argument `library_name` to `get_supported_tasks_for_model_type` is required, but got library_name=None. Defaulting to `transformers`. An error will be raised in a future version of Optimum if `library_name` is not provided.
Batches: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 47.67it/s]
Batches: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  5.89it/s]
Batches: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  6.99it/s]


In [31]:
print(result['ranker']['documents'])

[Document(id=2, content: 'Paris is the capital of France.', score: 57.35980398339408, embedding: vector of size 384), Document(id=1, content: 'There is a blue house on Oxford street.', score: 29.665641486886365, embedding: vector of size 384), Document(id=3, content: 'fastRAG had its first commit in 2022.', score: 21.54239634529506, embedding: vector of size 384)]
