<a href="https://colab.research.google.com/github/422171/transformers/blob/main/notebooks/en/rag_with_hf_and_milvus.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Build RAG with Hugging Face and Milvus

_Authored by: [Chen Zhang](https://github.com/zc277584121)_


[Milvus](https://milvus.io/) is a popular open-source vector database that powers AI applications with highly performant and scalable vector similarity search. In this tutorial, we will show you how to build a RAG (Retrieval-Augmented Generation) pipeline with Hugging Face and Milvus.

The RAG system combines a retrieval system with an LLM. The system first retrieves relevant documents from a corpus using Milvus vector database, then uses an LLM hosted in Hugging Face to generate answers based on the retrieved documents.

## Preparation
### Dependencies and Environment

In [1]:
! pip install --upgrade pymilvus sentence-transformers huggingface-hub langchain_community langchain-text-splitters pypdf tqdm

Collecting pymilvus
  Downloading pymilvus-2.5.6-py3-none-any.whl.metadata (5.7 kB)
Collecting sentence-transformers
  Downloading sentence_transformers-4.0.2-py3-none-any.whl.metadata (13 kB)
Collecting huggingface-hub
  Downloading huggingface_hub-0.30.2-py3-none-any.whl.metadata (13 kB)
Collecting langchain_community
  Downloading langchain_community-0.3.21-py3-none-any.whl.metadata (2.4 kB)
Collecting pypdf
  Downloading pypdf-5.4.0-py3-none-any.whl.metadata (7.3 kB)
Collecting grpcio<=1.67.1,>=1.49.1 (from pymilvus)
  Downloading grpcio-1.67.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.9 kB)
Collecting python-dotenv<2.0.0,>=1.0.1 (from pymilvus)
  Downloading python_dotenv-1.1.0-py3-none-any.whl.metadata (24 kB)
Collecting ujson>=2.0.0 (from pymilvus)
  Downloading ujson-5.10.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.3 kB)
Collecting milvus-lite>=2.4.0 (from pymilvus)
  Downloading milvus_lite-2.4.12-py3-none-manylinux2014

> If you are using Google Colab, to enable the dependencies, you may need to **restart the runtime** (click on the "Runtime" menu at the top of the screen, and select "Restart session" from the dropdown menu).

In addition, we recommend that you configure your [Hugging Face User Access Token](https://huggingface.co/docs/hub/security-tokens), and set it in your environment variables because we will use a LLM from the Hugging Face Hub. You may get a low limit of requests if you don't set the token environment variable.

In [1]:
import os

import getpass

# enter API key
os.environ["HF_TOKEN"] = HF_API_KEY = getpass.getpass()

··········


### Prepare the data

We use the [AI Act PDF](https://artificialintelligenceact.eu/wp-content/uploads/2021/08/The-AI-Act.pdf), a regulatory framework for AI with different risk levels corresponding to more or less regulation, as the private knowledge in our RAG.

In [2]:
# %%bash

# if [ ! -f "The-AI-Act.pdf" ]; then
#     wget -q https://artificialintelligenceact.eu/wp-content/uploads/2021/08/The-AI-Act.pdf
# fi

In [32]:
%%bash
# https://drive.google.com/file/d/1oiJk3wH0TVJvnPfG9nsZT7uFt0f2jJ_s/view?usp=sharing
if [ ! -f "Powerless-Book.pdf" ]; then
    wget -q -O Powerless-Book.pdf "https://drive.google.com/uc?export=download&id=1oiJk3wH0TVJvnPfG9nsZT7uFt0f2jJ_s"
fi

We use the [`PyPDFLoader`](https://python.langchain.com/v0.1/docs/modules/data_connection/document_loaders/pdf/) from LangChain to extract the text from the PDF, and then split the text into smaller chunks. By default, we set the chunk size as 1000 and the overlap as 200, which means each chunk will nearly have 1000 characters and the overlap between two chunks will be 200 characters.

In [33]:
from langchain_community.document_loaders import PyPDFLoader

# loader = PyPDFLoader("The-AI-Act.pdf")
loader = PyPDFLoader("Powerless-Book.pdf")
docs = loader.load()
print(len(docs))

496


In [34]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_documents(docs)

In [35]:
text_lines = [chunk.page_content for chunk in chunks]

### Prepare the Embedding Model
Define a function to generate text embeddings. We use [BGE embedding model](https://huggingface.co/BAAI/bge-small-en-v1.5) as an example, but you can use any embedding models, such as those found on the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard).

In [39]:
from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer("BAAI/bge-small-en-v1.5")

def emb_text(text):
    return embedding_model.encode([text], normalize_embeddings=True).tolist()[0]

Generate a test embedding and print its dimension and first few elements.

In [40]:
test_embedding = emb_text("This is a test")
embedding_dim = len(test_embedding)
print(embedding_dim)
print(test_embedding[:10])

384
[-0.07660680264234543, 0.02531672641634941, 0.012505538761615753, 0.004595162346959114, 0.02577998675405979, 0.038167111575603485, 0.08050814270973206, 0.0030353872571140528, 0.024392176419496536, 0.004880355205386877]


## Load data into Milvus

### Create the Collection

In [44]:
from pymilvus import MilvusClient

milvus_client = MilvusClient(uri="./powerless.db")

collection_name = "rag_collection"

> As for the argument of `MilvusClient`:
> - Setting the `uri` as a local file, e.g.`./hf_milvus_demo.db`, is the most convenient method, as it automatically utilizes [Milvus Lite](https://milvus.io/docs/milvus_lite.md) to store all data in this file.
> - If you have a large amount of data, say more than a million vectors, you can set up a more performant Milvus server on [Docker or Kubernetes](https://milvus.io/docs/quickstart.md). In this setup, please use the server uri, e.g.`http://localhost:19530`, as your `uri`.
> - If you want to use [Zilliz Cloud](https://zilliz.com/cloud), the fully managed cloud service for Milvus, adjust the `uri` and `token`, which correspond to the [Public Endpoint and Api key](https://docs.zilliz.com/docs/on-zilliz-cloud-console#cluster-details) in Zilliz Cloud.


Check if the collection already exists and drop it if it does.

In [45]:
if milvus_client.has_collection(collection_name):
    milvus_client.drop_collection(collection_name)

Create a new collection with specified parameters.

If we don't specify any field information, Milvus will automatically create a default `id` field for primary key, and a `vector` field to store the vector data. A reserved JSON field is used to store non-schema-defined fields and their values.

In [46]:
milvus_client.create_collection(
    collection_name=collection_name,
    dimension=embedding_dim,
    metric_type="IP",  # Inner product distance
    consistency_level="Strong",  # Strong consistency level
)

### Insert data
Iterate through the text lines, create embeddings, and then insert the data into Milvus.

Here is a new field `text`, which is a non-defined field in the collection schema. It will be automatically added to the reserved JSON dynamic field, which can be treated as a normal field at a high level.

In [47]:
from tqdm import tqdm

data = []

for i, line in enumerate(tqdm(text_lines, desc="Creating embeddings")):
    data.append({"id": i, "vector": emb_text(line), "text": line})

insert_res = milvus_client.insert(collection_name=collection_name, data=data)
insert_res["insert_count"]

Creating embeddings: 100%|██████████| 1276/1276 [00:15<00:00, 80.03it/s]


1276

## Build RAG

### Retrieve data for a query

Let's specify a question to ask about the corpus.

In [56]:
question = "How does Kai feel about Paedyn and his brother dancing?"

Search for the question in the collection and retrieve the top 3 semantic matches.

In [57]:
search_res = milvus_client.search(
    collection_name=collection_name,
    data=[
        emb_text(question)
    ],  # Use the `emb_text` function to convert the question to an embedding vector
    limit=3,  # Return top 3 results
    search_params={"metric_type": "IP", "params": {}},  # Inner product distance
    output_fields=["text"],  # Return the text field
)

Let's take a look at the search results of the query


In [58]:
import json

retrieved_lines_with_distances = [
    (res["entity"]["text"], res["distance"]) for res in search_res[0]
]
print(json.dumps(retrieved_lines_with_distances, indent=4))

[
    [
        "him as he says, \u201cDance with me, will you? Please?\u201d\nPaedyn hesitates for only a moment before nodding. And then I\u2019m\nstaring after them as they stride onto the dance floor where several other\ncouples have begun spinning in time to the music.\nBlair is suddenly saying something to me, dragging me to my feet\nbefore dragging me onto the dance floor. I don\u2019t remember when we started\ndancing. Suddenly, she\u2019s in my arms, and we are spinning across the marble\nfloor. The feel of her is foreign to me after the nights spent with Paedyn in\nmy arms. Nights that I still haven\u2019t told Kitt about.\nI was doing him a favor.  \nMy eyes wander across the dance floor, landing on my brother and the\ngirl in his arms. I\u2019m not wearing green, but I feel it, nonetheless. Envy\nclaws at me as I watch them step in time to the very waltz I led Paedyn\nthrough only last night. She looks elegant, enticing, entrancing.\nWhat the hell is wrong with me?\nI turn 

### Use LLM to get an RAG response

Before composing the prompt for LLM, let's first flatten the retrieved document list into a plain string.

In [59]:
context = "\n".join(
    [line_with_distance[0] for line_with_distance in retrieved_lines_with_distances]
)

Define prompts for the Language Model. This prompt is assembled with the retrieved documents from Milvus.

In [60]:
PROMPT = """
Use the following pieces of information enclosed in <context> tags to provide an answer to the question enclosed in <question> tags.
<context>
{context}
</context>
<question>
{question}
</question>
"""

We use the [Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) hosted on Hugging Face inference server to generate a response based on the prompt.

In [61]:
from huggingface_hub import InferenceClient

repo_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"

llm_client = InferenceClient(model=repo_id, timeout=120)

Finally, we can format the prompt and generate the answer.

In [62]:
prompt = PROMPT.format(context=context, question=question)

In [63]:
answer = llm_client.text_generation(
    prompt,
    max_new_tokens=1000,
).strip()
print(answer)

Kai feels a sense of duty and devotion towards his brother, and he wants to ensure that his brother looks his best at the ball. He is willing to go to great lengths to fulfill this duty, including ensuring that Paedyn and his brother have a successful dance together.


Congratulations! You have built an RAG pipeline with Hugging Face and Milvus.

In [64]:
question = "How does Kai feel about Paedyn and his brother dancing? Does he feel jealous?"

In [65]:
search_res = milvus_client.search(
    collection_name=collection_name,
    data=[
        emb_text(question)
    ],  # Use the `emb_text` function to convert the question to an embedding vector
    limit=3,  # Return top 3 results
    search_params={"metric_type": "IP", "params": {}},  # Inner product distance
    output_fields=["text"],  # Return the text field
)

In [66]:
import json

retrieved_lines_with_distances = [
    (res["entity"]["text"], res["distance"]) for res in search_res[0]
]
print(json.dumps(retrieved_lines_with_distances, indent=4))

[
    [
        "Chapter Fifty-One\nPaedyn\nI\u2019 M  AVOIDING  HIM . N OT  THE  BEST  WAY  TO  DEAL  WITH  A  PROBLEM , I\u2019 LL\nadmit. But Kai is a very pressing problem. A very desirable distraction.\nSo, I keep myself busy, though I still manage to notice that he is doing\nthe same. Girl after gorgeous girl finds their way into his arms and onto the\ndance floor, all of them wearing glowing smiles and green dresses.\nI bury the emotion I don\u2019t want to identify as jealousy, though it claws\nat me nonetheless.\nI have a job to do.\nI turn my attention back to my partner for the dozenth time. Kitt smiles,\ncontinuing our easy conversation that my mind keeps wanting to wander\nfrom. I force myself to focus on his words rather than the thing I need to\nsteal from him. We spin, and I catch a glimpse of the keyring against the\ninside of his suit coat pocket. My fingers twitch, itching to tap into the\nthieving instincts I\u2019ve suppressed while in the castle\u2014for the most 

In [67]:
prompt = PROMPT.format(context=context, question=question)

In [68]:
answer = llm_client.text_generation(
    prompt,
    max_new_tokens=1000,
).strip()
print(answer)

Kai does not feel jealous about Paedyn and his brother dancing. Instead, he feels a sense of duty to ensure that his brother looks his best at the ball. He is willing to go to great lengths to fulfill this duty, including dancing with the protagonist himself to distract her from his brother. This is evident when he states, "You're attending this ball with my brother, and he needs to look the best he is able." Additionally, when the protagonist asks him why he is doing this, Kai responds by saying that it is simple and that his brother needs to look his best. Therefore, Kai's actions and words suggest that he is focused on supporting his brother rather than feeling jealous or competitive.


In [69]:
question = "What does Kai say to Paedyn in the rain?"
search_res = milvus_client.search(
    collection_name=collection_name,
    data=[
        emb_text(question)
    ],  # Use the `emb_text` function to convert the question to an embedding vector
    limit=3,  # Return top 3 results
    search_params={"metric_type": "IP", "params": {}},  # Inner product distance
    output_fields=["text"],  # Return the text field
)

retrieved_lines_with_distances = [
    (res["entity"]["text"], res["distance"]) for res in search_res[0]
]
print(json.dumps(retrieved_lines_with_distances, indent=4))

[
    [
        "adorable as you looked blinking up at me in the rain, I want you to see me\nclearly when I tell you this.\u201d\nThere goes that stupid flutter in my chest.\n\u201cI meant what I said. I can\u2019t take my eyes off you. I can\u2019t take my mind\noff you.\u201d\nI look away from his burning gaze, shaking my head as I mutter, \u201cKai, I\n\u2014\u201d\n\u201cPaedyn.\u201d\nI still. I shiver. He says my name like it\u2019s sacred, like it\u2019s an oath he\u2019s\nswearing.\nHe tilts his head to the side, eyes roaming over my face. \u201cTell me,\u201d he\nmurmurs, \u201cwhat do you want me to call you?\u201d\nMy eyes slowly meet his, confused by his question. \u201cWhat do you want\nto call me?\u201d\n\u201cI want to call you mine.\u201d\nWe stare at each other. Both of us breathing hard, both of us taking in\nthe other. The rain is still splattering Kai, clinging to his thick lashes and\ndripping from his jaw.\n\u201cI know you feel it too,\u201d he says quietly.\n\u2

In [70]:
prompt = PROMPT.format(context=context, question=question)
answer = llm_client.text_generation(
    prompt,
    max_new_tokens=1000,
).strip()
print(answer)

Kai does not say anything to Paedyn in the rain in the provided context.
