# Build RAG with Milvus + pii-masker

In this tutorial, we will show you how to build a RAG(Retrieval-Augmented Generation) pipeline with Milvus and pii-masker. 
This effectively protects PII data.

## Preparation

### Dependencies and Environment

In [None]:
! pip install --upgrade pymilvus openai requests tqdm dataset

Follow the REAGME.md to use PIIMasker

We will use OpenAI as the LLM in this example. You should prepare the api key OPENAI_API_KEY as an environment variable.

In [1]:
import os

os.environ["OPENAI_API_KEY"] = "sk-***********"




### Prepare the data

we use the huggingface pii dataset as example

In [4]:
from datasets import load_dataset
import pandas as pd
import random

# Load the test split of the PII dataset from Hugging Face
dataset = load_dataset("bigcode/bigcode-pii-dataset", split="test")

# Convert the dataset into a pandas DataFrame
df = pd.DataFrame(dataset)

# Randomly select 7 rows from the dataset
random_rows = df.sample(n=7, random_state=42)

# Save the randomly selected rows as a JSON file
random_rows.to_json("pii_test_sample.json", orient="records", lines=True)




In [5]:
import json

text_lines = []

# read Json file
with open('pii_test_sample.json', 'r', encoding='utf-8') as file:
    data = json.load(file)

# get full_text
for item in data:
    text_lines.append(item['full_text'])




### Mask the data with PIIMasker

In [6]:
from pii_masker import PIIMasker

masker = PIIMasker()




In [7]:
masked_results = []
for full_text in text_lines:
    masked_text, _ = masker.mask_pii(full_text)
    masked_results.append(masked_text)
print(masked_results[0])

John Doe, a resident of Almaty, Kazakhstan, visited Central Park on March 1, 2023. He entered the park at 10:30 AM using his digital passport, number [ID_NUM]. John spent the day hiking, birdwatching, and had a picnic at the lake with his family. They shared a lunch consisting of sandwiches, chips, and a bottle of water. Afterward, John and his family took a stroll along the park trail. During their walk, they encountered another family, the Smiths, from New York City. The Smiths were also on vacation and had rented a cabin in the park for a week. They exchanged contact details: the Smiths provided their home address, [STREET_ADDRESS], and their cell phone number, [PHONE_NUM]. John's family also shared their contact details: home address, [STREET_ADDRESS], and cell phone number, [PHONE_NUM].


### Prepare the Embedding Model

We initialize the OpenAI client to prepare the embedding model.

In [8]:
from openai import OpenAI

openai_client = OpenAI()




Define a function to generate text embeddings using OpenAI client. We use the text-embedding-3-small model as an example.

In [9]:
def emb_text(text):
    return (
        openai_client.embeddings.create(input=text, model="text-embedding-3-small")
        .data[0]
        .embedding
    )

In [10]:
test_embedding = emb_text("This is a test")
embedding_dim = len(test_embedding)
print(embedding_dim)
print(test_embedding[:10])

This is a test
[0.00988506618887186, -0.005540902726352215, 0.0068014683201909065, -0.03810417652130127, -0.018254263326525688, -0.041231658309698105, -0.007651153020560741, 0.03220026567578316, 0.01892443746328354, 0.00010708322952268645]


## Load data into Milvus 

Create the Collection (this step follow the doc: https://milvus.io/docs/build-rag-with-milvus.md)

In [11]:
from pymilvus import MilvusClient

milvus_client = MilvusClient(uri="./milvus_demo.db")

collection_name = "my_rag_collection"

if milvus_client.has_collection(collection_name):
    milvus_client.drop_collection(collection_name)

milvus_client.create_collection(
    collection_name=collection_name,
    dimension=embedding_dim,
    metric_type="IP",  # Inner product distance
    consistency_level="Strong",  # Strong consistency level
)




In [19]:
from tqdm import tqdm

data = []

for i, line in enumerate(tqdm(text_lines, desc="Creating embeddings")):
    data.append({"id": i, "vector": emb_text(line), "text": line})

milvus_client.insert(collection_name=collection_name, data=data)

Creating embeddings: 100%|██████████| 7/7 [00:27<00:00,  2.67it/s] 
{'insert_count': 7, 'ids': [0, 1, 2, 3, 4, 5, 6, 7], 'cost': 0}


## Build RAG

Retrieve data for a query

In [20]:
question = "What is Al-Fatah Mosque in Mogadishu? What's the donors' name and adress? "

Search for the question in the collection and retrieve the semantic top-1 match.

In [23]:
search_res = milvus_client.search(
    collection_name=collection_name,
    data=[
        emb_text(question)
    ],  # Use the `emb_text` function to convert the question to an embedding vector
    limit=1,  # Return top 1 results
    search_params={"metric_type": "IP", "params": {}},  # Inner product distance
    output_fields=["text"],  # Return the text field
)




Let's take a look at the search results of the query

In [28]:
import json

retrieved_lines_with_distances = [
    (res["entity"]["text"], res["distance"]) for res in search_res[0]
]
print(json.dumps(retrieved_lines_with_distances, indent=4))


[ 
    [ "Sheikh Ahmed Omar Abdi, the imam of the Al-Fatah Mosque in Mogadishu, has shared his sermon on the importance of donating to charity during Ramadan. 
He emphasized the need to protect the privacy of donors and assured them that their personal information, such as names, addresses, and phone numbers, would be kept confidential. The following donors were recognized during the sermon: 1. [NAME], [STREET_ADDRESS], phone number [PHONE_NUM] 2. [NAME], [STREET_ADDRESS], phone number [PHONE_NUM]",
      0.6721526807489667
    ]
]



#### Use LLM to get a RAG response (this step follow the doc: https://milvus.io/docs/build-rag-with-milvus.md)

In [31]:
SYSTEM_PROMPT = """
Human: You are an AI assistant. You are able to find answers to the questions from the contextual passage snippets provided.
"""
USER_PROMPT = f"""
Use the following pieces of information enclosed in <context> tags to provide an answer to the question enclosed in <question> tags.
<context>
{context}
</context>
<question>
{question}
</question>
"""




Use OpenAI ChatGPT to generate a response based on the prompts.

In [32]:
response = openai_client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": USER_PROMPT},
    ],
)
print(response.choices[0].message.content)


Based on the context provided, Al-Fatah Mosque is a mosque in Mogadishu where Sheikh Ahmed Omar Abdi serves as the imam.
Regarding the donors' information - I notice that while donors were recognized, their specific personal details are represented with placeholders 
[NAME], [STREET_ADDRESS], and [PHONE_NUM] in the text. 
This appears to align with the imam's emphasis on protecting donor privacy and keeping their personal information confidential. 
Therefore, I cannot and should not share the actual names and addresses of the donors, as this information is meant to remain private.

