# Privacera AI Governance - Milvus Vector Database Filter

This notebook shows how to use Privacera Shield Library with a LangChain application that uses Milvus Vector Database. To run this notebook you will need the following,


1.  Sign up for a free account at [Privacera AI Governance (PAIG)](https://privacera.ai). This is simple, all you need is your email address.
2.  Your OpenAI API Key. This will allow you to create your first OpenAI application governed by Privacera AI Governance.

# 0. Reset a system library and restart the environment

In [None]:
!pip install packaging

import subprocess
from packaging import version
import grpc

# Get the current version of grpcio
current_version = grpc.__version__
print(f"Current grpcio version: {current_version}")

# Define the version to compare against
target_version = version.parse("1.63")

# Compare the versions
if version.parse(current_version) > target_version:
    # Uninstall grpcio if the version is greater than 1.63
    subprocess.check_call(["pip", "uninstall", "-y", "grpcio"])
    print("grpcio has been uninstalled.")
    # We need to restart the runtime
    # Ignore the warning at the bottom that says the runtime crashed
    exit()
else:
    print("grpcio version is not greater than 1.63. No action needed.")

# 1. Install the Python packages
This will take several seconds, upto a minute.

In [None]:
!pip -q install  \
  milvus \
  pymilvus \
  langchain==0.2.0 \
  langchain-core==0.2.0 \
  langchain-community==0.2.0 \
  langchain-openai==0.1.7 \
  langchain-text-splitters==0.2.0 \
  privacera_shield==1.1.9


# 2. Start Milvus Vector Database
This step will take less than a minute. There could be a few connection errors as Milvus starts, but finally it should say 'Connected to Milvus'

In [None]:
get_ipython().system_raw('milvus-server &')
!while ! (ps aux | grep -q '[m]ilvus' && ps aux | grep -q '[m]ilvus-server'); do sleep 1; done; echo 'Milvus is ready'

# Replace with your actual Milvus server parameters if different
MILVUS_HOST = "127.0.0.1"
MILVUS_PORT = "19530"

while True:
    try:
        import time
        from pymilvus import connections

        connections.connect(host=MILVUS_HOST, port=MILVUS_PORT)
        print("Connected to Milvus")
        break
    except Exception as e:
        print(f"Connection failed: {e}")
        time.sleep(1)

# 3. Create a Sample Collection in Milvus Vector Database

In this step, we will create a sample collection in Milvus Vector Database with
following schema -
- source - name of the document file
- text - content of the document
- pk - primary key
- vector - embedding vector of the content
- users - list of users that have access to this document
- groups - list of groups that have access to this document
- metadata - additional metadata associated with this document

In [None]:
from pymilvus import CollectionSchema, FieldSchema, DataType

COLLECTION_NAME = "PrivaceraSampleCollection"

def create_collection():
    source = FieldSchema(
        name="source",
        dtype=DataType.VARCHAR,
        max_length=65535
    )
    text = FieldSchema(
        name="text",
        dtype=DataType.VARCHAR,
        max_length=65535
    )
    pk = FieldSchema(
        name="pk",
        dtype=DataType.INT64,
        is_primary=True,
        auto_id=True
    )
    vector = FieldSchema(
        name="vector",
        dtype=DataType.FLOAT_VECTOR,
        dim=1536
    )
    users = FieldSchema(
        name="users",
        dtype=DataType.ARRAY,
        element_type=DataType.VARCHAR,
        max_length=65535,
        max_capacity=1024
    )
    groups = FieldSchema(
        name="groups",
        dtype=DataType.ARRAY,
        element_type=DataType.VARCHAR,
        max_length=65535,
        max_capacity=1024
    )
    metadata = FieldSchema(
        name="metadata",
        dtype=DataType.JSON
    )

    schema = CollectionSchema(
        fields=[source, text, pk, vector, users, groups, metadata],
        description="Sample Privacera Milvus Collection",
        enable_dynamic_field=True
    )

    from pymilvus import connections
    connections.connect(
        alias="default",
        host=MILVUS_HOST,
        port=MILVUS_PORT
    )

    from pymilvus import Collection

    collection = Collection(
        name=COLLECTION_NAME,
        schema=schema,
        using='default'
    )

    from pymilvus import Collection

    collection = Collection(COLLECTION_NAME)

    index_params = {
        "index_type": "HNSW",
        "metric_type": "L2",
        "params": {
            "M": 10,
            "efConstruction": 8
        }
    }

    collection.create_index(
        field_name="vector",
        index_params=index_params,
        index_name="index"
    )
    print(f"Collection = {COLLECTION_NAME} created")

create_collection()

# 4. Create sample documents in a folder

Creating some sample documents in a folder named raw_data.

- x10.txt - Contains existing product specification.
- x11.txt - Contains the specification of the product which is under development. This is highly classified data.
- x10-salesdata.txt - Sales number for the product x10. Only Sales team have access to it.
- customer-feedback.txt - Customer feedback which contains PII data. Only few people can access see PII data

In [None]:
import os
import warnings
warnings.filterwarnings('ignore')

def create_raw_data():
    raw_data_dir = "raw_data"

    file_contents = {
        "x10.txt": """
Product Specification Sheet of x10
Display: Size and resolution - 6.5" AMOLED, 120Hz refresh rate
Processor: Model name  Snapdragon 8 Gen 1
RAM: Options 8GB/12GB
Storage: Options 128GB/256GB
Camera: rear camera system with multiple lenses, front-facing camera
Battery: Capacity 5000mAh
Operating System: Version Android 13
Key Features: long battery life, fast performance, high-quality camera
        """
        , "x11.txt": """
Product Specification Sheet of x11
Display: Size and resolution - 7.5" AMOLED, 360Hz refresh rate
Processor: Model name  Snapdragon 10 Gen 3
RAM: Options 16GB/24GB
Storage: Options 256GB/512GB
Camera: 360 camera system with multiple lenses, front-facing camera
Battery: Capacity 10000mAh
Operating System: Version Android 13
Key Features: super long battery life, ultra fast performance, 360 camera
        """
        , "x10-salesdata.txt": """
Sales Data for X10 Model:
Monthly Sales Report (Internal)
Region	Units Sold	Revenue
North America	20,000	$10,000,000
Europe	15,000	$7,500,000
Asia Pacific	10,000	$5,000,000
Total	45,000	$22,500,000
    """
        , "customer-feedback.txt": """
Customer Feedback Analysis - X10 Model

Positive Feedback for X10 Model:

"The X10's battery life is amazing! I can finally ditch the portable charger."

Sarah Jones, Busy Professional
Email: sarah.jones@samplemail.com
Phone: (123) 456-7890
"The camera takes crystal-clear pictures, even in low-light conditions. Perfect for capturing memories on the go!"

David Lee, Travel Blogger
Email: david.lee@travelblogger.com
Phone: (234) 567-8901
"The phone's design is sleek and feels luxurious in hand. The user interface is user-friendly and easy to navigate, even for non-tech-savvy users like me."

Emily Garcia, Teacher
Email: emily.garcia@schoolmail.com
Phone: (345) 678-9012

Areas for Improvement for X10 Model:

"The phone is a bit bulky for one-handed use. It can be challenging to reach the top of the screen comfortably."

Michael Chen, Gamer
Email: michael.chen@gamermail.com
Phone: (456) 789-0123
"I've encountered a few minor software bugs that require restarting the phone. Hopefully, future updates will address these."

Olivia Rodriguez, Social Media Manager
Email: olivia.rodriguez@socialhub.com
Phone: (567) 890-1234
"The current storage options are a bit limiting for someone who stores a lot of photos and videos. A higher storage tier or microSD card support would be ideal."

William Smith, Content Creator
Email: william.smith@creatorhub.com
Phone: (678) 901-2345

Feature Requests for X10 Model:

"Wireless charging would be a fantastic addition for convenience. No more fumbling with cables!" (Multiple Users)
"A built-in fingerprint sensor would be a welcome security feature for added peace of mind." (Several Users)
"The ability to expand storage with a microSD card would be incredibly helpful for users who need more space." (Content Creators & Photographers)
"""
    }

    os.makedirs(raw_data_dir, exist_ok=True)

    for file_path, content in file_contents.items():
        file_path_with_dir = raw_data_dir + "/" + file_path
        with open(file_path_with_dir, 'w') as file:
            file.write(content)

    print("Raw data created successfully.")


create_raw_data()

# 5. Associate metadata with the documents
Here, we create a custom loader class that will add additional metadata for each *document* in the collection. For each document, we have list of users who are allowed to access the document, a list of groups that are allowed to access the document and additional metadata such as location (country) associated with the document.

We will use the users, groups and country attribute to filter the documents based upon the user querying the vector database.

In [None]:
import json

from typing import Optional, List, Iterator
from langchain_community.document_loaders import TextLoader
from langchain.schema import Document

# Define raw data metadata information
file_metadata = {
    "x10.txt": {
        "users": ["sally", "peter", "emily", "mark"],
        "groups": [],
        "metadata": {"file_name": "x10.txt"}
    },
    "x11.txt": {
        "users": ["mark", "peter"],
        "groups": [],
        "metadata": {"SECURITY_LEVEL": "CONFIDENTIAL", "file_name": "x11.txt"}
    },
    "x10-salesdata.txt": {
        "users": ["sally"],
        "groups": ["Sales"],
        "metadata": {"file_name": "x10-salesdata.txt"}
    },
    "customer-feedback.txt": {
        "users": ["emily", "sally", "peter", "mark"],
        "groups": ["Sales"],
        "metadata": {"file_name": "customer-feedback.txt"}
    }
}

class PrivaceraTextLoader(TextLoader):
    def __init__(self, file_path: str, encoding: Optional[str] = None, autodetect_encoding: bool = False):
        super().__init__(file_path, encoding, autodetect_encoding)
        print(f"inside CustomTextLoader init, file_path={file_path}")

    def lazy_load(self) -> Iterator[Document]:
        documents = super().lazy_load()

        for doc in documents:
            file_name = os.path.basename(self.file_path)
            print(f"lazy_load: file_name={file_name}")
            metadata = file_metadata.get(file_name)
            if metadata:
              doc.metadata["users"] = file_metadata[file_name]["users"]
              doc.metadata["groups"] = file_metadata[file_name]["groups"]
              doc.metadata["metadata"] = file_metadata[file_name]["metadata"]

            yield doc

print("PrivaceraTextLoader is ready")


# 6. Set your OpenAI API key in the environment
Enter your OpenAI API key so that it is set in the environment. This key will not be uploaded to Privacera AI Governance service.

In [None]:
from getpass import getpass

#if os.environ.get("OPENAI_API_KEY") is None:
openai_api_key = getpass("🔑 Enter your OpenAI API key and hit Enter:")
os.environ["OPENAI_API_KEY"] = openai_api_key

# 7. Load the sample documents into Milvus vector database
Now the sample documents are loaded into Milvus vector database using LangChain and OpenAI embedding API.

In [None]:
from langchain_community.document_loaders import DirectoryLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores.milvus import Milvus

text_loader_kwargs = {'autodetect_encoding': True}
loader = DirectoryLoader("raw_data", glob="**/*.txt",
                         loader_cls=PrivaceraTextLoader,
                         loader_kwargs=text_loader_kwargs)
docs = loader.load()

print(f"len docs = {len(docs)}")

text_splitter = CharacterTextSplitter(chunk_size=1024, chunk_overlap=0)
docs = text_splitter.split_documents(docs)

# Create OpenAI Embeddings
embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)

vector_store = Milvus.from_documents(
    docs,
    embedding=embeddings,
    collection_name=COLLECTION_NAME,
    connection_args={"host": MILVUS_HOST, "port": MILVUS_PORT}
)

print("Loaded data into collection successfully.")

# 8. Create Privacera AI Application and the VectorDB configuration

In this step, we will create an AI Application configuration in PAIG that will be used to associate PAIG with a sample RAG Langchain application.

1. Log into your account in PAIG.
1. Click on Application -> Vector DB and create a Vector DB and name it **Product Catalog - Milvus**, and save it.
1. Navigate back to the Application -> AI Application and create a new application and call it **Product Catalog - Milvus**
1. By clicking the **DOWNLOAD APP CONFIG**, download your application configuration file to your local disk.
1. Click on the pencil icon in the Information panel, and then click on the Enabled toggle to enable it, and then click on the Associated VectorDB drop-down and select the **Product Catalog - Milvus** vector database, and then click on Save in the application panel.

# 9. Upload the PAIG Application Config file to Colab

Run the cell and click on the **Choose Files** button. Select the application config file from your local disk and it will be uploaded into Colab.

In [None]:
from google.colab import files
uploaded = files.upload()
files = uploaded.keys()
if len(files) > 1:
  print("Upload only the application config json file")
else:
  app_config_file_content = uploaded[list(files)[0]].decode('UTF-8')

# 10. LangChain RAG bot
We have implemented a small RAG bot using LangChain that will use the Milvus vector database to provide the context.


In [None]:
import privacera_shield
from privacera_shield import client as privacera_shield_client
from langchain.memory import ConversationBufferWindowMemory
from langchain_openai import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import ConversationalRetrievalChain

memory = ConversationBufferWindowMemory(memory_key="chat_history", return_messages=True, k=3)

# Create Milvus vector store
vector_store = Milvus(embeddings, COLLECTION_NAME,
                      connection_args={"host": MILVUS_HOST, "port": MILVUS_PORT})

# expose this index in a retriever interface
milvus_retriever = vector_store.as_retriever(
    search_type="similarity", search_kwargs={"k": 100}
)

# Initialize Privacera Shield
privacera_shield_client.setup(frameworks=["milvus", "langchain"], application_config=app_config_file_content)

llm = ChatOpenAI(openai_api_key=openai_api_key, model_name="gpt-3.5-turbo")
template = """Question: {question}

Answer: Let's think step by step."""
prompt = PromptTemplate(template=template, input_variables=["question"])

def query_as_user(username, query):
    print(f"Prompt: {query}")
    print()

    llm_chain = ConversationalRetrievalChain.from_llm(llm=llm,
                                                      retriever=milvus_retriever,
                                                      memory=memory,
                                                      verbose=False)
    try:
        with privacera_shield_client.create_shield_context(username=username):
            response = llm_chain.invoke({"question": query})
            print("LLM Response:")
            print(f"{response.get('answer')}")
            #wrap_text(f"{response.get('answer')}")
    except privacera_shield.exception.AccessControlException as e:
        # If access is denied, then this exception will be thrown. You can handle it accordingly.
        print(f"AccessControlException: {e}")

# utility function to wrap the output
def wrap_text(text, width=80):
    words = text.split()
    character_count = 0
    for word in words:
        if character_count + len(word) + 1 > width:  # Check if adding the word would exceed the width
            print("\n", end="")  # Start a new line
            character_count = 0  # Reset the character count for the new line
        print(word, end=" ")  # Print the word followed by a space
        character_count += len(word) + 1  # Update the character count

print("RAG Bot is ready")

# 11. Create users in PAIG portal

1. Click on Account -> User Management and click on Add User button
1. Enter First Name as mark, Last Name as mark, User Name as mark and select Role as User and save the user.
1. Similarly create users sally, emily and peter

# 12. Ask question about the product X11 which is under development


Peter belongs to the R&D team and has access to details of unreleased product called X11. And he should be able to compare all the phone models.

Sally belongs to the Sales team and she doesn't have access to details of X11 and she shouldn't be able to compare the phone models

Since the Product Development of X11 is marked as CONFIDENTIAL, only certain users have access to it.

In [None]:
query_as_user("peter", "Compare the product specifications for X10 and X11")
# this will compare both the product names

In [None]:
query_as_user("sally", "Compare the product specifications for X10 and X11")
# since Sally doesn't have access to new development, she won't be able to compare the models

# 13. Ask sales details by members of Sales and other teams

Sally belongs to the Sales team and she has access to the sales numbers.

Peter belonging to the R&D doesn't have access sales data.

Only the sales team has access to sales documents and these are carried forward in the VectorDB and enforced there

In [None]:
query_as_user("sally", "Give me the monthly sales data for X10?")

In [None]:
query_as_user("peter", "Give me the monthly sales data for X10?")

# 14. Let's redact PII data based on policy
Sally belongs to the Sales team and she can see customer details

Peter belonging to the R&D can't see customer PII data, but can see the feedback.

1. Go to **Application -> AI Applications** and select the **AI Application** you created
2. Now select the **PERMISSIONS** tab
3. Click the pencil for the **Personal Identifier Redaction** policy
1. Remove **Everyone** and add **peter**
1. On the right side for **Prompt** select the dropdown value **Allow**
1. Leave the **Reply** as **Redact**
1. Save the policy
1. Now **Enable** the policy by toggling **Status** toggle


In [None]:
query_as_user("sally", "Give me the customer feedbacks and their contact information")

In [None]:
query_as_user("peter", "Give me the customer feedbacks and their contact information")