# Data Security Broker: dynamically de-identify sensitive data in RAG prompt context

Sensitive Information Disclosure is one of the [OWASP Top Ten for LLM Applications](https://owasp.org/www-project-top-10-for-large-language-model-applications/assets/PDF/OWASP-Top-10-for-LLMs-2023-v1_1.pdf). While the LLM cannot memorize the prompts and the context thereof, it can generate responses off the context which can bring an element of risk. In this notebook, we will show how this potential threat can play out and demonstrate how to shield your applications from it.  

This notebook shows how IBM Cloud Security & Compliance Center Data Security Broker (DSB or SCC DSB) can be used to dynamically encrypt/mask sensitive data that is fed into an LLM in a typical Retrieval Augmented Generation (RAG) workflow. We will demonstrate this with an example showing a couple of personas - a privileged user and a non-privileged user - and how the responses generated by LLM can be altered by dynamic RBAC and data masking without any changes to your Gen AI application!

For this demo, we will be using the IBM watsonx platform and one of the available LLMs - it should work with any of the available decoder models.

You can learn more about Data Security Broker in its [product documentation](https://cloud.ibm.com/docs/security-broker).

## Table of Contents

This notebook contains the following parts:

1.	[Intro to basic concepts](#rag)
1.	[Prerequisites](#prereq)
1.	[Setup](#steps)
1.	[Persona 1: pass context to the LLM as a Nurse](#persona1)
1.  [Persona 2: pass context to the LLM as a Physician ](#persona2)
1.	[Summary](#summary)

<a id="rag"></a>
## Intro to basic concepts 

### Retrieval Augmented Generation

Retrieval-augmented generation (RAG) is an AI framework for improving the quality of LLM-generated responses by grounding the model on external sources of knowledge to supplement the LLM’s internal representation of information. 

Implementing RAG in an LLM-based question answering system has the following benefits: 
-	The model has access to the most current, reliable facts.
-	Users have access to the model’s sources, ensuring that its claims can be checked for accuracy and ultimately trusted.
-	The model has fewer opportunities to pull information baked into its parameters because the LLM is grounded on a set of external, verifiable facts. This reduces the chances that an LLM will leak sensitive data, or ‘hallucinate’ incorrect or misleading information.
-	Users don’t need to continuously train the model on new data and update its parameters as circumstances evolve. In this way, RAG can lower the computational and financial costs of running LLM-powered chatbots in an enterprise setting.


You can learn more about RAG from this [source document](https://research.ibm.com/blog/retrieval-augmented-generation-RAG).


### Application-Level Encryption

Application-level encryption (ALE) is an approach to data protection that encrypts data at its source before it is stored or transmitted to other layers of the systems. Typical data/access controls tend to be over the perimter, but ALE shrinks them down to the data itself. This means that all the other layers are already secure by default, which improves the application's general risk profile.

Since DSB is the reverse-proxy based ALE solution, you can deploy DSB Shield in your data pipelines/paths and dynamically encrypt/mask data based on your role based access controls. 



<a id="prereq"></a>
## Prerequisites

Prepare the database you will be working with. In this notebook, we will be using a Postgres database. 

<a id="steps"></a>
## Setup 

Let's setup the PG table and a record we will be using in this demo:

```SQL
CREATE TABLE public.patient_visits2_1 (
	"ID" int4 NULL,
	"Name" varchar(100) NULL,
	address varchar(200) NULL,
	clinic varchar(100) NULL,
	"InsuranceProvider" varchar(16) NULL,
	"Date" varchar(100) NULL,
	notes varchar(200) NULL,
	reason varchar(100) NULL
);
```

Now, let's add a record to the table:

```SQL
INSERT INTO public.patient_visits2_1
("ID", "Name", address, clinic, "InsuranceProvider", "Date", notes, reason)
VALUES(1, 'John Doe', '123 Maine St Raleigh NC', 'NC Rex', 'Aetna', '07/23', 'Diagnosed with Costocondritis and prescribed steroids for two weeks', 'annual physical');
```

For the next step, let's install Data Security Broker Manager following the documented steps [here](https://cloud.ibm.com/docs/security-broker?topic=security-broker-sb_install_catalog). And also install the Data Security Broker Shield on the cluster following the steps [here](https://cloud.ibm.com/docs/security-broker?topic=security-broker-sb_ui_procedure).

And for the final step - setup the RBAC with a policy to mask all non-admin users on columns `notes` and `reason` following the steps [here](https://cloud.ibm.com/docs/security-broker?topic=security-broker-sb_configure_rbac).  Note the user/admin name in your RBAC group, so you can give the same in cell 6 and 10 below. 

In [None]:
# For IBM watsonx on IBM Cloud

import os

# setting watsonx specific credentials

os.environ['CLOUD_API_KEY'] = '<API_KEY>' # cloud API KEY
os.environ['CLOUD_URL'] = '<CLOUD_URL>' # region where the watsonx instance is running
os.environ['PROJECT_ID'] = '<PROJECT_ID>' # watsonx instance project ID

# setting up DB credentials
os.environ['DB_USER'] = '<DB_USER>' # enter the DB User to connect to the DSB Shield (database user)
os.environ['DB_PASSWORD'] = '<PASSWORD>' # password for the DSB Shield instance (database password)
os.environ['DB_DSB_HOST'] = '<DB_URL>' # url connecting to the DB through DSB Shield proxy

In [None]:
# For IBM watsonx on IBM Cloud

credentials = {
    "url"    : os.getenv('CLOUD_URL'),
    "apikey" : os.getenv('CLOUD_API_KEY')
}

Here we are going to initialize the LLM and its parameters that define LLM's behavior. While we chose `meta-llama/llama-2-70b-chat`, it should work for other LLMs as well. You can change the `model_id`. Similarly, play around with the variable `decoding_method` - try `sampling` insted of `greedy` to get some creative responses. For the use case at hand, the responses I got with `penalty` set to `1` are satisfactory, but feel free to change the values if needed.  

In [None]:
try:
    from ibm_watsonx_ai.foundation_models import Model
except:
    !pip install pip -–upgrade
    !pip install -U ibm-watsonx-ai
    from ibm_watsonx_ai.foundation_models import Model

model_id = "meta-llama/llama-2-70b-chat"

gen_parms = {
    "DECODING_METHOD" : "greedy",
    "MIN_NEW_TOKENS" : 1,
    "MAX_NEW_TOKENS" : 200,
    "repetition_penalty" : 1,
}


project_id = os.getenv('PROJECT_ID')

model = Model( model_id, credentials, gen_parms, project_id )

Here is the `generate` function that makes the outbound call to the chosen model with the given prompt. We will call this function a couple of times to demonstrate some scenarios further down in the notebook.

In [None]:
import json

def generate(model, prompt):

    generated_response = model.generate(prompt)

    # print(generated_response)  - uncomment and notice the returned results.

    if ("results" in generated_response) \
       and (len( generated_response["results"] ) > 0) \
       and ("generated_text" in generated_response["results"][0]):
        return generated_response["results"][0]["generated_text"]
    else:
        print("The model failed to generate an answer")
        print("\nDebug info:\n" + json.dumps(generated_response, indent=3))
        return ""

`searchAndAnswer` calls the `generate` with `model` and `prompt` details.

In [None]:
def searchAndAnswer(model, prompt):

   #  question = input( "Type your question:\n")

    # Generate output
    output = generate(model, prompt)
    print(output)

<a id="persona1"></a>
## Persona 1: pass context to the LLM as a Nurse

Here we are going to connect directly to the database with `DB_USER` credentials that has `admin` privileges. The application in question does need higher privileges to help serve some of its users. However, the current user persona - `nurse` - doesn't need access to sensitive data to perform her job. 

So we are going to rely on DSB's RBAC to dynamicaly change the context that is retrieved from the Postgres. When `/*+ User:<value> */` is not passed as comment, the default `masking` on sensitive columns gets triggered. You will see the retrieved masked context in the following code block. 

In [None]:
import psycopg2 as pg

conn = pg.connect(
    host=os.getenv('DB_DSB_HOST'),
    port="8444",
    database="ibmclouddb",
    user=os.getenv('DB_USER'),
    password=os.getenv('DB_PASSWORD'),
    connect_timeout=10)

cursor = conn.cursor()

cursor.execute("SELECT * FROM public.patient_visits2_1 ;")

context = f"Context: {cursor.fetchone()}. You will only provide one succinct answer to my question. If you cannot find the answer, you will respond with I don't know."

print(context)

In [None]:
prompt = f"{context} Input: Summarize John Doe's last vist. Output:"
searchAndAnswer(model, prompt)

In [None]:
prompt = f"{context} Input: Who is the insurance provider for John Doe? Output:"
searchAndAnswer(model, prompt)

In [None]:
prompt = f"{context} Input: Okay, what was the reason for his visit? Output:"
searchAndAnswer(model, prompt)

<a id="persona2"></a>
## Persona 2: pass context to the LLM as a Physician 

Now, for this scenario, we run the same commands as above but we pass the `/*+ User:karnatip */` in our query, which gets picked up by the DSB Shield, and it finds a match in the RBAC's admin user group. Since this is a privileged user authorized to read sensitive data, DSB Shield sends the context back in clear text. You can see that dynamic behavior in the following code block. 

Please feel free to change the username with whatever you gave in your RBAC configuration in DSB Manager. 

In [None]:
import psycopg2 as pg

conn = pg.connect(
    host=os.getenv('DB_DSB_HOST'),
    port="8444",
    database="ibmclouddb",
    user=os.getenv('DB_USER'),
    password=os.getenv('DB_PASSWORD'),
    connect_timeout=10)

cursor = conn.cursor()

cursor.execute("SELECT * FROM public.patient_visits2_1 /*+ User:karnatip */;") # "karnatip" - physician's ID as defined in RBAC Groups

# context = f"Context: {cursor.fetchone()}. You will limit your answers to the given context. If you cannot find the answer, you will respond with I don't know. You provide rationale to your responses. Your answers will be short and succint."

context = f"Context: {cursor.fetchone()}. You will only provide one succinct answer to my question. If you cannot find the answer, you will respond with I don't know."

print(context)

In [None]:
prompt = f"{context} Input: Give me everything you know about John Doe's last visit including the date and reason for his visit. Output:"
searchAndAnswer(model, prompt)

In [None]:
prompt = f"{context} Input: When was his last visit? Output:"
searchAndAnswer(model, prompt)

In [None]:
prompt = f"{context} Input: Who is the insurance provider for John Doe? Output:"
searchAndAnswer(model, prompt)

In [None]:
prompt = f"{context} Input: Okay, what was the reason for his visit? Output:"
searchAndAnswer(model, prompt)

In [None]:
prompt = f"{context} Input: Did I ask John Doe for a follow up visit? Output:"
searchAndAnswer(model, prompt)

<a id="summary"></a>
## Summary

Data Security Broker is the only no-code application-level encryption offering that protects data with column/field/row level granularity. Since it is a reverse proxy, you can deploy it in any data path and enforce an encryption and/or RBAC policy to seamlessly mask data and enforce granular access controls. As we saw in this demo, the context being retrieved from the Postgres is dynamically masked based on the end user persona. In doing so, we are making sure the data never leaves the secure data store when there is no need to. This addresses one of the concerns in "OWASP Top Ten for LLM Applications" - Sensitive Information Disclosure. 


### Author

**Pratheek Karnati, IBM Cloud**

Copyright © 2024 IBM. This notebook and its source code are released under the terms of the MIT License.