## Privacy Aware LLM Evaluation with RBAC

This solution Evaluate privacy awareness with RBAC for an LLM based chatbot such that responses to information seeking queries respect access controls.


This sample notebook shows you how to do assessment of LLM based chatbot system.

> **Note**: This is a reference notebook and it cannot run unless you make changes suggested in the notebook.

#### Pre-requisites:
1. **Note**: This notebook contains elements which render correctly in Jupyter interface. Open this notebook from an Amazon SageMaker Notebook Instance or Amazon SageMaker Studio.
1. Ensure that IAM role used has **AmazonSageMakerFullAccess**
1. Some hands-on experience using [Amazon SageMaker](https://aws.amazon.com/sagemaker/).
1. To use this algorithm successfully, ensure that:
    1. Either your IAM role has these three permissions and you have authority to make AWS Marketplace subscriptions in the AWS account used: 
        1. **aws-marketplace:ViewSubscriptions**
        1. **aws-marketplace:Unsubscribe**
        1. **aws-marketplace:Subscribe**  
    2. or your AWS account has a subscription to Vector Search for Company Description. 

#### Contents:
1. [Subscribe to the algorithm](#1.-Subscribe-to-the-algorithm)
1. [Prepare dataset](#2.-Prepare-dataset)
	1. [Dataset format expected by the algorithm](#A.-Dataset-format-expected-by-the-algorithm)
	1. [Configure the dataset](#B.-Configure-the-dataset)
	1. [Upload datasets to Amazon S3](#C.-Upload-datasets-to-Amazon-S3)
1. [Generate Data and Questions for Assessment](#3:-Generate-Data-and-Questions-for-Assessment)
	1. [Set up environment](#3.1-Set-up-environment)
	1. [Run Training](#3.2-Run-Training)
1. [Deploy Endpoint and Provide Assessment Results](#4:-Deploy-Endpoint-and-Provide-Assessment-Results)
    1. [Deploy Assessment Pipeline](#A.-Deploy-Assessment-Pipeline)
    1. [Create input payload](#B.-Create-input-payload)
    1. [Perform real-time inference](#C.-Perform-real-time-inference)
    1. [Visualize output](#D.-Visualize-output)
    1. [Delete the endpoint](#E.-Delete-the-endpoint)
1. [Clean-up](#5.-Clean-up)
	1. [Unsubscribe to the listing (optional)](#A.-Unsubscribe-to-the-listing-(optional))


#### Usage instructions
You can run this notebook one cell at a time (By using Shift+Enter for running a cell).

### 1. Subscribe to the algorithm

To subscribe to the algorithm:
1. Open the algorithm listing page **Privacy Aware LLM Evaluation with RBAC**
1. On the AWS Marketplace listing,  click on **Continue to subscribe** button.
1. On the **Subscribe to this software** page, review and click on **"Accept Offer"** if you agree with EULA, pricing, and support terms. 
1. Once you click on **Continue to configuration button** and then choose a **region**, you will see a **Product Arn**. This is the algorithm ARN that you need to specify while training a custom ML model. Copy the ARN corresponding to your region and specify the same in the following cell.

In [1]:
algo_arn ='privacy-awareness-copy-3'

### 2. Prepare dataset

In [2]:
import base64
import json 
import uuid
from sagemaker import ModelPackage
import sagemaker as sage
from sagemaker import get_execution_role
from urllib.parse import urlparse
import boto3
import urllib.request
import numpy as np
from zipfile import ZipFile
import pandas as pd
import tarfile

#### A. Dataset format expected by the algorithm

The algorithm requires data in the format as described for best results:
* The solution need three files application_info.json, credential.json, table_data.csv.
* The application_info.json contains the information about the chatbot and differnt role who can use it.
* credential.json contains credentials to access the claude model on aws bedrock.
* table_data.csv contain schema of tables

#### B. Configure the dataset

In [3]:
training_dataset='input/'

#### C. Upload datasets to Amazon S3

In [5]:
sagemaker_session = sage.Session()
bucket=sagemaker_session.default_bucket()

In [None]:
# training input location
common_prefix = "privacy-awareness"  
training_input_prefix = common_prefix + "/training-input-data"
TRAINING_WORKDIR = "input"
training_input = sagemaker_session.upload_data(TRAINING_WORKDIR, key_prefix=training_input_prefix)
print("Training input uploaded to " + training_input)

## 3: Generate Data and Questions for Assessment

Now the input dataset is available in an accessible Amazon S3 bucket, we are ready to generate the synthetic data and questions. 

### 3.1 Set up environment

In [18]:
role = get_execution_role()

In [19]:
output_location = 's3://{}/privacy-awareness/{}'.format(bucket, 'output')

### 3.2 Run Training

For information on creating an `Estimator` object, see [documentation](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html)

Please select an appropriate instance type based on the training requirement, here training refers to the generation of data and questions for assessment.  

In [70]:
instance_type='ml.m5.4xlarge'

In [21]:
#Create an estimator object for running a training job
estimator = sage.algorithm.AlgorithmEstimator(
    algorithm_arn=algo_arn,
    base_job_name="privacy-awareness-training",
    role=role,
    input_mode="File",
    output_path=output_location,
    sagemaker_session=sagemaker_session,
    instance_count=1,
    instance_type=instance_type
)
#Run the training job.
estimator.fit({"training": training_input})

2024-07-02 10:00:20 Starting - Starting the training job...
2024-07-02 10:00:45 Starting - Preparing the instances for trainingProfilerReport-1719914419: InProgress
...
2024-07-02 10:01:16 Downloading - Downloading the training image
2024-07-02 10:01:16 Training - Training image download completed. Training in progress..[34mStarting the train function.[0m
[34mGenerated data and questions saved..[0m
[34mSuccess[0m

2024-07-02 10:07:50 Uploading - Uploading generated training model
2024-07-02 10:07:50 Completed - Training job completed
Training seconds: 409
Billable seconds: 409


See this [blog-post](https://aws.amazon.com/blogs/machine-learning/easily-monitor-and-visualize-metrics-while-training-models-on-amazon-sagemaker/) for more information how to visualize metrics during the process. You can also open the training job from [Amazon SageMaker console](https://console.aws.amazon.com/sagemaker/home?#/jobs/) and monitor the metrics/logs in **Monitor** section.

In [50]:
# training output location
common_prefix = "privacy-awareness"  
training_output_prefix = estimator.model_data.split(sep=bucket)[1][1:] # gives the prefix of training output zipped file
TRAINING_OUTPUT_WORKDIR = "output"
training_output = sagemaker_session.download_data(TRAINING_OUTPUT_WORKDIR, bucket=bucket, key_prefix=training_output_prefix)
print("Training output data downloaded.")

Training output data downloaded.


In [54]:
file = tarfile.open("output/model.tar.gz")
file.extractall('output/')
file.close()

The generated data and questions are saved in output folder. Please give your chatbot the generated data (generated_data.xlsx) as context and give answers for the given questions in questions.csv file.

### 4: Deploy Endpoint and Provide Assessment Results

Now you can deploy the inference endpoint for performing assessment of the chatbot based on answers provided.

In [55]:
model_name='privacy-awareness'

content_type='text/csv'

real_time_inference_instance_type='ml.m5.large'
batch_transform_inference_instance_type='ml.m5.large'

#### A. Deploy Assessment Pipeline

In [56]:
from sagemaker.predictor import csv_serializer
predictor = estimator.deploy(1, real_time_inference_instance_type)

..........
-----!

Once endpoint is created, you can perform real-time inference.

#### B. Create input payload

The assessment pipeline accepts csv file containing questions generated in training pipeline and its answers in specific format.  
For detailed instructions, please refer input details.

In [57]:
file_name = 'inference_input/test_data.csv'

#### C. Perform real-time inference

In [62]:
output_file_name = 'inference_output/output.csv'

In [63]:
!aws sagemaker-runtime invoke-endpoint \
    --endpoint-name $predictor.endpoint_name \
    --body fileb://$file_name \
    --content-type $content_type \
    --region $sagemaker_session.boto_region_name \
    $output_file_name

{
    "ContentType": "text/csv; charset=utf-8",
    "InvokedProductionVariant": "AllTraffic"
}


#### D. Visualize output

In [73]:
df = pd.read_csv('inference_output/output.csv')

In [76]:
df.set_index(pd.Index(['Aware', 'Over Aware', 'Under Aware']))

Unnamed: 0,Doctor,Nurse,Administrative Staff,Patient,Administrator,overall_performace
Aware,0.3,0.466667,0.566667,0.45,0.516667,0.46
Over Aware,0.366667,0.233333,0.216667,0.233333,0.25,0.26
Under Aware,0.333333,0.3,0.216667,0.316667,0.233333,0.28


To assess the privacy awareness of LLM, awareness, over-awareness, and under-awareness percentage is given
- Awareness percentage: percentage of questions for which model given desired answer
- Over-awareness percentage: percentage of questions for which model should give answer but it refused
- Under-awareness percentage: percentage of questions for which model should not give answer but it gives
- Under-awareness percentage also signify about in how frequently model can do privacy leakage
- These percentages also provides role spacific perfermace of the chatbot system

- The awareness percentage signifies the chatbots' adherence to the role based access control.

#### E. Delete the endpoint

Now that you have successfully performed a real-time inference, you do not need the endpoint any more. you can terminate the same to avoid being charged.

In [66]:
predictor.delete_endpoint(delete_endpoint_config=True)

### 5. Clean-up

#### A. Unsubscribe to the listing (optional)

If you would like to unsubscribe to the algorithm, follow these steps. Before you cancel the subscription, ensure that you do not have any [deployable model](https://console.aws.amazon.com/sagemaker/home#/models) created from the model package or using the algorithm. Note - You can find this information by looking at the container name associated with the model. 

**Steps to unsubscribe to product from AWS Marketplace**:
1. Navigate to __Machine Learning__ tab on [__Your Software subscriptions page__](https://aws.amazon.com/marketplace/ai/library?productType=ml&ref_=mlmp_gitdemo_indust)
2. Locate the listing that you want to cancel the subscription for, and then choose __Cancel Subscription__  to cancel the subscription.

