# **RI Generative Stress Test Walkthrough**

In this walkthrough, we will run RIME's Generative AI Stress Testing on OpenAI's Chat GPT model to demonstrate how RIME can be used with generative models. The data consists of sample text-generated information that a Chat GPT model would use as input and output which includes a title, context, question, answer, and prompt. 

## **Install Dependencies, Import Libraries and Download Data**
Run the cell below to install libraries to receive data, install our SDK, and load analysis libraries.

In [None]:
!pip install rime-sdk &> /dev/null

In [None]:
import pandas as pd
from pathlib import Path
from rime_sdk import Client

Run the cell below to download and unzip a generated dataset and pre-trained model.

In [None]:
!pip install https://github.com/RobustIntelligence/ri-public-examples/archive/master.zip    
from ri_public_examples.download_files import download_files

download_files('generative/question_answering', 'question_answering') 

## **Establish the RIME Client**

To get started, provide the API credentials and the base domain/address of the RIME service. You can generate and copy an API token from the API Access Tokens Page under Workspace settings. For the domian/address of the RIME service, contact your admin. 

![API_token](https://drive.google.com/uc?id=1zF1inyDbU2Q08SW9Ans2eVw0LqUnA2uP)


In [None]:
API_TOKEN = '' # PASTE API_KEY 
CLUSTER_URL = '' # PASTE DEDICATED DOMAIN OF RIME SERVICE (eg: rime.stable.rbst.io)
AGENT_ID = '' # PASTE AGENT_ID IF USING AN AGENT THAT IS NOT THE DEFAULT

client = Client(CLUSTER_URL, API_TOKEN)

## **Create a New Project**

You can create projects in RIME to organize your test runs. Each project represents a workspace for a given machine learning task. It can contain multiple candidate models, but should only contain one promoted production model. 

In [None]:
description = (
    "Run Generative Stress Testing on a"
    " generative model and dataset. Demonstration uses"
    " a dataset composed of short scenarios that are each"
    " identifed by title and characterized by context,"
    " a question, an answer, and a prompt."
)
project = client.create_project(
    name="GAI Security Demo", 
    description=description, 
    model_task="MODEL_TASK_QUESTION_ANSWERING"
)

**Go back to the UI to see the new `Generative Model Stress Test Demo` project.**

## **Register and Upload the Model + Dataset**

We now want to kick off RIME Stress Tests that will help us evaluate the model in further depth beyond basic performance metrics like accuracy, precision, recall. In order to do this, we will upload this pre-trained model and the evaluation dataset the model was evaluated on to an S3 bucket that can be accessed by RIME. We also need to upload a fact sheet that will later be used to run the stress test. Futhermore, we'll need to register them with RIME.

In [None]:
upload_path = "ri_public_examples_generative"

model_s3_dir = client.upload_directory(
    Path('question_answering/models'), upload_path=upload_path
)
model_s3_path = model_s3_dir + "/chat_gpt_model.py"

eval_s3_path = client.upload_file(
    Path('question_answering/data/squad_v2_test_with_labels.json'), upload_path=upload_path
)

fact_sheet_path = client.upload_file(
    Path('question_answering/data/fact_sheet.txt'), upload_path=upload_path
)

Once the data and model are uploaded to S3, we can register them to RIME. Once they're registered, we can refer to these resources using their RIME-generated ID's. We also need to obtain an integration ID from the workspace, which is then used to register the model and aid in stress test configuration.

In [None]:
from datetime import datetime

dt = str(datetime.now())

integration_id = client.create_integration(
    workspace_id="", # PASTE WORKSPACE ID FROM INFO BUTTON ON WORKSPACE OVERVIEW PAGE
    name=f"GAI_walkthrough_{dt}",
    integration_type="INTEGRATION_TYPE_CUSTOM", 
    integration_schema=[
        {
            "name": "OPENAI_API_KEY",
            "sensitivity": "VARIABLE_SENSITIVITY_WORKSPACE_SECRET",
            "value": "", # FILL IN YOUR OPENAI API KEY HERE
        }
    ],
)

# Note: models and datasets need to have unique names.
model_id = project.register_model(
    name=f"model_{dt}",
    model_config={"generative_language_model": {
        "model_path": model_s3_path,
        "system_prompt": "I am ChatGPT, a large language model trained by OpenAI, based on the GPT-3.5 architecture.\nKnowledge cutoff: 2021-09\nCurrent date: 2023-03-28"
    }},    
    model_endpoint_integration_id=integration_id,
    skip_validation=True
)

data_params = {
    "label_col": "answer",
    "prompt_col": "prompt",
    "text_features": [
        "context",
        "question"]
}
eval_dataset_id = project.register_dataset(
    name=f"eval_dataset_{dt}",
    data_config={
        "connection_info": {"data_file": {"path": eval_s3_path}},
        "data_params":data_params
    }
)

tests_config = {
    "row_wise_factual_inconsistency": {"fact_sheet_path": fact_sheet_path},
}
test_suite_config = {"individual_tests_config": tests_config}

## **Running a Generative Stress Test**


A Generative Stress Test allows you to test your data and model before deployment. They are a comprehensive suite of hundreds of tests that automatically identify implicit assumptions and weaknesses of pre-production models. Each stress test is run on a single model and displays corresponding results.

Below is a sample configuration of how to setup and run a RIME Generative Stress Test. This configuration specifies relevant metadata for our dataset and model to run the stress test.

In [None]:
generative_stress_test_config = {
    "run_name":"GPT-3.5", 
    "model_task":"Question Answering",
    "model_id":model_id,
    "eval_dataset_id":eval_dataset_id,
    "integration_id":{"uuid": integration_id},
    "run_time_info":{"explicit_errors":False},
    "test_suite_config":test_suite_config,
    "categories": [
        "TEST_CATEGORY_TYPE_ADVERSARIAL",
        "TEST_CATEGORY_TYPE_DATA_POISONING_DETECTION",
        "TEST_CATEGORY_TYPE_SUBSET_PERFORMANCE",
        "TEST_CATEGORY_TYPE_TRANSFORMATIONS",
        "TEST_CATEGORY_TYPE_EVASION_ATTACK_DETECTION",
        "TEST_CATEGORY_TYPE_MODEL_ALIGNMENT",
        "TEST_CATEGORY_TYPE_FACTUAL_AWARENESS",
        "TEST_CATEGORY_TYPE_BIAS_AND_FAIRNESS"
    ]
}

stress_job = client._start_generative_stress_test(
    test_run_config=generative_stress_test_config,
    project_id=project.project_id
)
stress_job.get_status(verbose=True, wait_until_finish=True)

## **Generative Stress Test Results**

Generative Stress Test Results are grouped into categories that measure various aspects of generative model robustness (adverserial, transformations, model alignment, subset performance). These categories are what determine production readiness. Suggestions to improve your model are aggregated on the category level as well. Tests are ranked by default by a shared severity metric. Clicking on an individual test surfaces more detailed information.

You can view the detailed results in the UI by running the below cell and redirecting to the generated link. This page shows granular results for a given AI Generative Stress Test run.

In [None]:
test_run = stress_job.get_test_run()
test_run

### **Analyzing the Results**

Below you can see a snapshot of the results. 

![stress_test.png](https://drive.google.com/uc?id=1Tyn3QHU7UM9EMgE3S7WngGSEW-2Mv4BR)

Notice that each test result category articulates status of the model with respect to the category. Under each category description is a list of any 'Data Requirements' for that category to run successfully.

Here are the results of the "Transformations" tests. These tests augment your evaluation dataset with synthetic abnormal values to proactively test your pipeline’s error-handling behavior and measure the performance degradation caused by different types of abnormal values. 

![transformations_category](https://drive.google.com/uc?id=17Ggnf2fw4p3SteTtfitd6l3ZuHxO6eMH)

Below we are exploring the "Generative Synonym Swap" test cases. This test measures the robustness of your model to synonym swap transformations. It does this by randomly swapping synonyms in the input string and measuring the difference in model outputs between the original input and the transformed input. For the first test case, we can see the SBERT Score, which measures similarity between the original and transformed outputs, is 0.64. This means that our model has changed its output significantly and is likely not robust to synonym swaps.

![datapoint_details.png](https://drive.google.com/uc?id=19KHVhZaYhm2uv8edqJdTJg0WNef0vMof)

### **Programmatically Querying the Results**

RIME not only provides you with an intuitive UI to visualize and explore these results, but also allows you to programmatically query these results. This allows customers to integrate with their MLOps pipeline, log results to experiment management tools like MLFlow, bring automated decision making to their ML practices, or store these results for future references. 

Run the below cell to programmatically query the results. The results are outputed as a pandas dataframe.

**Access results at the test run overview level**

In [None]:
test_run_result = test_run.get_result_df()
test_run_result.to_csv("QA_Test_Run_Results.csv")
test_run_result

**Access detailed test results at each individual test cases level.**

In [None]:
test_case_result = test_run.get_test_cases_df()
test_case_result.to_csv("QA_Test_Run_Results.csv")
test_case_result.head()

## **Appendix**

### **Uploading a Model to RIME**
To be able to run certain tests, RIME needs query access to your model. To give RIME access, you'll need to write a Python file that implements the `generate(system_prompt: str, prompt: str) -> str` function, and upload that file (and any objects that it loads) to the platform. Here we provide an example model file, show you how to upload this file and the relevant model artifacts, and show you how to configure stress tests to use this model.

In [None]:
%%writefile question_answering/models/FLAN_T5_base.py
from pathlib import Path

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Load FLAN-T5-base model
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")
model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base")

def generate(system_prompt: str, prompt: str) -> str:
    """Given a system prompt and prompt, return the response as a string."""
    text = f"{system_prompt}\n{prompt}"
    input_ids = tokenizer(text, return_tensors="pt").input_ids
    outputs = model.generate(input_ids, max_new_tokens=100)
    decoded = tokenizer.decode(outputs[0])
    return decoded

In [None]:
from datetime import datetime

dt = str(datetime.now())

appendix_model_dir = client.upload_directory(
    Path('question_answering/models'), upload_path=upload_path
)
appendix_model_path = appendix_model_dir + "/FLAN_T5_base.py"
# Note: models need to have unique names
appendix_model_id = project.register_model(
    name=f"appendix_model_{dt}",
    model_config={"generative_language_model": {
        "model_path": appendix_model_path,
        "system_prompt": "I am a Hugging Face language model."
    }},
    skip_validation=True
)

appendix_fact_sheet_path = client.upload_file(
    Path('question_answering/data/fact_sheet.txt'), upload_path=upload_path
)
appendix_tests_config = {
    "row_wise_factual_inconsistency": {"fact_sheet_path": appendix_fact_sheet_path},
}
appendix_test_suite_config = {"individual_tests_config": appendix_tests_config}

stress_test_with_model_config = {
    "run_name":"Uploaded Model Example", 
    "model_task":"Question Answering",
    "model_id":appendix_model_id,
    "eval_dataset_id":eval_dataset_id,
    "run_time_info":{"explicit_errors":False},
    "test_suite_config":appendix_test_suite_config,
    "categories": [
        "TEST_CATEGORY_TYPE_ADVERSARIAL",
        "TEST_CATEGORY_TYPE_DATA_POISONING_DETECTION",
        "TEST_CATEGORY_TYPE_SUBSET_PERFORMANCE",
        "TEST_CATEGORY_TYPE_TRANSFORMATIONS",
        "TEST_CATEGORY_TYPE_EVASION_ATTACK_DETECTION",
        "TEST_CATEGORY_TYPE_MODEL_ALIGNMENT",
        "TEST_CATEGORY_TYPE_FACTUAL_AWARENESS",
        "TEST_CATEGORY_TYPE_BIAS_AND_FAIRNESS"
    ]
}
stress_job = client._start_generative_stress_test(
    test_run_config=stress_test_with_model_config,
    project_id=project.project_id
)
stress_job.get_status(verbose=True, wait_until_finish=True)