# 🚀 **Exploring Braintrust Evaluation APIs**

This notebook aims to explore Evaluation APIs offered by [Braintrust](https://docs.perplexity.ai/). 

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/GiacomoMeloni/ExploringLLMs/blob/main/notebooks/exploring_braintrust.ipynb)

## 🤔  **About Braintrust**
Braintrust is a service that helps you evaluate your LLM app, mainly focused on production-quality applications. It is possible to leverage on the [Python SDK](https://www.braintrustdata.com/docs/libs/python) as well as on the [Node library](https://www.braintrustdata.com/docs/libs/nodejs).

Evaluations are crucial in AI, and accuracy deserves as much focus as speed in developing LLM-based applications. Braintrust's evaluation-driven approach enhances product reliability and quality, accelerating the testing process. This service allows teams to share evaluation datasets, with also versioning features; define their own evaluation metrics or leveraging in predefined metrics offered by the framework.

## 🔧 **Get Started**
To access to `Braintrust` Dashboard an API key is needed. You can easily access through the "Settings" page of active account (is needed to sign up on the platform) and click on "API Keys" to generate a new key for your projects.

⚠️ **Remember API Keys will be visibile only one time, after that they won't be displayed again! So keep it safe once created!**
ℹ️ **The API key needs to be stored as environment variable with the following name `BRAINTRUST_API_KEY`**

Another feature that needs mention is that you need to update a GenAI Provider API Key (e.g. OpenAI, Anthropic, etc.) to leverage on some model-based metrics that we will cover later on this notebook.

ℹ️ **There is no need to have an enterprise plan to leverage on basic evaluation features offered by the library**
ℹ️ **For more details on API key sey-up refer to [Create an API key](https://www.braintrustdata.com/docs#create-an-api-key) guide section**

In [None]:
!pip install braintrust autoevals openai

In [None]:
# Basic check to get the environment in which the notebook is running

if 'google.colab' in str(get_ipython()):
    COLAB_ENV = True
    from google.colab import userdata
else:
    COLAB_ENV = False

## 🧑🏼‍🚀 Set up
To access to all the functionalities offered by Braintrust we need to use a Braintrust API Key. Additionally, in this notebook is illustrated a basic test use case that leverages on OpenAI gpt-3.5-turbo model so an OpenAI API key is needed as well.

🗝️To make this notebook work is needed to set up the following environment variables:


| Env Variable              | Description                                          |
|---------------------------|------------------------------------------------------|
| `BRAINTRUST_API_KEY`      | Braintrust API key                                   |
| `OPENAI_API_KEY`          | OpenAI API key                                       |
| `PPLXAI_API_KEY`          | PerplexityAI API key                                 |
| `TRUSTBRAIN_PROJECT_NAME` | The name assigned to Braintrust's Evaluation Project | 

In [None]:
import os
from openai import OpenAI
from braintrust import Eval, init_dataset
from datetime import datetime
from autoevals import Factuality, LevenshteinScorer 

check_bt_api_key = bool(os.environ.get('BRAINTRUST_API_KEY')) if not COLAB_ENV else userdata.get('BRAINTRUST_API_KEY')

PROJECT_NAME: str = os.environ.get('TRUSTBRAIN_PROJECT_NAME')

assert check_bt_api_key, "❌ A Braintrust API Key it's needed for the purpose of this notebook!"

## 🛜 AI Proxy
Braintrust provides also an AI proxy to access to different LLM-providers and models using a single API. For the sake of this guide is used a PerplexityAI API key. It is possible to use OpenAI, Anthropic or other providers API keys as well. 

ℹ️ More information are provided in the official [AI Proxy](https://www.braintrustdata.com/docs/guides/proxy) guide section.

In [None]:
# For this project is used a PerplexityAI API key. It is possible to use OpenAI, Anthropic or other providers API keys as well.
client_key = os.environ.get('PPLXAI_API_KEY', None) if not COLAB_ENV else userdata.get('PPLXAI_API_KEY')

client = OpenAI(
  base_url="https://braintrustproxy.com/v1",
  api_key=client_key
)

assert client.api_key is not None, "❌ An LLM-provider API Key is needed for the purpose of this notebook!"

## 💡 Use Case: Sentiment Classifier
Imagine a scenario where a custom classifier to detect positive or negative reviews of your products. A common classification scenario that can leverage on the benefits of an evalutaion pipelines. 

As first step, an evaluation dataset needs to be defined. This dataset consist in list of objects composed by the following elements: 

* `input`: the input to evaluate, in this scenario this will be the review but for example if you need to evaluate a question answering model this might be a question;
* `output`: the output value is supposed to be the expected application response (you can rename this as *expected* directly from your evaluation dataset to align directly to what the framework is expecting to receive);
* `metadata`: is a set of key-value pairs that you can use to filter and group your data.

These data will be used to evaluate the LLM-based app later! 

ℹ️ For further information about Evaluation Datasets look at [Datasets](https://www.braintrustdata.com/docs/guides/datasets) page in the official guide.

In [None]:
eval_set = [
  {
    "input": "The product exceeded my expectations. Even Gandalf would love it!",
    "expected": "POSITIVE"
  },
  {
    "input": "How's the weather?",
    "expected": "NA"
  },
  {
    "input": "Terrible experience with this product. I would not recommend it.",
    "expected": "NEGATIVE"
  }, 
  {
    "input": "The service was not professional, how could you offer this experience to your clients?",
    "expected": "NEGATIVE"
  },
  {
    "input": "Great customer service, very helpful and responsive.",
    "expected": "POSITIVE"
  }
]

## ⬆️ Upload Dataset
Braintrust offers the possibility to upload your Dataset to share with a team the same evaluation records. Is possible to leverage on your cloud infrastructure to store it! 

⚠️ **The possibility to upload Datasets is available only for enterprise users, so skip this cell if you do not have an enterprise subscription plan with Braintrust!**

In [None]:
# Upload these to a new unique dataset in BrainTrust so your teammates can also easily use and manage this dataset.
# 

# dataset = init_dataset(PROJECT_NAME, 
#                                   name=f"Sentiment Evaluation for Product reviews",
#                                   description="A set of customers review samples to use for evaluating for a LLM 0-Shot task response",
#                                   version="0.1")
# 
# for test_case in eval_set:
#         dataset.insert(**test_case)

## ©️ Prompt Setup
In LLM-based applications, the prompt is the core part needed to address generative models' behaviour. In this notebook there different system prompts that need to be tested.  

In [None]:
# 🖋️ Set your system prompt here

# system_prompt = """You are the review classifier, your job is to classify customer reviews in two categories: "POSITIVE" or "NEGATIVE". Do not answer to any question and if some question arrives that is not related to a customer review respond with "NA" """

system_prompt = """
You are the review classifier, your job is to classify customer reviews in two categories: "POSITIVE" or "NEGATIVE". If a question is posed in the input respond with "NA". 
Format: reply only with "POSITIVE", "NEGATIVE" or "NA".
"""

review_prompt = """
Review: {review}
Category:
"""

In [None]:
# The classifier supports only three labels as defined here

SUPPORTED_LABELS = ["POSITIVE", "NEGATIVE", "NA"]

## 🧰 Task definition 
In order to evaluate an LLM-based pipeline it is necessary define a function containing all the steps that need to be evaluated. This function will then be passed as parameter to the Eval class for to start the tests.

In [None]:
def generation_pipeline(input: str):
  response = client.chat.completions.create(
    messages=[{"role": "system", "content": system_prompt},
              {"role": "user", "content": review_prompt.format(review=input)}],
    model="mistral-7b-instruct", 
    max_tokens=50)
  
  return response.choices[0].message.content
  # import numpy as np
  # return SUPPORTED_LABELS[np.random.randint(0,3)] 

In [None]:
# This cell is just to verify if all is set up correctly!
review_input = input('Insert here your review')
result = generation_pipeline("This product is fantastic?")

print("Result: ", result)

| Metrics                   | Type                      | Status                |
|---------------------------|---------------------------|-----------------------|
| Factuality                | Model-Based Classification | Available             |
| Levenshtein distance      | Heuristic                  | Available             |



In [None]:
# 🤖 Factuality is one of the model-based metrics offered by the library 
fact_evaluator = Factuality(model="mistral-7b-instruct",
                            api_key=client_key,
                            base_url="https://braintrustproxy.com/v1") 
leven_evaluator = LevenshteinScorer()

def single_response_scorer(input, output=None, expected=None):
  if len(output.split())>1:
    return 0
  
  return 1
  
def label_scorer(input, output=None, expected=None):
  if output not in SUPPORTED_LABELS:
    return 0
  if expected not in SUPPORTED_LABELS:
    return 0
  
  return 1

def format_scorer(input, output=None, expected=None):
  if not str.isupper(output):
    return 0
  if not str.isupper(expected):
    return 0
  
  return 1

In [None]:
await Eval(PROJECT_NAME,
           data=eval_set,
           task=generation_pipeline,
           scores=[fact_evaluator ,leven_evaluator, single_response_scorer, label_scorer, format_scorer],
           metadata={"experiment_name": f"Evaluation-{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"})