# Logging, evaluating, and tracing Cerebras models with Braintrust

## Setup

Let's install some dependencies.


%pip install autoevals braintrust openai


Braintrust knows how to intercept calls to the `openai` client library to automatically trace them. Since Cerebras has an OpenAI-compatible API, it's a breeze to set this up!


In [1]:
import os

import openai
import braintrust

client = braintrust.wrap_openai(
    openai.OpenAI(
        api_key=os.getenv("CEREBRAS_API_KEY"),
        base_url="https://api.cerebras.ai/v1",
    )
)

## Logging

To log to Braintrust, simply initialize a logger. All Cerebras model calls will be automatically traced and logged to Braintrust. This works for both streaming and non-streaming calls.


In [3]:
braintrust.init_logger("Cerebras test")

response = client.chat.completions.create(
    model="llama3.1-8b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of the Nevada?"},
    ],
)

print(response.choices[0].message.content)

The capital of Nevada is Carson City.


In Braintrust, we'll see the completion along with a bunch of metrics. Wow, Cerebras is fast!

![Log view](./assets/Log-view.png)


If you enter your Cerebras API key in Braintrust (under Settings -> AI providers), you can also reproduce the call in the UI, and even tweak the prompt!

![Tweak prompt](./assets/Tweak-prompt.gif)

## Evaluating

Evals automatically support Cerebras models as well. Let's run a simple math test eval and see how it does.


In [8]:
from braintrust import Eval
from autoevals import Factuality

await Eval(
    "Cerebras test",
    data=[
        {"input": "What is 100-94?", "expected": "6"},
        {"input": "square root of 16?", "expected": "4"},
    ],
    task=lambda input: client.chat.completions.create(
        model="llama3.1-8b",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": input},
        ],
    )
    .choices[0]
    .message.content,
    # We'll use the smarter Llama 3.1-70b model to evaluate the output.
    scores=[Factuality(model="llama3.1-70b", api_key=os.environ["CEREBRAS_API_KEY"])],
)


Experiment add-bt-1727847550 is running at https://www.braintrust.dev/app/braintrustdata.com/p/Cerebras%20test/experiments/add-bt-1727847550
Cerebras test (data): 2it [00:00, 25420.02it/s]


Cerebras test (tasks):   0%|          | 0/2 [00:00<?, ?it/s]


50.00% 'Factuality' score

0.18s duration
0.17s llm_duration
27.50tok prompt_tokens
10tok completion_tokens
37.50tok total_tokens
0.00$ estimated_cost

See results for add-bt-1727847550 at https://www.braintrust.dev/app/braintrustdata.com/p/Cerebras%20test/experiments/add-bt-1727847550


EvalResultWithSummary(summary="...", results=[...])

Looks like the output is getting penalized for containing an explanation of how to solve the problem. Let's tweak the prompt and try again.


In [9]:
from braintrust import Eval
from autoevals import Factuality

await Eval(
    "Cerebras test",
    data=[
        {"input": "What is 100-94?", "expected": "6"},
        {"input": "square root of 16?", "expected": "4"},
    ],
    task=lambda input: client.chat.completions.create(
        model="llama3.1-8b",
        messages=[
            {
                "role": "system",
                "content": "You are a helpful assistant. Solve the problem and provide the answer only.",
            },
            {"role": "user", "content": input},
        ],
    )
    .choices[0]
    .message.content,
    # We'll use the smarter Llama 3.1-70b model to evaluate the output.
    scores=[Factuality(model="llama3.1-70b", api_key=os.environ["CEREBRAS_API_KEY"])],
)

Experiment add-bt-1727847722 is running at https://www.braintrust.dev/app/braintrustdata.com/p/Cerebras%20test/experiments/add-bt-1727847722
Cerebras test (data): 2it [00:00, 1514.19it/s]


Cerebras test (tasks):   0%|          | 0/2 [00:00<?, ?it/s]


add-bt-1727847722 compared to add-bt-1727847550:
100.00% (+50.00%) 'Factuality' score	(2 improvements, 0 regressions)

25.38s (+2519.12%) 'duration'         	(0 improvements, 2 regressions)
25.37s (+2519.73%) 'llm_duration'     	(0 improvements, 2 regressions)
36.50tok (+900.00%) 'prompt_tokens'    	(0 improvements, 2 regressions)
2tok (-800.00%) 'completion_tokens'	(2 improvements, 0 regressions)
38.50tok (+100.00%) 'total_tokens'     	(0 improvements, 1 regressions)
0.00$ (+00.00%) 'estimated_cost'   	(0 improvements, 0 regressions)

See results for add-bt-1727847722 at https://www.braintrust.dev/app/braintrustdata.com/p/Cerebras%20test/experiments/add-bt-1727847722


EvalResultWithSummary(summary="...", results=[...])

Excellent! It looks like we improved both cases.

![Updated eval](./assets/Eval-2.gif)

## Where to go from here

Now that you can build logs and evaluations for your Cerebras models, you can ship applications with the confidence that you can reproduce user issues,
eval to improve your prompts, and continue to iterate with confidence.

To learn more about Braintrust, check out the [docs](https://braintrust.dev/docs).
