# Submitting Custom Evaluations for your LLM Application

Datadog's LLM Observability tool allows you to perform evaluations on your LLM application and tie these evaluations to specific traces. 

In this notebook, we'll build a simple LLM application involving user feedback. We'll generate dummy feedback data and submit them to LLM Observability.

#### Learning Goals
- Understand how to export span context to correlate with your custom evaluation
- Understand how to submit custom evaluations tied to specific traces

### Initial Setup

In [2]:
from dotenv import load_dotenv
import os
import openai
import time

from ddtrace.llmobs import LLMObs

load_dotenv()

LLMObs.enable(
    api_key=os.environ.get("DD_API_KEY"),
    site=os.environ.get("DD_SITE", "datadoghq.com"),
    ml_app="ask-llmobs-docs",
    agentless_enabled=True,
)

## Sample LLM Application

### Instrumenting the Application

Here's a sample LLM application workflow involving 2 steps:
1. Preprocessing to sanitize user input
2. Send the sanitized user input to an LLM call to OpenAI

In [13]:
def llm_call(prompt):
    client = openai.OpenAI()
    resp = client.completions.create(model="gpt-3.5-turbo-instruct", prompt=prompt, temperature=0.9, max_tokens=50)
    return resp.choices[0].text

def sanitize_prompt(prompt):
    time.sleep(0.3)  # to simulate more complex work
    sanitized_prompt = prompt.strip("invalid_string").strip()
    return sanitized_prompt


def workflow_trace(prompt):
    sanitized_prompt = sanitize_prompt(prompt)
    resp = llm_call(sanitized_prompt)
    return resp

Let's instrument `sanitize_prompt()` and `workflow_trace()`, since the OpenAI call is auto-instrumented.

In [14]:
from ddtrace.llmobs.decorators import task, workflow

@task
def sanitize_prompt(prompt):
    time.sleep(0.3)  # to simulate more complex work
    sanitized_prompt = prompt.strip("invalid_string").strip()
    LLMObs.annotate(input_data=prompt, output_data=sanitized_prompt)
    return sanitized_prompt

@workflow
def workflow_trace(prompt):
    sanitized_prompt = sanitize_prompt(prompt)
    resp = llm_call(sanitized_prompt)
    LLMObs.annotate(input_data=prompt, output_data=resp)
    return resp

Now that we have this application instrumented, feel free to run this:

In [15]:
print(workflow_trace("invaid_str                                Who is the greatest basketball player of all time and why?              invalid_str"))



The greatest basketball player of all time is widely considered to be Michael Jordan. There are several reasons why he is considered the greatest:

1. Unmatched Accomplishments: Jordan won 6 NBA Championships, 5 MVP awards, and 10


## Custom Evaluations

Now that we have the application instrumented, we can submit traces to LLM Observability. However, here's a user feedback recording helper:

In [19]:
def _measure_user_satisfaction(resp):
    """Dummy feedback generator, recording user satisfaction with the response from a scale of 1 to 10."""
    return 10

We can submit this custom evaluation to LLM Observability by correlating it with the trace from before. We just need to add a few tweaks to:
- Export the span context of `workflow_trace()` using `LLMObs.export_span()` and save that for later.
- Pass that exported span context from `workflow_trace()` to `record_user_satisfaction()`, and use `LLMObs.submit_evaluation()` to submit the evaluation to Datadog.

In [21]:
@workflow
def workflow_trace(prompt):
    sanitized_prompt = sanitize_prompt(prompt)
    resp = llm_call(sanitized_prompt)
    LLMObs.annotate(input_data=prompt, output_data=resp)
    # Export the span here and return it as well as the response
    return resp

def record_user_satisfaction(resp, span_context):
    satisfaction_value = _measure_user_satisfaction(resp)
    # Submit the evaluation to Datadog using the exported span context

Once you've done that, your code should look something like:

In [22]:
@workflow
def workflow_trace(prompt):
    sanitized_prompt = sanitize_prompt(prompt)
    resp = llm_call(sanitized_prompt)
    LLMObs.annotate(input_data=prompt, output_data=resp)
    # Export the span here and return it as well as the response
    span_context = LLMObs.export_span()
    return resp, span_context

def record_user_satisfaction(resp, span_context):
    satisfaction_value = _measure_user_satisfaction(resp)
    # Submit the evaluation to Datadog using the exported span context
    LLMObs.submit_evaluation(span_context, label="user_satisfaction", metric_type="score", value=10)

Run your LLM workflow and recording the user satisfaction method.

In [None]:
resp, span_context = workflow_trace("invaid_str                                Who is the greatest soccer player of all time and why?             invalid_str")
print(resp)
record_user_satisfaction(resp, span_context)

Now, try checking out the [LLM Observability interface](https://app.datadoghq.com/llm) in Datadog. You should see a trace that describes the workflow we just ran, and you should see the custom evaluation we've associated with this trace.