# Using Additional Input Data ðŸ“¥

In some cases, a single input value isn't enoughâ€”you may need to process multiple values for a task.

For this example, we'll use a Taskmaster dataset which contains comments and topics. Our goal is to evaluate whether a given comment is relevant to the list of topics.

Additionally, your model's output may need to include multiple values. In this case, instead of just returning a label (e.g., "in-topic" or "off-topic"), we might also want to return a confidence score.

In [None]:
import os
import math

from dotenv import load_dotenv
# Load environment variables from the .env file.
load_dotenv(override=True)

from typing import Dict, Any

from ddtrace.llmobs import LLMObs

from openai import OpenAI

LLMObs.enable(api_key=os.getenv("DD_API_KEY"), app_key=os.getenv("DD_APPLICATION_KEY"),  project_name="Onboarding", ml_app="Onboarding-ML-App")

oai_client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

In [None]:
# Specify the columns that contain the input and expected output.
dataset = LLMObs.create_dataset_from_csv(csv_path="./data/taskmaster.csv", dataset_name="taskmaster-mini", input_data_columns=["prompt", "topics"], expected_output_columns=["labels"])

In [None]:
# If the label is False, then the comment is in topic with the list of topics. Otherwise, it is not.
dataset.as_dataframe()

## Task definition

The following task will try to analyze whether a prompt belongs to a set of topics, both defined in the dataset.

This approach will output multiple metrics and we will use them in the evaluators.

The computation of certainty is a bit complex, so feel free to skip, it's not necessary to understand the workflow.

In [None]:
# a task that uses both the prompt and the topics from the input to determine if prompt is relevant
def topic_relevance(input_data, config):
    output = oai_client.chat.completions.create(
        model=config['model'],
        messages=[
            {"role": "system", "content": f"You are a {config['personality']} assistant that can detect if a comment is in topic with a given list of topics. Return YES if it is, otherwise return NO. Nothing else."},
            {"role": "user", "content": f"Comment: {input['prompt']}\n\nTopics: {input['topics']}"}
        ],
        logprobs=True,
        top_logprobs=10,
        temperature=config["temperature"]
    )

    response = output.choices[0].message.content == "YES"

    # Get logprobs for YES and NO responses
    logprobs = output.choices[0].logprobs.content[0].top_logprobs
    yes_prob = next((lp.logprob for lp in logprobs if lp.token == "YES"), float("-inf"))
    no_prob = next((lp.logprob for lp in logprobs if lp.token == "NO"), float("-inf"))

    # Convert log probabilities to raw probabilities
    yes_raw_prob = math.exp(yes_prob)
    no_raw_prob = math.exp(no_prob)

    # Normalize probabilities to get proper probability distribution
    total_prob = yes_raw_prob + no_raw_prob
    if total_prob > 0:  # Avoid division by zero
        yes_norm_prob = yes_raw_prob / total_prob
        no_norm_prob = no_raw_prob / total_prob
    else:
        # Fallback if both probabilities are extremely low
        yes_norm_prob = 0.5 if yes_raw_prob > 0 else 0
        no_norm_prob = 0.5 if no_raw_prob > 0 else 0

    # Calculate normalized confidence for the chosen response
    confidence = yes_norm_prob if response else no_norm_prob

    # Calculate entropy-based certainty (1 = perfectly certain, 0 = completely uncertain)
    if yes_norm_prob > 0 and no_norm_prob > 0:
        entropy = -(yes_norm_prob * math.log2(yes_norm_prob) + no_norm_prob * math.log2(no_norm_prob))
        max_entropy = 1.0  # Maximum entropy for binary choice
        certainty = 1 - (entropy / max_entropy)
    else:
        certainty = 1.0  # If one probability is 0, the model is completely certain

    return {
        "response": str(not response),  # Maintaining your original logic
        "confidence": confidence,       # Normalized probability of chosen answer
        "certainty": certainty,         # Entropy-based measure of model certainty
        "yes_probability": yes_norm_prob,
        "no_probability": no_norm_prob,
        "raw_confidence": math.exp(yes_prob if response else no_prob)  # Original calculation for comparison
    }


def exact_match(input_data, output_data, expected_output):
    return expected_output == output["response"]

# define a confidence score evaluator to check if the confidence score is greater than 0.8 and the output is not the expected output.
def false_confidence(input_data, output_data, expected_output):
    return output["certainty"] > 0.8 and expected_output != output["response"]


experiment = LLMObs.experiment(
    name="taskmaster-experiment",
    dataset=dataset,
    task=topic_relevance,
    evaluators=[exact_match, false_confidence],
    config={"model": "gpt-4o-mini", "temperature": 0.3, "personality": "helpful"},
)

In [None]:
# Let's test just on one sample
input = dataset[0]["input_data"]
output = topic_relevance(input, {"model": "gpt-4o-mini", "temperature": 0.3, "personality": "helpful"})
print(output)

Great! Now let's run the experiment.

In [None]:
results = experiment.run(jobs=50)

experiment.url

If you check the experiment in Datadog with the link above (may take a few seconds to be accessible), you'll be able to see the experiment with a dataset that is imported from a CSV, using a more complex evaluator!