# Using Additional Input Data 📥

In some cases, a single input value isn't enough—you may need to process multiple values for a task.

For this example, we'll use the [Taskmaster dataset](https://huggingface.co/datasets/taskmaster) which contains comments and topics. Our goal is to evaluate whether a given comment is relevant to the list of topics.

Additionally, your model's output may need to include multiple values. In this case, instead of just returning a label (e.g., "in-topic" or "off-topic"), we might also want to return a confidence score.

In [1]:
import os
import math

from openai import OpenAI
from dotenv import load_dotenv

# Load environment variables from a .env file, overriding existing ones.
# Disable override if your environment is defined outside the virtualenv.
load_dotenv(override=True)

from pydantic import BaseModel

from ddtrace.llmobs import Dataset, Experiment, task, evaluator

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

In [15]:
# Specify the columns that contain the input and expected output.
dataset = Dataset.from_csv("./data/taskmaster.csv", name="taskmaster-mini-2", input_columns=["prompt", "topics"], expected_output_columns=["labels"])

In [None]:
# If the label is False, then the comment is in topic with the list of topics. Otherwise, it is not.
dataset.as_dataframe()

In [None]:
# dataset.push()
dataset = Dataset(name='taskmaster-mini-2')

In [27]:
# We define a task that uses both the prompt and the topics from the input to determine if the comment is in topic with the list of topics.
@task
def topic_relevance(input, config):
    output = client.chat.completions.create(
        model=f"{config['model']}",
        messages=[
            {"role": "system", "content": f"You are a {config['personality']} assistant that can detect if a comment is in topic with a given list of topics. Return YES if it is, otherwise return NO. Nothing else."},
            {"role": "user", "content": f"Comment: {input['prompt']}\n\nTopics: {input['topics']}"}
        ],
        logprobs=True,
        top_logprobs=10,
        temperature=config["temperature"]
    )

    response = output.choices[0].message.content == "YES"
    
    # Get logprobs for YES and NO responses
    logprobs = output.choices[0].logprobs.content[0].top_logprobs
    yes_prob = next((lp.logprob for lp in logprobs if lp.token == "YES"), float("-inf"))
    no_prob = next((lp.logprob for lp in logprobs if lp.token == "NO"), float("-inf"))
    
    # Convert log probability to probability for the chosen response
    confidence = math.exp(yes_prob if response else no_prob)
    
    return {"response": str(not response), "confidence": confidence}

# We define an evaluator that checks if the output is the same as the expected output.
@evaluator
def exact_match(input, output, expected_output):
    return expected_output == output["response"]

# We now use a confidence score evaluator to check if the confidence score is greater than 0.8 and the output is not the expected output.
@evaluator
def false_confidence(input, output, expected_output):
    return output["confidence"] > 0.8 and expected_output != output["response"]


experiment = Experiment(
    name="taskmaster-experiment",
    dataset=dataset,
    task=topic_relevance,
    evaluators=[exact_match, false_confidence],
    config={"model": "gpt-4o-mini", "temperature": 0.3, "personality": "helpful"},
)

In [None]:
# Let's test just on one sample
input = dataset[0]["input"]
output = topic_relevance(input, {"model": "gpt-4o-mini", "temperature": 0.3, "personality": "helpful"})
print(output)

In [None]:
results = experiment.run()

In [None]:
results.as_dataframe()

## Running More Experiments 🚀

In this round of experiments, we’ll modify the prompt to see how it impacts model performance. Unlike previous experiments, we won't use the config parameter.

In [11]:
# Here we'll use a CoT prompt to see if the model can handle it. It uses structured output to return the answer.

class TopicRelevanceCoT(BaseModel):
    reasoning_why_in_topic: str
    reasoning_why_not_in_topic: str
    deliberation: str
    in_topic: bool
    confidence: float

@task
def topic_relevance_CoT(input):
    output = client.beta.chat.completions.parse(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": """
            You are a helpful assistant that can detect if a comment is in topic with a given list of topics.
             
            Reason carefully and answer correctly only.

            You must return a JSON with the following fields in that order:
            - reasoning_why_in_topic: a string with the reasoning of why the comment is in topic with the list of topics.
            - reasoning_why_not_in_topic: a string with the reasoning of why the comment is not in topic with the list of topics.
            - deliberation: a string with an argument of why the comment is in topic or not in topic with the list of topics.
            - in_topic: a boolean that indicates if the comment is in topic with the list of topics.
            - confidence: a number between 0 and 1 that indicates the confidence of the model in its answer.
            """,
            },
            {"role": "user", "content": f"Comment: {input['prompt']}\n\nTopics: {input['topics']}"}
        ],
        response_format=TopicRelevanceCoT
    )

    in_topic = output.choices[0].message.parsed.in_topic
    confidence = output.choices[0].message.parsed.confidence
    reasoning_why_in_topic = output.choices[0].message.parsed.reasoning_why_in_topic
    reasoning_why_not_in_topic = output.choices[0].message.parsed.reasoning_why_not_in_topic
    deliberation = output.choices[0].message.parsed.deliberation

    # Here let's even return the reasoning of why the comment is in topic or not in topic to see if the model is reasoning correctly.

    return {"response": str(not in_topic), "confidence": confidence, "reasoning_why_in_topic": reasoning_why_in_topic, "reasoning_why_not_in_topic": reasoning_why_not_in_topic, "deliberation": deliberation}


To test the task, let's run it on one sample.



In [None]:
input = dataset[0]["input"]
output = topic_relevance_CoT(input)
print(output)

Great! Now let's run the experiment.

In [16]:
experiment = Experiment(name="taskmaster-experiment-cot", dataset=dataset, task=topic_relevance_CoT, evaluators=[exact_match, false_confidence])

In [None]:
results = experiment.run()
results.as_dataframe()

If you check Datadog's LLM Observability's UI, you'll be able to see the improvements reflected in the results! 