# Evaluating LLM-based tasks

Evaluating LLM-based use cases is pivotal for several reasons.
First, with the myriad of methods available, comparability becomes essential.
By systematically evaluating different approaches, we can discern which techniques are more effective or suited for specific tasks, fostering a deeper understanding of their strengths and weaknesses.
Secondly, optimization plays a significant role. Without proper evaluation metrics and rigorous testing, it becomes challenging to fine-tune methods and/or models to achieve their maximum potential.
Moreover, drawing comparisons with state-of-the-art (SOTA) and open-source methods is crucial.
Such comparisons not only provide benchmarks but also enable users to determine the value-added by proprietary or newer models over freely available counterparts.

However, evaluating LLMs, especially in the domain of text generation, presents unique challenges.
Text generation is inherently subjective, and what one evaluator deems coherent and relevant, another might find disjointed or off-topic. This subjectivity complicates the establishment of universal evaluation standards, making it imperative to approach LLM evaluation with a multifaceted and comprehensive strategy.

### Evaluating classification use-cases

To (at least for now) evade the elusive issue described in the last paragraph, let's have a look at an easier to evaluate methodology: classification.
Why is this easier?
Well, unlike other tasks such as QA, the result of a classification task is more or less binary (true/false).
There are very few grey areas, as it is unlikely that a classification result is somewhat or "half" correct.

Make sure that you have familiarized yourself with the `PromptBasedClassify` and `EmbeddingBasedClassify` prior to starting this notebook.


First, we need to instantiate our task and an evaluator for it.


In [2]:
import os

from intelligence_layer.connectors.limited_concurrency_client import LimitedConcurrencyClient
from intelligence_layer.core import InMemoryEvaluationRepository
from intelligence_layer.use_cases.classify.classify import SingleLabelClassifyEvaluator
from intelligence_layer.use_cases.classify.prompt_based_classify import PromptBasedClassify


client = LimitedConcurrencyClient.from_token(os.getenv("AA_TOKEN"))
task = PromptBasedClassify(client)
repository = InMemoryEvaluationRepository()
evaluator = SingleLabelClassifyEvaluator(task, repository)


Now, let's run a single example and see what comes of it!

In [3]:
from intelligence_layer.core.tracer import InMemoryTracer
from intelligence_layer.core.chunk import Chunk
from intelligence_layer.use_cases.classify.classify import ClassifyInput


classify_input = ClassifyInput(
        chunk=Chunk("This is good"),
        labels=frozenset({"positive", "negative"}),
    )
evaluation_tracer = InMemoryTracer()
expected_output = "positive"
evaluation = evaluator.evaluate(
    input=classify_input, tracer=evaluation_tracer, expected_output=[expected_output]
)

print("The task result:", evaluation.output.scores)
print("The expected output:", expected_output)
print("The eval result:", evaluation.correct)


The task result: {'negative': 0.001098731072924643, 'positive': 0.9989012689270754}
The expected output: positive
The eval result: True


Cool!

Let's have a look at this [dataset](https://huggingface.co/cardiffnlp/tweet-topic-21-multi) for more elaborate evaluaton.

In [4]:
from datasets import load_dataset

dataset = load_dataset(f"cardiffnlp/tweet_topic_multi")
test_set_name = "validation_random"
all_data = list(dataset[test_set_name])
data, all_data = all_data[:10], all_data[10:] # this has 573 datapoints, let's take a look at 10 for now


We need to transform our dataset into the required format. 
Therefore, let's check out what it looks like.

In [5]:
data[1]


{'text': 'COMING UP!! At the beginning of the pandemic, various models predicted a doomsday scenario for Nigeria & other African countries, yet fewer cases than expected have been recorded. For the latest on Nigeria’s response to #COVID19, join DG… {{URL}} VIA {@NCDC@} ',
 'date': '2020-08-17',
 'label': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
 'label_name': ['news_&_social_concern'],
 'id': '1295296955732471809'}

Accordingly, this must be translated into the interface of our `Evaluator`.

This is the target structure:

``` python
class Example(BaseModel, Generic[Input, ExpectedOutput]):
    input: Input
    expected_output: ExpectedOutput
    id: Optional[str] = Field(default_factory=lambda: str(uuid4()))


class Dataset(Protocol, Generic[Input, ExpectedOutput]):
    @property
    def name(self) -> str:
        ...

    @property
    def examples(self) -> Iterable[Example[Input, ExpectedOutput]]:
        ...
```

We want the `input` in each `Example` to mimic the input of an actual task, therefore we must every time include the text (chunk) and all possible labels.
The `expected_output` shall correspond to anything we wish to compare our generated output to.
In this case, that means the correct class(es).

In [6]:
from intelligence_layer.core import Example, SequenceDataset


all_labels = list(set(c for d in data for c in d["label_name"]))
dataset = SequenceDataset(
    name="tweet topics",
    examples=[
        Example(
            input=ClassifyInput(
                chunk=Chunk(d["text"]),
                labels=all_labels
            ),
            expected_output=d["label_name"]
        ) for d in data
    ]
)


Ok, let's run this!

Note that this may take a while as we parallelise the tasks in a way that accommodates the inference API.

In [7]:
from intelligence_layer.core.tracer import JsonSerializer
from intelligence_layer.use_cases.classify.classify import ClassifyEvaluation


evaluation_tracer = InMemoryTracer()
result = evaluator.evaluate_dataset(dataset=dataset, tracer=evaluation_tracer)
if result.failed_evaluation_count > 0:
    results = repository.evaluation_run_results(result.id, ClassifyEvaluation)
    print(JsonSerializer(root=results).model_dump_json(indent=2))
    raise RuntimeError("Failed evaluation")


Evaluating: 10it [00:05,  1.83it/s]

[
  {
    "example_id": "898fd6e6-0879-4d6d-9c81-1d2f7dac69ee",
    "result": {
      "correct": true,
      "output": {
        "scores": {
          "film_tv_&_video": 0.00005261736444822042,
          "arts_&_culture": 0.00030798010534601383,
          "news_&_social_concern": 0.0025233538483477353,
          "fitness_&_health": 0.0003954542830984499,
          "other_hobbies": 0.0004275880310893618,
          "science_&_technology": 0.00006765425080210426,
          "celebrity_&_pop_culture": 0.00011314511381048893,
          "sports": 0.9961122070030577
        }
      }
    },
    "trace": {
      "traces": [
        {
          "traces": [
            {
              "traces": [
                {
                  "traces": [],
                  "start": "2023-11-24T16:05:37.346195",
                  "end": "2023-11-24T16:05:42.431985",
                  "input": {
                    "request": {
                      "prompt": {
                        "items": [
            




RuntimeError: Failed evaluation

Checking out the results...

In [None]:
print("Percentage correct:", result.statistics.percentage_correct)
print("First example:", result.statistics.evaluations[0])


Looking good!

Because we designed the `ClassifyEvaluator` in a way that allows it to evaluate any `Task` with `ClassifyInput` and `ClassifyOutput` (both single & multi label), it can even evaluate different classifier implementations, such as the `EmbeddingBasedClassifier`.

To achieve this, let's first find some examples for the different labels within our eval set.

In [None]:
from pprint import pprint
from typing import Mapping, Sequence

from intelligence_layer.use_cases.classify.embedding_based_classify import LabelWithExamples

labels_with_examples_dict: Mapping[str, Sequence[str]] = {}
for d in all_data:
    for label in d["label_name"]:
        if label in labels_with_examples_dict:
            labels_with_examples_dict[label].append(d["text"])
        else:
            labels_with_examples_dict[label] = [d["text"]]
labels_with_examples = [LabelWithExamples(name=k, examples=v) for k, v in labels_with_examples_dict.items()]
pprint({k: v[:1] for k, v in labels_with_examples_dict.items()})


Alright, let's instantiate our `EmbeddingBasedClassify`-task with these examples.
Again, this may take a few seconds, because we embed all examples.

In [None]:
from intelligence_layer.use_cases.classify.embedding_based_classify import EmbeddingBasedClassify
from intelligence_layer.use_cases.classify.classify import MultiLabelClassifyEvaluator


ebc = EmbeddingBasedClassify(labels_with_examples, client)
ebc_evaluator = MultiLabelClassifyEvaluator(ebc, InMemoryEvaluationRepository())

And, let's run!

In [None]:
ebc_evaluation_tracer = InMemoryTracer()
ebc_result = ebc_evaluator.evaluate_dataset(dataset=dataset, tracer=ebc_evaluation_tracer)

Once again, let's print part of the result.

In [None]:
print("Percentage correct:", ebc_result.statistics.percentage_correct)
print("First example:", ebc_result.statistics.evaluations[0])

As we can see, our `EmbeddingBasedClassify` outperformed the prompt-based approach here.
However, also note the small sample size of 10.
To achieve statistical signifance in evaluation, we generally recommend evaluating on at least 100, if not 1000, examples.

In the case at hand, we can note that the embedding-based approach likely benefitted from the large examples we were able to provide on the basis of the extensive dataset.
Generally, we recommend using this approach once you can provide around 10 or more examples per label.

### Wrap up

There you go, this is how to evaluate any task using the Intelligence Layer framework.
Simply define an `Evaluator` that takes the target `Task` as input and customize the `evaluate` as well as `aggregate` methods.