# Evaluating LLM-based tasks

Evaluating LLM-based use cases is pivotal for several reasons.
First, with the myriad of methods available, comparability becomes essential.
By systematically evaluating different approaches, we can discern which techniques are more effective or suited for specific tasks, fostering a deeper understanding of their strengths and weaknesses.
Secondly, optimization plays a significant role. Without proper evaluation metrics and rigorous testing, it becomes challenging to fine-tune methods and/or models to achieve their maximum potential.
Moreover, drawing comparisons with state-of-the-art (SOTA) and open-source methods is crucial.
Such comparisons not only provide benchmarks but also enable users to determine the value-added by proprietary or newer models over freely available counterparts.

However, evaluating LLMs, especially in the domain of text generation, presents unique challenges.
Text generation is inherently subjective, and what one evaluator deems coherent and relevant, another might find disjointed or off-topic. This subjectivity complicates the establishment of universal evaluation standards, making it imperative to approach LLM evaluation with a multifaceted and comprehensive strategy.

### Evaluating classification use-cases

To (at least for now) evade the elusive issue described in the last paragraph, let's have a look at an easier to evaluate methodology: classification.
Why is this easier?
Well, unlike other tasks such as QA, the result of a classification task is more or less binary (true/false).
There are very few grey areas, as it is unlikely that a classification result is somewhat or "half" correct.

Make sure that you have familiarized yourself with the `PromptBasedClassify` prior to starting this notebook.


First, we need to instantiate our task, an evaluator for it and a repository that stores the evaluation results along with tracing information for the evaluated examples.


In [None]:
from dotenv import load_dotenv

from intelligence_layer.connectors import LimitedConcurrencyClient
from intelligence_layer.evaluation import (
    Evaluator,
    InMemoryEvaluationRepository,
    InMemoryRunRepository,
    InMemoryDatasetRepository,
    InMemoryAggregationRepository,
    Runner,
    Aggregator,
)
from intelligence_layer.use_cases import (
    PromptBasedClassify,
    SingleLabelClassifyEvaluationLogic,
    SingleLabelClassifyAggregationLogic,
)

load_dotenv()

task = PromptBasedClassify()
dataset_repository = InMemoryDatasetRepository()
run_repository = InMemoryRunRepository()
evaluation_repository = InMemoryEvaluationRepository()
aggregation_repository = InMemoryAggregationRepository()
aggregation_logic = SingleLabelClassifyAggregationLogic()
evaluation_logic = SingleLabelClassifyEvaluationLogic()


evaluator = Evaluator(
    dataset_repository,
    run_repository,
    evaluation_repository,
    "single-label-classify",
    evaluation_logic,
)
aggregator = Aggregator(
    evaluation_repository,
    aggregation_repository,
    "single-label-classify",
    aggregation_logic,
)
runner = Runner(task, dataset_repository, run_repository, "prompt-based-classify")

Now, let's run a single example and see what comes of it!

In [None]:
from intelligence_layer.core import TextChunk, NoOpTracer
from intelligence_layer.use_cases import ClassifyInput
from intelligence_layer.evaluation import Example


classify_input = ClassifyInput(
    chunk=TextChunk("This is good"),
    labels=frozenset({"positive", "negative"}),
)

single_example_dataset = dataset_repository.create_dataset(
    examples=[Example(input=classify_input, expected_output="positive")],
    dataset_name="ClassifyDataset",
)

run_overview = runner.run_dataset(single_example_dataset.id, NoOpTracer())
evaluation_overview = evaluator.evaluate_runs(run_overview.id)
aggregation_overview = aggregator.aggregate_evaluation(evaluation_overview.id)

print("Statistics: ", aggregation_overview.statistics)

Cool!

Let's have a look at this [dataset](https://huggingface.co/cardiffnlp/tweet-topic-21-multi) for more elaborate evaluation.

In [None]:
from datasets import load_dataset

dataset = load_dataset("cardiffnlp/tweet_topic_multi")
test_set_name = "validation_random"
all_data = list(dataset[test_set_name])
data = all_data[:25]  # this has 573 datapoints, let's take a look at 25 for now

We need to transform our dataset into the required format. 
Therefore, let's check out what it looks like.

In [None]:
data[1]

Accordingly, this must be translated into the interface of our `Evaluator`.

This is the target structure:

``` python
class Example(BaseModel, Generic[Input, ExpectedOutput]):
    input: Input
    expected_output: ExpectedOutput
    id: Optional[str] = Field(default_factory=lambda: str(uuid4()))

```

We want the `input` in each `Example` to mimic the input of an actual task, therefore we must every time include the text (chunk) and all possible labels.
The `expected_output` shall correspond to anything we wish to compare our generated output to.
In this case, that means the correct class(es).

In [None]:
all_labels = list(set(c for d in data for c in d["label_name"]))
dataset = dataset_repository.create_dataset(
    examples=[
        Example(
            input=ClassifyInput(chunk=TextChunk(d["text"]), labels=all_labels),
            expected_output=d["label_name"],
        )
        for d in data
    ],
    dataset_name="tweet_topic_multi",
)

Ok, let's run this!

Note that this may take a while as we parallelise the tasks in a way that accommodates the inference API.

In [None]:
run_overview = runner.run_dataset(dataset.id)
evaluation_overview = evaluator.evaluate_runs(run_overview.id)
aggregation_overview = aggregator.aggregate_evaluation(evaluation_overview.id)
aggregation_overview.raise_on_evaluation_failure()

Checking out the results...

In [None]:
evaluation_overview

In [None]:
from intelligence_layer.use_cases import SingleLabelClassifyEvaluation

print("Percentage correct:", aggregation_overview.statistics.percentage_correct)
print(
    "First example:",
    evaluation_repository.example_evaluations(
        evaluation_id=next(iter(aggregation_overview.evaluation_overviews)).id,
        evaluation_type=SingleLabelClassifyEvaluation,
    )[0],
)

For the sake of comparison, let's see if we can achieve a better result with our EmbeddingBasedClassifier.
Here, we have to provide some example for each class.

We can even reuse our data repositories

In [None]:
from collections import defaultdict
from typing import Any, Mapping, Sequence
from intelligence_layer.use_cases import (
    MultiLabelClassifyEvaluationLogic,
    MultiLabelClassifyAggregationLogic,
    EmbeddingBasedClassify,
    LabelWithExamples,
)


def build_labels_and_examples(hf_data: Any) -> Mapping[str, Sequence[str]]:
    examples = defaultdict(list)
    for d in hf_data:
        labels = d["label_name"]
        for label in labels:
            if len(examples[label]) < 20:
                examples[label].append(d["text"])
    return examples


client = LimitedConcurrencyClient.from_env()
embedding_based_classify = EmbeddingBasedClassify(
    client=client,
    labels_with_examples=[
        LabelWithExamples(name=name, examples=examples)
        for name, examples in build_labels_and_examples(all_data[25:]).items()
    ],
)
eval_logic = MultiLabelClassifyEvaluationLogic(threshold=0.6)
aggregation_logic = MultiLabelClassifyAggregationLogic()

embedding_based_classify_evaluator = Evaluator(
    dataset_repository,
    run_repository,
    evaluation_repository,
    "multi-label-classify",
    eval_logic,
)
embedding_based_classify_aggregator = Aggregator(
    evaluation_repository,
    aggregation_repository,
    "multi-label-classify",
    aggregation_logic,
)
embedding_based_classify_runner = Runner(
    embedding_based_classify,
    dataset_repository,
    run_repository,
    "embedding-based-classify",
)

In [None]:
embedding_based_classify_run_result = embedding_based_classify_runner.run_dataset(
    dataset.id
)
embedding_based_classify_evaluation_result = (
    embedding_based_classify_evaluator.evaluate_runs(
        embedding_based_classify_run_result.id
    )
)
embedding_based_classify_aggregation_result = (
    embedding_based_classify_aggregator.aggregate_evaluation(
        embedding_based_classify_evaluation_result.id
    )
)
embedding_based_classify_aggregation_result.raise_on_evaluation_failure()

In [None]:
embedding_based_classify_aggregation_result.statistics.macro_avg

Apparently, our method has a great recall value, but we tend to falsely predict labels at times.

Note, that the evaluation criteria for the multiple label approach are a lot harsher; we evaluate whether we correctly predict all labels & not just one of the correct ones!



### Wrap up

There you go, this is how to evaluate any task using the 'Intelligence Layer'-framework.
Simply define an `Evaluator` that takes the target `Task` as input and customize the `do_evaluate` as well as `aggregate` methods.