# Evaluating LLM-based tasks

Evaluating LLM-based use cases is pivotal for several reasons.
First, with the myriad of methods available, comparability becomes essential.
By systematically evaluating different approaches, we can discern which techniques are more effective or suited for specific tasks, fostering a deeper understanding of their strengths and weaknesses.
Secondly, optimization plays a significant role. Without proper evaluation metrics and rigorous testing, it becomes challenging to fine-tune methods and/or models to achieve their maximum potential.
Moreover, drawing comparisons with state-of-the-art (SOTA) and open-source methods is crucial.
Such comparisons not only provide benchmarks but also enable users to determine the value-added by proprietary or newer models over freely available counterparts.

However, evaluating LLMs, especially in the domain of text generation, presents unique challenges.
Text generation is inherently subjective, and what one evaluator deems coherent and relevant, another might find disjointed or off-topic. This subjectivity complicates the establishment of universal evaluation standards, making it imperative to approach LLM evaluation with a multifaceted and comprehensive strategy.

### Evaluating classification use-cases

To (at least for now) evade the elusive issue described in the last paragraph, let's have a look at an easier to evaluate methodology: classification.
Make sure that you have familiarized yourself with the `SingleLabelClassify` and `EmbeddingBasedClassify` prior to starting this notebook.


Next, we need to instantiate an evaluator that takes our classify methodology (`task`) and some datapoints and returns some evaluation metrics.

First, let's evaluate a single example and see what happens.

In [None]:
from intelligence_layer.use_cases.classify.classify import ClassifyEvaluator

evaluator = ClassifyEvaluator(task)
classify_input = ClassifyInput(
        chunk=Chunk("This is good"),
        labels=frozenset({"positive", "negative"}),
    )
evaluation_logger = InMemoryDebugLogger(name="evaluation logger")
expected_output = "positive"
evaluation = evaluator.evaluate(
    input=classify_input, logger=evaluation_logger, expected_output=[expected_output]
)

print("The task result:", evaluation.output.scores)
print("The expected output:", expected_output)
print("The eval result:", evaluation.correct)


Cool!
Let's now try to find a dataset to use.
We found this [dataset](https://huggingface.co/cardiffnlp/tweet-topic-21-multi) on huggingface, let's see if we can get an evaluation going!

In [None]:
from datasets import load_dataset

dataset = load_dataset(f"cardiffnlp/tweet_topic_multi")
test_set_name = "validation_random"
data = list(dataset[test_set_name])[:10] # this has 573 datapoints, let's take a look at 20 for now


We need to transform our dataset into the required format. 
Therefore, let's check out what it looks like.

In [None]:
data[1]


Accordingly, this must be translated into the interface of our `Evaluator`.

In [None]:
from intelligence_layer.core.evaluator import Example, Dataset


all_labels = list(set(c for d in data for c in d["label_name"]))
dataset = Dataset(
    name="tweet topics",
    examples=[
        Example(
            input=ClassifyInput(
                chunk=d[Chunk("text")],
                labels=all_labels
            ),
            expected_output=d["label_name"]
        ) for d in data
    ]
)


Ok, let's run this!

In [None]:
evaluation_logger = InMemoryDebugLogger(name="evaluation logger")
result = evaluator.evaluate_dataset(dataset=dataset, logger=evaluation_logger)


Checking out the results...

In [None]:
print("Percentage correct:", result.percentage_correct)
print("First example:", result.evaluations[0])
