# Assessing the Effectiveness of LLM-based Email Classification Systems

In the fast-paced world of business, effectively managing incoming support emails is crucial. The ability to quickly and accurately classify these emails into the appropriate department and determine their urgency is not just a matter of operational efficiency; it directly impacts customer satisfaction and overall business success. Given the high stakes, it's essential to rigorously evaluate any solution designed to automate this process. This tutorial focuses on the evaluation of a LLM-based program developed to automate the classification of support emails.

### The Importance of Evaluating LLM-Based Solutions

In an environment brimming with various methodologies and tools, understanding the comparative effectiveness of different approaches is vital. Systematic evaluation allows us to identify which techniques are best suited for specific tasks, understand their strengths and weaknesses, and optimize their performance.

### Business Impact and Evaluation Challenges

For a business, the deployment of an LLM-based email classification system can be transformative. It can streamline workflows, expedite response times, and ensure that critical issues are swiftly addressed. However, evaluating these models presents unique challenges, particularly due to the generative nature of LLMs.

Unlike traditional models, LLMs generate outputs based on complex and often less transparent internal processes. This complexity can make it challenging to understand why a model categorizes an email in a certain way, which is crucial for business stakeholders who need to trust and rely on the system's decisions. Moreover, in the context of email classification, the nuances of language and context can lead to misclassifications, which can have significant business ramifications.

### Addressing Evaluation Challenges with Traceability

One way to mitigate these challenges is through traceability of model outputs. By implementing mechanisms that allow us to trace back how and why a particular classification was made, we can gain insights into the model's decision-making process. This transparency is invaluable for fine-tuning the implementation, addressing misclassifications, and building trust among stakeholders (and users).


In [None]:
# first: import task and try out two examples that product manager gave me
# looks good, let's do eval...

In [None]:
from dotenv import load_dotenv

from intelligence_layer.evaluation import (
    Evaluator,
    InMemoryEvaluationRepository,
    InMemoryRunRepository,
    InMemoryDatasetRepository,
    InMemoryAggregationRepository,
    Runner,
    Aggregator,
)
from intelligence_layer.use_cases import (
    PromptBasedClassify,
    SingleLabelClassifyEvaluationLogic,
    SingleLabelClassifyAggregationLogic,
)

load_dotenv()

prompt_based_classify = PromptBasedClassify()
dataset_repository = InMemoryDatasetRepository()
run_repository = InMemoryRunRepository()
evaluation_repository = InMemoryEvaluationRepository()
aggregation_repository = InMemoryAggregationRepository()


evaluator = Evaluator(
    dataset_repository,
    run_repository,
    evaluation_repository,
    "single-label-classify",
    SingleLabelClassifyEvaluationLogic(),
)
aggregator = Aggregator(
    evaluation_repository,
    aggregation_repository,
    "single-label-classify",
    SingleLabelClassifyAggregationLogic(),
)
runner = Runner(prompt_based_classify, dataset_repository, run_repository, "prompt-based-classify")

Now, let's run a single example and see what comes of it!

In [None]:
from intelligence_layer.core import TextChunk, NoOpTracer
from intelligence_layer.use_cases import ClassifyInput
from intelligence_layer.evaluation import Example


classify_input = ClassifyInput(
    chunk=TextChunk("This is good"),
    labels=frozenset({"positive", "negative"}),
)

single_example_dataset = dataset_repository.create_dataset(
    examples=[Example(input=classify_input, expected_output="positive")]
)

run_overview = runner.run_dataset(single_example_dataset, NoOpTracer())
evaluation_overview = evaluator.evaluate_runs(run_overview.id)
aggregation_overview = aggregator.aggregate_evaluation(evaluation_overview.id)

print("Statistics: ", aggregation_overview.statistics)

Cool!

Let's have a look at this [dataset](https://huggingface.co/cardiffnlp/tweet-topic-21-multi) for more elaborate evaluation.

In [None]:
from datasets import load_dataset

dataset = load_dataset("cardiffnlp/tweet_topic_multi")
test_set_name = "validation_random"
all_data = list(dataset[test_set_name])
data = all_data[:25]  # this has 573 datapoints, let's take a look at 25 for now

We need to transform our dataset into the required format. 
Therefore, let's check out what it looks like.

In [None]:
data[1]

Accordingly, this must be translated into the interface of our `Evaluator`.

This is the target structure:

``` python
class Example(BaseModel, Generic[Input, ExpectedOutput]):
    input: Input
    expected_output: ExpectedOutput
    id: Optional[str] = Field(default_factory=lambda: str(uuid4()))

```

We want the `input` in each `Example` to mimic the input of an actual task, therefore we must every time include the text (chunk) and all possible labels.
The `expected_output` shall correspond to anything we wish to compare our generated output to.
In this case, that means the correct class(es).

In [None]:
all_labels = list(set(c for d in data for c in d["label_name"]))
dataset_id = dataset_repository.create_dataset(
    examples=[
        Example(
            input=ClassifyInput(chunk=TextChunk(d["text"]), labels=all_labels),
            expected_output=d["label_name"],
        )
        for d in data
    ]
)

Ok, let's run this!

Note that this may take a while as we parallelise the tasks in a way that accommodates the inference API.

In [None]:
run_overview = runner.run_dataset(dataset_id)
evaluation_overview = evaluator.evaluate_runs(run_overview.id)
aggregation_overview = aggregator.aggregate_evaluation(evaluation_overview.id)
aggregation_overview.raise_on_evaluation_failure()

Checking out the results...

In [None]:
evaluation_overview

In [None]:
from intelligence_layer.use_cases import SingleLabelClassifyEvaluation

print("Percentage correct:", aggregation_overview.statistics.percentage_correct)
print(
    "First example:",
    evaluation_repository.example_evaluations(
        evaluation_id=next(iter(aggregation_overview.evaluation_overviews)).id,
        evaluation_type=SingleLabelClassifyEvaluation,
    )[0],
)

For the sake of comparison, let's see if we can achieve a better result with our EmbeddingBasedClassifier.
Here, we have to provide some example for each class.

We can even reuse our data repositories

In [None]:
from collections import defaultdict
from typing import Any, Mapping, Sequence
from intelligence_layer.use_cases import (
    MultiLabelClassifyEvaluationLogic,
    MultiLabelClassifyAggregationLogic,
    EmbeddingBasedClassify,
    LabelWithExamples,
)


def build_labels_and_examples(hf_data: Any) -> Mapping[str, Sequence[str]]:
    examples = defaultdict(list)
    for d in hf_data:
        labels = d["label_name"]
        for label in labels:
            if len(examples[label]) < 20:
                examples[label].append(d["text"])
    return examples


client = LimitedConcurrencyClient.from_env()
embedding_based_classify = EmbeddingBasedClassify(
    client=client,
    labels_with_examples=[
        LabelWithExamples(name=name, examples=examples)
        for name, examples in build_labels_and_examples(all_data[25:]).items()
    ],
)
eval_logic = MultiLabelClassifyEvaluationLogic(threshold=0.6)
aggregation_logic = MultiLabelClassifyAggregationLogic()

embedding_based_classify_evaluator = Evaluator(
    dataset_repository,
    run_repository,
    evaluation_repository,
    "multi-label-classify",
    eval_logic,
)
embedding_based_classify_aggregator = Aggregator(
    evaluation_repository,
    aggregation_repository,
    "multi-label-classify",
    aggregation_logic,
)
embedding_based_classify_runner = Runner(
    embedding_based_classify,
    dataset_repository,
    run_repository,
    "embedding-based-classify",
)

In [None]:
embedding_based_classify_run_result = embedding_based_classify_runner.run_dataset(
    dataset_id
)
embedding_based_classify_evaluation_result = (
    embedding_based_classify_evaluator.evaluate_runs(
        embedding_based_classify_run_result.id
    )
)
embedding_based_classify_aggregation_result = (
    embedding_based_classify_aggregator.aggregate_evaluation(
        embedding_based_classify_evaluation_result.id
    )
)
embedding_based_classify_aggregation_result.raise_on_evaluation_failure()

In [None]:
embedding_based_classify_aggregation_result.statistics.macro_avg

Apparently, our method has a great recall value, but we tend to falsely predict labels at times.

Note, that the evaluation criteria for the multiple label approach are a lot harsher; we evaluate whether we correctly predict all labels & not just one of the correct ones!



### Wrap up

There you go, this is how to evaluate any task using the 'Intelligence Layer'-framework.
Simply define an `Evaluator` that takes the target `Task` as input and customize the `do_evaluate` as well as `aggregate` methods.