In [None]:
from dotenv import load_dotenv
from intelligence_layer.core import InMemoryTracer, LuminousControlModel, TextChunk
from intelligence_layer.evaluation import (
    Aggregator,
    Evaluator,
    Example,
    InMemoryAggregationRepository,
    InMemoryDatasetRepository,
    InMemoryEvaluationRepository,
    InMemoryRunRepository,
    Runner,
)
from intelligence_layer.use_cases import (
    ClassifyInput,
    PromptBasedClassify,
    SingleLabelClassifyAggregationLogic,
    SingleLabelClassifyEvaluation,
    SingleLabelClassifyEvaluationLogic,
    SingleLabelClassifyOutput,
)
import json


load_dotenv()

# Assessing the Effectiveness of LLM-based Email Classification Systems

In the fast-paced world of business, effectively managing incoming support emails is crucial. The ability to quickly and accurately classify these emails into the appropriate department and determine their urgency is not just a matter of operational efficiency; it directly impacts customer satisfaction and overall business success. Given the high stakes, it's essential to rigorously evaluate any solution designed to automate this process. This tutorial focuses on the evaluation of a LLM-based program developed to automate the classification of support emails.

In an environment brimming with various methodologies and tools, understanding the comparative effectiveness of different approaches is vital. Systematic evaluation allows us to identify which techniques are best suited for specific tasks, understand their strengths and weaknesses, and optimize their performance.


To start off, we are only given a few anecdotal examples. Let's see how far we can get with these.


In [None]:
examples = [
    "Hi, my laptop crashed and I can't start it anymore. Do you need the serial number or sth?",
    "Hello,\n\nI am writing my Master's Thesis and would like to investigate the model's performance. Could I get some free credits?\n\nCheers, Niklas",
]

labels = {
    "Product",
    "Customer",
    "CEO Office",
    "Research",
    "Finance and Accounting",
    "Legal",
    "Communications",
    "Infrastructure",
    "Human Resources",
}

Luckily, the Intelligence Layer provides some classification tasks out of the box.

Let's run it!


In [None]:
Luckily, the Intelligence provides some classification tasks out of the box.

Let's import it and run!


In [None]:
# instantiating the default task
prompt_based_classify = PromptBasedClassify()

classify_inputs = [
    ClassifyInput(chunk=TextChunk(example), labels=labels) for example in examples
]


outputs = prompt_based_classify.run_concurrently(classify_inputs, InMemoryTracer())
outputs

In [None]:
Hmm, we have some results, but they aren't really legible (yet).

In [5]:
[sorted(list(o.scores.items()), key=lambda i: i[1], reverse=True)[0] for o in outputs]

In [None]:
It appears that the Finance Department can fix my laptop and the Comms people can reward free credits...
We probably have to do some finetuning of our classification approach.

    },
    {
        "label": "Sales",
        "message": "Jonas, we have met each other at the event in Nürnberg, can we meet for a follow up in your Office in Heidelberg?"

    },
    {
        "label": "Security",
        "message": "Your hTTPs Certificate is not valid on your www.aleph-alpha.de"
    },
    {
        "label": "HR",
        "message": "I want to take a week off immediatly"
    },
    {
        "label": "HR",
        "message": "I want to take a sabbatical"
    },
    {
        "label": "HR",
        "message": "How can I work more, I want to work weekends, can I get paid overtime?"
    }
]

In [None]:
with open("data/classify_examples.json", "r") as file:
    labeled_examples: list[dict[str, str]] = json.load(file)

labeled_examples

The Intelligence Layer offers support to run task evaluations.

First, we have to create a dataset inside a repository.
There are different repositories (that persist datasets in different ways), but an `InMemoryDatasetRepository` will do for now.


In [None]:
dataset_repository = InMemoryDatasetRepository()

dataset_id = dataset_repository.create_dataset(
    examples=[
        Example(
            input=ClassifyInput(chunk=TextChunk(example["message"]), labels=labels),
            expected_output=example["label"],
        )
        for example in labeled_examples
    ],
    dataset_name="MyDataset",
).id

When a dataset is created, we generate a unique ID. We'll need it later.

In [None]:
dataset_id

In [None]:
# we need a few repositories to store runs, evals and aggregated evaluations
run_repository = InMemoryRunRepository()
evaluation_repository = InMemoryEvaluationRepository()
aggregation_repository = InMemoryAggregationRepository()


# each repository is used by a class that has a dedicated responsibility
runner = Runner(
    prompt_based_classify, dataset_repository, run_repository, "prompt-based-classify"
)
evaluator = Evaluator(
    dataset_repository,
    run_repository,
    evaluation_repository,
    "single-label-classify",
    SingleLabelClassifyEvaluationLogic(),
)
aggregator = Aggregator(
    evaluation_repository,
    aggregation_repository,
    "single-label-classify",
    SingleLabelClassifyAggregationLogic(),
)

Before evaluating, we must generate predictions for each sample in our datasets.


In [None]:
eval_overview = evaluator.evaluate_runs(run_overview.id)

In [None]:
eval_overview = evaluator.evaluate_runs(run_overview.id)
eval_overview

Finally, let's aggregate all individual evaluations to get some eval statistics.

In [None]:
aggregation_overview = aggregator.aggregate_evaluation(eval_overview.id)
aggregation_overview

It looks like we only predicted around 25% of classes correctly.

However, a closer look at the overview suggests that we have a bunch of incorrect labels in our test dataset.
We will fix this later.

First, let's have a look at a few failed examples in detail.

In [None]:
def get_failed_examples(run_id: str, eval_id: str, dataset_id: str, first_n: int):
    overview = [
        {
            "input": example.input,
            "expected_output": example.expected_output,
            "result": sorted(
                list(
                    next(
                        example_output
                        for example_output in run_repository.example_outputs(
                            run_id, SingleLabelClassifyOutput
                        )
                        if example_output.example_id == example.id
                    ).output.scores.items()
                ),
                key=lambda i: i[1],
                reverse=True,
            )[0],
            "eval": evaluation_repository.example_evaluation(
                evaluation_id=eval_id,
                example_id=example.id,
                evaluation_type=SingleLabelClassifyEvaluation,
            ).result,
        }
        for example in dataset_repository.examples(
            dataset_id=dataset_id, input_type=ClassifyInput, expected_output_type=str
        )
    ]
    return [example for example in overview if not example["eval"].correct][:first_n]


get_failed_examples(run_overview.id, eval_overview.id, dataset_id, 3)

This confirms it: some expected labels are missing. Let's try fixing this.

We can do this two ways: Adjust our set of labels or adjust the eval set. In this case, we'll do the latter.


In [None]:
# let's translate the other labels into the correct department
label_map = {
    "IT Support": "Infrastructure",
    "Sales": "Customer",
    "Marketing": "Customer",
    "Security": "Infrastructure",
    "Finance": "Finance and Accounting",
}

for example in labeled_examples:
    label = example["label"]
    if label in label_map.keys():
        example["label"] = label_map[label]

# datasets in the IL are immutable, so we must create a new one
cleaned_dataset_id = dataset_repository.create_dataset(
    examples=[
        Example(
            input=ClassifyInput(chunk=TextChunk(example["message"]), labels=labels),
            expected_output=example["label"],
        )
        for example in labeled_examples
    ],
    dataset_name="CleanedDataset",
).id

The prompt used for the `PromptBasedClassify`-task looks as follows:

In [None]:
print(prompt_based_classify.instruction)

We can probably improve this task by making the prompt more specific, like so:

In [None]:
adjusted_prompt = """Identify the department that would be responsible for handling the given request.
Reply with only the department name."""
prompt_adjusted_classify = PromptBasedClassify(instruction=adjusted_prompt)

Let's run the cleaned dataset using this task...

In [None]:
runner_prompt_adjusted = Runner(
    prompt_adjusted_classify,
    dataset_repository,
    run_repository,
    "running for adjusted prompt",
)
run_overview_prompt_adjusted = runner_prompt_adjusted.run_dataset(cleaned_dataset_id)
eval_overview_prompt_adjusted = evaluator.evaluate_runs(run_overview_prompt_adjusted.id)
aggregation_overview_prompt_adjusted = aggregator.aggregate_evaluation(
    eval_overview_prompt_adjusted.id
)

In [None]:
aggregation_overview_prompt_adjusted

Cool, this already got us up to 58%!

So far, we only used the `luminous-base-control` model. Let's see if we can improve our classifications by upgrading to a bigger model!

In [None]:
classify_with_extended = PromptBasedClassify(
    instruction=adjusted_prompt, model=LuminousControlModel("luminous-supreme-control")
)

Ok, let's run it again and see if we improved!


In [None]:
runner_with_extended = Runner(
    classify_with_extended,
    dataset_repository,
    run_repository,
    "running for adjusted prompt & better model",
)
run_overview_with_extended = runner_with_extended.run_dataset(cleaned_dataset_id)
eval_overview_with_extended = evaluator.evaluate_runs(run_overview_with_extended.id)
aggregation_overview_with_extended = aggregator.aggregate_evaluation(
    eval_overview_with_extended.id
)

In [None]:
aggregation_overview_with_extended

So using a bigger model further improved our results to 66.66%.

As you can see there are plenty of option on how to further enhance the accuracy of our classify task. Notice, for instance, that so far we did not tell our classification task what each class means.

In [None]:
get_failed_examples(
    run_overview_prompt_adjusted.id,
    eval_overview_prompt_adjusted.id,
    cleaned_dataset_id,
    3,
)

The model had to 'guess' what we mean by each class purely from the given labels. In order to tackle this issue you could use the `PromptBasedClassifyWithDefinitions` task. This task allows you to also provide a short description for each class.

Feel free to further play around and improve our classification example. 