In [None]:
import json

import numpy
import pandas
from dotenv import load_dotenv
from matplotlib import pyplot

from intelligence_layer.core import InMemoryTracer, LuminousControlModel, TextChunk
from intelligence_layer.evaluation import (
    Aggregator,
    Evaluator,
    Example,
    InMemoryAggregationRepository,
    InMemoryDatasetRepository,
    InMemoryEvaluationRepository,
    InMemoryRunRepository,
    Runner,
    evaluation_lineages_to_pandas,
)
from intelligence_layer.evaluation.evaluation.domain import FailedExampleEvaluation
from intelligence_layer.use_cases import (
    ClassifyInput,
    PromptBasedClassify,
    SingleLabelClassifyAggregationLogic,
    SingleLabelClassifyEvaluationLogic,
)

load_dotenv()

# Assessing the Effectiveness of LLM-based Email Classification Systems

In the fast-paced world of business, effectively managing incoming support emails is crucial. The ability to quickly and accurately classify these emails into the appropriate department and determine their urgency is not just a matter of operational efficiency; it directly impacts customer satisfaction and overall business success. Given the high stakes, it's essential to rigorously evaluate any solution designed to automate this process. This tutorial focuses on the evaluation of a LLM-based program developed to automate the classification of support emails.

In an environment brimming with various methodologies and tools, understanding the comparative effectiveness of different approaches is vital. Systematic evaluation allows us to identify which techniques are best suited for specific tasks, understand their strengths and weaknesses, and optimize their performance.

## Setup and Evaluation

To start off, we are only given a few anecdotal examples.
Firstly, there are two e-mails, and secondly a number of potential departments to which they should be sent.

Let's have a look.


In [None]:
examples = [
    "Hi, my laptop crashed and I can't start it anymore. Do you need the serial number or sth?",
    "Hello,\n\nI am writing my Master's Thesis and would like to investigate the model's performance. Could I get some free credits?\n\nCheers, Niklas",
]

labels = {
    "Product",
    "Customer",
    "CEO Office",
    "Research",
    "Finance and Accounting",
    "Legal",
    "Communications",
    "Infrastructure",
    "Human Resources",
}

Luckily, the Intelligence Layer provides some classification tasks out of the box.

Let's run it!


In [None]:
# instantiating the default task
prompt_based_classify = PromptBasedClassify()

# building the input object for each example
classify_inputs = [
    ClassifyInput(chunk=TextChunk(example), labels=labels) for example in examples
]

# running the tasks concurrently
outputs = prompt_based_classify.run_concurrently(classify_inputs, InMemoryTracer())
outputs

Hmm, we have some results, but they aren't really legible (yet).
So let's look at the sorted individual results for more clarity: 

In [None]:
[o.sorted_scores for o in outputs]

For the first example 'Communications' gets the highest score, while for the second example the 'Communications' is the clear winner.
This suggests that the Finance Department can fix my laptop and the Comms people can reward free credits ... Not very likely.
We probably have to do some fine-tuning of our classification approach.

However, let's first make sure that this evidence is not anecdotal.
For this, we need to do some eval. Luckily, we have by now got access to a few more examples...


The Intelligence layer offers support to run task evaluations.

First, we have to create a dataset inside a repository.
There are different repositories (that persist datasets in different ways), but an `InMemoryDatasetRepository` will do for now.


In [None]:
with open("data/classify_examples.json", "r") as file:
    labeled_examples: list[dict[str, str]] = json.load(file)

labeled_examples

The Intelligence Layer offers support to run task evaluations.

First, we have to create a dataset inside a repository.
There are different repositories (that persist datasets in different ways), but an `InMemoryDatasetRepository` will do for now.


In [None]:
dataset_repository = InMemoryDatasetRepository()

examples = [
    Example(
        input=ClassifyInput(chunk=TextChunk(example["message"]), labels=labels),
        expected_output=example["label"],
    )
    for example in labeled_examples
]

dataset_id = dataset_repository.create_dataset(
    examples=examples,
    dataset_name="MyDataset",
).id

When a dataset is created, we generate a unique ID. We'll need it later.

In [None]:
dataset_id

Now that we have a dataset, let's actually run an evaluation on it!


In [None]:
# we need a few repositories to store runs, evals and aggregated evaluations
run_repository = InMemoryRunRepository()
evaluation_repository = InMemoryEvaluationRepository()
aggregation_repository = InMemoryAggregationRepository()


# each repository is used by a class that has a dedicated responsibility
runner = Runner(
    prompt_based_classify, dataset_repository, run_repository, "prompt-based-classify"
)
evaluator = Evaluator(
    dataset_repository,
    run_repository,
    evaluation_repository,
    "single-label-classify",
    SingleLabelClassifyEvaluationLogic(),
)
aggregator = Aggregator(
    evaluation_repository,
    aggregation_repository,
    "single-label-classify",
    SingleLabelClassifyAggregationLogic(),
)

Before evaluating, we must generate predictions for each sample in our datasets.


In [None]:
run_overview = runner.run_dataset(dataset_id)
run_overview

Next, let's evaluate this run.

In [None]:
eval_overview = evaluator.evaluate_runs(run_overview.id)
eval_overview

The evaluation throws many warnings and we will take care of them below.

Finally, let's aggregate all individual evaluations to get some eval statistics.

In [None]:
aggregation_overview = aggregator.aggregate_evaluation(eval_overview.id)
aggregation_overview

It looks like we only predicted around 25% of classes correctly.

Again, we get warnings that there are examples for which the expected labels are not part of the labels that the model can predict.

## Fixing the Data

Let's have a look at a few failed examples in detail:

In [None]:
passed_lineages = [
    lineage
    for lineage in evaluator.evaluation_lineages(eval_overview.id)
    if not isinstance(lineage.evaluation.result, FailedExampleEvaluation)
]


lineages = [
    lineage for lineage in passed_lineages if not lineage.evaluation.result.correct
][:2]


for lineage in lineages:
    display(lineage)

This confirms it: The first example has an expected label "IT Support". However, this label is not listed in the set of labels our model can predict for that example.

Let's see how often this is the case and which are the invalid expected labels:

In [None]:
lineages = [
    lineage
    for lineage in passed_lineages
    if lineage.evaluation.result.expected_label_missing
]

print(
    f"Number of examples with invalid expected label: {len(lineages)} out of {len(passed_lineages)}"
)
print(
    f"Invalid expected labels: {set([lineage.example.expected_output for lineage in lineages])}"
)

We can fix this in two ways: Add the missing labels to the set of allowed labels, or change the expected label to the closes matching available label. In this case, we'll do the latter.

In [None]:
# let's translate the other labels into the correct department
label_map = {
    "IT Support": "Infrastructure",
    "Sales": "Customer",
    "Marketing": "Customer",
    "Security": "Infrastructure",
    "Finance": "Finance and Accounting",
}

# we update the existing examples inplace with the correct labels
for example in examples:
    if example.expected_output in label_map.keys():
        example.expected_output = label_map[example.expected_output]

# datasets in the IL are immutable, so we must create a new one
cleaned_dataset_id = dataset_repository.create_dataset(
    examples=examples,
    dataset_name="CleanedDataset",
).id

## Improving the Prompt

The prompt used for the `PromptBasedClassify`-task looks as follows:

In [None]:
print(prompt_based_classify.instruction)

We can probably improve this task by making the prompt more specific, like so:

In [None]:
adjusted_prompt = """Identify the department that would be responsible for handling the given request.
Reply with only the department name."""
prompt_adjusted_classify = PromptBasedClassify(instruction=adjusted_prompt)

Let's run the cleaned dataset using this task...

In [None]:
runner_prompt_adjusted = Runner(
    prompt_adjusted_classify,
    dataset_repository,
    run_repository,
    "running for adjusted prompt",
)
run_overview_prompt_adjusted = runner_prompt_adjusted.run_dataset(cleaned_dataset_id)
eval_overview_prompt_adjusted = evaluator.evaluate_runs(run_overview_prompt_adjusted.id)
aggregation_overview_prompt_adjusted = aggregator.aggregate_evaluation(
    eval_overview_prompt_adjusted.id
)

In [None]:
aggregation_overview_prompt_adjusted

Our adjustments improved the accuracy to 58%!

So far, we only used the `luminous-base-control` model. Let's see if we can improve our classifications by upgrading to a bigger model!

In [None]:
classify_with_extended = PromptBasedClassify(
    instruction=adjusted_prompt, model=LuminousControlModel("luminous-supreme-control")
)

Ok, let's run it again and see if we improved!


In [None]:
runner_with_extended = Runner(
    classify_with_extended,
    dataset_repository,
    run_repository,
    "running for adjusted prompt & better model",
)
run_overview_with_extended = runner_with_extended.run_dataset(cleaned_dataset_id)
eval_overview_with_extended = evaluator.evaluate_runs(run_overview_with_extended.id)
aggregation_overview_with_extended = aggregator.aggregate_evaluation(
    eval_overview_with_extended.id
)

In [None]:
aggregation_overview_with_extended

So using a bigger model further improved our results to 67%. But there are still wrongly predicted labels:

In [None]:
incorrect_predictions_lineages = [
    lineage
    for lineage in evaluator.evaluation_lineages(eval_overview_prompt_adjusted.id)
    if not isinstance(lineage.evaluation.result, FailedExampleEvaluation)
    and not lineage.evaluation.result.correct
]

df = evaluation_lineages_to_pandas(incorrect_predictions_lineages)
df["input"] = [i.chunk for i in df["input"]]
df["predicted"] = [r.predicted for r in df["result"]]
df.reset_index()[["example_id", "input", "expected_output", "predicted"]]

So let's analyze this in more depth by visualizing how often each label was expected or predicted in a histogram. 

In [None]:
by_labels = aggregation_overview_with_extended.statistics.by_label

expected_counts_by_labels = {
    label: by_labels[label].expected_count for label in by_labels.keys()
}
predicted_counts_by_labels = {
    label: by_labels[label].predicted_count for label in by_labels.keys()
}

x_axis = numpy.arange(len(expected_counts_by_labels.keys()))
pyplot.bar(
    x_axis - 0.2, expected_counts_by_labels.values(), width=0.4, label="expected counts"
)
pyplot.bar(
    x_axis + 0.2,
    predicted_counts_by_labels.values(),
    width=0.4,
    label="predicted counts",
)
pyplot.ylabel("Classification count")
pyplot.xlabel("Labels")
pyplot.legend()
_ = pyplot.xticks(x_axis, by_labels.keys(), rotation=45)

As we can see our task tends to overpredict the `Customer` label while it underpredicts `Infrastructure`, `CEO Office` and `Product`.

We can get even more insight into the classification behaviour of our task by analysing its cross-matrix. From the off-diagonal cells in the cross-matrix we can see the explicit misslabeling for each class. This helps us to see if a specific class is frequently misslabeld as a particular other class. 

In [None]:
confusion_matrix = aggregation_overview_with_extended.statistics.confusion_matrix

data = []
for (predicted_label, expected_label), count in confusion_matrix.items():
    data.append(
        {
            "Expected Label": expected_label,
            "Predicted Label": predicted_label,
            "Count": count,
        }
    )

df = pandas.DataFrame(data)
df = df.pivot(index="Expected Label", columns="Predicted Label", values="Count")
df = df.fillna(0)
df = df.reindex(
    index=labels, columns=labels, fill_value=0
)  # this will add any labels that were neither expected nor predicted
df = df.style.background_gradient(cmap="grey", vmin=df.min().min(), vmax=df.max().max())
df = df.format("{:.0f}")
df

In our case we can see that the bias towards the `Customer` class does not come at the cost of one particular other class, but is caused by a more general mislabeling. 

As you can see there is plenty of room for further improvements of our classification task. 

Notice, for instance, that so far we did not tell our classification task what each class means.

The model had to 'guess' what we mean by each class purely from the given labels. In order to tackle this issue you could use the `PromptBasedClassifyWithDefinitions` task. This task allows you to also provide a short description for each class.

Feel free to further play around and improve our classification example. 