# Assessing the Effectiveness of LLM-based Email Classification Systems

In the fast-paced world of business, effectively managing incoming support emails is crucial. The ability to quickly and accurately classify these emails into the appropriate department and determine their urgency is not just a matter of operational efficiency; it directly impacts customer satisfaction and overall business success. Given the high stakes, it's essential to rigorously evaluate any solution designed to automate this process. This tutorial focuses on the evaluation of a LLM-based program developed to automate the classification of support emails.

In an environment brimming with various methodologies and tools, understanding the comparative effectiveness of different approaches is vital. Systematic evaluation allows us to identify which techniques are best suited for specific tasks, understand their strengths and weaknesses, and optimize their performance.


To start off, we are only given a few anecdotal examples.
Firstly, there are two e-mails, and secondly a number of potential departments to which they should be sent.

Let's have a look.


In [3]:
examples = [
    "Hi, my laptop crashed and I can't start it anymore. Do you need the serial number or sth?",
    "Hello,\n\nI am writing my Master's Thesis and would like to investigate the model's performance. Could I get some free credits?\n\nCheers, Niklas",
]

labels = {
    "Product",
    "Customer",
    "CEO Office",
    "Research",
    "Finance",
    "Accounting",
    "Legal",
    "Communication Department",
    "Infrastructure",
    "People & Culture",
}

Luckily, the Intelligence provides some classification tasks out of the box.

Let's import it and run!


In [4]:
from intelligence_layer.core import TextChunk, InMemoryTracer
from intelligence_layer.use_cases import PromptBasedClassify, ClassifyInput


# instantiating the default task
prompt_based_classify = PromptBasedClassify()

# building the input object for each example
classify_inputs = [
    ClassifyInput(chunk=TextChunk(example), labels=labels) for example in examples
]

# running the tasks concurrently
outputs = prompt_based_classify.run_concurrently(classify_inputs, InMemoryTracer())
outputs

[SingleLabelClassifyOutput(scores={'People & Culture': 0.00540895946326583, 'Product': 0.0012847404821690297, 'Legal': 0.011450767273591046, 'Infrastructure': 7.63933851429347e-06, 'Accounting': 0.0037175198485175764, 'Communication Department': 0.4868987707545542, 'Research': 0.0001194994797172138, 'Finance': 0.4868987707545542, 'CEO Office': 8.307395237093146e-07, 'Customer': 0.004212501865592915}),
 SingleLabelClassifyOutput(scores={'People & Culture': 0.0002471222567108989, 'Product': 1.034420307682754e-06, 'Legal': 5.253219782981775e-06, 'Infrastructure': 1.797552960512546e-07, 'Accounting': 0.0024014826120226144, 'Communication Department': 0.9688271358793473, 'Research': 0.02010736772998545, 'Finance': 0.008381997922340201, 'CEO Office': 2.756636746614045e-08, 'Customer': 2.839863783934064e-05})]

Hmm, we have some results, but they aren't really legible (yet).

In [5]:
[sorted(list(o.scores.items()), key=lambda i: i[1], reverse=True)[0] for o in outputs]

[('Communication Department', 0.4868987707545542),
 ('Communication Department', 0.9688271358793473)]

It appears that both inputs were mistakenly classified as having to be sent to the Comms Department.
We probably have to do some finetuning of our classification approach.

However, let's first make sure that this evidence is not anecdotal.
For this, we need to do some eval. Luckily, we have by now got access to a few more examples...


In [6]:
import json


with open("data/classify_examples.json", 'r') as file:
    labeled_examples = json.load(file)

labeled_examples
    

[{'label': 'Finance',
  'message': 'I just traveled to Paris for a conference, where can I get the train ride refunded?'},
 {'label': 'Customer',
  'message': 'Hello, we would like to get in contact with your sales team, because we are interested in your solution.'},
 {'label': 'Communication Department',
  'message': 'We are working on a documentation on AI and would like to film a piece about you. Would you be interested?'},
 {'label': 'Research',
  'message': 'I am working with Stanford and was hoping to win you over for a research collaboration.'},
 {'label': 'IT Support', 'message': 'My laptop is broken'},
 {'label': 'Communications',
  'message': 'I already tried to call many times. Can I get a meeting with Jonas?'},
 {'label': 'Communications', 'message': 'Can you send your models via email?'},
 {'label': 'Research', 'message': 'We should do a research collaboration.'},
 {'label': 'Research',
  'message': 'H100 cluster available right now. Would you like to procure at low prices

The Intelligence layer offers support to run task evaluations.

First, we have to create a dataset inside a repository.
There are different repositories (that persist datasets in different ways), but an `InMemoryDatasetRepository` will do for now.


In [7]:
from intelligence_layer.evaluation import InMemoryDatasetRepository, Example

dataset_repository = InMemoryDatasetRepository()

dataset_id = dataset_repository.create_dataset(
    examples=[
        Example(
            input=ClassifyInput(chunk=TextChunk(example["message"]), labels=labels),
            expected_output=example["label"],
        )
        for example in labeled_examples
    ]
)


When a dataset is created, we generate a unique ID. We'll need it later.

In [8]:
dataset_id

'2521c77d-114d-4e65-8f81-449e53108f7f'

Now that we have a dataset, let's actually run an evaluation on it!


In [9]:
from dotenv import load_dotenv

from intelligence_layer.evaluation import (
    Evaluator,
    InMemoryEvaluationRepository,
    InMemoryRunRepository,
    InMemoryAggregationRepository,
    Runner,
    Aggregator,
)
from intelligence_layer.use_cases import (
    SingleLabelClassifyEvaluationLogic,
    SingleLabelClassifyAggregationLogic,
)

load_dotenv()

# we need a few repositories to store runs, evals and aggregated evaluations
run_repository = InMemoryRunRepository()
evaluation_repository = InMemoryEvaluationRepository()
aggregation_repository = InMemoryAggregationRepository()


# each repository is used by a class that has a dedicated responsibility
runner = Runner(
    prompt_based_classify,
    dataset_repository,
    run_repository,
    "prompt-based-classify"
)
evaluator = Evaluator(
    dataset_repository,
    run_repository,
    evaluation_repository,
    "single-label-classify",
    SingleLabelClassifyEvaluationLogic(),
)
aggregator = Aggregator(
    evaluation_repository,
    aggregation_repository,
    "single-label-classify",
    SingleLabelClassifyAggregationLogic(),
)


Before evaluating, we must generate predictions for each sample in our datasets.


In [10]:
run_overview = runner.run_dataset(dataset_id)
run_overview

Evaluating: 0it [00:00, ?it/s]

Evaluating: 27it [00:10,  2.49it/s]


RunOverview(dataset_id='2521c77d-114d-4e65-8f81-449e53108f7f', id='e547db06-baee-4c69-8210-01c8e059a68f', start=datetime.datetime(2024, 3, 12, 15, 45, 25, 706068, tzinfo=datetime.timezone.utc), end=datetime.datetime(2024, 3, 12, 15, 45, 36, 593248, tzinfo=datetime.timezone.utc), failed_example_count=0, successful_example_count=27, description='prompt-based-classify')

Next, let's evaluate this run.

In [11]:
eval_overview = evaluator.evaluate_runs(run_overview.id)
eval_overview


Evaluating: 0it [00:00, ?it/s]


EvaluationOverview(run_overviews=frozenset({RunOverview(dataset_id='2521c77d-114d-4e65-8f81-449e53108f7f', id='e547db06-baee-4c69-8210-01c8e059a68f', start=datetime.datetime(2024, 3, 12, 15, 45, 25, 706068, tzinfo=datetime.timezone.utc), end=datetime.datetime(2024, 3, 12, 15, 45, 36, 593248, tzinfo=datetime.timezone.utc), failed_example_count=0, successful_example_count=27, description='prompt-based-classify')}), id='48db85f8-e808-4b75-bb81-ad23947553d8', start=datetime.datetime(2024, 3, 12, 15, 45, 36, 598450, tzinfo=datetime.timezone.utc), description='single-label-classify')

Finally, let's aggregate all individual evaluations to get seom eval statistics.

In [12]:
aggregation_overview = aggregator.aggregate_evaluation(eval_overview.id)
aggregation_overview


AggregationOverview(evaluation_overviews=frozenset({EvaluationOverview(run_overviews=frozenset({RunOverview(dataset_id='2521c77d-114d-4e65-8f81-449e53108f7f', id='e547db06-baee-4c69-8210-01c8e059a68f', start=datetime.datetime(2024, 3, 12, 15, 45, 25, 706068, tzinfo=datetime.timezone.utc), end=datetime.datetime(2024, 3, 12, 15, 45, 36, 593248, tzinfo=datetime.timezone.utc), failed_example_count=0, successful_example_count=27, description='prompt-based-classify')}), id='48db85f8-e808-4b75-bb81-ad23947553d8', start=datetime.datetime(2024, 3, 12, 15, 45, 36, 598450, tzinfo=datetime.timezone.utc), description='single-label-classify')}), id='c53bf1be-96b1-4de9-bd48-aa097c3eeefe', start=datetime.datetime(2024, 3, 12, 15, 45, 36, 607875, tzinfo=datetime.timezone.utc), end=datetime.datetime(2024, 3, 12, 15, 45, 36, 607970, tzinfo=datetime.timezone.utc), successful_evaluation_count=27, crashed_during_evaluation_count=0, description='single-label-classify', statistics=AggregatedSingleLabelClassif

In [13]:
from intelligence_layer.use_cases import (
    SingleLabelClassifyOutput,
    SingleLabelClassifyEvaluation,
)


overview = [
    {
        "input": example.input,
        "expected_output": example.expected_output,
        "result": next(
            e
            for e in run_repository.example_outputs(
                run_overview.id, SingleLabelClassifyOutput
            )
            if e.example_id == example.id
        ).output,
        "eval": evaluation_repository.example_evaluation(
            evaluation_id=eval_overview.id,
            example_id=example.id,
            evaluation_type=SingleLabelClassifyEvaluation,
        ).result,
    }
    for example in dataset_repository.examples(
        dataset_id=dataset_id, input_type=ClassifyInput, expected_output_type=str
    )
]

In [14]:
[e for e in overview if not e["eval"].correct]

[{'input': ClassifyInput(chunk='I am working with Stanford and was hoping to win you over for a research collaboration.', labels=frozenset({'People & Culture', 'Product', 'Legal', 'Infrastructure', 'Accounting', 'Communication Department', 'Research', 'Finance', 'CEO Office', 'Customer'})),
  'expected_output': 'Research',
  'result': SingleLabelClassifyOutput(scores={'People & Culture': 0.0007585755008278973, 'Product': 2.1306824628346814e-06, 'Legal': 9.644205228204069e-05, 'Infrastructure': 2.992136017512143e-08, 'Accounting': 0.0001092831623539844, 'Communication Department': 0.9426427384693921, 'Research': 0.053180248281635, 'Finance': 0.003193722149143611, 'CEO Office': 7.066480296161476e-08, 'Customer': 1.6759115739504213e-05}),
  'eval': SingleLabelClassifyEvaluation(correct=False)},
 {'input': ClassifyInput(chunk='Hey, I did not get a t-shirt in the onboarding. Could I still get one?', labels=frozenset({'People & Culture', 'Product', 'Legal', 'Infrastructure', 'Accounting', 'C

In [15]:
prompt_adjusted_classify_task = PromptBasedClassify(
    instruction="""Identify teh department that would be responsible for handling the given request.
Reply with only the department name."""
)

In [16]:
runner_prompt_adjusted = Runner(
    prompt_adjusted_classify_task,
    dataset_repository,
    run_repository,
    "running for adjusted prompt",
)
run_overview_prompt_adjusted = runner_prompt_adjusted.run_dataset(dataset_id)

Evaluating: 27it [00:32,  1.19s/it]


In [17]:
eval_overview_prompt_adjusted = evaluator.evaluate_runs(run_overview_prompt_adjusted.id)

Evaluating: 0it [00:00, ?it/s]


In [18]:
aggregation_overview_prompt_adjusted = aggregator.aggregate_evaluation(
    eval_overview_prompt_adjusted.id
)
aggregation_overview_prompt_adjusted

AggregationOverview(evaluation_overviews=frozenset({EvaluationOverview(run_overviews=frozenset({RunOverview(dataset_id='2521c77d-114d-4e65-8f81-449e53108f7f', id='886a10f9-4ddb-4c36-a0e4-f1215c1ead00', start=datetime.datetime(2024, 3, 12, 15, 45, 36, 624811, tzinfo=datetime.timezone.utc), end=datetime.datetime(2024, 3, 12, 15, 46, 8, 689225, tzinfo=datetime.timezone.utc), failed_example_count=0, successful_example_count=27, description='running for adjusted prompt')}), id='1be5cc48-316e-498b-ad53-9662545f2e28', start=datetime.datetime(2024, 3, 12, 15, 46, 8, 693089, tzinfo=datetime.timezone.utc), description='single-label-classify')}), id='6a4e4df1-443f-47cc-b3bd-54267abb1c4a', start=datetime.datetime(2024, 3, 12, 15, 46, 8, 700178, tzinfo=datetime.timezone.utc), end=datetime.datetime(2024, 3, 12, 15, 46, 8, 700319, tzinfo=datetime.timezone.utc), successful_evaluation_count=27, crashed_during_evaluation_count=0, description='single-label-classify', statistics=AggregatedSingleLabelClass

In [19]:
overview = [
    {
        "input": example.input,
        "expected_output": example.expected_output,
        "result": next(
            e
            for e in run_repository.example_outputs(
                run_overview_prompt_adjusted.id, SingleLabelClassifyOutput
            )
            if e.example_id == example.id
        ).output,
        "eval": evaluation_repository.example_evaluation(
            evaluation_id=eval_overview_prompt_adjusted.id,
            example_id=example.id,
            evaluation_type=SingleLabelClassifyEvaluation,
        ).result,
    }
    for example in dataset_repository.examples(
        dataset_id=dataset_id, input_type=ClassifyInput, expected_output_type=str
    )
]
[e for e in overview if not e["eval"].correct]

[{'input': ClassifyInput(chunk='Hey, I did not get a t-shirt in the onboarding. Could I still get one?', labels=frozenset({'People & Culture', 'Product', 'Legal', 'Infrastructure', 'Accounting', 'Communication Department', 'Research', 'Finance', 'CEO Office', 'Customer'})),
  'expected_output': 'People & Culture',
  'result': SingleLabelClassifyOutput(scores={'People & Culture': 0.11905821963385722, 'Product': 0.0007529779112762469, 'Legal': 0.0026281511497473755, 'Infrastructure': 9.573039863173891e-05, 'Accounting': 0.00976476612874965, 'Communication Department': 0.028255287665935772, 'Research': 1.1433354228584936e-05, 'Finance': 0.19613051178613103, 'CEO Office': 0.00021573178351041137, 'Customer': 0.6430871901879318}),
  'eval': SingleLabelClassifyEvaluation(correct=False)},
 {'input': ClassifyInput(chunk='Jonas, we have met each other at the event in Nürnberg, can we meet for a follow up in your Office in Heidelberg?', labels=frozenset({'People & Culture', 'Product', 'Legal', 'I