# Assessing the Effectiveness of LLM-based Email Classification Systems

In the fast-paced world of business, effectively managing incoming support emails is crucial. The ability to quickly and accurately classify these emails into the appropriate department and determine their urgency is not just a matter of operational efficiency; it directly impacts customer satisfaction and overall business success. Given the high stakes, it's essential to rigorously evaluate any solution designed to automate this process. This tutorial focuses on the evaluation of a LLM-based program developed to automate the classification of support emails.

In an environment brimming with various methodologies and tools, understanding the comparative effectiveness of different approaches is vital. Systematic evaluation allows us to identify which techniques are best suited for specific tasks, understand their strengths and weaknesses, and optimize their performance.


To start off, we are only given a few anecdotal examples. Let's see how far we can get with these.


In [1]:
examples = [
    "Hi, my laptop crashed and I can't start it anymore. Do you need the serial number or sth?",
    "Hello,\n\nI am writing my Master's Thesis and would like to investigate the model's performance. Could I get some free credits?\n\nCheers, Niklas",
]

labels = {
    "Product",
    "Customer",
    "CEO Office",
    "Research",
    "Finance",
    "Accounting",
    "Legal",
    "Communication Department",
    "Infrastructure",
    "People & Culture",
}

In [None]:
from intelligence_layer.core import TextChunk, InMemoryTracer
from intelligence_layer.use_cases import PromptBasedClassify, ClassifyInput


prompt_based_classify = PromptBasedClassify()

classify_inputs = [
    ClassifyInput(chunk=TextChunk(example), labels=labels) for example in examples
]


outputs = prompt_based_classify.run_concurrently(classify_inputs, InMemoryTracer())
outputs

In [None]:
[sorted(list(o.scores.items()), key=lambda i: i[1], reverse=True)[0] for o in outputs]

In [None]:
labeled_examples = [
    {
        "label": "Finance",
        "message": "I just traveled to Paris for a conference, where can I get the train ride refunded?",
    },
    {
        "label": "Customer",
        "message": "Hello, we would like to get in contact with your sales team, because we are interested in your solution.",
    },
    {
        "label": "Communication Department",
        "message": "We are working on a documentation on AI and would like to film a piece about you. Would you be interested?",
    },
    {
        "label": "Research",
        "message": "I am working with Stanford and was hoping to win you over for a research collaboration.",
    },
    {
        "label": "IT Support",
        "message": "My laptop is broken"},
    {
        "label": "Communications",
        "message": "I already tried to call many times. Can I get a meeting with Jonas?",
    },
    {
        "label": "Communications",
        "message": "Can you send your models via email?"
    },
    {
        "label": "Research",
        "message": "We should do a research collaboration."},
    {
        "label": "Research",
        "message": "H100 cluster available right now. Would you like to procure at low prices?",
    },
    {
        "label": "Research",
        "message": "My company has been working on time series and signal processing for a long time. It would make sense to define a joint go to market.",
    },
    {
        "label": "People & Culture",
        "message": "Full stack developer in your area available now.",
    },
    {
        "label": "Product",
        "message": "Hi,\n\nI am having trouble running your docker container in my environment. It fails to start. Can you help?",
    },
    {
        "label": "Product",
        "message": "Hello,\n\nI am getting strange errors from your API. It is saying the queue is full, but I am only sending one task at a time. Why is this happening?",
    },
    {
        "label": "Customer",
        "message": "Can you show me a demo of different use cases your product can solve?",
    },
    {
        "label": "People & Culture",
        "message": "Hey, I did not get a t-shirt in the onboarding. Could I still get one?",
    },
    {
        "label": "Customer",
        "message": "Hi, can you name me a couple of timeslots for a first call? Would be really interested in learning more about the product?",
    },
    {
        "label": "Product",
        "message": "Hi Jan, is your tool ISO 37301 compliant?"},
    {
        "label": "I can’t login to Mattermost or Sharepoint, how can I gain access?",
        "message": "IT Support",
    },
    {
        "label": "Ignore",
        "message": "Hi, Jonas here. I need something really urgently right now. Could you share your number with me?",
    },
    {
        "label": "Finance",
        "message": "I did not get paid last month, when do I get paid? What is going on?"
    },
    {
        "label": "Security",
        "message": "Hi, I want to get a new badge, the photo of me looks ugly and I just got new glasses so it does not look like me. "
    },
    {
        "label": "Marketing",
        "message": "Let us celebrate AI day in style, we want to invite you and the CEO to join us."

    },
    {
        "label": "Sales",
        "message": "Jonas, we have met each other at the event in Nürnberg, can we meet for a follow up in your Office in Heidelberg?"

    },
    {
        "label": "Security",
        "message": "Your hTTPs Certificate is not valid on your www.aleph-alpha.de"
    },
    {
        "label": "HR",
        "message": "I want to take a week off immediatly"
    },
    {
        "label": "HR",
        "message": "I want to take a sabbatical"
    },
    {
        "label": "HR",
        "message": "How can I work more, I want to work weekends, can I get paid overtime?"
    }
]

In [None]:
from intelligence_layer.evaluation import InMemoryDatasetRepository, Example

dataset_repository = InMemoryDatasetRepository()

dataset_id = dataset_repository.create_dataset(
    examples=[
        Example(
            input=ClassifyInput(chunk=TextChunk(example["message"]), labels=labels),
            expected_output=example["label"],
        )
        for example in labeled_examples
    ]
)

In [None]:
dataset_id

In [None]:
from dotenv import load_dotenv

from intelligence_layer.evaluation import (
    Evaluator,
    InMemoryEvaluationRepository,
    InMemoryRunRepository,
    InMemoryAggregationRepository,
    Runner,
    Aggregator,
)
from intelligence_layer.use_cases import (
    SingleLabelClassifyEvaluationLogic,
    SingleLabelClassifyAggregationLogic,
)

load_dotenv()

run_repository = InMemoryRunRepository()
evaluation_repository = InMemoryEvaluationRepository()
aggregation_repository = InMemoryAggregationRepository()


evaluator = Evaluator(
    dataset_repository,
    run_repository,
    evaluation_repository,
    "single-label-classify",
    SingleLabelClassifyEvaluationLogic(),
)
aggregator = Aggregator(
    evaluation_repository,
    aggregation_repository,
    "single-label-classify",
    SingleLabelClassifyAggregationLogic(),
)
runner = Runner(
    prompt_based_classify, dataset_repository, run_repository, "prompt-based-classify"
)
run_overview = runner.run_dataset(dataset_id)

In [None]:
eval_overview = evaluator.evaluate_runs(run_overview.id)

In [None]:
aggregation_overview = aggregator.aggregate_evaluation(eval_overview.id)

In [None]:
from intelligence_layer.use_cases import (
    SingleLabelClassifyOutput,
    SingleLabelClassifyEvaluation,
)


overview = [
    {
        "input": example.input,
        "expected_output": example.expected_output,
        "result": next(
            e
            for e in run_repository.example_outputs(
                run_overview.id, SingleLabelClassifyOutput
            )
            if e.example_id == example.id
        ).output,
        "eval": evaluation_repository.example_evaluation(
            evaluation_id=eval_overview.id,
            example_id=example.id,
            evaluation_type=SingleLabelClassifyEvaluation,
        ).result,
    }
    for example in dataset_repository.examples(
        dataset_id=dataset_id, input_type=ClassifyInput, expected_output_type=str
    )
]

In [None]:
[e for e in overview if not e["eval"].correct]

In [None]:
prompt_adjusted_classify_task = PromptBasedClassify(
    instruction="""Identify teh department that would be responsible for handling the given request.
Reply with only the department name."""
)

In [None]:
runner_prompt_adjusted = Runner(
    prompt_adjusted_classify_task,
    dataset_repository,
    run_repository,
    "running for adjusted prompt",
)
run_overview_prompt_adjusted = runner_prompt_adjusted.run_dataset(dataset_id)

In [None]:
eval_overview_prompt_adjusted = evaluator.evaluate_runs(run_overview_prompt_adjusted.id)

In [None]:
aggregation_overview_prompt_adjusted = aggregator.aggregate_evaluation(
    eval_overview_prompt_adjusted.id
)
aggregation_overview_prompt_adjusted

In [None]:
overview = [
    {
        "input": example.input,
        "expected_output": example.expected_output,
        "result": next(
            e
            for e in run_repository.example_outputs(
                run_overview_prompt_adjusted.id, SingleLabelClassifyOutput
            )
            if e.example_id == example.id
        ).output,
        "eval": evaluation_repository.example_evaluation(
            evaluation_id=eval_overview_prompt_adjusted.id,
            example_id=example.id,
            evaluation_type=SingleLabelClassifyEvaluation,
        ).result,
    }
    for example in dataset_repository.examples(
        dataset_id=dataset_id, input_type=ClassifyInput, expected_output_type=str
    )
]
[e for e in overview if not e["eval"].correct]