# Customer Service Email Classification Agent

This notebook implements a DSPy-based email classifier for customer service tickets.
The agent classifies incoming emails into predefined contact reasons based on the subject and first message.

## 0. Imports

In [54]:
import dspy
import mlflow
import pandas as pd
from typing import Literal
import json
import os
from dotenv import load_dotenv
from azure.identity import DefaultAzureCredential

## 1. LM Configuration

Configure the Azure OpenAI language model using DefaultAzureCredential for authentication.
The model follows the LiteLLM provider format: `azure/<deployment-name>`.

In [69]:
# === MLflow Tracing Setup ===
mlflow.dspy.autolog()
mlflow.set_experiment("dspy-email-classifier")

# === Load Environment ===
load_dotenv()

credential = DefaultAzureCredential()
token = credential.get_token("https://cognitiveservices.azure.com/.default").token

# === Configure Language Models ===
# DSPy uses LiteLLM under the hood - for Azure OpenAI use format: azure/<deployment-name>
lm_1 = dspy.LM(
    model="azure/gpt-4.1",
    api_base="https://ai-ecom-data-agent-resource.cognitiveservices.azure.com",
    api_key=token
)

lm_2 = dspy.LM(
    model="azure/gpt-4.1-mini",
    api_base="https://ai-ecom-data-agent-resource.cognitiveservices.azure.com",
    api_key=token,
    cache=False
)

# Set default LM for DSPy modules
dspy.configure(lm=lm_2)

print(f"‚úì MLflow tracing enabled")
print(f"‚úì LM configured: {lm_1.model}")
print(f"‚úì LM configured: {lm_2.model}")

# Test the connection (will be traced!)
response = lm_1("Say 'Hello!' if you can hear me.")
response = lm_2("Say 'Hello!' if you can hear me.")

print(f"‚úì Connection test: {response[:50]}...")

‚úì MLflow tracing enabled
‚úì LM configured: azure/gpt-4.1
‚úì LM configured: azure/gpt-4.1-mini
‚úì Connection test: ['Hello!']...


In [68]:
lm_2("hello, which model provider is the best?")

['Hello! There isn\'t a single "best" model provider, as it really depends on your specific needs and use case. Different providers excel in different areas such as natural language understanding, image recognition, customization options, pricing, or integration capabilities. Here are some popular model providers and what they‚Äôre known for:\n\n- **OpenAI**: Known for advanced language models like GPT-4, very strong in natural language understanding and generation, widely used for chatbots, content creation, and coding assistance.\n- **Google (Vertex AI, PaLM)**: Offers powerful models with strong integration into Google Cloud services, excellent for scalable enterprise solutions.\n- **Microsoft (Azure AI, OpenAI partnership)**: Provides access to OpenAI models with enterprise-grade security, plus additional Azure AI tools.\n- **Anthropic**: Focuses on developing safe and steerable AI models.\n- **Cohere**: Emphasizes custom NLP models and ease of integration.\n- **Hugging Face**: Not

In [62]:
lm_1("hello, which model provider is the best?")

['Hello! The "best" model provider depends on what you need‚Äîthere isn‚Äôt a single answer for everyone. Here‚Äôs a quick overview of major AI model providers and their strengths as of 2024:\n\n### 1. **OpenAI (makers of ChatGPT, GPT-4 and GPT-4o)**\n   - **Strengths:** State-of-the-art language models, high accuracy, conversational depth, great tooling (APIs, plugins), reliable ethics guardrails.\n   - **Best for:** General-purpose chatbots, creative writing, coding help, research, and enterprise applications.\n\n### 2. **Google (Gemini, formerly Bard)**\n   - **Strengths:** Integrates well with Google ecosystem, powerful with factual retrieval, strong in reasoning tasks, leading-edge research.\n   - **Best for:** Web search integration, summarization, and real-time information tasks.\n\n### 3. **Anthropic (Claude 3 series)**\n   - **Strengths:** Advanced safety, very long context windows (can process large documents), transparent model behavior.\n   - **Best for:** Business/enterpri

## 2. Load Labels

Load the contact reason labels from `labels.json`. These define the possible classification categories.

In [14]:
with open("labels.json", encoding="utf-8") as f:
    config = json.load(f)

LABELS = config["labels"]
CONTACT_REASONS = list(LABELS.keys())

print(f"‚úì {len(CONTACT_REASONS)} contact reasons loaded:")
for reason in CONTACT_REASONS:
    print(f"  - {reason}: {LABELS[reason]['description'][:50]}...")

‚úì 12 contact reasons loaded:
  - Order Delay: Klant vraagt naar verzendstatus, track & trace upd...
  - Lost Order: Pakket staat als afgeleverd maar klant geeft aan h...
  - Return Order: Klant wil een product retourneren voor terugbetali...
  - Cancel Order: Klant wil een bestelling annuleren voordat deze is...
  - Damaged Item: Product is kapot, gebarsten, gescheurd of beschadi...
  - Bad Product Quality: Product is defect, werkt niet zoals verwacht, of k...
  - Wrong Order: Klant heeft verkeerd product ontvangen, verkeerde ...
  - Missing Item: Bestelling is aangekomen maar √©√©n of meerdere arti...
  - Product Question: Vragen over productspecificaties, eigenschappen, c...
  - Shipping Question: Algemene vragen over verzendopties, kosten, levert...
  - Special Request: Aangepaste verzoeken zoals cadeauverpakking, speci...
  - Other: Algemene vragen die niet in andere categorie√´n pas...


In [17]:
def build_label_descriptions() -> str:
    """Build label descriptions for the signature."""
    return "\n".join([
        f"- {key}: {info['description']}" 
        for key, info in LABELS.items()
    ])

# Preview the label descriptions
print(build_label_descriptions())

- Order Delay: Klant vraagt naar verzendstatus, track & trace updates, of vertraagde levering. Bestelling is niet binnen de verwachte termijn aangekomen.
- Lost Order: Pakket staat als afgeleverd maar klant geeft aan het niet ontvangen te hebben, of tracking toont langere tijd geen updates. Bestelling lijkt verloren tijdens transport.
- Return Order: Klant wil een product retourneren voor terugbetaling of omruiling. Kan ook vragen over retourlabels of retourbeleid bevatten.
- Cancel Order: Klant wil een bestelling annuleren voordat deze is verzonden of geleverd.
- Damaged Item: Product is kapot, gebarsten, gescheurd of beschadigd aangekomen tijdens verzending. Zichtbare fysieke schade aan het artikel.
- Bad Product Quality: Product is defect, werkt niet zoals verwacht, of kwaliteit komt niet overeen met beschrijving/verwachtingen. Niet beschadigd tijdens verzending maar inherent gebrekkig.
- Wrong Order: Klant heeft verkeerd product ontvangen, verkeerde kleur, verkeerde maat, of verkee

## 3. Signature Definition

Define a DSPy Signature for the classification task. The signature specifies:
- **Inputs**: `subject` and `first_message` from the customer email
- **Output**: `contact_reason` - one of the predefined categories (using `Literal` type)

In [18]:
# Contact signature class met inheritance van dspy.Signature class
class ContactReasonSignature(dspy.Signature):
    """Classify a customer email to the most appropriate contact reason."""
    
    subject: str = dspy.InputField(desc="The email subject line")
    first_message: str = dspy.InputField(desc="The first customer message")
    
    # Hoeveel input en output velden zijn mogelijk/best practice?
    # Voorbeeld gezien met meerdere output velden, resultaat, confidence score, redenerings proces, etc.
    
    contact_reason: Literal[tuple(CONTACT_REASONS)] = dspy.OutputField(
        
        # Docstring met labels en descriptions
        desc=f"The category that best matches the customer's issue.\n\nCategories:\n{build_label_descriptions()}"
    )
    
# Alles in de signature komt in de prompt voor de LLM in een gestructureerde manier met doc string,veldbeschrijvingen en typeannotaties etc.
print(f"‚úì Signature defined with {len(CONTACT_REASONS)} possible output categories")

‚úì Signature defined with 12 possible output categories


## 4. Classifier Module

Create a DSPy Module that wraps the signature with a predictor. The `forward` method defines how inputs flow through the module.

In [6]:
class EmailClassifier(dspy.Module):
    def __init__(self):
        super().__init__()
        self.predictor = dspy.Predict(ContactReasonSignature)

    def forward(self, subject: str, first_message: str):
        return self.predictor(subject=subject, first_message=first_message)

# Initialize the classifier
classifier = EmailClassifier()
print("‚úì EmailClassifier module initialized")

‚úì EmailClassifier module initialized


## 5. Dataset Preparation

Utility function to convert a pandas DataFrame into DSPy Examples for training/evaluation.

In [7]:
def prepare_dataset(df: pd.DataFrame):
    """Convert DataFrame to DSPy Examples."""
    dataset = []
    for _, row in df.iterrows():
        example = dspy.Example(
            subject=row['subject'] or "",
            first_message=row['first_message'] or "",
            contact_reason=row['contact_reason']
        ).with_inputs('subject', 'first_message')
        dataset.append(example)
    return dataset

print("‚úì prepare_dataset function defined")

‚úì prepare_dataset function defined


## 6. Load Dataset

Load the cleaned ticket dataset and prepare it for evaluation.

In [8]:
# Load the dataset
df = pd.read_parquet("../data/ticket_details_clean.parquet")
print(f"‚úì {len(df)} tickets loaded")

# Display sample
df.head()

‚úì 333 tickets loaded


Unnamed: 0,ticket_id,subject,message_count,customer_name,first_message,first_message_from_agent,tags,contact_reason,ai_intent
1,38556826,38556826: Re: Jouw bestelling is verzonden!,1,Joanneke Duitman,"Goedemorgen, \n\nMeer dan een week geleden is ...",False,"[{'decoration': {'color': '#84db2d'}, 'id': 14...",Special Request,Order::Status::Other
14,38497638,38497638: Hondenjas past niet,3,Eva Jansen,Ik heb van jullie een hondenjas ontvangen en b...,False,"[{'decoration': {'color': '#27a74c'}, 'id': 14...",Return Order,Exchange::Request::Other
22,38473261,38473261: Klacht,3,Dunja Bruijnes,"Geachte heer/mevrouw, \nOnlangs heb ik bij u e...",False,"[{'decoration': {'color': '#7b825b'}, 'id': 18...",Order Delay,Order::Refund::Other
31,38455822,38455822: Bestelling #20069,3,Marieke van Buren,"Beste heer/mevrouw, \n\nGraag wil ik bestellin...",False,"[{'decoration': {'color': '#c5a608'}, 'id': 14...",Cancel Order,Order::Cancel::Other
32,38454445,38454445: Bestelling hondentuig,3,Ria van Buren,"Goedemiddag,\r\n\r\nIk heb twee tuigjes bestel...",False,"[{'decoration': {'color': '#7b825b'}, 'id': 18...",Order Delay,Order::Edit::Other


In [None]:
# Prepare the dataset for DSPy
df_prepared = prepare_dataset(df)
print(f"‚úì {len(df_prepared)} examples prepared for DSPy")

‚úì 333 examples prepared for DSPy


In [18]:
pred = classifier(
    subject=df_prepared[2].subject,
    first_message=df_prepared[2].first_message
)

print(f"Predicted contact reason: {pred.contact_reason}")

Predicted contact reason: Bad Product Quality


## Batch Classification

Classify the entire dataset and save predictions to a file.

In [19]:
from dspy.evaluate import Evaluate

# Define metric for classification
def classification_metric(example, pred, trace=None):
    """Returns 1 if prediction matches label, 0 otherwise."""
    return example.contact_reason == pred.contact_reason

# Set up the evaluator (DSPy best practice)
evaluator = Evaluate(
    devset=df_prepared,
    metric=classification_metric,
    num_threads=4,  # Parallel evaluation
    display_progress=True,
    display_table=10  # Show first 10 results in table
)

# Run evaluation
eval_result = evaluator(classifier)

print(f"\n‚úì Evaluation complete!")
print(f"  Accuracy: {eval_result.score:.1f}%")

Average Metric: 260.00 / 333 (78.1%): 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 333/333 [09:10<00:00,  1.65s/it]

2026/01/28 07:16:58 INFO dspy.evaluate.evaluate: Average Metric: 260 / 333 (78.1%)





Unnamed: 0,subject,first_message,example_contact_reason,pred_contact_reason,classification_metric
0,38556826: Re: Jouw bestelling is verzonden!,"Goedemorgen, Meer dan een week geleden is de deurmat die we bij ju...",Special Request,Order Delay,‚úîÔ∏è [False]
1,38497638: Hondenjas past niet,Ik heb van jullie een hondenjas ontvangen en betaald . Het past he...,Return Order,Return Order,‚úîÔ∏è [True]
2,38473261: Klacht,"Geachte heer/mevrouw, Onlangs heb ik bij u een bestelling geplaats...",Order Delay,Bad Product Quality,‚úîÔ∏è [False]
3,38455822: Bestelling #20069,"Beste heer/mevrouw, \n\nGraag wil ik bestelling 20069 annuleren \n...",Cancel Order,Cancel Order,‚úîÔ∏è [True]
4,38454445: Bestelling hondentuig,"Goedemiddag, Ik heb twee tuigjes besteld voor onze honden Tommie e...",Order Delay,Other,‚úîÔ∏è [False]
5,38452192: Re: Belangrijk nieuws over je bestelling #19984,Ik heb nog niks mogen ontvangen Op 23 jan 2026 09:32 schreef Info ...,Order Delay,Order Delay,‚úîÔ∏è [True]
6,38440061: Re: Jouw bestelling is verzonden!,Hallo waar blijf mij bestelling staat al sins afgelopen zonda als ...,Order Delay,Order Delay,‚úîÔ∏è [True]
7,38438330: Nieuw klantbericht op 23 januari 2026 om 10:40,Nieuw klantbericht op 23 januari 2026 om 10:40 Je hebt een nieuw b...,Return Order,Wrong Order,‚úîÔ∏è [False]
8,38427070: Retour 19558,"Hoi, Ik zou graag mijn bestelling 19558 retourneren, maar ook inee...",Return Order,Return Order,‚úîÔ∏è [True]
9,38338984: Retour,Ik wil de riem retour sturen omdat hij veel te grof is. Dit is nie...,Return Order,Return Order,‚úîÔ∏è [True]



‚úì Evaluation complete!
  Accuracy: 78.1%


In [20]:
# Convert evaluation results to DataFrame for analysis
# eval_result.results contains: [(example, prediction, score), ...]

results_data = []
for example, pred, score in eval_result.results:
    results_data.append({
        "subject": example.subject,
        "first_message": example.first_message[:200],  # Truncate for readability
        "actual_label": example.contact_reason,
        "predicted_label": pred.contact_reason,
        "match": bool(score),
        "potential_mislabel": not score  # Where model disagrees with human
    })

df_results = pd.DataFrame(results_data)

# Save all results
output_path = "../data/classification_results"
df_results.to_parquet(f"{output_path}.parquet", index=False)
df_results.to_csv(f"{output_path}.csv", index=False)

print(f"‚úì Results saved to {output_path}.parquet/.csv")
print(f"\nüìä Summary:")
print(f"  Total tickets: {len(df_results)}")
print(f"  Matches: {df_results['match'].sum()}")
print(f"  Discrepancies: {(~df_results['match']).sum()} ‚Üê Review these for potential mislabels!")

‚úì Results saved to ../data/classification_results.parquet/.csv

üìä Summary:
  Total tickets: 333
  Matches: 260
  Discrepancies: 73 ‚Üê Review these for potential mislabels!


## Review Potential Mislabels

Show tickets where the model's prediction differs from the human label.
These are candidates for manual review ‚Äî either the model is wrong, or the agent mislabeled it.

In [21]:
# Filter discrepancies for manual review
df_discrepancies = df_results[df_results["potential_mislabel"]].copy()

print(f"üîç {len(df_discrepancies)} tickets to review:\n")

# Show discrepancies grouped by actual vs predicted
confusion = df_discrepancies.groupby(["actual_label", "predicted_label"]).size().reset_index(name="count")
confusion = confusion.sort_values("count", ascending=False)
print("Confusion patterns (actual ‚Üí predicted):")
print(confusion.to_string(index=False))

# Display sample discrepancies for manual review
print("\n" + "="*80)
print("Sample tickets to review:")
print("="*80)
for i, row in df_discrepancies.head(5).iterrows():
    print(f"\nüìß Subject: {row['subject']}")
    print(f"   Message: {row['first_message'][:150]}...")
    print(f"   üë§ Agent labeled: {row['actual_label']}")
    print(f"   ü§ñ Model suggests: {row['predicted_label']}")

üîç 73 tickets to review:

Confusion patterns (actual ‚Üí predicted):
     actual_label     predicted_label  count
      Order Delay          Lost Order      6
      Order Delay               Other      5
      Order Delay    Product Question      5
            Other         Order Delay      5
     Return Order Bad Product Quality      5
     Return Order         Order Delay      4
      Order Delay   Shipping Question      4
      Order Delay     Special Request      3
     Return Order         Wrong Order      3
      Order Delay         Wrong Order      3
      Order Delay        Missing Item      3
      Order Delay        Cancel Order      3
     Cancel Order         Order Delay      3
     Damaged Item Bad Product Quality      3
      Order Delay        Return Order      2
      Wrong Order        Return Order      2
      Order Delay Bad Product Quality      1
       Lost Order         Order Delay      1
     Cancel Order               Other      1
     Cancel Order   Shipping 

## 7. Test Classification

Test the classifier with a sample email.

In [18]:
# Test with a realistic Order Delay example
pred = classifier(
    subject="Re: Belangrijk nieuws over je bestelling #19984",
    first_message="Ik heb nog niks mogen ontvangen"
)

print(f"Subject: Re: Belangrijk nieuws over je bestelling #19984")
print(f"Message: Ik heb nog niks mogen ontvangen")
print(f"\n‚Üí Predicted contact reason: {pred.contact_reason}")

Subject: Re: Belangrijk nieuws over je bestelling #19984
Message: Ik heb nog niks mogen ontvangen

‚Üí Predicted contact reason: Order Delay


## 8. Inspect LM History

Use `dspy.inspect_history()` to see the prompts sent to the LM and the responses received.

In [19]:
# Inspect the last LM call to see the prompt and response
lm.inspect_history(n=1)





[34m[2026-01-25T17:10:03.113096][0m

[31mSystem message:[0m

Your input fields are:
1. `subject` (str): The email subject line
2. `first_message` (str): The first customer message
Your output fields are:
1. `contact_reason` (Literal['Order Delay', 'Lost Order', 'Return Order', 'Cancel Order', 'Damaged Item', 'Bad Product Quality', 'Wrong Order', 'Missing Item', 'Product Question', 'Shipping Question', 'Special Request', 'Other']): The category that best matches the customer's issue.

Categories:
- Order Delay: Klant vraagt naar verzendstatus, track & trace updates, of vertraagde levering. Bestelling is niet binnen de verwachte termijn aangekomen.
- Lost Order: Pakket staat als afgeleverd maar klant geeft aan het niet ontvangen te hebben, of tracking toont langere tijd geen updates. Bestelling lijkt verloren tijdens transport.
- Return Order: Klant wil een product retourneren voor terugbetaling of omruiling. Kan ook vragen over retourlabels of retourbeleid bevatten.
- Cancel Ord

In [24]:
# Test Predict
# Configuraite van DSPy met de juiste LM
dspy.configure(lm=dspy.LM(
    model="azure/gpt-4.1-mini",  # or "azure/gpt-4.1" - pick one
    api_base="https://ai-ecom-data-agent-resource.cognitiveservices.azure.com",
    api_key=token
))
               
class QASignature(dspy.Signature):
    """Answer customer questions based on provided context."""
    
    question: str = dspy.InputField(desc="The customer's question")
    
    answer: str = dspy.OutputField(
        desc="The answer to the customer's question based on the provided context."
    )

module = dspy.Predict(QASignature)

response = module(question="What is the capital city of spain?")

print(f"Question: What is the capital city of spain?")
print(f"Answer: {response.answer}")

Question: What is the capital city of spain?
Answer: The capital city of Spain is Madrid.


In [25]:
# Create a module with a signature

# Chain of thought gebruikt voor het eind resultaat ook predict, de chain of thought module
# zorgt voor het extra toevoegen van redeneringstappen in de input voor de llm
cot_module = dspy.ChainOfThought(QASignature)
 
# Call it like a function
cot_result = cot_module(question="What is 2 x 450?")
 
# Access outputs by name
print(cot_result.reasoning)
print(cot_result.answer)

# It takes a Signature ‚Äî a DSPy abstraction describing input/output fields.

# It prepends a ‚Äúreasoning‚Äù field (like an internal reasoning text) to that signature.

# It builds a Predict module using that new signature.

# When called, it returns the model‚Äôs prediction (including reasoning).

# This is basically the ‚Äúreasoning-enabled‚Äù version of a normal Predict module.

To find 2 times 450, multiply 2 by 450. Multiplying these gives 900.
2 x 450 = 900.


In [33]:
import httpx

def get_weather(city: str) -> str:
    """Get the current weather for a city."""
    # First, geocode the city name to coordinates
    geo_url = f"https://geocoding-api.open-meteo.com/v1/search?name={city}&count=1"
    geo_response = httpx.get(geo_url).json()
    
    if not geo_response.get("results"):
        return f"Could not find city: {city}"
    
    lat = geo_response["results"][0]["latitude"]
    lon = geo_response["results"][0]["longitude"]
    name = geo_response["results"][0]["name"]
    
    # Get current weather
    weather_url = f"https://api.open-meteo.com/v1/forecast?latitude={lat}&longitude={lon}&current=temperature_2m,weather_code"
    weather = httpx.get(weather_url).json()
    
    temp = weather["current"]["temperature_2m"]
    return f"The weather in {name} is {temp}¬∞C"
 

In [35]:
# Create a ReAct agent
react_agent = dspy.ReAct(
    signature=QASignature,
    tools=[get_weather],
    max_iters=5 # max times to run in a loop
)
 
# Use the agent
result = react_agent(question="What's the weather like in Tarragona?")
print(result.answer)
print("Tool calls made:", result.trajectory)

The current weather in Tarragona is 10.4¬∞C.
Tool calls made: {'thought_0': 'To provide the current weather in Tarragona, I need to fetch the latest weather data for that city.', 'tool_name_0': 'get_weather', 'tool_args_0': {'city': 'Tarragona'}, 'observation_0': 'The weather in Tarragona is 10.4¬∞C', 'thought_1': "I have obtained the current temperature in Tarragona. Since the question only asks about the weather condition generally, I'll conclude the task by providing this information.", 'tool_name_1': 'finish', 'tool_args_1': {}, 'observation_1': 'Completed.'}
