# Multi-Agent Customer Support Routing System

This notebook implements an intelligent customer support system that:
- Automatically classifies incoming queries by department
- Routes queries to specialized RAG agents
- Provides accurate answers grounded in company documentation
- Maintains full observability with Langfuse tracing
- Evaluates response quality automatically

## Architecture
```
User Query ‚Üí Orchestrator (Classification) ‚Üí Specialized Agent (RAG) ‚Üí Response
                                            ‚Üì
                                    Evaluator (Quality Check)
                                            ‚Üì
                                    Langfuse (Observability)
```

## 1. Setup & Imports

First, let's import all necessary libraries and set up our environment.

In [1]:
# Standard library imports
import os
import json
from pathlib import Path
from dotenv import load_dotenv

# LangChain imports
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import DirectoryLoader, TextLoader

# Langfuse imports
from langfuse import Langfuse
from langfuse.langchain import CallbackHandler

# Our custom agents
from src.agents import (
    HRAgent,
    TechAgent,
    FinanceAgent,
    OrchestratorAgent,
    EvaluatorAgent
)

print("‚úÖ All imports successful!")

‚úÖ All imports successful!


In [2]:
# Load environment variables from .env file
load_dotenv()

# Verify API keys are set
assert os.getenv("OPENROUTER_API_KEY"), "OPENROUTER_API_KEY not found in environment variables"
assert os.getenv("LANGFUSE_PUBLIC_KEY"), "LANGFUSE_PUBLIC_KEY not found in environment variables"
assert os.getenv("LANGFUSE_SECRET_KEY"), "LANGFUSE_SECRET_KEY not found in environment variables"

print("‚úÖ Environment variables loaded successfully!")
print(f"   OpenRouter API Key: {os.getenv('OPENROUTER_API_KEY')[:8]}...")
print(f"   Langfuse Public Key: {os.getenv('LANGFUSE_PUBLIC_KEY')[:15]}...")

‚úÖ Environment variables loaded successfully!
   OpenRouter API Key: sk-or-v1...
   Langfuse Public Key: pk-lf-9cace04a-...


## 2. Document Loading & Vector Stores

We'll load company documentation for each department and create vector stores for retrieval.

In [3]:
# Verify document directories exist
data_dir = Path("data")
hr_docs = data_dir / "hr_docs"
tech_docs = data_dir / "tech_docs"
finance_docs = data_dir / "finance_docs"

print("üìÅ Document directories:")
print(f"   HR docs: {len(list(hr_docs.glob('*.txt')))} files")
print(f"   Tech docs: {len(list(tech_docs.glob('*.txt')))} files")
print(f"   Finance docs: {len(list(finance_docs.glob('*.txt')))} files")
print("\n‚úÖ Document directories verified!")

üìÅ Document directories:
   HR docs: 4 files
   Tech docs: 3 files
   Finance docs: 4 files

‚úÖ Document directories verified!


### Initialize Langfuse for Observability

Langfuse provides complete tracing and monitoring of our multi-agent system.

In [4]:
# Initialize Langfuse client
langfuse = Langfuse(
    public_key=os.getenv("LANGFUSE_PUBLIC_KEY"),
    secret_key=os.getenv("LANGFUSE_SECRET_KEY"),
    host=os.getenv("LANGFUSE_HOST", "https://cloud.langfuse.com")
)

# Create callback handler for tracing
# Note: In newer langfuse.langchain, the CallbackHandler uses the global Langfuse client
langfuse_handler = CallbackHandler(
    public_key=os.getenv("LANGFUSE_PUBLIC_KEY")
)

print("‚úÖ Langfuse initialized!")
print("   You can view traces at: https://cloud.langfuse.com")

‚úÖ Langfuse initialized!
   You can view traces at: https://cloud.langfuse.com


## 3. Initialize Specialized RAG Agents

Each agent specializes in a specific domain (HR, IT, Finance) with its own documentation.

In [5]:
print("üöÄ Initializing HR Agent...")
hr_agent = HRAgent(langfuse_handler=langfuse_handler)
hr_agent.initialize(docs_path="data/hr_docs")
print("\n" + "="*80)

üöÄ Initializing HR Agent...



In [6]:
print("üöÄ Initializing Tech/IT Agent...")
tech_agent = TechAgent(langfuse_handler=langfuse_handler)
tech_agent.initialize(docs_path="data/tech_docs")
print("\n" + "="*80)

üöÄ Initializing Tech/IT Agent...



In [7]:
print("üöÄ Initializing Finance Agent...")
finance_agent = FinanceAgent(langfuse_handler=langfuse_handler)
finance_agent.initialize(docs_path="data/finance_docs")
print("\n" + "="*80)

üöÄ Initializing Finance Agent...



## 4. Initialize Orchestrator Agent

The orchestrator classifies queries and routes them to the appropriate specialized agent.

In [8]:
print("üéØ Initializing Orchestrator Agent...")
orchestrator = OrchestratorAgent(
    hr_agent=hr_agent,
    tech_agent=tech_agent,
    finance_agent=finance_agent,
    langfuse_handler=langfuse_handler
)
print("\n‚úÖ Orchestrator ready!")
print("   Can route queries to: HR, IT, Finance")

üéØ Initializing Orchestrator Agent...

‚úÖ Orchestrator ready!
   Can route queries to: HR, IT, Finance


## 5. Initialize Evaluator Agent (BONUS)

The evaluator assesses response quality on multiple dimensions and logs scores to Langfuse.

In [9]:
print("‚≠ê Initializing Evaluator Agent (BONUS)...")
evaluator = EvaluatorAgent(langfuse_client=langfuse)
print("\n‚úÖ Evaluator ready!")
print("   Will evaluate responses on: Relevance, Completeness, Accuracy, Clarity")

‚≠ê Initializing Evaluator Agent (BONUS)...

‚úÖ Evaluator ready!
   Will evaluate responses on: Relevance, Completeness, Accuracy, Clarity


## 6. Testing with Sample Queries

Let's test the system with queries from different departments.

### Test 1: HR Query - Paid Time Off

In [10]:
query1 = "How many PTO days do I get per year?"
result1 = orchestrator.process_query(query1, verbose=True)


PROCESSING QUERY: How many PTO days do I get per year?

[Orchestrator] Classifying query: How many PTO days do I get per year?
[Orchestrator] Classified as: HR (confidence: 0.90)
[Orchestrator] Reasoning: The query is related to employee benefits and leave policies, which fall under the HR department's responsibilities.

--------------------------------------------------------------------------------
CLASSIFICATION:
  Department: HR
  Confidence: 0.90
  Reasoning: The query is related to employee benefits and leave policies, which fall under the HR department's responsibilities.
--------------------------------------------------------------------------------

ANSWER:
Based on the information provided in the context, full-time employees accrue PTO as follows:
- 0-2 years of service: 15 days (120 hours) per year
- 3-5 years of service: 20 days (160 hours) per year
- 6-10 years of service: 25 days (200 hours) per year
- 10+ years of service: 30 days (240 hours) per year

Part-time employ

### Test 2: IT Query - Laptop Issues

In [11]:
query2 = "My laptop won't turn on, what should I do?"
result2 = orchestrator.process_query(query2, verbose=True)


PROCESSING QUERY: My laptop won't turn on, what should I do?

[Orchestrator] Classifying query: My laptop won't turn on, what should I do?
[Orchestrator] Classified as: IT (confidence: 0.90)
[Orchestrator] Reasoning: The query is related to hardware troubleshooting, which falls under the IT department's responsibilities.

--------------------------------------------------------------------------------
CLASSIFICATION:
  Department: IT
  Confidence: 0.90
  Reasoning: The query is related to hardware troubleshooting, which falls under the IT department's responsibilities.
--------------------------------------------------------------------------------

ANSWER:
If your laptop won't turn on, follow these steps:

1. Check that the power adapter is securely connected to both the laptop and the power outlet.
2. Try plugging the power adapter into a different power outlet to rule out any issues with the current outlet.
3. Hold down the power button for 10 seconds to perform a hard reset.
4. If

### Test 3: Finance Query - Expense Reimbursement

In [12]:
query3 = "What is the reimbursement policy for business travel expenses?"
result3 = orchestrator.process_query(query3, verbose=True)


PROCESSING QUERY: What is the reimbursement policy for business travel expenses?

[Orchestrator] Classifying query: What is the reimbursement policy for business travel expenses?
[Orchestrator] Classified as: Finance (confidence: 0.95)
[Orchestrator] Reasoning: The query is related to expenses and reimbursement policies, which falls under the Finance department's area of expertise.

--------------------------------------------------------------------------------
CLASSIFICATION:
  Department: Finance
  Confidence: 0.95
  Reasoning: The query is related to expenses and reimbursement policies, which falls under the Finance department's area of expertise.
--------------------------------------------------------------------------------

ANSWER:
The reimbursement policy for business travel expenses includes the following eligible expenses when properly documented and approved:
- Airfare for business trips (economy class)
- Hotel accommodations at reasonable rates
- Ground transportation (re

### Test 4: IT Query - VPN Access

In [13]:
query4 = "I forgot my VPN password. How can I reset it?"
result4 = orchestrator.process_query(query4, verbose=True)


PROCESSING QUERY: I forgot my VPN password. How can I reset it?

[Orchestrator] Classifying query: I forgot my VPN password. How can I reset it?
[Orchestrator] Classified as: IT (confidence: 0.95)
[Orchestrator] Reasoning: The query is related to a technical issue with VPN password, which falls under IT department's expertise.

--------------------------------------------------------------------------------
CLASSIFICATION:
  Department: IT
  Confidence: 0.95
  Reasoning: The query is related to a technical issue with VPN password, which falls under IT department's expertise.
--------------------------------------------------------------------------------

ANSWER:
If you have forgotten your VPN password, you can reset it by following these steps:

1. Call IT Support at Extension 4357 to initiate the password reset process.
2. Verify your identity with your employee ID and birth date.
3. A temporary password will be provided to you via your registered phone number.
4. Log in to the VPN 

## 7. Response Quality Evaluation (BONUS)

Now let's evaluate the quality of our responses using the Evaluator Agent.

In [14]:
# Evaluate the first query result
print("\n" + "="*80)
print("EVALUATING RESPONSE QUALITY")
print("="*80)

# Get trace_id from result if available
trace_id = result1.get('trace_id')
if trace_id:
    print(f"Using trace_id: {trace_id[:16]}...")

evaluation1 = evaluator.evaluate_response(
    query=result1["query"],
    answer=result1["answer"],
    department=result1["classification"]["department"],
    source_documents=result1["source_documents"],
    trace_id=trace_id  # Pass trace_id to link scores to trace
)

print("\nüìä Evaluation Results:")
print(f"   Overall Score: {evaluation1.overall_score}/10")
print(f"   Relevance: {evaluation1.relevance_score}/10")
print(f"   Completeness: {evaluation1.completeness_score}/10")
print(f"   Accuracy: {evaluation1.accuracy_score}/10")
print(f"   Clarity: {evaluation1.clarity_score}/10")
print(f"\nüí¨ Feedback: {evaluation1.feedback}")
print(f"\n‚úÖ Strengths: {evaluation1.strengths}")
print(f"\nüîß Improvements: {evaluation1.improvements}")

if trace_id:
    print("\n" + "="*80)
    print("‚úÖ Scores logged to Langfuse with trace_id!")
    print("üîó Check 'Scores' tab at: https://cloud.langfuse.com")
    print("="*80)


EVALUATING RESPONSE QUALITY
Using trace_id: 9a0a40865a8458f9...

[Evaluator] Evaluating response for query: How many PTO days do I get per year?...
[Evaluator] Overall Score: 9/10
[Evaluator] Relevance: 10/10
[Evaluator] Completeness: 9/10
[Evaluator] Accuracy: 9/10
[Evaluator] Clarity: 9/10
[Evaluator] Scores logged to Langfuse (trace_id: 9a0a40865a8458f98c884552dfac52a2)

üìä Evaluation Results:
   Overall Score: 9/10
   Relevance: 10/10
   Completeness: 9/10
   Accuracy: 9/10
   Clarity: 9/10

üí¨ Feedback: The response is highly relevant, providing a detailed breakdown of PTO accrual based on years of service for full-time employees and a general guideline for part-time employees. The information is comprehensive and accurate, addressing the user's question effectively. The clarity of the response is good, but it could be improved by organizing the information into bullet points for better readability.

‚úÖ Strengths: 1. Highly relevant information provided based on the user's q

## 8. Batch Testing with Test Queries

Let's test the system with all queries from our test dataset.

In [15]:
# Load test queries
with open('test_queries.json', 'r') as f:
    test_data = json.load(f)

print(f"üìù Loaded {len(test_data['test_queries'])} test queries")
print("\nTest queries by department:")

dept_counts = {}
for test in test_data['test_queries']:
    dept = test['expected_department']
    dept_counts[dept] = dept_counts.get(dept, 0) + 1

for dept, count in dept_counts.items():
    print(f"   {dept}: {count} queries")

üìù Loaded 15 test queries

Test queries by department:
   HR: 5 queries
   IT: 5 queries
   Finance: 5 queries


In [16]:
# Run all test queries
results = []
correct_classifications = 0
total_queries = len(test_data['test_queries'])

print("\n" + "="*80)
print("RUNNING BATCH TESTS")
print("="*80 + "\n")

for i, test in enumerate(test_data['test_queries'], 1):
    print(f"\n[{i}/{total_queries}] Testing: {test['query'][:60]}...")
    
    # Process query
    result = orchestrator.process_query(test['query'], verbose=False)
    
    # Check if classification is correct
    expected = test['expected_department']
    actual = result['classification']['department']
    is_correct = expected == actual
    
    if is_correct:
        correct_classifications += 1
        status = "‚úÖ CORRECT"
    else:
        status = "‚ùå INCORRECT"
    
    print(f"   Expected: {expected} | Got: {actual} | {status}")
    print(f"   Confidence: {result['classification']['confidence']:.2f}")
    
    results.append({
        'query': test['query'],
        'expected': expected,
        'actual': actual,
        'correct': is_correct,
        'confidence': result['classification']['confidence'],
        'answer': result['answer']
    })

# Calculate accuracy
accuracy = (correct_classifications / total_queries) * 100

print("\n" + "="*80)
print("BATCH TEST RESULTS")
print("="*80)
print(f"\nüìä Overall Accuracy: {accuracy:.1f}% ({correct_classifications}/{total_queries})")
print(f"\n‚úÖ Correct Classifications: {correct_classifications}")
print(f"‚ùå Incorrect Classifications: {total_queries - correct_classifications}")


RUNNING BATCH TESTS


[1/15] Testing: How many PTO days do I get per year?...

[Orchestrator] Classifying query: How many PTO days do I get per year?
[Orchestrator] Classified as: HR (confidence: 0.90)
[Orchestrator] Reasoning: The query is related to employee benefits and leave policies, which fall under the HR department's responsibilities.
   Expected: HR | Got: HR | ‚úÖ CORRECT
   Confidence: 0.90

[2/15] Testing: My laptop won't turn on, what should I do?...

[Orchestrator] Classifying query: My laptop won't turn on, what should I do?
[Orchestrator] Classified as: IT (confidence: 0.90)
[Orchestrator] Reasoning: The query is related to hardware troubleshooting, which falls under the IT department's expertise.
   Expected: IT | Got: IT | ‚úÖ CORRECT
   Confidence: 0.90

[3/15] Testing: What is the reimbursement policy for business travel expense...

[Orchestrator] Classifying query: What is the reimbursement policy for business travel expenses?
[Orchestrator] Classified as: Finance

## 9. Evaluate All Responses (BONUS)

Let's evaluate the quality of all responses.

In [17]:
# Evaluate quality of first 5 responses
print("\n" + "="*80)
print("EVALUATING RESPONSE QUALITY (First 5 queries)")
print("="*80)

evaluation_results = []
avg_scores = {
    'overall': 0,
    'relevance': 0,
    'completeness': 0,
    'accuracy': 0,
    'clarity': 0
}

for i, test in enumerate(test_data['test_queries'][:5], 1):
    print(f"\n[{i}/5] Evaluating: {test['query'][:60]}...")
    
    # Get fresh result
    result = orchestrator.process_query(test['query'], verbose=False)
    
    # Get trace_id from result
    trace_id = result.get('trace_id')
    
    # Evaluate - will automatically log to Langfuse with trace_id
    evaluation = evaluator.evaluate_response(
        query=result["query"],
        answer=result["answer"],
        department=result["classification"]["department"],
        source_documents=result["source_documents"],
        trace_id=trace_id  # Pass trace_id to link scores
    )
    
    print(f"   Overall: {evaluation.overall_score}/10")
    print(f"   Relevance: {evaluation.relevance_score}/10 | Completeness: {evaluation.completeness_score}/10")
    print(f"   Accuracy: {evaluation.accuracy_score}/10 | Clarity: {evaluation.clarity_score}/10")
    if trace_id:
        print(f"   Trace ID: {trace_id[:16]}... ‚úì")
    
    # Accumulate scores
    avg_scores['overall'] += evaluation.overall_score
    avg_scores['relevance'] += evaluation.relevance_score
    avg_scores['completeness'] += evaluation.completeness_score
    avg_scores['accuracy'] += evaluation.accuracy_score
    avg_scores['clarity'] += evaluation.clarity_score
    
    evaluation_results.append(evaluation)

# Calculate averages
n = len(evaluation_results)
for key in avg_scores:
    avg_scores[key] /= n

print("\n" + "="*80)
print("AVERAGE QUALITY SCORES")
print("="*80)
print(f"\nüìä Overall Average: {avg_scores['overall']:.1f}/10")
print(f"   Relevance: {avg_scores['relevance']:.1f}/10")
print(f"   Completeness: {avg_scores['completeness']:.1f}/10")
print(f"   Accuracy: {avg_scores['accuracy']:.1f}/10")
print(f"   Clarity: {avg_scores['clarity']:.1f}/10")

print("\n" + "="*80)
print("‚úÖ All scores logged to Langfuse with trace IDs!")
print("üìä View scores at: https://cloud.langfuse.com")
print("   Navigate to: Scores tab")
print("="*80)


EVALUATING RESPONSE QUALITY (First 5 queries)

[1/5] Evaluating: How many PTO days do I get per year?...

[Orchestrator] Classifying query: How many PTO days do I get per year?
[Orchestrator] Classified as: HR (confidence: 0.90)
[Orchestrator] Reasoning: The query is related to employee benefits and leave policies, which fall under the HR department's responsibilities.

[Evaluator] Evaluating response for query: How many PTO days do I get per year?...
[Evaluator] Overall Score: 9/10
[Evaluator] Relevance: 10/10
[Evaluator] Completeness: 9/10
[Evaluator] Accuracy: 9/10
[Evaluator] Clarity: 9/10
[Evaluator] Scores logged to Langfuse (trace_id: 925a187ad3dc7a2fb30c8662c96b3a56)
   Overall: 9/10
   Relevance: 10/10 | Completeness: 9/10
   Accuracy: 9/10 | Clarity: 9/10
   Trace ID: 925a187ad3dc7a2f... ‚úì

[2/5] Evaluating: My laptop won't turn on, what should I do?...

[Orchestrator] Classifying query: My laptop won't turn on, what should I do?
[Orchestrator] Classified as: IT (confidenc

## 10. Interactive Testing

Try your own queries!

In [18]:
# Interactive query testing
def test_query(query_text):
    """
    Test a custom query and evaluate the response.
    
    Args:
        query_text: Your question to test
    """
    print("\n" + "="*80)
    result = orchestrator.process_query(query_text, verbose=True)
    
    print("\n" + "-"*80)
    print("EVALUATING RESPONSE QUALITY")
    print("-"*80)
    
    # Get trace_id from result
    trace_id = result.get('trace_id')
    
    evaluation = evaluator.evaluate_response(
        query=result["query"],
        answer=result["answer"],
        department=result["classification"]["department"],
        source_documents=result["source_documents"],
        trace_id=trace_id  # Pass trace_id to link scores
    )
    
    print(f"\nüìä Quality Scores:")
    print(f"   Overall: {evaluation.overall_score}/10")
    print(f"   Relevance: {evaluation.relevance_score}/10")
    print(f"   Completeness: {evaluation.completeness_score}/10")
    print(f"   Accuracy: {evaluation.accuracy_score}/10")
    print(f"   Clarity: {evaluation.clarity_score}/10")
    print(f"\nüí¨ {evaluation.feedback}")
    
    if trace_id:
        print("\n" + "-"*80)
        print(f"‚úÖ Scores logged to Langfuse (trace_id: {trace_id[:16]}...)")
    print("="*80)
    
    return result, evaluation

# Example usage - uncomment to try your own!
# test_query("What is the parental leave policy?")
# test_query("How do I connect to the office WiFi?")
# test_query("When will I receive my expense reimbursement?")

## 11. View Results in Langfuse

All queries have been traced and logged to Langfuse. You can now:

1. Visit [cloud.langfuse.com](https://cloud.langfuse.com)
2. Navigate to your project
3. View **Traces** to see all query processing steps
4. View **Scores** to see quality evaluations
5. Debug any misclassifications or poor responses

### What You Can See in Langfuse:

**Traces:**
- Complete execution path for each query
- Classification reasoning
- Retrieved documents
- Generated responses
- Execution time and token usage

**Scores:**
- Overall quality scores (1-10)
- Dimension-specific scores (relevance, completeness, accuracy, clarity)
- Detailed feedback and suggestions

**Analytics:**
- Query volume by department
- Average response quality
- Most common query types
- Performance metrics

## 12. Summary & Next Steps

### What We've Built:

‚úÖ **Multi-Agent System**: Orchestrator + 3 specialized RAG agents (HR, IT, Finance)

‚úÖ **Intent Classification**: Automatic query routing with confidence scores

‚úÖ **RAG Implementation**: Document retrieval with 50+ chunks per domain

‚úÖ **Langfuse Integration**: Complete observability and tracing

‚úÖ **Quality Evaluation**: Automated response scoring (BONUS)

### Technical Highlights:

- **LangChain Framework**: Production-grade components
- **Vector Stores**: ChromaDB for efficient retrieval
- **Structured Outputs**: Pydantic models for type safety
- **Observability**: Full tracing with Langfuse
- **Quality Metrics**: Multi-dimensional evaluation

### Next Steps:

1. **Review Langfuse Dashboard**: Analyze traces and scores
2. **Test Edge Cases**: Try ambiguous or multi-department queries
3. **Tune Parameters**: Adjust chunk size, k-value, temperature
4. **Add More Departments**: Legal, Sales, Marketing
5. **Deploy to Production**: API wrapper, web interface

### Performance Expectations:

- **Classification Accuracy**: 90%+ expected
- **Response Quality**: 7-9/10 average
- **Latency**: 2-5 seconds per query
- **Cost**: ~$0.01-0.05 per query

---

**üéâ Congratulations! You've built a production-grade multi-agent system with full observability!**

In [19]:
# Ensure all data is sent to Langfuse
print("üîÑ Flushing all data to Langfuse...")
langfuse.flush()
print("‚úÖ All data sent to Langfuse!")
print("\n" + "="*80)
print("üìä VIEW YOUR RESULTS IN LANGFUSE")
print("="*80)
print("\n1. Go to: https://cloud.langfuse.com")
print("2. Navigate to your project")
print("3. Click on 'Traces' tab to see all query executions")
print("4. Click on 'Scores' tab to see all evaluation scores")
print("\nYou should see:")
print("  - Query traces (RetrievalQA, ChatOpenAI)")
print("  - Evaluation traces with scores")
print("  - Score names: overall_quality, relevance, completeness, accuracy, clarity")
print("\n" + "="*80)

üîÑ Flushing all data to Langfuse...
‚úÖ All data sent to Langfuse!

üìä VIEW YOUR RESULTS IN LANGFUSE

1. Go to: https://cloud.langfuse.com
2. Navigate to your project
3. Click on 'Traces' tab to see all query executions
4. Click on 'Scores' tab to see all evaluation scores

You should see:
  - Query traces (RetrievalQA, ChatOpenAI)
  - Evaluation traces with scores
  - Score names: overall_quality, relevance, completeness, accuracy, clarity

