# Evaluating Virtual Assistant Conversations with Label Studio

Want to understand how well your virtual assistant handles complex conversations? This notebook shows you how to setup a Label Studio project to evaluate multi-turn chat dialogues using human feedback and analyze the results.

By labeling and analyzing real conversations, you can:
- Identify common user intents and conversation patterns
- Measure how effectively your assistant addresses user needs
- Spot gaps in understanding or response quality
- Track conversation flow and intent transitions
- Find opportunities to improve the assistant's responses

This notebook walks through:
- Creating a Label Studio project with an intuitive labeling interface
- Importing your chat conversations for evaluation
- Setting up a comprehensive labeling schema to assess:
  - User intent classification
  - Response accuracy and helpfulness
  - Success in addressing user needs
- Analyzing the results with insightful visualizations and metrics


**Prerequisites:**

- A running Label Studio instance 
  - We'll use a local instance in this notebook
  - Get your API key from Account & Settings in the Label Studio UI
- Label Studio SDK and pandas (`pip install -r requirements.txt`) 
- Multi-turn chat conversations to analyze
  - We provide a sample dataset in this notebook
  - You can substitute your own conversation data

In [11]:
# Import the SDK and the client module
from label_studio_sdk.client import LabelStudio

# Define the URL where Label Studio is accessible and the API key for your user account
LABEL_STUDIO_URL = 'http://localhost:8080'
API_KEY = '<YOUR_API_KEY>'

# Connect to the Label Studio API and check the connection
client = LabelStudio(base_url=LABEL_STUDIO_URL, api_key=API_KEY)

## Creating the Label Configuration

To evaluate multi-turn chat conversations, we need to define a labeling configuration that creates our UI interface. This configuration will:

1. Display the complete conversation for context
2. Break down each conversation turn individually for detailed analysis
3. Present specific questions and choices for evaluating each turn

### About MAX_CHAT_TURNS

The Label Studio interface requires us to specify a maximum number of turns (`MAX_CHAT_TURNS`) upfront because:
- Label Studio needs a static UI configuration
- We can't dynamically add UI elements as conversations get longer
- Each turn needs its own set of questions and styling

How it works:
- Short conversations: Empty turns are added as padding
- Long conversations: Must be split into multiple tasks
- Default setting: 5 turns (adjust based on your typical conversation length)

In [12]:
# Define the maximum number of chat turns
MAX_CHAT_TURNS = 5

In [13]:
def generate_colors(n):
    """Generate a curated set of colors for up to n turns."""
    # New curated color palette
    base_colors = [
        {"border": "#6A5ACD", "background": "#EDEDFD"},  # Slate Blue
        {"border": "#2E8B57", "background": "#EAF5F1"},  # Emerald Green
        {"border": "#FF4500", "background": "#FFF4EC"},  # Sunset Orange
        {"border": "#DC143C", "background": "#FDECEC"},  # Crimson Red
        {"border": "#4B0082", "background": "#F3EAFD"},  # Indigo Purple
        {"border": "#4682B4", "background": "#EDF6FA"},  # Steel Teal
        {"border": "#DAA520", "background": "#FFF8E4"},  # Goldenrod
        {"border": "#20B2AA", "background": "#E8FAF9"},  # Pale Aqua
    ]
    
    # If n exceeds the palette, repeat colors
    colors = [base_colors[i % len(base_colors)] for i in range(n)]
    return colors

# Generate colors for the number of chat turns
turn_colors = generate_colors(MAX_CHAT_TURNS)

# Generate CSS dynamically
css = """
.root {
    font-family: Arial, sans-serif;
    display: flex;
    flex-direction: column;
    height: 100vh; /* Full height of the viewport */
    margin: 0;
    padding: 0;
}

.container {
    display: flex;
    flex: 1;
    gap: 20px;
    height: 100%; /* Ensure it stretches to fill the root height */
    overflow: hidden; /* Prevent scrolling at the container level */
}

.column {
    flex: 1;
    display: flex;
    flex-direction: column;
    overflow: hidden; /* Prevent column itself from scrolling */
}

.dialogue {
    max-width: 750px;
    border: 1px solid #ccc;
    padding: 10px;
    border-radius: 5px;
    background-color: #f8f9fa;
    overflow-y: auto; /* Enable vertical scrolling */
    flex: 1; /* Stretch to fill the available height */
}

.questions {
    border: 1px solid #ddd;
    padding: 10px;
    border-radius: 5px;
    background-color: #f8f9fa;
    overflow-y: auto; /* Enable vertical scrolling */
    flex: 1; /* Stretch to fill the available height */
}

.panel {
    margin-bottom: 10px;
    padding: 10px;
    border: 1px solid #e9ecef;
    border-radius: 5px;
    background-color: #f8f9fa;
}

.panel-header {
    font-weight: bold;
    margin-bottom: 10px;
}

.section-header {
    margin-bottom: 10px;
}
"""

# Add turn-specific styles
for i, colors in enumerate(turn_colors, start=1):
    css += f"""
    .turn-{i} {{
        border: 2px solid {colors['border']};
        background-color: {colors['background']};
        padding: 10px;
        border-radius: 5px;
        margin-bottom: 20px;
    }}
    """

# Generate panels for each turn dynamically
panels = ""
for i in range(1, MAX_CHAT_TURNS + 1):
    panels += f"""
    <Panel value="Turn {i}" className="panel-header">
        <View className="panel-turn turn-{i}">
            <Paragraphs name="turn{i}_prg" value="$turn{i}_dialogue" layout="dialogue" nameKey="role" textKey="content" />
            
            <Header value="What is the user's intent in this turn?" />
            <Choices name="turn{i}_user_intent" toName="turn{i}_prg" choice="multiple">
                <Choice value="Product Inquiry" />
                <Choice value="Order Status" />
                <Choice value="Return/Exchange Inquiry" />
                <Choice value="Payment/Refund Inquiry" />
                <Choice value="Complaint" />
                <Choice value="Store/Location Information" />
                <Choice value="Other" />
            </Choices>

            <Header value="Did the assistant’s response address the user's intent?" />
            <Choices name="turn{i}_response_address_intent" toName="turn{i}_prg" choice="single">
                <Choice value="Fully Addressed" />
                <Choice value="Partially Addressed" />
                <Choice value="Not Addressed" />
            </Choices>

            <Header value="Is the assistant’s response accurate and helpful?" />
            <Choices name="turn{i}_response_accuracy_helpfulness" toName="turn{i}_prg" choice="single">
                <Choice value="Yes, Accurate and Helpful" />
                <Choice value="Yes, Accurate but Unhelpful" />
                <Choice value="No, Inaccurate" />
                <Choice value="No Response" />
            </Choices>

            <Header value="What action is implied by the assistant’s response (if any)?" />
            <Choices name="turn{i}_response_action" toName="turn{i}_prg" choice="multiple">
                <Choice value="Provide More Information to the User" />
                <Choice value="Request More Information from the User" />
                <Choice value="Escalate to Human Support" />
                <Choice value="Redirect to a Different Team/Resource" />
                <Choice value="Confirm Action Taken" />
                <Choice value="No Action/Response" />
            </Choices>
        </View>
    </Panel>
    """

# Combine everything into the final template
multi_turn_label_config = f"""
<View>
    <Style>
        {css}
    </Style>
    <View className="root">
        <Header value="Dialogue and Questions" />
        <View className="container">
            <View className="column">
                <View className="dialogue">
                    <Header value="Full Conversation" />
                    <Paragraphs name="prg" value="$messages" layout="dialogue" nameKey="role" textKey="content" />
                </View>
            </View>
            <View className="column">
                <View className="questions">
                    <Header value="Answer the questions for each turn" className="section-header" />
                    <Collapse>
                        {panels}
                    </Collapse>
                </View>
            </View>
        </View>
    </View>
</View>
"""

In [14]:
# Create a new project
multi_turn_project = client.projects.create(
    title='Multi-turn Chat Labeling',
    color='#CC6FBE',
    description='Labeling for multi-turn chat conversations',
    label_config=multi_turn_label_config
)

# Importing Multi-turn Chat Data

Now that we have created the project, we can import the tasks into the project. We will need to transform the data into the format that Label Studio expects.

## Data Structure
Each conversation in Label Studio needs:
- A full conversation view for context
- Individual turns broken out for analysis
- Each turn contains:
  - A user message
  - An assistant's response

## Sample Data Format
We'll use example conversations that follow the OpenAI chat format, where each message is a dictionary with:
- `role`: either "user" or "assistant"
- `content`: the message text

The data will be transformed to accommodate our `MAX_CHAT_TURNS` setting, with empty turns added as padding if needed.

## Customization
For your own use:
- Replace our sample data with your conversation logs
- Modify the labeling questions from the previous step to match your evaluation needs (e.g., specific intents, success metrics, or quality criteria)
- Keep the same data structure format for compatibility

In [15]:
multi_turn_data = [
    {"messages": [
        {
          "role": "user",
          "content": "Hi, do you have the new SmartWear fitness tracker in stock?"
        },
        {
          "role": "assistant",
          "content": "Let me check that for you. Which model are you interested in—SmartWear Pro or SmartWear Lite?"
        },
        {
          "role": "user",
          "content": "SmartWear Pro."
        },
        {
          "role": "assistant",
          "content": "Yes, the SmartWear Pro is in stock in both black and silver. Would you like to place an order or reserve one?"
        },
        {
          "role": "user",
          "content": "Can I reserve the black one?"
        },
        {
          "role": "assistant",
          "content": "Absolutely! I've reserved a black SmartWear Pro for you. You can pick it up anytime today."
        }
      ]},
    {"messages": [
        {
          "role": "user",
          "content": "What’s your return policy for online purchases?"
        },
        {
          "role": "assistant",
          "content": "We accept returns within 30 days of purchase for items in their original condition. Would you like details about how to start a return?"
        },
        {
          "role": "user",
          "content": "Yes, please."
        },
        {
          "role": "assistant",
          "content": "You can start a return by visiting our website, logging into your account, and selecting the item you want to return. Would you like me to send you the direct link to the return page?"
        },
        {
          "role": "user",
          "content": "That would be great, thanks!"
        },
        {
          "role": "assistant",
          "content": "Here you go: [Return Page Link]. Let me know if you need any further help."
        }
      ]},
    {"messages": [
        {
          "role": "user",
          "content": "I ordered a jacket last week, but it hasn’t arrived yet. Can you check the status?"
        },
        {
          "role": "assistant",
          "content": "Sure! Could you provide your order number?"
        },
        {
          "role": "user",
          "content": "It’s #56789."
        },
        {
          "role": "assistant",
          "content": "Thanks! I see your order was delayed due to weather conditions. It’s expected to arrive in 2 more days. I’m sorry for the inconvenience!"
        },
        {
          "role": "user",
          "content": "Can I get a refund on the shipping fee?"
        },
        {
          "role": "assistant",
          "content": "Yes, I’ve issued a refund for the shipping fee. It should reflect in your account within 3-5 business days. Is there anything else I can assist you with?"
        }
      ]}
]

In [16]:
def transform_data(data, max_turns):
    transformed = []
    for conversation in data:
        messages = conversation["messages"]
        formatted_conversation = {
            "data": {
                "messages": messages,
                **{
                    f"turn{i+1}_dialogue": messages[i * 2:(i + 1) * 2] if i < len(messages) // 2 else [
                        {"role": "", "content": ""},
                        {"role": "", "content": ""}
                    ]
                    for i in range(max_turns)
                }
            }
        }
        transformed.append(formatted_conversation)
    return transformed

In [17]:
multi_turn_tasks = transform_data(multi_turn_data, max_turns=MAX_CHAT_TURNS)

In [18]:
multi_turn_tasks

[{'data': {'messages': [{'role': 'user',
     'content': 'Hi, do you have the new SmartWear fitness tracker in stock?'},
    {'role': 'assistant',
     'content': 'Let me check that for you. Which model are you interested in—SmartWear Pro or SmartWear Lite?'},
    {'role': 'user', 'content': 'SmartWear Pro.'},
    {'role': 'assistant',
     'content': 'Yes, the SmartWear Pro is in stock in both black and silver. Would you like to place an order or reserve one?'},
    {'role': 'user', 'content': 'Can I reserve the black one?'},
    {'role': 'assistant',
     'content': "Absolutely! I've reserved a black SmartWear Pro for you. You can pick it up anytime today."}],
   'turn1_dialogue': [{'role': 'user',
     'content': 'Hi, do you have the new SmartWear fitness tracker in stock?'},
    {'role': 'assistant',
     'content': 'Let me check that for you. Which model are you interested in—SmartWear Pro or SmartWear Lite?'}],
   'turn2_dialogue': [{'role': 'user', 'content': 'SmartWear Pro.'},


## Import tasks into the project



In [19]:
# Import tasks into the project
for task in multi_turn_tasks:
    client.tasks.create(
        project=multi_turn_project.id,
        data=task['data']
    )

Now that we have imported the tasks into the project, we can start labeling the tasks in Label Studio.

# Analyzing Labeled Results

After completing the labeling tasks in Label Studio, we can analyze the results to evaluate our virtual assistant's performance. The analysis will help us understand:
- Common user intents and patterns
- Response effectiveness
- Areas needing improvement
- Conversation flow patterns

## Getting the Data
For this example:
- We'll use pre-labeled results from `labeled_results.json` included in this repository
- In practice, you would export your labeled data from the Label Studio UI as JSON
- The analysis code will work the same way with your exported data

Let's load the labeled results and transform our labeled data into structured DataFrames for analysis.

In [20]:
import json

# Load the labeled results
with open('labeled_results.json', 'r') as file:
    labeled_results = json.load(file)


In [21]:
import pandas as pd

# Extract data into a structured format
turn_data = []
for task in labeled_results:
    for annotation in task["annotations"]:
        for result in annotation["result"]:
            record = {
                "Task ID": task["id"],
                "Turn": result["from_name"].split("_")[0],  # Extract turn (e.g., turn1, turn2)
                "Field": result["from_name"],
                "Choice": result["value"]["choices"][0],  # Assume single/multiple choice
            }
            turn_data.append(record)

# Convert to DataFrame
df = pd.DataFrame(turn_data)

In [22]:
# Display the table
df

Unnamed: 0,Task ID,Turn,Field,Choice
0,87004,turn1,turn1_user_intent,Product Inquiry
1,87004,turn1,turn1_response_address_intent,Partially Addressed
2,87004,turn1,turn1_response_accuracy_helpfulness,"Yes, Accurate and Helpful"
3,87004,turn1,turn1_response_action,Request More Information from the User
4,87004,turn2,turn2_user_intent,Product Inquiry
5,87004,turn2,turn2_response_address_intent,Fully Addressed
6,87004,turn2,turn2_response_accuracy_helpfulness,"Yes, Accurate and Helpful"
7,87004,turn2,turn2_response_action,Provide More Information to the User
8,87004,turn3,turn3_user_intent,Other
9,87004,turn3,turn3_response_address_intent,Fully Addressed


## Data Transformation and Metrics

Using the results from the labeling tasks, we can extract several metrics to help us understand the results. Here we will focus on the following metrics:
- User intents from each conversation turn
- How well the assistant addressed these intents
- Response quality metrics
- Actions suggested by the assistant

Each metric helps us understand different aspects of a single conversation turn, which can give us insights into what users are asking and how well the assistant is handling these requests.

In [23]:
from collections import Counter

# Format data for different analyses
# 1. User Intent Distribution
intent_df = df[df["Field"].str.contains("user_intent")]
intent_summary = intent_df["Choice"].value_counts().reset_index()
intent_summary.columns = ["User Intent", "Count"]

# 2. Intent Addressing Metrics
addressing_df = df[df["Field"].str.contains("response_address_intent")]
addressing_summary = addressing_df["Choice"].value_counts().reset_index()
addressing_summary.columns = ["Addressing Outcome", "Count"]

# 3. Accuracy and Helpfulness Metrics
accuracy_df = df[df["Field"].str.contains("response_accuracy_helpfulness")]
accuracy_summary = accuracy_df["Choice"].value_counts(normalize=True).reset_index()
accuracy_summary.columns = ["Accuracy and Helpfulness", "Percentage"]
accuracy_summary["Percentage"] = (accuracy_summary["Percentage"] * 100).round(2)

# 4. Suggested Actions
action_df = df[df["Field"].str.contains("response_action")]
action_summary = action_df["Choice"].value_counts().reset_index()
action_summary.columns = ["Suggested Action", "Count"]


# Visualization of Key Metrics

We'll display our analysis in a few tables that show:

### Intent Distribution
Shows which user intents are most common across conversations. This helps identify:
- Primary use cases (e.g., 40% of conversations deal with Product Inquiries)
- Gaps in coverage (e.g., few Store Location questions)

### Intent Addressing
Measures how well the assistant handles each intent type:
- Fully Addressed: Complete resolution (e.g., product information provided)
- Partially Addressed: Some gaps (e.g., product info given but price missing)
- Not Addressed: Failed responses (e.g., misunderstood user request)
- Helps identify which intents need improvement

### Response Quality
Evaluates accuracy and helpfulness of responses:
- "Accurate and Helpful": Perfect responses (e.g., correct product info with relevant details)
- "Accurate but Unhelpful": Missing context (e.g., correct but incomplete information)
- "Inaccurate": Wrong information (e.g., incorrect product details)
- Shows where assistant knowledge needs updating

### Action Analysis
Shows what actions the assistant commonly suggests or takes:
- Information Requests (e.g., "Can you provide your order number?")
- Escalations (e.g., "Let me connect you with support")
- Confirmations (e.g., "I've processed your refund")
- Helps understand typical conversation outcomes

### Understanding the Results
These metrics help you:
1. Identify strongest and weakest areas
2. Prioritize improvements
3. Track changes over time
4. Ensure consistent response quality

In [24]:
from IPython.display import display, HTML

# Display data in styled tables
# Display Results in Notebook with Color Gradients
def display_table(title, dataframe, cmap="YlGnBu"):
    display(HTML(f"<h3>{title}</h3>"))
    display(dataframe.style.background_gradient(cmap=cmap)
            .set_table_attributes('style="width:50%; margin:auto;"')
            .set_caption(title)
            .set_properties(**{"text-align": "left"})
            .set_table_styles([{"selector": "caption", "props": [("caption-side", "top")]}]))
            

display_table("User Intent Distribution", intent_summary)
display_table("Intent Addressing Metrics", addressing_summary)
display_table("Accuracy and Helpfulness Metrics", accuracy_summary)
display_table("Suggested Actions", action_summary)

Unnamed: 0,User Intent,Count
0,Return/Exchange Inquiry,3
1,Product Inquiry,2
2,Order Status,2
3,Other,1
4,Payment/Refund Inquiry,1


Unnamed: 0,Addressing Outcome,Count
0,Fully Addressed,7
1,Partially Addressed,2


Unnamed: 0,Accuracy and Helpfulness,Percentage
0,"Yes, Accurate and Helpful",100.0


Unnamed: 0,Suggested Action,Count
0,Provide More Information to the User,4
1,Request More Information from the User,3
2,Confirm Action Taken,2


# Conversation Flow Analysis

While the previous table helped us understand our restuls on a turn level, we may aslo want to see how conversations evolve over time. We'll use the same data to create a few tables that show:


### Transition Matrix
Shows how users move between different topics during a conversation to help identify common paths users take through conversations:
- Raw counts show actual number of transitions (e.g., 5 users went from Product Inquiry → Refund Inquiry)
- Normalized probabilities show likelihood (e.g., 30% of Order Status checks lead to Refund Inquiry)

### Intent Consistency
Tells us if conversations stay on topic or switch between different subjects:
- Single-intent conversations (e.g., just discussing an order status)
- Multi-intent conversations (e.g., starting with a product question, ending with a store location question)
- Expected vs unexpected topic changes

### Pattern Analysis
Identifies common conversation journeys to:
- Find frequent paths (e.g., Complaint → Product Inquiry → Return/Exchange Inquiry)
- Spot potential pain points (e.g., many Product Inquiries leading to Complaints)
- Understand typical user behavior patterns

In [25]:
from itertools import pairwise

# Parse the data into a structured format
conversation_data = []
for task in labeled_results:
    task_id = task["id"]
    turn_intents = []
    
    for annotation in task["annotations"]:
        for result in annotation["result"]:
            if "user_intent" in result["from_name"]:
                turn = result["from_name"].split("_")[0]  # Get the turn number
                intent = result["value"]["choices"][0]  # Assume single-choice intent
                turn_intents.append((turn, intent))
    
    conversation_data.append({"Task ID": task_id, "Intent Sequence": turn_intents})

# Analyze transitions between intents
transition_counts = Counter()
intent_sequences = []

for conversation in conversation_data:
    intents = [intent for _, intent in conversation["Intent Sequence"]]
    intent_sequences.append(intents)
    transitions = pairwise(intents)
    transition_counts.update(transitions)

# Create a Transition Matrix
unique_intents = list({intent for seq in intent_sequences for intent in seq})
transition_matrix = pd.DataFrame(0, index=unique_intents, columns=unique_intents)

for (start, end), count in transition_counts.items():
    transition_matrix.at[start, end] = count

# Normalize the matrix to show probabilities
transition_matrix_normalized = transition_matrix.div(transition_matrix.sum(axis=1), axis=0).fillna(0)

# Compute Intent Consistency
intent_consistency = {
    "Total Conversations": len(conversation_data),
    "Consistent Conversations": sum(1 for seq in intent_sequences if len(set(seq)) == 1),
    "Percentage Consistent": sum(1 for seq in intent_sequences if len(set(seq)) == 1) / len(conversation_data) * 100,
}

# Common Transition Patterns
common_patterns = Counter(tuple(seq) for seq in intent_sequences).most_common(5)

In [26]:
# Transition Matrix (Counts)
display_table("Transition Matrix (Counts)", transition_matrix)

# Transition Matrix (Normalized)
display_table("Transition Matrix (Normalized)", transition_matrix_normalized, cmap="Blues")

# Intent Consistency
display(HTML("<h3>Intent Consistency Metrics</h3>"))
intent_consistency_df = pd.DataFrame([intent_consistency])
display(intent_consistency_df.style.background_gradient(cmap="YlOrRd", subset=["Percentage Consistent"])
        .set_table_attributes('style="width:50%; margin:auto;"'))

# Top 5 Common Transition Patterns
display(HTML("<h3>Top 5 Common Transition Patterns</h3>"))
common_patterns_df = pd.DataFrame(common_patterns, columns=["Intent Sequence", "Count"])
display(common_patterns_df.style.background_gradient(cmap="Purples", subset=["Count"])
        .set_table_attributes('style="width:50%; margin:auto;"'))

Unnamed: 0,Other,Product Inquiry,Payment/Refund Inquiry,Return/Exchange Inquiry,Order Status
Other,0,0,0,0,0
Product Inquiry,1,1,0,0,0
Payment/Refund Inquiry,0,0,0,0,0
Return/Exchange Inquiry,0,0,0,2,0
Order Status,0,0,1,0,1


Unnamed: 0,Other,Product Inquiry,Payment/Refund Inquiry,Return/Exchange Inquiry,Order Status
Other,0.0,0.0,0.0,0.0,0.0
Product Inquiry,0.5,0.5,0.0,0.0,0.0
Payment/Refund Inquiry,0.0,0.0,0.0,0.0,0.0
Return/Exchange Inquiry,0.0,0.0,0.0,1.0,0.0
Order Status,0.0,0.0,0.5,0.0,0.5


Unnamed: 0,Total Conversations,Consistent Conversations,Percentage Consistent
0,3,1,33.333333


Unnamed: 0,Intent Sequence,Count
0,"('Product Inquiry', 'Product Inquiry', 'Other')",1
1,"('Return/Exchange Inquiry', 'Return/Exchange Inquiry', 'Return/Exchange Inquiry')",1
2,"('Order Status', 'Order Status', 'Payment/Refund Inquiry')",1


# Conclusion

This notebook has demonstrated how to systematically evaluate multi-turn chat conversations using Label Studio. By following this process, you can:

## Gain Valuable Insights
- Understand common user intents and conversation patterns
- Measure your virtual assistant's effectiveness
- Identify areas needing improvement
- Track how conversations flow and evolve

## Take Action
Use these insights to:
- Train your assistant on frequently occurring patterns
- Improve responses for poorly handled intents
- Optimize common conversation paths
- Address identified pain points
- Create better fallback strategies for unexpected transitions

## Next Steps
Consider:
- Running this analysis regularly to track improvements
- Customizing the labeling schema for your specific use case
- Adding more detailed metrics based on your findings
- Using the insights to prioritize development efforts

Remember: The goal is not just to collect metrics, but to continuously improve your virtual assistant's ability to handle complex, multi-turn conversations effectively.
