# Getting Started with Agentune Analyze

Welcome to Agentune Analyze! This tutorial will walk you through the fundamentals of using the library to analyze conversation data and generate insights.

## What You'll Learn

- How to load multi-table conversation data
- Running analysis on conversations
- Exploring discovered features and their predictive value
- Generating action recommendations from conversation patterns
- Properly managing resources with RunContext

**Estimated time**: 10-15 minutes

## Prerequisites

- Python >=3.12
- Agentune Analyze installed (`pip install agentune-analyze`)
- Jupyter Notebook installed (`pip install jupyter`)
- OpenAI API key ([get one here](https://platform.openai.com/api-keys))

**Note**: The sample data attached to this notebook is provided strictly for research and AI model development. Commercial use,
resale, or redistribution is prohibited.
**Note for Mac users**: If you encounter errors related to lightgbm, you may need to install OpenMP first: `brew install libomp`. See the [LightGBM macOS installation guide](https://lightgbm.readthedocs.io/en/latest/Installation-Guide.html) for details.

---

## 1. Setup and Imports

First, let's import the necessary libraries and set up our environment.

In [18]:
import os
from pathlib import Path

import polars as pl

# Agentune Analyze imports
from agentune.analyze.api.base import RunContext
from agentune.analyze.feature.problem import ProblemDescription

### Configure OpenAI API Key

In [19]:
# Recommended: Set the environment variable before starting Jupyter
# export OPENAI_API_KEY="your-api-key-here"

# Alternative: Set it in the notebook (not recommended for production)
# os.environ["OPENAI_API_KEY"] = "your-api-key-here"

# Verify it's set
if 'OPENAI_API_KEY' not in os.environ:
    raise ValueError('Please set OPENAI_API_KEY environment variable')

print('‚úì Environment configured')

‚úì Environment configured


---

## 2. Load Sample Data

We'll work with auto insurance customer service conversations. The dataset consists of two tables:
- **Conversations table** (conversations.csv): One row per conversation with outcome and duration
- **Messages table** (messages.csv): Individual message turns within each conversation

In [20]:
# Define data paths
data_dir = Path('data')
conversations_path = data_dir / 'conversations.csv'
messages_path = data_dir / 'messages.csv'

# Load data using Polars
conversations_df = pl.read_csv(conversations_path)
messages_df = pl.read_csv(messages_path)

print(f'Loaded {len(conversations_df)} conversations')
print(f'Loaded {len(messages_df)} message turns')
print(f'\nConversations columns: {conversations_df.columns}')
print(f'Messages columns: {messages_df.columns}')

Loaded 101 conversations
Loaded 16823 message turns

Conversations columns: ['conversation_id', 'outcome', 'duration_seconds']
Messages columns: ['conversation_id', 'timestamp', 'message', 'author']


### Explore the Data

In [21]:
# Show outcome distribution
print('Outcome distribution:')
print(conversations_df.group_by('outcome').agg(pl.len()).sort('len', descending=True))

print(f'\nTotal conversations: {len(conversations_df)}')
print(f'Total messages: {len(messages_df)}')
print(f'Average messages per conversation: {len(messages_df) / len(conversations_df):.1f}')

Outcome distribution:
shape: (6, 2)
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ outcome                         ‚îÜ len ‚îÇ
‚îÇ ---                             ‚îÜ --- ‚îÇ
‚îÇ str                             ‚îÜ u32 ‚îÇ
‚ïû‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ï™‚ïê‚ïê‚ïê‚ïê‚ïê‚ï°
‚îÇ customer not interested         ‚îÜ 32  ‚îÇ
‚îÇ process paused - customer need‚Ä¶ ‚îÜ 28  ‚îÇ
‚îÇ process paused - customer need‚Ä¶ ‚îÜ 17  ‚îÇ
‚îÇ customer objections not handle‚Ä¶ ‚îÜ 12  ‚îÇ
‚îÇ no quote - ineligible customer  ‚îÜ 11  ‚îÇ
‚îÇ buy                             ‚îÜ 1   ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò

Total conversations: 101
Total messages: 16823
Average messages per conversation: 166.6


---

## 3. Create RunContext

The `RunContext` is your main entry point to Agentune Analyze. It manages all resources including database connections, HTTP clients, and LLM instances.

In [22]:
# Create context
ctx = await RunContext.create()

print('‚úì RunContext created')
print('  - Database connection available')

‚úì RunContext created
  - Database connection available


**Note**: The RunContext will be used throughout the tutorial. Remember to close it when you're done (we'll show this at the end).

---

## 4. Load Data into DuckDB

Now we'll load our Polars DataFrames into DuckDB tables, which Agentune Analyze uses for efficient data processing.

In [23]:
# Load conversations table (main table: one row per conversation)
conversations_table = await ctx.data.from_csv(conversations_path).copy_to_table('conversations')

print(f'‚úì Loaded conversations table: {conversations_table.name}')
print(f'  Schema: {conversations_table.schema}')

‚úì Loaded conversations table: "memory"."main"."conversations"
  Schema: Schema(cols=(Field(name='conversation_id', dtype=SimpleDtype(name='str', duckdb_type=VARCHAR, polars_type=String)), Field(name='outcome', dtype=SimpleDtype(name='str', duckdb_type=VARCHAR, polars_type=String)), Field(name='duration_seconds', dtype=SimpleDtype(name='float64', duckdb_type=DOUBLE, polars_type=Float64))))


In [24]:
# Load messages table
messages_table = await ctx.data.from_csv(messages_path).copy_to_table('messages')

print(f'‚úì Loaded messages table: {messages_table.name}')
print(f'  Schema: {messages_table.schema}')

‚úì Loaded messages table: "memory"."main"."messages"
  Schema: Schema(cols=(Field(name='conversation_id', dtype=SimpleDtype(name='str', duckdb_type=VARCHAR, polars_type=String)), Field(name='timestamp', dtype=SimpleDtype(name='timestamp', duckdb_type=TIMESTAMP_MS, polars_type=Datetime(time_unit='ms', time_zone=None))), Field(name='message', dtype=SimpleDtype(name='str', duckdb_type=VARCHAR, polars_type=String)), Field(name='author', dtype=SimpleDtype(name='str', duckdb_type=VARCHAR, polars_type=String))))


---

## 5. Define Table Relationships

Now that both tables are loaded, we need to tell Agentune Analyze how they relate to each other for conversation analysis.

In [25]:
# Define how the messages table relates to conversations
join_strategy = messages_table.join_strategy.conversation(
    name='messages',
    main_table_key_col='conversation_id',
    key_col='conversation_id',
    timestamp_col='timestamp',
    role_col='author',
    content_col='message'
)

print('‚úì Join strategy defined for multi-turn conversations')

‚úì Join strategy defined for multi-turn conversations


---

## 6. Split Data

Agentune Analyze splits your data into subsets for feature generation and evaluation.

In [26]:
# Split the conversations table (main table)
split_data = await conversations_table.split(train_fraction=0.9)

print('‚úì Data split complete')

‚úì Data split complete


**Note**: The data is split into training (90%) and test (10%) subsets for analysis.

---

## 7. Define the Problem

Tell Agentune Analyze what you're trying to predict. In our case, we want to predict the conversation outcome.

In [27]:
# Define the prediction problem
# Note: target_desired_outcome must exactly match a value from the outcome column
# (see the outcome distribution shown in section 2 above)
problem = ProblemDescription(
    target_column='outcome',
    problem_type='classification',
    target_desired_outcome='process paused - customer needs to consider the offer',  # The outcome we want to optimize for
    name='Customer Service Conversation Outcome Prediction',
    description='Analyze the outcome of auto insurance customer service conversations, and suggest improvements for increasing insurance sales',
    target_description='The final outcome of the conversation (buy, not interested, needs more info, etc.)'
)

print('‚úì Problem defined')

‚úì Problem defined


---

## 8. Run Analysis

Now comes the exciting part! We'll run analysis, which will:
1. Generate candidate features from conversation analysis
2. Evaluate each feature's predictive power
3. Select the most valuable features

In [28]:
# Run analysis!
# This will take a few minutes to analyze conversation patterns

print('Starting analysis...')
print('This may take 5-10 minutes to analyze conversation patterns...')

results = await ctx.ops.analyze(
    problem_description=problem,
    main_input=split_data,
    secondary_tables=[messages_table],
    join_strategies=[join_strategy]
)

print('\n‚úì Analysis complete!')
print(f'  Discovered {len(results.features)} features')

Starting analysis...
This may take 5-10 minutes to analyze conversation patterns...

‚úì Analysis complete!
  Discovered 58 features


---

## 9. Explore the Results

Let's examine what features were discovered and how predictive they are.

In [29]:
# Show discovered features
print('Top 10 Discovered Features:\n')
for i, feature_with_stats in enumerate(results.features_with_train_stats[:10], 1):
    feature = feature_with_stats.feature
    stats = feature_with_stats.stats

    print(f'{i}. {feature.name}')
    print(f'   Description: {feature.description}')
    print(f'   Type: {feature.dtype}')
    print(f'   R¬≤: {stats.relationship.r_squared:.4f}')
    print()

Top 10 Discovered Features:

1. customer_requested_time
   Description: Did the customer explicitly say they needed time to think, compare, or call back? (Y/N)
   Type: bool
   R¬≤: 0.1027

2. commitment_step_reached
   Description: Did the call reach a commitment step (policy bind, payment, e-signature) before ending? (Y/N)
   Type: bool
   R¬≤: 0.1690

3. quote_delivery_method
   Description: Through which channel was the final quote or link promised to be delivered? (Verbal/Text/Email/WebLink/Mail/None)
   Type: Enum[Text, None, Email, Verbal, Mail, Phone, WebLink, Text and Email, _other_]
   R¬≤: 0.0803

4. customer_requested_delivery
   Description: Did the customer explicitly request a specific delivery method for the quote (e.g., email/mail)? (Y/N)
   Type: bool
   R¬≤: 0.0474

5. documents_not_at_hand
   Description: Did the customer say they were not at home / at work / unable to access documents right now? (Y/N)
   Type: bool
   R¬≤: 0.0551

6. customer_hold_wait
   Descripti

---

## 10. Generate Action Recommendations

Based on the conversation patterns, generate actionable recommendations for improving outcomes.

In [30]:
# Generate recommendations
print('Generating action recommendations...')

recommendations = await ctx.ops.recommend_conversation_actions(
    analyze_input=split_data,
    analyze_results=results
)

print('\n‚úì Generated recommendations' if recommendations else '\n‚úì No recommendations generated')

Generating action recommendations...

‚úì Generated recommendations


In [31]:
# Display top 5 recommendations
print('\nTop 5 Recommendations:\n')
for i, rec in enumerate(recommendations.recommendations[:5], 1):
    print(f'{i}. {rec.title}')
    print(f'   Rationale: {rec.rationale}')
    print(f'   Description: {rec.description}')
    print(f'   Evidence: {rec.evidence}')
    print()


Top 5 Recommendations:

1. Mandatory ‚ÄúNext-Step‚Äù Scheduling Workflow
   Rationale: Most ‚Äúprocess paused ‚Äì customer needs to consider‚Äù calls end without a firm, dated follow-up commitment.
   Description: Embed a required ‚ÄúNext-Step‚Äù object in the agent UI.  
   a. At any point the agent selects ‚ÄúProspect needs time‚Äù the system must launch a follow-up scheduler (date, time-window, channel).  
   b. The scheduler sends an iCalendar invite + SMS reminder to the customer; the agent‚Äôs dialer receives an auto-pop task.  
   c. Callback window can‚Äôt be more than 72 hours (configurable SLA).  
   d. Completion KPI (‚Äúnext step booked‚Äù) becomes visible on agent scorecards; pause outcome cannot be saved without it.
   Evidence: ‚Ä¢ Feature #10 ‚ÄúWas a specific callback or future contact time agreed upon?‚Äù ‚Äì SSE 0.0475  
  ‚áí Highly predictive of a call moving forward; its absence dominates paused cases.  
‚Ä¢ Only 1 of 10 paused samples contained a concrete date/t

---

## 11. Visualize Results with Interactive Dashboard

Agentune Analyze includes a utility to generate an interactive HTML dashboard for exploring analysis results. This is a convenience to help visualize our outputs so you can later on use them in your application.

In [32]:
from utils.generate_analyze_dashboard import create_dashboard

# Generate the dashboard
dashboard_path = create_dashboard(
    results=results,
    output_file='analysis_dashboard.html',
    title='Auto Insurance Conversation Analysis Results'
)

print(f'‚úì Dashboard generated: {dashboard_path}')
print('\nTo view the dashboard, open it in your browser')

‚úì Dashboard generated: analysis_dashboard.html

To view the dashboard, open it in your browser


In [33]:
from utils import create_recommendations_dashboard

# Generate dashboard
dashboard_path = create_recommendations_dashboard(
    report=recommendations,
    output_file='recommendations_dashboard.html',
    title='Auto Insurance Action Recommendations'
)

# Access raw report directly from the report object if needed
print(recommendations.raw_report)

---------------------------------------------------------------
1. Mandatory ‚ÄúNext-Step‚Äù Scheduling Workflow
---------------------------------------------------------------
Finding (What):
Most ‚Äúprocess paused ‚Äì customer needs to consider‚Äù calls end without a firm, dated follow-up commitment.

Analysis & Impact (Why):
‚Ä¢ Feature #10 ‚ÄúWas a specific callback or future contact time agreed upon?‚Äù ‚Äì SSE 0.0475  
  ‚áí Highly predictive of a call moving forward; its absence dominates paused cases.  
‚Ä¢ Only 1 of 10 paused samples contained a concrete date/time; the rest used vague language (‚Äúcall me tomorrow‚Äù, ‚ÄúI‚Äôll look it up‚Äù, ‚Äútext me later‚Äù).  
‚Ä¢ Without a calendar entry, 67 % of these prospects never re-engage (internal CRM abandonment report).  
‚Ä¢ Lost revenue: each dead lead = ‚âà $430 of foregone 1-year premium.

Strategic Recommendation (What Next & How):
Embed a required ‚ÄúNext-Step‚Äù object in the agent UI.  
a. At any point the agent selects

### What You'll See in the Dashboard

The interactive dashboard includes:
- **Target Distribution Chart**: Visual breakdown of outcome classes (desired outcome highlighted with gold border)
- **Feature Performance Ranking**: Features sorted by R¬≤ (coefficient of determination)
- **Sortable Feature Table**: Detailed statistics for all features
- **Interactive Feature Comparison Tool**: Select multiple features to compare side-by-side
- **Detailed Statistics**: Click any feature row to expand and see:
  - R¬≤ score (variance explained)
  - Distribution statistics (mean, std, missing values, unique categories)
  - Relationship statistics (lift matrix, class distributions)
  - For numeric features: histogram visualization

**Note**: The R¬≤ (R-squared) metric shows what percentage of variance in the target outcome is explained by each feature. Values range from 0 (no predictive power) to 1 (perfect prediction). Higher values indicate stronger predictive features.

See example below:

![Analysis Results Dashboard](https://raw.githubusercontent.com/SparkBeyond/aoa/getting-started-guide/agentune-analyze/examples/screenshots/analysis_results_dashboard_screenshot.png)

---

## 12. Cleanup

Let's clean up the resources we've been using.

In [34]:
# Clean up resources: closes database connections, deletes temporary files, and frees memory
await ctx.aclose()

print('‚úì Resources cleaned up')

‚úì Resources cleaned up


### Best Practice: Context Manager

For production code, it's recommended to use the context manager pattern, which automatically handles cleanup:

```python
async with await RunContext.create() as ctx:
    # Load data
    conversations_table = await ctx.data.from_csv('data.csv').copy_to_table('conversations')

    # Split and search
    split_data = await conversations_table.split(train_fraction=0.9)
    problem = ProblemDescription(target_column='outcome', target_desired_outcome='buy')
    results = await ctx.ops.analyze(problem, split_data)

    # Use results...

# Resources automatically cleaned up here when exiting the 'with' block
```

This ensures resources are always cleaned up, even if an error occurs. However, for interactive notebook exploration, the explicit `create()` and `close()` pattern shown in this tutorial is often more convenient.

---

## Summary

In this tutorial, you learned how to:

* Load multi-table conversation data
* Create a RunContext for managing resources
* Split data for feature generation and evaluation
* Run analysis on conversations
* Explore discovered features and their predictive value
* Generate actionable recommendations
* Visualize results with an interactive dashboard

## Next Steps

More tutorials coming soon!

For detailed information, see the [Architecture Guide](../docs/).

## Questions?

Open an issue on GitHub or contact the maintainers. See the [main README](../README.md) for details.

---

**Tutorial complete!** üéâ