# Keywords AI Experiments Workflow Demo

This notebook demonstrates the complete workflow for managing experiments in Keywords AI, including creation, adding test cases and model configurations, running experiments, and evaluating results.

## What are Experiments in Keywords AI?

Keywords AI experiments are powerful tools for A/B testing different AI model configurations and prompts. They allow you to:

- **Compare multiple model configurations** (columns) across the same test cases
- **Test different prompts, temperatures, and parameters** systematically  
- **Run evaluations** to automatically score and compare results
- **Analyze performance** across different scenarios
- **Make data-driven decisions** about which configurations work best

Each experiment consists of:
- **Columns**: Different model configurations (prompts, models, parameters)
- **Rows**: Test cases with inputs and optional expected outputs
- **Results**: Generated outputs from running each column against each row
- **Evaluations**: Automated scoring of the results

## Prerequisites

- Keywords AI API key (`KEYWORDSAI_API_KEY`)
- Keywords AI base URL (`KEYWORDSAI_BASE_URL`) 
- Python packages: `keywordsai`, `python-dotenv` (installed via Poetry)


In [1]:
# Setup and imports
import os
import asyncio
from datetime import datetime, timezone
from dotenv import load_dotenv

# Load environment variables
load_dotenv(override=True)

# Keywords AI imports
from keywordsai.experiments.api import ExperimentAPI
from keywordsai_sdk.keywordsai_types.experiment_types import ExperimentType
from keywordsai import (
    ExperimentList,
    ExperimentCreate,
    ExperimentColumnType,
    ExperimentRowType,
    AddExperimentRowsRequest,
    AddExperimentColumnsRequest,
    RunExperimentRequest,
    RunExperimentEvalsRequest,
    ExperimentUpdate
)

# Check environment variables
api_key = os.getenv("KEYWORDSAI_API_KEY")
base_url = os.getenv("KEYWORDSAI_BASE_URL")

if not api_key or not base_url:
    print("❌ Missing environment variables!")
    print("Please set KEYWORDSAI_API_KEY and KEYWORDSAI_BASE_URL in your .env file")
else:
    print("✅ Environment variables loaded successfully")
    print(f"🔑 API Key: {api_key[:10]}...{api_key[-4:] if len(api_key) > 14 else api_key}")
    print(f"🌐 Base URL: {base_url}")

# Initialize the Experiment API client
client = ExperimentAPI(api_key=api_key, base_url=base_url)


✅ Environment variables loaded successfully
🔑 API Key: ptIwf5dO.g...FGoP
🌐 Base URL: http://localhost:8000/api


## Phase 1: Basic Experiment Creation

### Step 1: Create Initial Experiment

Let's start by creating a new experiment with one model configuration and one test case.


In [2]:
print("📝 Creating a new Experiment...")

# Create basic prompt
experiment_data = ExperimentType(
    name="Customer Support Assistant",
    description="AI assistant for handling customer support inquiries with professional tone",
)
experiment = await client.acreate(experiment_data)
experiment_id = experiment.id
print(f"✅ Created experiment with ID: {experiment.id}")

📝 Creating a new Experiment...
✅ Created experiment with ID: 48b6c352e80548d69137113ac3354569


### Step 2: Create column

Create column


In [19]:
# Step 2: Create a column (model configuration)
print("🤖 Step 2: Creating a column...")


# Create a column with specific parameters
column1 = ExperimentColumnType(
    model="gpt-4",
    name="Expert Assistant",
    temperature=0.3,
    max_completion_tokens=250,
    top_p=1.0,
    frequency_penalty=0.0,
    presence_penalty=0.0,
    prompt_messages=[
        {
            "role": "system",
            "content": "You are an expert technical assistant. Provide accurate, detailed explanations with examples when helpful."
        },
        {
            "role": "user",
            "content": "{{question}}"
        }
    ],
    tools=[],
    tool_choice="auto",
    response_format={"type": "text"}
)
from keywordsai.prompts.api import PromptAPI
prompt_id = "98db4236865d4abbb1ca5d88ce4ac3e3"
prompt_client = PromptAPI(api_key=api_key, base_url=base_url)
prompt = await prompt_client.aget(prompt_id)
print(prompt.current_version.messages)
def to_messages(msgs):
    out = []
    for m in msgs:
        # m.content may be a list of TextContent(...) or a plain string
        if isinstance(m.content, list):
            text = "".join(getattr(c, "text", "") for c in m.content if getattr(c, "type", None) == "text")
        else:
            text = m.content
        out.append({"role": m.role, "content": text})
    return out

column2 = ExperimentColumnType(
    model="gpt-4",
    name="Expert Assistant",
    temperature=0.3,
    max_completion_tokens=250,
    top_p=1.0,
    frequency_penalty=0.0,
    presence_penalty=0.0,
    prompt_messages=to_messages(prompt.current_version.messages),
    tools=[],
    tool_choice="auto",
    response_format={"type": "text"}
)

add_columns_request = AddExperimentColumnsRequest(columns=[column1, column2])
await client.aadd_columns(experiment_id, add_columns_request)
print("✅ Added Expert Assistant column (column1), and Expert Assistant column with prompt_messages from current version (column2)")

# Verify the column was added
updated_experiment = await client.aget(experiment_id)
print(f"📊 Experiment now has {len(updated_experiment.columns)} total columns")

# Show the column configuration
print(f"\n🔧 Column configuration for the latest column:")
latest_column = updated_experiment.columns[-1]  # Get the last added column
print(f"   Name: {latest_column.name}")
print(f"   Model: {latest_column.model}")
print(f"   Temperature: {latest_column.temperature}")
print(f"   Max tokens: {latest_column.max_completion_tokens}")
print(f"   System prompt: {latest_column.prompt_messages[0]['content']}")

print(f"\n💡 Columsn created successfully! Ready to add test cases.")


🤖 Step 2: Creating a column...
[Message(role='system', content=[TextContent(type='text', text='You are a friendly customer support assistant. Use a warm, professional tone while solving customer problems efficiently.', cache_control=None)], name=None, tool_call_id=None, tool_calls=None, reasoning_content=None, thinking_blocks=None), Message(role='user', content=[TextContent(type='text', text='{{customer_inquiry}}', cache_control=None)], name=None, tool_call_id=None, tool_calls=None, reasoning_content=None, thinking_blocks=None)]
✅ Added Expert Assistant column (column1), and Expert Assistant column with prompt_messages from current version (column2)
📊 Experiment now has 12 total columns

🔧 Column Configuration:
   Name: Expert Assistant
   Model: gpt-4
   Temperature: 0.3
   Max tokens: 250
   System prompt: You are a friendly customer support assistant. Use a warm, professional tone while solving customer problems efficiently.

💡 Column created successfully! Ready to add test cases.


## Phase 2: Expanding the Experiment

### Step 3: Add Test Cases (Rows)

Let's add more test cases to evaluate our model configurations across different scenarios.


In [4]:
# Add more test cases (rows)
print("📝 Step 3: Adding more test cases...")

new_rows = [
    ExperimentRowType(
        input={"question": "What is machine learning?"},
        ideal_output="ML is a subset of AI that learns from data."
    ),
    ExperimentRowType(
        input={"question": "Explain neural networks briefly."},
        ideal_output="Neural networks are computing systems inspired by biological neural networks."
    ),
    ExperimentRowType(
        input={"question": "What is the difference between AI and ML?"}
        # No ideal_output for this one - let's see how models handle it
    ),
    ExperimentRowType(
        input={"question": "How does deep learning work?"},
        ideal_output="Deep learning uses multi-layered neural networks to learn complex patterns."
    )
]

add_rows_request = AddExperimentRowsRequest(rows=new_rows)

await client.aadd_rows(experiment_id, add_rows_request)
print(f"✅ Added {len(new_rows)} new test cases")

# Verify the rows were added
updated_experiment = await client.aget(experiment_id)
print(f"📊 Experiment now has {len(updated_experiment.rows)} total rows")

# Show all test cases
print(f"\n📝 All Test Cases:")
for i, row in enumerate(updated_experiment.rows, 1):
    question = row.input.get('question', 'Unknown')
    print(f"   {i}. {question}")
    if hasattr(row, 'ideal_output') and row.ideal_output:
        print(f"      Expected: {row.ideal_output}")
    else:
        print(f"      Expected: (No ideal output provided)")

📝 Step 3: Adding more test cases...
✅ Added 4 new test cases
📊 Experiment now has 4 total rows

📝 All Test Cases:
   1. What is machine learning?
      Expected: ML is a subset of AI that learns from data.
   2. Explain neural networks briefly.
      Expected: Neural networks are computing systems inspired by biological neural networks.
   3. What is the difference between AI and ML?
      Expected: (No ideal output provided)
   4. How does deep learning work?
      Expected: Deep learning uses multi-layered neural networks to learn complex patterns.


### Step 4: Add Model Configurations (Columns)

Now let's add a second model configuration to compare different approaches.


In [5]:
# Add another model configuration (column)
print("🤖 Step 4: Adding another model configuration...")

# Create a second column with different parameters
expert_column = ExperimentColumnType(
    model="gpt-4o",
    name="Expert Assistant",
    temperature=0.2,  # Lower temperature for more focused responses
    max_completion_tokens=270,
    top_p=1.0,
    frequency_penalty=0.0,
    presence_penalty=0.0,
    prompt_messages=[
        {
            "role": "system",
            "content": "You are an expert technical assistant. Provide accurate, detailed explanations with examples when helpful."
        },
        {
            "role": "user",
            "content": "{{question}}"
        }
    ],
    tools=[],
    tool_choice="auto",
    response_format={"type": "text"}
)

add_columns_request = AddExperimentColumnsRequest(columns=[expert_column])
await client.aadd_columns(experiment_id, add_columns_request)
print("✅ Added GPT-4 Expert Assistant configuration")

# Verify the column was added
updated_experiment = await client.aget(experiment_id)
print(f"📊 Experiment now has {len(updated_experiment.columns)} total columns")

# Show all configurations
print(f"\n🔧 Model Configurations:")
for i, col in enumerate(updated_experiment.columns, 1):
    print(f"   Column {i}: {col.name}")
    print(f"   - Model: {col.model}")
    print(f"   - Temperature: {col.temperature}")
    print(f"   - Max tokens: {col.max_completion_tokens}")
    print(f"   - System prompt: {col.prompt_messages[0]['content'][:50]}...")
    print()

print(f"💡 Now we can compare how GPT-3.5 vs GPT-4 handle the same questions!")


🤖 Step 4: Adding another model configuration...
✅ Added GPT-4 Expert Assistant configuration
📊 Experiment now has 3 total columns

🔧 Model Configurations:
   Column 1: Expert Assistant
   - Model: gpt-4
   - Temperature: 0.3
   - Max tokens: 250
   - System prompt: You are an expert technical assistant. Provide acc...

   Column 2: Expert Assistant
   - Model: gpt-4
   - Temperature: 0.3
   - Max tokens: 250
   - System prompt: You are an expert technical assistant. Provide acc...

   Column 3: Expert Assistant
   - Model: gpt-4o
   - Temperature: 0.2
   - Max tokens: 270
   - System prompt: You are an expert technical assistant. Provide acc...

💡 Now we can compare how GPT-3.5 vs GPT-4 handle the same questions!


### Step 5: Update Experiment Metadata

Let's update the experiment name and description to reflect our changes.


In [6]:
# Update experiment metadata
print("✏️ Step 5: Updating experiment metadata...")

update_data = ExperimentUpdate(
    name=f"updated",
    description="Comprehensive AI comparison experiment with GPT-3.5 vs GPT-4 across multiple technical questions"
)

updated_experiment = await client.aupdate(experiment_id, update_data)
print(f"✅ Updated experiment name to: {updated_experiment.name}")
print(f"📋 New description: {updated_experiment.description}")

# Show final experiment structure
print(f"\n📊 Final Experiment Structure:")
print(f"   - Name: {updated_experiment.name}")
print(f"   - Columns: {len(updated_experiment.columns)} (model configurations)")
print(f"   - Rows: {len(updated_experiment.rows)} (test cases)")
print(f"   - Status: {updated_experiment.status}")
print(f"   - Total combinations: {len(updated_experiment.columns)} × {len(updated_experiment.rows)} = {len(updated_experiment.columns) * len(updated_experiment.rows)}")

print(f"\n🚀 The experiment is now ready to run!")


✏️ Step 5: Updating experiment metadata...
✅ Updated experiment name to: updated
📋 New description: 

📊 Final Experiment Structure:
   - Name: updated
   - Columns: 3 (model configurations)
   - Rows: 4 (test cases)
   - Status: 
   - Total combinations: 3 × 4 = 12

🚀 The experiment is now ready to run!


### Step 6: List All Experiments

Let's see how to manage multiple experiments and find the one we created.


In [7]:
listed_experiments = await client.alist(page_size=5)
print(listed_experiments)
print(len(listed_experiments.results))

results=[ExperimentType(id='4', column_count=1, columns=[], created_at='2025-07-21T19:34:21.997000Z', created_by=1, name='asdfadsf', organization=1, row_count=1, rows=[], status='ready', test_id='001Kun', updated_at='2025-07-21T19:34:22.034885Z', updated_by=1, variables=[], variable_definitions=[], starred=True, tags=[], description=''), ExperimentType(id='1a9f1edcc1b84f6390370313d41516b2', column_count=0, columns=[], created_at='2025-08-22T06:21:30.556743Z', created_by=1, name='Customer Support Assistant', organization=1, row_count=0, rows=[], status='', test_id='1a9f1edcc1b84f6390370313d41516b2', updated_at='2025-08-22T06:21:30.556775Z', updated_by=1, variables=[], variable_definitions=[], starred=False, tags=[], description=''), ExperimentType(id='91fdba26697f4ea1abd095a64e153afb', column_count=0, columns=[], created_at='2025-08-22T06:21:14.536972Z', created_by=1, name='Customer Support Assistant', organization=1, row_count=0, rows=[], status='', test_id='91fdba26697f4ea1abd095a64e1

### Step 7: Run experiment


In [11]:
# Run experiment with extended timeout (avoids ReadTimeout)
print("🚀 Running experiment...")
print("loading...")
result = await client.arun_experiment(experiment_id)
print("✅ Run results:", result)

🚀 Running experiment...
loading...
✅ Run results: {'id': '48b6c352e80548d69137113ac3354569', 'updater': {'first_name': 'Leon', 'last_name': 'Lian', 'email': 'leonxlian@gmail.com'}, 'column_count': 2, 'columns': [{'id': 'a992e03837c94527aa2a010d429ac2bd', 'name': 'Expert Assistant', 'model': 'gpt-4', 'tools': [], 'top_p': 1.0, 'stream': False, 'temperature': 0.3, 'tool_choice': 'auto', 'prompt_messages': [{'role': 'system', 'content': 'You are an expert technical assistant. Provide accurate, detailed explanations with examples when helpful.'}, {'role': 'user', 'content': '{{question}}'}], 'response_format': {'type': 'text'}, 'presence_penalty': 0.0, 'frequency_penalty': 0.0, 'max_completion_tokens': 250}, {'id': '3f0122282b8247af96577184e7bb3131', 'name': 'Expert Assistant', 'model': 'gpt-4', 'tools': [], 'top_p': 1.0, 'stream': False, 'temperature': 0.3, 'tool_choice': 'auto', 'prompt_messages': [{'role': 'system', 'content': 'You are an expert technical assistant. Provide accurate, de

## Phase 3: Managing Experiment Content

### Step 8: Delete Rows and Columns

Now let's learn how to remove test cases (rows) and model configurations (columns) from experiments. This is useful for cleaning up experiments or removing configurations that aren't working well.


In [9]:
# Delete first 2 rows
print("🗑️ Step 7: Delete first 2 rows...")

from keywordsai import RemoveExperimentRowsRequest

# Get current experiment
experiment = await client.aget(experiment_id)
print(f"📊 Current rows: {len(experiment.rows)}")

# Show current rows
for i, row in enumerate(experiment.rows):
    question = row.input.get('question', 'Unknown')
    print(f"   {i+1}. {question}")

# Get first 2 row IDs
rows_to_delete = []
for row in experiment.rows[:2]:  # Take first 2 rows
    row_id = getattr(row, 'id', None)
    if row_id:
        rows_to_delete.append(row_id)
        print(f"🗑️ Will delete: {row.input.get('question', 'Unknown')}")

# Delete them
remove_request = RemoveExperimentRowsRequest(rows=rows_to_delete)
await client.aremove_rows(experiment_id, remove_request)

# Check result
updated = await client.aget(experiment_id)
print(f"✅ Done! Deleted {len(rows_to_delete)} rows. Now have {len(updated.rows)} rows")


🗑️ Step 7: Delete first 2 rows...
📊 Current rows: 4
   1. What is machine learning?
   2. Explain neural networks briefly.
   3. What is the difference between AI and ML?
   4. How does deep learning work?
🗑️ Will delete: What is machine learning?
🗑️ Will delete: Explain neural networks briefly.
✅ Done! Deleted 2 rows. Now have 2 rows


In [10]:
# Simple delete one column
print("🗑️ Delete one column...")

from keywordsai import RemoveExperimentColumnsRequest

# Get current experiment
experiment = await client.aget(experiment_id)
print(f"📊 Current columns: {len(experiment.columns)}")

# Show current columns
for i, col in enumerate(experiment.columns):
    print(f"   {i+1}. {col.name} ({col.model})")

last_column = experiment.columns[-1]
column_id = getattr(last_column, 'id', None)

if column_id:
    print(f"\n🗑️ Deleting: {last_column.name}")
    
    # Delete it
    remove_request = RemoveExperimentColumnsRequest(columns=[column_id])
    await client.aremove_columns(experiment_id, remove_request)
    
    # Check result
    updated = await client.aget(experiment_id)
    print(f"✅ Done! Now have {len(updated.columns)} columns")
else:
    print("⚠️ Could not find column ID")


🗑️ Delete one column...
📊 Current columns: 3
   1. Expert Assistant (gpt-4)
   2. Expert Assistant (gpt-4)
   3. Expert Assistant (gpt-4o)

🗑️ Deleting: Expert Assistant
✅ Done! Now have 2 columns
