# UI Text Search Benchmark - Arize

This notebook tests the performance of text search functionality in Arize UI by:
1. Loading span data (manually downloaded from UI)
2. Duplicating rows to reach target number of records
3. Modifying input fields to contain 25k characters
4. Adding unique searchable keywords to specific rows
5. Spreading timestamps over the past 90 days from today
6. Uploading to Arize using log_spans method (arize[Tracing])
7. Manual testing of search in UI

## Prerequisites
- Download span data from Arize UI as CSV/DataFrame
- Set up environment variables for Arize API credentials

## Quick Start
1. Download span data from Arize UI (CSV export)
2. Update `DATA_FILE_PATH` in cell 4 to point to your downloaded file
3. Update `TARGET_ROWS` in cell 2 to set desired dataset size (default: 10M)
4. Run all cells to prepare and upload data
5. Follow the manual testing instructions at the end


In [None]:
!uv pip install pandas python-dotenv "arize[Tracing]" numpy

In [19]:
import time
import os
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import uuid
import random
import string
from contextlib import contextmanager
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Benchmarking functions
@contextmanager
def timer():
    """Context manager to time execution"""
    start_time = time.time()
    yield
    end_time = time.time()
    execution_time = end_time - start_time
    print(f"Execution time: {execution_time:.2f} seconds")
    
# Configuration
TARGET_ROWS = 10_000  # Change this to test different sizes (e.g., 1_000_000, 10_000_000, 20_000_000)
TEXT_LENGTH = 25_000  # 25k characters
UNIQUE_KEYWORD_ROWS = 10  # Number of rows with unique keywords
TIMESTAMP_SPREAD_DAYS = 90  # Spread over past 90 days

# Unique keywords for search testing
UNIQUE_KEYWORDS = [
    "BENCHMARK_UNIQUE_ALPHA",
    "BENCHMARK_UNIQUE_BETA", 
    "BENCHMARK_UNIQUE_GAMMA",
    "BENCHMARK_UNIQUE_DELTA",
    "BENCHMARK_UNIQUE_EPSILON",
    "BENCHMARK_UNIQUE_ZETA",
    "BENCHMARK_UNIQUE_ETA",
    "BENCHMARK_UNIQUE_THETA",
    "BENCHMARK_UNIQUE_IOTA",
    "BENCHMARK_UNIQUE_KAPPA"
]

print(f"Target dataset size: {TARGET_ROWS:,} rows")


Target dataset size: 10,000 rows


## Step 1: Load Downloaded Data

First, download span data from Arize UI and save it as a CSV file. Update the path below to point to your downloaded file.


In [20]:
# Load the downloaded span data
# UPDATE THIS PATH to your downloaded CSV file
DATA_FILE_PATH = "datasets/tracing_export.csv"  # Change this to your actual file path

try:
    df_original = pd.read_csv(DATA_FILE_PATH)
    print(f"Loaded {len(df_original)} rows from {DATA_FILE_PATH}")
    print(f"Columns: {list(df_original.columns)}")
except FileNotFoundError:
    print(f"ERROR: File not found at {DATA_FILE_PATH}")
    print("Please download span data from Arize UI and update the DATA_FILE_PATH variable")
    raise


Loaded 22 rows from datasets/tracing_export.csv
Columns: ['name', 'spanKind', 'statusCode', 'start_time', 'parent_id', 'latency_ms', 'context.trace_id', 'context.span_id', 'attributes.session.id', 'attributes.openinference.span.kind', 'totalTokenCount', 'attributes.input.value', 'attributes.output.value', 'attributes.llm.token_count.total', 'attributes.llm.token_count.prompt', ':arize-computed:tokenization.prompt.encoder', ':arize-computed:tokenization.prompt.method', 'attributes.llm.token_count.completion', ':arize-computed:tokenization.completion.encoder', ':arize-computed:tokenization.completion.method', 'status_code']


In [21]:
# =============================================================================
# DATA PREPARATION FUNCTIONS
# =============================================================================

def generate_large_text(base_text, target_length, unique_keyword=None):
    """Generate text of specified length with optional unique keyword"""
    # Start with unique keyword if provided
    parts = [f"SEARCHABLE_CONTENT: {unique_keyword}\n\n"] if unique_keyword else []
    parts.append(str(base_text) if base_text else "")
    
    # Fill remaining space with lorem ipsum variations
    lorem_base = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. "
    while sum(len(p) for p in parts) < target_length:
        random_word = ''.join(random.choices(string.ascii_lowercase, k=random.randint(5, 15)))
        parts.append(f"{lorem_base}Random_{random_word}. ")
    
    return ''.join(parts)[:target_length]


def duplicate_rows(df, target_rows):
    """Duplicate dataframe rows to reach target number"""
    current_rows = len(df)
    if current_rows >= target_rows:
        return df
    
    # Calculate multiplication factor and duplicate
    multiplier = (target_rows // current_rows) + 1
    print(f"   Duplicating {current_rows} rows {multiplier}x to reach {target_rows:,}")
    
    df_list = [df.copy() for _ in range(multiplier)]
    df_final = pd.concat(df_list, ignore_index=True).iloc[:target_rows].copy()
    df_final['unique_id'] = [str(uuid.uuid4()) for _ in range(len(df_final))]
    
    return df_final


def spread_timestamps(df, days_back=90):
    """Spread timestamps over the past N days from today"""
    num_rows = len(df)
    end_time = datetime.now()
    start_time = end_time - timedelta(days=days_back)
    
    print(f"   Spreading {num_rows:,} timestamps from {start_time.strftime('%Y-%m-%d')} to {end_time.strftime('%Y-%m-%d')}")
    
    # Generate evenly spaced timestamps and shuffle them
    time_increment = timedelta(days=days_back) / num_rows
    timestamps = [start_time + (i * time_increment) for i in range(num_rows)]
    random.shuffle(timestamps)
    
    # Update timestamp column (prefer start_time for log_spans compatibility)
    timestamp_col = next((col for col in ['start_time', 'timestamp', 'time'] if col in df.columns), 'start_time')
    df[timestamp_col] = timestamps
    
    return df


def prepare_data_simplified(df_original, target_rows, text_length=TEXT_LENGTH):
    """Simplified data preparation pipeline"""
    print(f"\n=== Preparing {target_rows:,} rows ===")
    
    # Step 1: Duplicate to target size
    print("1. Duplicating rows...")
    with timer():
        df = duplicate_rows(df_original, target_rows)
    
    # Step 2: Spread timestamps over past 90 days
    print("2. Spreading timestamps...")
    with timer():
        df = spread_timestamps(df, TIMESTAMP_SPREAD_DAYS)
    
    # Step 3: Generate large text with keywords
    print(f"3. Generating {text_length:,}-char text...")
    with timer():
        input_col = next((col for col in ['attributes.input.value', 'input', 'prompt'] if col in df.columns), 'attributes.input.value')
        
        # Add keywords to first 10 rows for search testing
        for i in range(len(df)):
            keyword = UNIQUE_KEYWORDS[i] if i < min(len(UNIQUE_KEYWORDS), len(df)) else None
            base_text = df.iloc[i].get(input_col, "")
            df.iloc[i, df.columns.get_loc(input_col)] = generate_large_text(base_text, text_length, keyword)
            
            if keyword:
                print(f"   Row {i}: '{keyword}'")
    
    print(f"✅ Ready: {len(df):,} rows")
    return df


## Step 2: Data Preparation Functions


In [22]:
def prepare_benchmark_data(df_original, target_rows, text_length=TEXT_LENGTH):
    """Prepare data for benchmark with all transformations"""
    
    print(f"\n=== Preparing data for {target_rows:,} rows ===")
    
    # Step 1: Duplicate rows
    print("\n1. Duplicating rows...")
    with timer():
        df = duplicate_rows(df_original, target_rows)
    print(f"   Result: {len(df):,} rows")
    
    # Step 2: Spread timestamps over past 90 days
    with timer():
        df = spread_timestamps(df, TIMESTAMP_SPREAD_DAYS)
    
    # Verify timestamp spreading
    if 'start_time' in df.columns:
        print(f"   Timestamp range: {df['start_time'].min()} to {df['start_time'].max()}")
    elif 'timestamp' in df.columns:
        print(f"   Timestamp range: {df['timestamp'].min()} to {df['timestamp'].max()}")
    
    # Step 3: Generate large text for input fields
    print(f"\n3. Generating {text_length:,} character text for input fields...")
    print("   Adding unique keywords to search test rows...")
    
    # Find the input column (might be named differently)
    input_col = None
    for col in ['input', 'input.value', 'attributes.input.value', 'prompt']:
        if col in df.columns:
            input_col = col
            break
    
    if input_col is None:
        print("   WARNING: No input column found. Creating new 'input' column.")
        df['input'] = ""
        input_col = 'input'
    
    # Select random rows for unique keywords and create mapping
    keyword_sample = random.sample(range(len(df)), min(UNIQUE_KEYWORD_ROWS, len(df)))
    keyword_mapping = {idx: UNIQUE_KEYWORDS[i] for i, idx in enumerate(keyword_sample)}
    
    print(f"   Keywords will be placed at rows: {sorted(keyword_mapping.keys())[:5]}...")  # Show first 5
    
    # Generate base text pattern once
    lorem_variations = [
        "Lorem ipsum dolor sit amet, consectetur adipiscing elit. ",
        "Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. ",
        "Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris. ",
        "Duis aute irure dolor in reprehenderit in voluptate velit esse cillum. ",
        "Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia. ",
    ]
    
    # Create a large text block that we can reuse
    print("   Generating base text block...")
    base_text_block = ""
    while len(base_text_block) < text_length + 1000:  # Extra buffer
        random_word = ''.join(random.choices(string.ascii_lowercase, k=random.randint(5, 15)))
        base_text_block += random.choice(lorem_variations) + f"Random_{random_word}. "
    
    # Use numpy for efficient array operations
    print("   Creating text array...")
    text_array = np.empty(len(df), dtype=object)
    
    # Fill with base text
    text_array[:] = base_text_block[:text_length]
    
    # Add keywords to specific positions
    for idx, keyword in keyword_mapping.items():
        keyword_text = f"SEARCHABLE_CONTENT: {keyword}\n\n" + base_text_block[:text_length - len(f"SEARCHABLE_CONTENT: {keyword}\n\n")]
        text_array[idx] = keyword_text
        print(f"   Added keyword '{keyword}' to row {idx}")
    
    # Assign all at once
    print("   Assigning text to dataframe...")
    with timer():
        df[input_col] = text_array
    
    print(f"\n4. Data preparation complete!")
    print(f"   Total rows: {len(df):,}")
    print(f"   Unique keywords added to {len(keyword_mapping)} rows")
    print(f"   Keywords used: {', '.join(UNIQUE_KEYWORDS[:len(keyword_mapping)])}")
    
    return df


## Step 3: Prepare Data


In [23]:
# Prepare dataset with target number of rows
df_prepared = prepare_data_simplified(df_original, TARGET_ROWS)



=== Preparing 10,000 rows ===
1. Duplicating rows...
   Duplicating 22 rows 455x to reach 10,000
Execution time: 0.05 seconds
2. Spreading timestamps...
   Spreading 10,000 timestamps from 2025-06-03 to 2025-09-01
Execution time: 0.01 seconds
3. Generating 25,000-char text...
   Row 0: 'BENCHMARK_UNIQUE_ALPHA'
   Row 1: 'BENCHMARK_UNIQUE_BETA'
   Row 2: 'BENCHMARK_UNIQUE_GAMMA'
   Row 3: 'BENCHMARK_UNIQUE_DELTA'
   Row 4: 'BENCHMARK_UNIQUE_EPSILON'
   Row 5: 'BENCHMARK_UNIQUE_ZETA'
   Row 6: 'BENCHMARK_UNIQUE_ETA'
   Row 7: 'BENCHMARK_UNIQUE_THETA'
   Row 8: 'BENCHMARK_UNIQUE_IOTA'
   Row 9: 'BENCHMARK_UNIQUE_KAPPA'
Execution time: 16.42 seconds
✅ Ready: 10,000 rows


## Step 4: Setup Arize


In [24]:
from arize.pandas.logger import Client

# Configuration for span logging
ARIZE_SPACE_ID = os.getenv("ARIZE_SPACE_ID")
ARIZE_API_KEY = os.getenv("ARIZE_API_KEY")

# Create a unique project name for this benchmark
PROJECT_NAME = f"TextSearchBench-{TARGET_ROWS}-{datetime.now().strftime('%H%M')}"

# Setup Arize client for logging spans
arize_client = Client(
    space_id=ARIZE_SPACE_ID,
    api_key=ARIZE_API_KEY,
)

print(f"Arize project name: {PROJECT_NAME}")
print("✅ Arize client setup complete!")


Arize project name: TextSearchBench-10000-1434
✅ Arize client setup complete!


## Step 5: Upload Spans to Arize


In [25]:
def upload_spans_to_arize(df):
    """Upload dataframe rows as spans to Arize using log_spans"""
    
    total_rows = len(df)
    print(f"\n=== Uploading {total_rows:,} spans to Arize using log_spans ===")
    print(f"Project: {PROJECT_NAME}")
    
    print("\nPreparing spans DataFrame for log_spans...")
    print("Using known column structure from tracing_export.csv format")
    
    with timer():
        # Create clean spans DataFrame with only the required columns
        spans_df = pd.DataFrame()
        
        # Required columns - ensure proper data types
        spans_df['context.trace_id'] = df['context.trace_id'].fillna('').astype(str)
        spans_df['context.span_id'] = df['context.span_id'].fillna('').astype(str)
        spans_df['name'] = df['name'].fillna('LLM_span').astype(str)
        
        # Handle timestamps
        spans_df['start_time'] = pd.to_datetime(df['start_time'], errors='coerce')
        
        # Calculate end times using latency_ms
        if 'latency_ms' in df.columns:
            # Ensure latency_ms is numeric
            latency_ms = pd.to_numeric(df['latency_ms'], errors='coerce').fillna(1000.0)
            spans_df['end_time'] = spans_df['start_time'] + pd.to_timedelta(latency_ms, unit='ms')
        else:
            # Generate random latency if not available
            latency_ms = [random.uniform(100, 2000) for _ in range(len(df))]
            spans_df['end_time'] = spans_df['start_time'] + pd.to_timedelta(latency_ms, unit='ms')
        
        # Handle input/output attributes - ensure they are strings
        if 'attributes.input.value' in df.columns:
            spans_df['attributes.input.value'] = df['attributes.input.value'].fillna('').astype(str)
        
        if 'attributes.output.value' in df.columns:
            spans_df['attributes.output.value'] = df['attributes.output.value'].fillna('').astype(str)
        
        # Handle status code
        if 'status_code' in df.columns:
            spans_df['status_code'] = df['status_code'].fillna('OK').astype(str)
        else:
            spans_df['status_code'] = 'OK'
        
        # Handle span kind
        if 'attributes.openinference.span.kind' in df.columns:
            spans_df['attributes.openinference.span.kind'] = df['attributes.openinference.span.kind'].fillna('LLM').astype(str)
        else:
            spans_df['attributes.openinference.span.kind'] = 'LLM'
        
        # Handle parent_id if present
        if 'parent_id' in df.columns:
            # Only include non-null parent_ids
            parent_ids = df['parent_id'].fillna('')
            spans_df['parent_id'] = parent_ids.astype(str)
            # Replace empty strings with None for proper parent relationship
            spans_df.loc[spans_df['parent_id'] == '', 'parent_id'] = None
        
        # Handle token counts if present (ensure they are numeric)
        if 'totalTokenCount' in df.columns:
            spans_df['attributes.llm.token_count.total'] = pd.to_numeric(df['totalTokenCount'], errors='coerce').fillna(0).astype(int)
        
        if 'attributes.llm.token_count.prompt' in df.columns:
            spans_df['attributes.llm.token_count.prompt'] = pd.to_numeric(df['attributes.llm.token_count.prompt'], errors='coerce').fillna(0).astype(int)
        
        if 'attributes.llm.token_count.completion' in df.columns:
            spans_df['attributes.llm.token_count.completion'] = pd.to_numeric(df['attributes.llm.token_count.completion'], errors='coerce').fillna(0).astype(int)
        
        # Add unique_id for tracking (from our benchmark preparation)
        if 'unique_id' in df.columns:
            spans_df['unique_id'] = df['unique_id'].astype(str)
        else:
            spans_df['unique_id'] = [str(uuid.uuid4()) for _ in range(len(spans_df))]
        
        print(f"   Prepared spans DataFrame with {len(spans_df)} rows and {len(spans_df.columns)} columns")
        print(f"   Key columns: {[col for col in ['context.trace_id', 'context.span_id', 'name', 'start_time', 'end_time', 'attributes.input.value'] if col in spans_df.columns]}")
        
        # Upload to Arize using log_spans
        print("\n🚀 Uploading spans to Arize...")
        response = arize_client.log_spans(
            dataframe=spans_df,
            model_id=PROJECT_NAME,
            model_version="1.0",
            validate=True,
            verbose=True
        )
        
        # Check response
        if response.status_code == 200:
            print(f"✅ Successfully uploaded {total_rows:,} spans to Arize!")
            print(f"   Project: {PROJECT_NAME}")
            print(f"   Response: {response.status_code}")
        else:
            print(f"❌ Upload failed with status code: {response.status_code}")
            print(f"   Response text: {response.text}")
            
    return response


In [26]:
# Upload the prepared dataset to Arize
upload_spans_to_arize(df_prepared)



=== Uploading 10,000 spans to Arize using log_spans ===
Project: TextSearchBench-10000-1434

Preparing spans DataFrame for log_spans...
Using known column structure from tracing_export.csv format
   Prepared spans DataFrame with 10000 rows and 14 columns
   Key columns: ['context.trace_id', 'context.span_id', 'name', 'start_time', 'end_time', 'attributes.input.value']

🚀 Uploading spans to Arize...
[38;21m  arize.utils.logging | INFO | Performing direct input type validation.[0m
[38;21m  arize.utils.logging | INFO | Performing dataframe form validation.[0m
[38;21m  arize.utils.logging | INFO | The following columns are not part of the Open Inference Specification and will be ignored: unique_id[0m
[38;21m  arize.utils.logging | INFO | Performing values validation.[0m
[38;21m  arize.utils.logging | INFO | Sending file to Arize[0m
[38;21m  arize.utils.logging | INFO | Success! Check out your data at https://app.arize.com/organizations/QWNjb3VudE9yZ2FuaXphdGlvbjoyMzczNjpLYVg4/s

<Response [200]>