LLM Research Lab

A comprehensive, enterprise-grade platform for conducting systematic research on Large Language Models (LLMs). Built with Rust for performance, reliability, and safety.

Overview

LLM Research Lab provides a complete infrastructure for:

Experiment Management - Design, execute, and track LLM experiments with full reproducibility
Model Registry - Manage multiple LLM providers and model configurations
Dataset Management - Version-controlled datasets with schema validation
Prompt Engineering - Template management with variable substitution and versioning
Evaluation Framework - Comprehensive metrics including LLM-as-judge evaluations
Analytics & Observability - Real-time metrics, alerting, and performance monitoring

Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                         LLM Research Lab                            │
├─────────────────────────────────────────────────────────────────────┤
│  CLI (llm-research)              │  SDK (llm-research-sdk)          │
│  ├── auth                        │  ├── LlmResearchClient           │
│  ├── experiments                 │  ├── ExperimentsClient           │
│  ├── models                      │  ├── ModelsClient                │
│  ├── datasets                    │  ├── DatasetsClient              │
│  ├── prompts                     │  ├── PromptsClient               │
│  └── evaluations                 │  └── EvaluationsClient           │
├─────────────────────────────────────────────────────────────────────┤
│                           API Layer (Axum)                          │
│  ├── REST Endpoints              │  ├── Authentication (JWT/API Key)│
│  ├── Rate Limiting               │  ├── RBAC Authorization          │
│  └── Request Validation          │  └── Audit Logging               │
├─────────────────────────────────────────────────────────────────────┤
│                         Core Services                               │
│  ├── Experiment Engine           │  ├── Workflow Orchestration      │
│  ├── Metrics Calculator          │  ├── Evaluation Runner           │
│  └── Model Adapter               │  └── Dataset Processor           │
├─────────────────────────────────────────────────────────────────────┤
│                         Storage Layer                               │
│  ├── PostgreSQL (metadata)       │  ├── ClickHouse (time-series)    │
│  └── S3 (artifacts/datasets)     │  └── Redis (caching)             │
└─────────────────────────────────────────────────────────────────────┘

Features

Experiment Management

Reproducible Experiments - Full configuration tracking with random seeds
Collaborative Research - Multi-user support with role-based access
Experiment Runs - Execute multiple runs with configuration overrides
Metrics Aggregation - Automatic statistical analysis across runs

Model Registry

Multi-Provider Support - OpenAI, Anthropic, Google, Cohere, local models
Version Tracking - Track model versions and configurations
Custom Configurations - Temperature, top-p, max tokens, and more

Dataset Management

Multiple Formats - JSON, JSONL, CSV, Parquet, plain text
Schema Validation - Enforce data structure consistency
Version Control - Full versioning with changelogs
Secure Storage - Pre-signed URLs for uploads/downloads

Prompt Engineering

Template System - Mustache-style variable substitution {{variable}}
Variable Types - String, number, boolean, array, object support
Version History - Track template evolution over time
Validation - Syntax checking and variable detection

Evaluation Framework

Built-in Metrics
- BLEU, ROUGE (1/2/L), METEOR
- Perplexity, F1 Score, Exact Match
- Latency, Throughput, Token Usage
LLM-as-Judge - Use LLMs to evaluate response quality
Custom Metrics - Extensible metric calculator interface
Comparative Analysis - Compare evaluations across experiments

Enterprise Features

Authentication - JWT tokens, API keys, OAuth support
Authorization - Fine-grained RBAC with resource-level permissions
Audit Logging - Complete audit trail for compliance
Rate Limiting - Configurable rate limits per user/endpoint
Circuit Breakers - Fault tolerance for external services
Health Checks - Kubernetes-ready liveness/readiness probes

Quick Start

Prerequisites

Rust 1.75 or later
PostgreSQL 14+
ClickHouse 23+
Redis 7+
S3-compatible storage (AWS S3, MinIO)

Installation

# Clone the repository
git clone https://github.com/your-org/llm-research-lab.git
cd llm-research-lab

# Build all crates
cargo build --release

# Run tests
cargo test --workspace

Configuration

Create a .env file or set environment variables:

# Database
DATABASE_URL=postgres://user:password@localhost:5432/llm_research

# ClickHouse
CLICKHOUSE_URL=http://localhost:8123

# S3 Storage
AWS_ACCESS_KEY_ID=your-access-key
AWS_SECRET_ACCESS_KEY=your-secret-key
S3_BUCKET=llm-research-data
S3_REGION=us-east-1

# Redis
REDIS_URL=redis://localhost:6379

# JWT Secret
JWT_SECRET=your-super-secret-key

# Server
API_HOST=0.0.0.0
API_PORT=8080

Running the Server

# Start the API server
./target/release/llm-research-lab

# Or with environment file
source .env && ./target/release/llm-research-lab

CLI Reference

The llm-research CLI provides a comprehensive interface for interacting with the platform.

Global Options

llm-research [OPTIONS] <COMMAND>

Options:
  -o, --output <FORMAT>    Output format: table, json, yaml, compact [default: table]
      --api-url <URL>      API base URL [env: LLM_RESEARCH_API_URL]
      --api-key <KEY>      API key [env: LLM_RESEARCH_API_KEY]
  -p, --profile <NAME>     Configuration profile [env: LLM_RESEARCH_PROFILE]
  -v, --verbose            Enable verbose output
      --no-color           Disable colored output
  -h, --help               Print help
  -V, --version            Print version

Authentication Commands

# Interactive login
llm-research auth login

# Login with API key
llm-research auth login --api-key YOUR_API_KEY

# Check authentication status
llm-research auth status

# Display current token
llm-research auth token

# Logout and clear credentials
llm-research auth logout

Configuration Commands

# Show current configuration
llm-research config show

# Set a configuration value
llm-research config set settings.timeout_secs 60

# Get a configuration value
llm-research config get settings.output_format

# List all profiles
llm-research config profiles

# Use a specific profile
llm-research config use-profile production

# Create a new profile
llm-research config create-profile staging --api-url https://staging-api.example.com

# Delete a profile
llm-research config delete-profile old-profile --force

# Show configuration file paths
llm-research config path

# Reset to defaults
llm-research config reset --force

Experiment Commands

# List experiments
llm-research experiments list
llm-research experiments list --status running --limit 50

# Get experiment details
llm-research experiments get <EXPERIMENT_ID>

# Create a new experiment
llm-research experiments create \
  --name "GPT-4 vs Claude Comparison" \
  --description "Comparing response quality" \
  --hypothesis "GPT-4 will perform better on coding tasks" \
  --tags "comparison,coding"

# Update an experiment
llm-research experiments update <ID> --name "Updated Name" --tags "new,tags"

# Delete an experiment
llm-research experiments delete <ID> --force

# Start an experiment
llm-research experiments start <ID>

# List experiment runs
llm-research experiments runs <ID>

# Create a new run
llm-research experiments run <ID> --overrides '{"temperature": 0.7}'

# Get experiment metrics
llm-research experiments metrics <ID>

Model Commands

# List models
llm-research models list
llm-research models list --provider openai

# Get model details
llm-research models get <MODEL_ID>

# Register a new model
llm-research models create \
  --name "GPT-4 Turbo" \
  --provider openai \
  --identifier gpt-4-turbo-preview \
  --version "0125" \
  --config '{"temperature": 0.7, "max_tokens": 4096}'

# Update a model
llm-research models update <ID> --version "0409"

# Delete a model
llm-research models delete <ID> --force

# List available providers
llm-research models providers

Dataset Commands

# List datasets
llm-research datasets list
llm-research datasets list --format jsonl --tags "training"

# Get dataset details
llm-research datasets get <DATASET_ID>

# Create a new dataset
llm-research datasets create \
  --name "Code Generation Benchmark" \
  --format jsonl \
  --description "10K coding problems" \
  --schema '{"type": "object", "properties": {"prompt": {"type": "string"}}}' \
  --tags "coding,benchmark"

# Update a dataset
llm-research datasets update <ID> --description "Updated description"

# Delete a dataset
llm-research datasets delete <ID> --force

# List dataset versions
llm-research datasets versions <ID>

# Create a new version
llm-research datasets create-version <ID> \
  --version "2.0.0" \
  --description "Added 5K more examples" \
  --changelog "Expanded dataset with code review tasks"

# Get upload URL
llm-research datasets upload <ID> --filename data.jsonl --content-type application/jsonl

# Get download URL
llm-research datasets download <ID>

Prompt Commands

# List prompts
llm-research prompts list
llm-research prompts list --tags "summarization" --search "article"

# Get prompt details
llm-research prompts get <PROMPT_ID>

# Create a prompt template
llm-research prompts create \
  --name "Article Summarizer" \
  --template "Summarize the following article in {{word_count}} words:\n\n{{article}}" \
  --description "Summarizes articles to specified length" \
  --system "You are a professional summarizer. Be concise and accurate." \
  --tags "summarization,articles"

# Update a prompt
llm-research prompts update <ID> --name "New Name" --tags "updated,tags"

# Delete a prompt
llm-research prompts delete <ID> --force

# List prompt versions
llm-research prompts versions <ID>

# Create a new version
llm-research prompts create-version <ID> \
  --template "New template content with {{variables}}" \
  --changelog "Improved prompt structure"

# Render a prompt with variables
llm-research prompts render <ID> \
  --vars '{"word_count": 100, "article": "Your article text here..."}'

# Validate a template
llm-research prompts validate --template "Hello {{name}}, your order {{order_id}} is ready."

Evaluation Commands

# List evaluations
llm-research evaluations list
llm-research evaluations list --experiment <EXPERIMENT_ID> --status completed

# Get evaluation details
llm-research evaluations get <EVALUATION_ID>

# Create an evaluation
llm-research evaluations create \
  --name "Quality Assessment" \
  --experiment-id <EXPERIMENT_ID> \
  --dataset-id <DATASET_ID> \
  --metrics "bleu,rouge_l,exact_match" \
  --config '{"sample_size": 1000}'

# Update an evaluation
llm-research evaluations update <ID> --name "Updated Evaluation"

# Delete an evaluation
llm-research evaluations delete <ID> --force

# Run an evaluation
llm-research evaluations run <ID>

# Get evaluation results
llm-research evaluations results <ID>

# Compare evaluations
llm-research evaluations compare <ID1> <ID2> --metrics "bleu,rouge_l"

Output Formats

# Table format (default) - human-readable tables
llm-research experiments list

# JSON format - for programmatic processing
llm-research experiments list -o json

# YAML format - human-readable structured data
llm-research experiments list -o yaml

# Compact format - one line per item for scripting
llm-research experiments list -o compact

SDK Reference

The Rust SDK provides a type-safe interface for integrating with the platform.

Installation

Add to your Cargo.toml:

[dependencies]
llm-research-sdk = "0.1"
tokio = { version = "1", features = ["full"] }

Quick Start

use llm_research_sdk::{LlmResearchClient, CreateExperimentRequest};
use uuid::Uuid;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Create client with API key
    let client = LlmResearchClient::builder("https://api.example.com")
        .with_api_key("your-api-key")
        .build()?;

    // List experiments
    let experiments = client.experiments().list(None).await?;
    println!("Found {} experiments", experiments.data.len());

    // Create a new experiment
    let request = CreateExperimentRequest::new(
        "My Experiment",
        Uuid::new_v4(), // owner_id
    )
    .with_description("Testing GPT-4 performance")
    .with_tags(vec!["gpt4".to_string(), "benchmark".to_string()]);

    let experiment = client.experiments().create(request).await?;
    println!("Created experiment: {}", experiment.id);

    Ok(())
}

Client Configuration

use llm_research_sdk::{LlmResearchClient, AuthConfig};
use std::time::Duration;

// Basic configuration
let client = LlmResearchClient::builder("https://api.example.com")
    .with_api_key("your-api-key")
    .build()?;

// Full configuration
let client = LlmResearchClient::builder("https://api.example.com")
    .with_auth(AuthConfig::BearerToken("jwt-token".to_string()))
    .with_timeout(Duration::from_secs(60))
    .with_connect_timeout(Duration::from_secs(10))
    .with_max_retries(3)
    .with_logging(true)
    .with_header("X-Custom-Header", "value")
    .build()?;

// Using environment variables
let client = LlmResearchClient::from_env()?;

Experiments API

use llm_research_sdk::{
    CreateExperimentRequest, UpdateExperimentRequest,
    ListExperimentsParams, ExperimentConfig, StartExperimentRequest
};

// List experiments with filters
let params = ListExperimentsParams::new()
    .with_limit(20)
    .with_offset(0)
    .with_status("running")
    .with_tags(vec!["production".to_string()]);

let experiments = client.experiments().list(Some(params)).await?;

// Get experiment by ID
let experiment = client.experiments().get(experiment_id).await?;

// Create experiment
let config = ExperimentConfig::new()
    .with_model(model_id)
    .with_dataset(dataset_id)
    .with_prompt_template(prompt_id)
    .with_parameter("temperature", serde_json::json!(0.7))
    .with_metric("bleu");

let request = CreateExperimentRequest::new("My Experiment", owner_id)
    .with_description("Description")
    .with_hypothesis("Hypothesis")
    .with_config(config);

let experiment = client.experiments().create(request).await?;

// Update experiment
let update = UpdateExperimentRequest::new()
    .with_name("Updated Name")
    .with_tags(vec!["updated".to_string()]);

let experiment = client.experiments().update(experiment_id, update).await?;

// Start experiment
let request = StartExperimentRequest::new();
let run = client.experiments().start(experiment_id, request).await?;

// Get runs
let runs = client.experiments().list_runs(experiment_id, None).await?;

// Get metrics
let metrics = client.experiments().get_metrics(experiment_id).await?;
println!("BLEU mean: {}", metrics.aggregated_metrics["bleu"].mean);

// Delete experiment
client.experiments().delete(experiment_id).await?;

Models API

use llm_research_sdk::{CreateModelRequest, UpdateModelRequest, ListModelsParams};

// List models
let params = ListModelsParams::new()
    .with_provider("openai")
    .with_limit(50);

let models = client.models().list(Some(params)).await?;

// Get model
let model = client.models().get(model_id).await?;

// Create model
let request = CreateModelRequest::new("GPT-4", "openai", "gpt-4-turbo")
    .with_version("0125")
    .with_config(serde_json::json!({
        "temperature": 0.7,
        "max_tokens": 4096
    }));

let model = client.models().create(request).await?;

// Update model
let update = UpdateModelRequest::new()
    .with_version("0409");

let model = client.models().update(model_id, update).await?;

// List providers
let providers = client.models().list_providers().await?;

// Delete model
client.models().delete(model_id).await?;

Datasets API

use llm_research_sdk::{
    CreateDatasetRequest, DatasetFormat, CreateDatasetVersionRequest, UploadRequest
};

// List datasets
let datasets = client.datasets().list(None).await?;

// Create dataset
let request = CreateDatasetRequest::new("My Dataset", DatasetFormat::Jsonl)
    .with_description("Training data")
    .with_schema(serde_json::json!({
        "type": "object",
        "properties": {
            "prompt": { "type": "string" },
            "response": { "type": "string" }
        }
    }))
    .with_tags(vec!["training".to_string()]);

let dataset = client.datasets().create(request).await?;

// Create version
let version_request = CreateDatasetVersionRequest::new("1.0.0")
    .with_description("Initial release")
    .with_changelog("First version with 10K examples");

let version = client.datasets().create_version(dataset_id, version_request).await?;

// Get upload URL
let upload = UploadRequest::new("data.jsonl", "application/jsonl");
let upload_info = client.datasets().get_upload_url(dataset_id, upload).await?;
println!("Upload to: {}", upload_info.upload_url);

// Get download URL
let download_info = client.datasets().get_download_url(dataset_id).await?;
println!("Download from: {}", download_info.download_url);

Prompts API

use llm_research_sdk::{
    CreatePromptRequest, CreatePromptVersionRequest,
    RenderPromptRequest, ValidatePromptRequest
};

// List prompts
let prompts = client.prompts().list(None).await?;

// Create prompt
let request = CreatePromptRequest::new(
    "Summarizer",
    "Summarize in {{word_count}} words:\n\n{{content}}"
)
.with_description("Article summarization template")
.with_system_prompt("You are a professional summarizer.")
.with_tags(vec!["summarization".to_string()]);

let prompt = client.prompts().create(request).await?;

// Create new version
let version_request = CreatePromptVersionRequest::new(
    "Summarize the following in {{word_count}} words, focusing on key points:\n\n{{content}}"
)
.with_system_prompt("You are an expert summarizer. Be concise and accurate.")
.with_changelog("Improved prompt clarity");

let version = client.prompts().create_version(prompt_id, version_request).await?;

// Render prompt
let mut variables = std::collections::HashMap::new();
variables.insert("word_count".to_string(), serde_json::json!(100));
variables.insert("content".to_string(), serde_json::json!("Your article here..."));

let render_request = RenderPromptRequest::new(variables);
let rendered = client.prompts().render(prompt_id, render_request).await?;
println!("Rendered: {}", rendered.rendered_template);

// Validate template
let validate_request = ValidatePromptRequest::new("Hello {{name}}!");
let validation = client.prompts().validate(validate_request).await?;
println!("Valid: {}, Variables: {:?}", validation.valid, validation.detected_variables);

Evaluations API

use llm_research_sdk::{
    CreateEvaluationRequest, MetricConfig, JudgeConfig, CompareEvaluationsRequest
};

// List evaluations
let evaluations = client.evaluations().list(None).await?;

// Create evaluation with metrics
let metrics = vec![
    MetricConfig::new("bleu"),
    MetricConfig::new("rouge_l"),
    MetricConfig::with_config("exact_match", serde_json::json!({"case_sensitive": false})),
];

let request = CreateEvaluationRequest::new("Quality Eval", experiment_id, dataset_id)
    .with_metrics(metrics)
    .with_sample_size(1000);

let evaluation = client.evaluations().create(request).await?;

// Create evaluation with LLM judge
let judge = JudgeConfig::new(judge_model_id)
    .with_criteria(vec!["relevance".to_string(), "accuracy".to_string()])
    .with_scale(JudgeScale::FivePoint);

let request = CreateEvaluationRequest::new("LLM Judge Eval", experiment_id, dataset_id)
    .with_judge(judge);

let evaluation = client.evaluations().create(request).await?;

// Run evaluation
let run = client.evaluations().run(evaluation_id).await?;

// Get results
let results = client.evaluations().get_results(evaluation_id).await?;
println!("BLEU: {}", results.metrics["bleu"]);

// Compare evaluations
let compare_request = CompareEvaluationsRequest::new(vec![eval1_id, eval2_id])
    .with_metrics(vec!["bleu".to_string(), "rouge_l".to_string()]);

let comparison = client.evaluations().compare(compare_request).await?;

Error Handling

use llm_research_sdk::{SdkError, SdkResult};

async fn handle_errors(client: &LlmResearchClient) -> SdkResult<()> {
    match client.experiments().get(experiment_id).await {
        Ok(experiment) => {
            println!("Found: {}", experiment.name);
        }
        Err(SdkError::NotFound { resource_type, resource_id }) => {
            println!("Experiment {} not found", resource_id);
        }
        Err(SdkError::AuthenticationError(msg)) => {
            println!("Auth failed: {}", msg);
        }
        Err(SdkError::RateLimited { retry_after, .. }) => {
            println!("Rate limited, retry after {} seconds", retry_after);
        }
        Err(SdkError::ValidationError(errors)) => {
            for error in errors.errors {
                println!("Field {}: {}", error.field, error.message);
            }
        }
        Err(e) => {
            println!("Error: {}", e);
        }
    }
    Ok(())
}

Crate Structure

Crate	Description
`llm-research-lab`	Main application binary
`llm-research-core`	Core domain types and traits
`llm-research-api`	REST API with Axum
`llm-research-storage`	Database and storage implementations
`llm-research-metrics`	Metric calculators (BLEU, ROUGE, etc.)
`llm-research-workflow`	Workflow orchestration and pipelines
`llm-research-sdk`	Rust SDK for API clients
`llm-research-cli`	Command-line interface

Metrics Reference

Text Similarity

Metric	Description	Range
BLEU	Bilingual Evaluation Understudy	0-1
ROUGE-1	Unigram overlap	0-1
ROUGE-2	Bigram overlap	0-1
ROUGE-L	Longest common subsequence	0-1
METEOR	Metric for Evaluation of Translation	0-1

Classification

Metric	Description	Range
Exact Match	Perfect string match	0-1
F1 Score	Harmonic mean of precision/recall	0-1

Language Modeling

Metric	Description	Range
Perplexity	Model uncertainty	1-∞ (lower is better)

Performance

Metric	Description	Unit
Latency	Response time	milliseconds
Throughput	Requests per second	req/s
Token Usage	Tokens consumed	count

Development

Building

# Debug build
cargo build

# Release build
cargo build --release

# Build specific crate
cargo build -p llm-research-sdk

Testing

# Run all tests
cargo test --workspace

# Run tests for specific crate
cargo test -p llm-research-sdk

# Run with output
cargo test -- --nocapture

Code Quality

# Format code
cargo fmt

# Run linter
cargo clippy --workspace

# Check documentation
cargo doc --workspace --no-deps

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the LLM Dev Ops Permanent Source-Available License. See LICENSE for details.

Support

Documentation: docs.example.com
Issues: GitHub Issues
Discussions: GitHub Discussions

Built with Rust for performance, reliability, and safety.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.cargo		.cargo
.claude-flow/metrics		.claude-flow/metrics
.github/workflows		.github/workflows
adapters		adapters
benchmarks		benchmarks
config		config
docs		docs
k8s		k8s
llm-research-api		llm-research-api
llm-research-cli		llm-research-cli
llm-research-core		llm-research-core
llm-research-lab		llm-research-lab
llm-research-metrics		llm-research-metrics
llm-research-sdk		llm-research-sdk
llm-research-storage		llm-research-storage
llm-research-workflow		llm-research-workflow
migrations		migrations
plans		plans
tests/integration		tests/integration
.dockerignore		.dockerignore
.gitignore		.gitignore
BENCHMARK_COMPLIANCE_REPORT.md		BENCHMARK_COMPLIANCE_REPORT.md
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
LICENSE.md		LICENSE.md
Makefile		Makefile
PHASE_2B_INFRA_INTEGRATION.md		PHASE_2B_INFRA_INTEGRATION.md
README.md		README.md
deploy.sh		deploy.sh
docker-compose.yml		docker-compose.yml
package-lock.json		package-lock.json
package.json		package.json
verify-project.sh		verify-project.sh

License

LLM-Dev-Ops/research-lab

Folders and files

Latest commit

History

Repository files navigation

LLM Research Lab

Overview

Architecture

Features

Experiment Management

Model Registry

Dataset Management

Prompt Engineering

Evaluation Framework

Enterprise Features

Quick Start

Prerequisites

Installation

Configuration

Running the Server

CLI Reference

Global Options

Authentication Commands

Configuration Commands

Experiment Commands

Model Commands

Dataset Commands

Prompt Commands

Evaluation Commands

Output Formats

SDK Reference

Installation

Quick Start

Client Configuration

Experiments API

Models API

Datasets API

Prompts API

Evaluations API

Error Handling

Crate Structure

Metrics Reference

Text Similarity

Classification

Language Modeling

Performance

Development

Building

Testing

Code Quality

Contributing

License

Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages