# CERT Framework - Quick Start in Google Colab

This notebook shows you how to use CERT Framework to test LLM reliability.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Javihaus/cert-framework/blob/master/examples/colab/quickstart.ipynb)

## What is CERT?

CERT (Consistency Evaluation and Reliability Testing) helps you:
- 🎯 **Test Consistency**: Measure how reliably your LLM produces the same output
- ✅ **Test Accuracy**: Verify outputs match expected ground truth
- 🔍 **Diagnose Issues**: Get automatic diagnosis and actionable suggestions
- 📊 **Track Metrics**: Monitor performance over time

## Installation

In [None]:
# Install CERT Framework
!pip install cert-framework

# Optional: Install with extras
# !pip install cert-framework[langchain]  # For LangChain support
# !pip install cert-framework[inspector]  # For Web UI
# !pip install cert-framework[all]        # Everything

## Example 1: Test Consistency of a Simple Function

Let's test a function that simulates an LLM with some variance.

In [None]:
from cert import TestRunner, TestConfig, GroundTruth
import random

# Create a test runner
runner = TestRunner()

# Define a function to test (simulating an LLM with some variance)
async def my_llm_function():
    """Simulated LLM that returns slightly different answers"""
    responses = [
        "The capital of France is Paris.",
        "Paris is the capital of France.",
        "France's capital is Paris.",
    ]
    return random.choice(responses)

# Add ground truth
runner.add_ground_truth(GroundTruth(
    id="simple-test",
    question="What is the capital of France?",
    expected="Paris",
    metadata={"correctPages": [1]}
))

# Test retrieval (required for layer enforcement)
retrieval_result = await runner.test_retrieval(
    "simple-test",
    lambda q: [{"pageNum": 1, "content": "Paris is the capital"}],
    {"precisionMin": 0.8}
)
print(f"✅ Retrieval: {retrieval_result.status}")

# Test accuracy
accuracy_result = await runner.test_accuracy(
    "simple-test",
    my_llm_function,
    {"threshold": 0.8}
)
print(f"✅ Accuracy: {accuracy_result.status} ({accuracy_result.accuracy:.2%})")

# Configure consistency test
config = TestConfig(
    n_trials=5,
    consistency_threshold=0.8,
    accuracy_threshold=0.8,
    semantic_comparison=True
)

# Run consistency test
result = await runner.test_consistency(
    "simple-test",
    my_llm_function,
    config
)

print(f"\n📊 Results:")
print(f"Status: {result.status}")
print(f"Consistency: {result.consistency:.2%}")
print(f"Unique outputs: {result.evidence.unique_count if result.evidence else 'N/A'}")

if result.status == 'fail':
    print(f"\n❌ Diagnosis: {result.diagnosis}")
    print(f"\n💡 Suggestions:")
    for suggestion in result.suggestions:
        print(f"  - {suggestion}")

## Example 2: Test a More Realistic LLM

Now let's test with a function that has actual variance issues.

In [None]:
import random

# Simulate an LLM with high variance
counter = 0

async def inconsistent_llm():
    """LLM that returns different answers each time"""
    global counter
    counter += 1
    
    # Different responses each time
    responses = [
        "42",
        "The answer is 42",
        "forty-two",
        "4 + 38 = 42",
        "6 * 7"
    ]
    return responses[counter % len(responses)]

# Add ground truth
runner.add_ground_truth(GroundTruth(
    id="variance-test",
    question="What is 6 * 7?",
    expected="42",
    metadata={"correctPages": [1]}
))

# Test retrieval
await runner.test_retrieval(
    "variance-test",
    lambda q: [{"pageNum": 1}],
    {"precisionMin": 0.8}
)

# Test accuracy
await runner.test_accuracy(
    "variance-test",
    inconsistent_llm,
    {"threshold": 0.8}
)

# Test consistency (should fail)
result = await runner.test_consistency(
    "variance-test",
    inconsistent_llm,
    TestConfig(
        n_trials=10,
        consistency_threshold=0.9,
        accuracy_threshold=0.8,
        semantic_comparison=True
    )
)

print(f"\n📊 Consistency Test:")
print(f"Status: {result.status}")
print(f"Consistency: {result.consistency:.2%}")

if result.evidence:
    print(f"\n🔍 Evidence:")
    print(f"Unique outputs: {result.evidence.unique_count}/{len(result.evidence.outputs)}")
    print(f"Examples:")
    for i, example in enumerate(result.evidence.examples[:3], 1):
        print(f"  {i}. {example}")

if result.diagnosis:
    print(f"\n❌ Diagnosis: {result.diagnosis}")

if result.suggestions:
    print(f"\n💡 Suggestions:")
    for suggestion in result.suggestions:
        print(f"  - {suggestion}")

## Example 3: Using with LangChain (Optional)

If you have a LangChain chain, you can wrap it with CERT.

In [None]:
# Install dependencies
!pip install cert-framework[langchain] langchain-openai

from langchain_openai import OpenAI
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from cert.langchain_integration import wrap_chain
import os

# Set your OpenAI API key
# os.environ["OPENAI_API_KEY"] = "your-api-key-here"

# Create a simple chain
llm = OpenAI(temperature=0.7)
prompt = PromptTemplate(
    input_variables=["question"],
    template="Answer this question concisely: {question}"
)
chain = LLMChain(llm=llm, prompt=prompt)

# Wrap with CERT
cert_chain = wrap_chain(chain, "langchain-test")
cert_chain = cert_chain.with_consistency(threshold=0.9, n_trials=5)

# Run the chain with testing
try:
    result = await cert_chain.ainvoke({"question": "What is 2+2?"})
    print(f"✅ Result: {result}")
except Exception as e:
    print(f"❌ Test failed: {e}")

## Example 4: Semantic Comparison

CERT automatically handles semantically equivalent outputs.

In [None]:
from cert import SemanticComparator

comparator = SemanticComparator()

# Test various equivalent formats
test_cases = [
    ("$391 billion", "391B"),
    ("$391 billion", "$391,000,000,000"),
    ("Paris", "paris"),
    ("Paris", "Paris, France"),
]

print("🔍 Semantic Comparison Tests:\n")
for expected, actual in test_cases:
    result = comparator.compare(expected, actual)
    status = "✅" if result.matched else "❌"
    print(f"{status} '{expected}' vs '{actual}'")
    print(f"   Matched: {result.matched}, Confidence: {result.confidence:.2f}")
    print()

## Next Steps

1. **Try with your own LLM**: Replace `my_llm_function` with your actual LLM calls
2. **Add more tests**: Test different scenarios and edge cases
3. **Track over time**: Use the SQLite storage to monitor degradation
4. **Visual inspection**: Install `cert-framework[inspector]` and run `cert inspect`

## Resources

- [GitHub Repository](https://github.com/Javihaus/cert-framework)
- [Documentation](https://github.com/Javihaus/cert-framework#readme)
- [Report Issues](https://github.com/Javihaus/cert-framework/issues)

## Support

If you find CERT useful, please ⭐ the repository on GitHub!