# Experiments

## What I Learned
The session covered how to set up experiments to run evaluations across datasets with control over things like concurrency, repetitions, and dataset splits. It also showed how experiments track metadata and handle different dataset versions.

## Changes in Code
I ran several experiments on handpicked splits of my dataset, including repeated runs to check result consistency. I also practiced managing metadata and comparing results across different versions.

In [None]:
from langsmith import Client
from langsmith.evaluation import evaluate
from langchain_openai import ChatOpenAI

client = Client()

# Define the target function to test
def qa_system(inputs: dict) -> dict:
    """Simple Q&A system using LLM"""
    llm = ChatOpenAI(model="gpt-4o-mini")
    question = inputs["question"]
    response = llm.invoke(question)
    return {"answer": response.content}

# Define evaluator
def accuracy_evaluator(run, example):
    predicted = run.outputs.get("answer", "")
    expected = example.outputs.get("answer", "")
    
    # Calculate similarity (simple word overlap)
    pred_words = set(predicted.lower().split())
    exp_words = set(expected.lower().split())
    
    if not exp_words:
        score = 0.0
    else:
        overlap = len(pred_words & exp_words)
        score = overlap / len(exp_words)
    
    return {"key": "word_overlap", "score": score}

# Run experiment
dataset_name = "qa_examples"
results = evaluate(
    qa_system,
    data=dataset_name,
    evaluators=[accuracy_evaluator],
    experiment_prefix="qa_experiment",
    metadata={
        "model": "gpt-4o-mini",
        "version": "1.0"
    }
)

print(f"Experiment completed: {results['experiment_name']}")