# Evaluation Comparison Notebook

This notebook walks through the complete 4-step evaluation workflow:
1. **Initialize** the QA bot
2. **Generate predictions** for test questions
3. **Build the evaluation dataset** (questions + predictions + ground truth)
4. **Score with an evaluation framework** (LangChain, DeepEval, RAGAS, etc.)

---

## Prerequisites

**LangChain + Ollama setup (optional)**
- Install [Ollama](https://ollama.ai/download) and run `ollama pull llama3` (or any model you prefer).
- In your `.env`, set `LANGCHAIN_USE_OLLAMA=true`, `OLLAMA_MODEL=llama3`, and optionally `OLLAMA_BASE_URL`.
- Restart the notebook kernel after updating environment variables so LangChain picks up the new settings.

## Step 1: Initialize the QA Bot

Set up the QA bot with the sample technical documentation. The bot will use TF-IDF retrieval to answer questions.

In [1]:
from pathlib import Path
import json
import sys

root = Path("..").resolve()
sys.path.insert(0, str(root))
sys.path.insert(0, str(root / "src"))

from src.qa_bot import QABot
from evaluations.base_evaluator import EvaluationInput

# Step 1: Initialize the bot with sample technical documentation
bot = QABot(documents_path=root / "data" / "documents" / "sample_docs")
print(
    f"✓ QA bot initialized with documents from {root / 'data' / 'documents' / 'sample_docs'}"
)

✓ QA bot initialized with documents from C:\Users\Owner\source\repos\LiteObject\eval-framework-sandbox\data\documents\sample_docs


## Steps 2 & 3: Generate Predictions & Build Evaluation Dataset

Ask the bot each test question to generate **predictions**, then pair them with **ground truth** answers to create the evaluation dataset.

### What is a prediction?

In this notebook, a **prediction** is simply the answer text produced by the QA bot for each test question. Evaluation frameworks compare that prediction against the corresponding **ground truth** answer to compute metrics.

In [2]:
from evaluations.utils import load_dataset_from_files

questions_path = root / "data" / "test_questions.json"
ground_truth_path = root / "data" / "ground_truth.json"

# Step 2: Generate predictions from the bot for each test question
questions_data = json.loads(questions_path.read_text(encoding="utf-8"))

predictions: dict[str, str] = {}
for item in questions_data:
    response = bot.answer(item["question"])
    predictions[item["id"]] = response.response

print(f"✓ Generated {len(predictions)} predictions")

# Step 3: Build the evaluation dataset (predictions + ground truth)
eval_dataset = list(
    load_dataset_from_files(
        questions_path=questions_path,
        ground_truth_path=ground_truth_path,
        predictions=predictions,
    )
)

print(
    f"✓ Prepared {len(eval_dataset)} evaluation samples with predictions & ground truth"
)

✓ Generated 3 predictions
✓ Prepared 3 evaluation samples with predictions & ground truth


## Step 4: Score with Evaluation Frameworks

Run one or more evaluation frameworks to score how well your QA bot performed. Start with LangChain, then try others (DeepEval, RAGAS).

In [13]:
from evaluations.langchain_eval_runner import LangChainEvalRunner

# Step 4a: LangChain Evaluation
print("\n=== LangChain Evaluation ===")
langchain_runner = LangChainEvalRunner(output_dir=root / "results")
try:
    langchain_result = langchain_runner.evaluate(eval_dataset)
    langchain_summary = {
        "framework": langchain_result.framework,
        "score": langchain_result.score,
        "details": langchain_result.details,
    }
    print(f"✓ Score: {langchain_result.score}")
except Exception as exc:
    langchain_summary = {
        "framework": "langchain",
        "error": str(exc),
    }
    print(f"✗ Error: {exc}")

langchain_summary


=== LangChain Evaluation ===
✓ Score: 0.3333333333333333
✓ Score: 0.3333333333333333


{'framework': 'langchain',
 'score': 0.3333333333333333,
 'details': {'raw': [{'reasoning': 'CORRECT', 'value': 'CORRECT', 'score': 1},
   {'reasoning': 'INCORRECT', 'value': 'INCORRECT', 'score': 0},
   {'reasoning': 'INCORRECT', 'value': 'INCORRECT', 'score': 0}],
  'provider': 'ollama'}}

["## Step 4b: Compare Framework Scores","","Collect results from all evaluated frameworks and display a summary comparison."]

In [None]:
# Collect results from all frameworks (extend this dict as you run more evaluations)
framework_results = {
    "LangChain": langchain_summary.get("score"),
    # "DeepEval": deepeval_summary.get("score"),
    # "RAGAS": ragas_summary.get("score"),
}

print("\n=== Framework Comparison ===")
for framework, score in framework_results.items():
    if score is not None:
        print(f"{framework}: {score:.3f}")
    else:
        print(f"{framework}: Not evaluated or error occurred")

# Optional: Create a simple bar chart if you have multiple results
if len([s for s in framework_results.values() if s is not None]) > 1:
    try:
        import matplotlib.pyplot as plt

        names = [k for k, v in framework_results.items() if v is not None]
        scores = [v for v in framework_results.values() if v is not None]
        plt.bar(names, scores)
        plt.ylabel("Score (0-1)")
        plt.title("Evaluation Framework Comparison")
        plt.show()
    except ImportError:
        print("Install matplotlib for visualization: pip install matplotlib")


=== Framework Comparison ===
LangChain: 0.333


## Optional: Try Other Frameworks

Uncomment and run any of the cells below to compare scores across frameworks.

In [None]:
# from evaluations.deepeval_runner import DeepEvalRunner

# print("\n=== DeepEval Evaluation ===")
# deepeval_runner = DeepEvalRunner(output_dir=root / "results")
# try:
#     deepeval_result = deepeval_runner.evaluate(eval_dataset)
#     print(f"✓ Score: {deepeval_result.score}")
#     deepeval_result.details
# except Exception as exc:
#     print(f"✗ Error: {exc}")

In [None]:
# from evaluations.ragas_runner import RagasRunner

# print("\n=== RAGAS Evaluation ===")
# ragas_runner = RagasRunner(output_dir=root / "results")
# try:
#     ragas_result = ragas_runner.evaluate(eval_dataset)
#     print(f"✓ Score: {ragas_result.score}")
#     ragas_result.details
# except Exception as exc:
#     print(f"✗ Error: {exc}")