# RAGAS V2 Re-evaluation Notebook

This notebook regenerates answers for instruction models with improved prompts and evaluates them using RAGAS V2.

Goals:
1. Regenerate answers for all instruction models (using cached retrievals)
2. Use improved prompt to get direct answers without reasoning
3. Evaluate with RAGAS V2 for better scoring
4. Compare results with original RAGAS scores

In [10]:
import sys
sys.path.append("../")

In [11]:
from elasticsearch import Elasticsearch
from qdrant_client import QdrantClient
from cache.cache import Cache

qdrant_client = QdrantClient(host="localhost", port=6333)
es_client = Elasticsearch(
    hosts=["http://localhost:9200"],
)
cache = Cache()

  qdrant_client = QdrantClient(host="localhost", port=6333)


In [12]:
from common.names import (
    OPENAI_EMBEDDING_MODEL_NAMES,
    PASSAGE_PREFIX_MAP,
    QUERY_PREFIX_MAP,
    INST_MODEL_PATHS,
    DATASET_SEED,
)
from repository.es_repository import ESRepository
from repository.qdrant_openai_repository import QdrantOpenAIRepository
from repository.qdrant_repository import QdrantRepository
from qdrant_client.models import Distance

from rerankers.hf_reranker import HFReranker
from retrievers.es_retriever import ESRetriever
from retrievers.hybrid_retriever import HybridRetriever
from retrievers.qdrant_retriever import QdrantRetriever
from retrievers.retriever import Retriever
from generators.instruction_generator import InstructionGenerator
from generators.openai_generator import OpenAIGenerator

from dataset.polqa_dataset_getter import PolqaDatasetGetter
from dataset.poquad_dataset_getter import PoquadDatasetGetter

## Define Retriever Functions

Using the same retrievers as in the manual evaluation notebook.

In [13]:
def get_best_poquad_retriever() -> tuple[Retriever, str]:
    dataset_key = "clarin-pl-poquad-100000"
    es_index = "morfologik_index"
    qdrant_model = "intfloat/multilingual-e5-large"
    reranker_model = "sdadas/polish-reranker-large-ranknet"
    alpha = 0.5

    es_repository = ESRepository(es_client, es_index, cache)
    passage_prefix = PASSAGE_PREFIX_MAP[qdrant_model]
    query_prefix = QUERY_PREFIX_MAP[qdrant_model]
    qdrant_repository = QdrantRepository.get_repository(
        qdrant_client,
        qdrant_model,
        Distance.COSINE,
        cache,
        passage_prefix,
        query_prefix,
    )
    reranker = HFReranker(reranker_model, cache)

    retriever = HybridRetriever(
        es_repository, qdrant_repository, dataset_key, alpha, reranker
    )

    return (
        retriever,
        "morfologik_index-intfloat/multilingual-e5-large-Cosine-clarin-pl-poquad-100000-0.5-sdadas/polish-reranker-large-ranknet",
    )


def get_50p_poquad_retriever() -> tuple[Retriever, str]:
    dataset_key = "clarin-pl-poquad-1000"
    qdrant_model = "sdadas/mmlw-retrieval-roberta-large"

    passage_prefix = PASSAGE_PREFIX_MAP[qdrant_model]
    query_prefix = QUERY_PREFIX_MAP[qdrant_model]
    qdrant_repository = QdrantRepository.get_repository(
        qdrant_client,
        qdrant_model,
        Distance.EUCLID,
        cache,
        passage_prefix,
        query_prefix,
    )

    retriever = QdrantRetriever(qdrant_repository, dataset_key)

    return (
        retriever,
        "sdadas/mmlw-retrieval-roberta-large-Euclid-clarin-pl-poquad-1000",
    )


def get_worst_poquad_retriever() -> tuple[Retriever, str]:
    dataset_key = "clarin-pl-poquad-500"
    es_index = "basic_index"

    es_repository = ESRepository(es_client, es_index, cache)

    retriever = ESRetriever(es_repository, dataset_key)

    return (retriever, "basic_index-clarin-pl-poquad-500")


def get_best_poquad_openai_retriever() -> tuple[Retriever, str]:
    repository = QdrantOpenAIRepository.get_repository(
        qdrant_client, OPENAI_EMBEDDING_MODEL_NAMES[0], Distance.COSINE, cache
    )

    retriever = QdrantRetriever(repository, "clarin-pl-poquad-2000")

    return (retriever, "text-embedding-3-large-Cosine-clarin-pl-poquad-2000")


def get_worst_poquad_openai_retriever() -> tuple[Retriever, str]:
    repository = QdrantOpenAIRepository.get_repository(
        qdrant_client, OPENAI_EMBEDDING_MODEL_NAMES[0], Distance.COSINE, cache
    )

    retriever = QdrantRetriever(repository, "clarin-pl-poquad-500")

    return (retriever, "text-embedding-3-large-Cosine-clarin-pl-poquad-500")


def get_best_polqa_retriever() -> tuple[Retriever, str]:
    dataset_key = "ipipan-polqa-1000"
    es_index = "morfologik_index"
    qdrant_model = "sdadas/mmlw-retrieval-roberta-large"
    reranker_model = "sdadas/polish-reranker-large-ranknet"
    alpha = 0.75

    es_repository = ESRepository(es_client, es_index, cache)
    passage_prefix = PASSAGE_PREFIX_MAP[qdrant_model]
    query_prefix = QUERY_PREFIX_MAP[qdrant_model]
    qdrant_repository = QdrantRepository.get_repository(
        qdrant_client,
        qdrant_model,
        Distance.COSINE,
        cache,
        passage_prefix,
        query_prefix,
    )
    reranker = HFReranker(reranker_model, cache)

    retriever = HybridRetriever(
        es_repository, qdrant_repository, dataset_key, alpha, reranker
    )

    return (
        retriever,
        "morfologik_index-sdadas/mmlw-retrieval-roberta-large-Cosine-ipipan-polqa-1000-0.75-sdadas/polish-reranker-large-ranknet",
    )


def get_50p_polqa_retriever() -> tuple[Retriever, str]:
    dataset_key = "ipipan-polqa-1000"
    es_index = "morfologik_index"
    qdrant_model = "sdadas/mmlw-retrieval-roberta-large"
    alpha = 0.75

    es_repository = ESRepository(es_client, es_index, cache)
    passage_prefix = PASSAGE_PREFIX_MAP[qdrant_model]
    query_prefix = QUERY_PREFIX_MAP[qdrant_model]
    qdrant_repository = QdrantRepository.get_repository(
        qdrant_client,
        qdrant_model,
        Distance.COSINE,
        cache,
        passage_prefix,
        query_prefix,
    )

    retriever = HybridRetriever(es_repository, qdrant_repository, dataset_key, alpha)

    return (
        retriever,
        "morfologik_index-sdadas/mmlw-retrieval-roberta-large-Cosine-ipipan-polqa-1000-0.75",
    )


def get_worst_polqa_retriever() -> tuple[Retriever, str]:
    dataset_key = "ipipan-polqa-500"
    es_index = "basic_index"

    es_repository = ESRepository(es_client, es_index, cache)

    retriever = ESRetriever(es_repository, dataset_key)

    return (
        retriever,
        "basic_index-ipipan-polqa-500",
    )


def get_best_polqa_openai_retriever() -> tuple[Retriever, str]:
    repository = QdrantOpenAIRepository.get_repository(
        qdrant_client, OPENAI_EMBEDDING_MODEL_NAMES[0], Distance.EUCLID, cache
    )

    retriever = QdrantRetriever(repository, "ipipan-polqa-2000")

    return (retriever, "text-embedding-3-large-Euclid-ipipan-polqa-2000")


def get_worst_polqa_openai_retriever() -> tuple[Retriever, str]:
    repository = QdrantOpenAIRepository.get_repository(
        qdrant_client, OPENAI_EMBEDDING_MODEL_NAMES[0], Distance.COSINE, cache
    )

    retriever = QdrantRetriever(repository, "ipipan-polqa-500")

    return (retriever, "text-embedding-3-large-Cosine-ipipan-polqa-500")

In [14]:
poquad_retriever_functions = [
    get_best_poquad_retriever,
    get_50p_poquad_retriever,
    get_worst_poquad_retriever,
]

poquad_openai_retriever_functions = [
    get_best_poquad_openai_retriever,
    get_worst_poquad_openai_retriever,
]

polqa_retriever_functions = [
    get_best_polqa_retriever,
    get_50p_polqa_retriever,
    get_worst_polqa_retriever,
]

polqa_openai_retriever_functions = [
    get_best_polqa_openai_retriever,
    get_worst_polqa_openai_retriever,
]

## Load Test Datasets

Using the same 100-question subset as in manual evaluation.

In [15]:
poquad_dataset_getter = PoquadDatasetGetter()
polqa_dataset_getter = PolqaDatasetGetter()

poquad_dataset = poquad_dataset_getter.get_random_n_test(500, DATASET_SEED)[:100]
polqa_dataset = polqa_dataset_getter.get_random_n_test(500, DATASET_SEED)[:100]

print(f"PoQuAD dataset: {len(poquad_dataset)} questions")
print(f"PolQA dataset: {len(polqa_dataset)} questions")

PoQuAD dataset: 100 questions
PolQA dataset: 100 questions


## Initialize RAGAS V2 Evaluator

We'll use the improved RAGAS V2 evaluator with better scoring.

In [16]:
from evaluation.ragas_evaulator_v2 import RAGASEvaluatorV2
from vectorizer.hf_vectorizer import HFVectorizer

# Initialize vectorizer for RAGAS
vectorizer = HFVectorizer("intfloat/multilingual-e5-large", cache)

# Initialize RAGAS V2
ragas_evaluator = RAGASEvaluatorV2(
    reranker_model_name="sdadas/polish-reranker-large-ranknet",
    cache=cache,
    generator_model_name=INST_MODEL_PATHS[2],  # PLLuM-12B - best for Polish question generation
    vectorizer=vectorizer,
)

print("RAGAS V2 evaluator initialized with PLLuM-12B for question generation!")

Vectorizer with model intfloat/multilingual-e5-large initialized
RAGAS V2 evaluator initialized with PLLuM-12B for question generation!
RAGAS V2 evaluator initialized with PLLuM-12B for question generation!


## Define Evaluation Function

This function will:
1. Use cached retrieval results
2. Regenerate answers with instruction_v2 hash (new prompt)
3. Evaluate with RAGAS V2 including answer correctness

In [None]:
import csv
from tqdm import tqdm
import hashlib

def clean_text_for_csv(text):
    """Clean text to be CSV-safe by removing newlines and extra whitespace"""
    if text is None:
        return ""
    # Convert to string and replace newlines with spaces
    cleaned = str(text).replace('\n', ' ').replace('\r', ' ')
    # Replace multiple spaces with single space
    cleaned = ' '.join(cleaned.split())
    return cleaned

def create_safe_filename_v2(retriever_name, generator_name, generator_type, dataset_name, n):
    """Create a safe filename from the configuration using hash to avoid collisions"""
    # Clean names for filename
    clean_retriever = clean_text_for_csv(retriever_name).replace("/", "_").replace("-", "_")
    clean_generator = clean_text_for_csv(generator_name).replace("/", "_").replace("-", "_")
    
    # If retriever name is too long, use first 40 chars + hash of full name
    if len(clean_retriever) > 50:
        # Create a short hash of the full retriever name for uniqueness
        retriever_hash = hashlib.md5(clean_retriever.encode()).hexdigest()[:8]
        safe_retriever = f"{clean_retriever[:40]}_{retriever_hash}"
    else:
        safe_retriever = clean_retriever
    
    # Truncate generator name if needed
    safe_generator = clean_generator[:30]
    
    return f"ragas_v2_{dataset_name}_{safe_retriever}_{safe_generator}_{generator_type}_n{n}.csv"

def evaluate_generator_ragas_v2(
    retriever_name: str,
    retriever,
    generator_name: str,
    generator,
    dataset,
    dataset_name: str,
    n: int
):
    """
    Evaluate generator using RAGAS V2 with answer correctness
    Also returns data for manual evaluation files
    """
    results = []
    manual_eval_rows = []
    
    for i, entry in enumerate(tqdm(dataset, desc=f"{dataset_name} - {generator_name[:30]}")):
        question = entry.question
        correct_passage_id = entry.passage_id
        correct_answers = entry.answers
        
        # Get retrieval results (cached)
        retriever_result = retriever.get_relevant_passages(question)
        passages = [passage for (passage, _) in retriever_result.passages]
        top_n_passages = passages[:n]
        
        # Generate answer (will use instruction_v3 cache - NEW ANSWERS!)
        answer = generator.generate_answer(question, top_n_passages)
        
        # Check if correct passage is in top n
        retrieved_ids = [passage.id for passage in top_n_passages]
        has_correct_passages = str(correct_passage_id in retrieved_ids).upper()
        
        # Evaluate with RAGAS V2 including correctness
        ragas_score = ragas_evaluator.ragas(
            retriever_result,
            correct_passage_id,
            answer,
            correct_answers=correct_answers
        )
        
        # Also get individual metrics for analysis
        faithfulness = ragas_evaluator.faithfulness(retriever_result, answer)
        answer_relevance = ragas_evaluator.answer_relevance(question, answer)
        answer_correctness = ragas_evaluator.answer_correctness(answer, correct_answers)
        context_recall = ragas_evaluator.context_recall(retriever_result, correct_passage_id)
        
        results.append({
            'question': question,
            'answer': answer,
            'correct_answers': ' | '.join(correct_answers) if isinstance(correct_answers, list) else correct_answers,
            'ragas_v2': ragas_score,
            'faithfulness': faithfulness,
            'answer_relevance': answer_relevance,
            'answer_correctness': answer_correctness,
            'context_recall': context_recall,
        })
        
        # Prepare row for manual evaluation file
        question_id = f"{dataset_name}_q{i+1}"
        question_text_clean = clean_text_for_csv(question)
        answer_clean = clean_text_for_csv(answer)
        
        if isinstance(correct_answers, list):
            correct_answer_text = " | ".join([clean_text_for_csv(ans) for ans in correct_answers])
        else:
            correct_answer_text = clean_text_for_csv(str(correct_answers))
        
        manual_eval_rows.append([
            question_text_clean,
            question_id,
            has_correct_passages,
            f"{ragas_score:.4f}",  # RAGAS V2 overall score
            f"{faithfulness:.4f}",  # Faithfulness score
            f"{answer_relevance:.4f}",  # Answer relevance score
            f"{answer_correctness:.4f}",  # Answer correctness score
            f"{context_recall:.4f}",  # Context recall score
            answer_clean,
            correct_answer_text,
            ""  # Empty result column for manual evaluation
        ])
    
    # Calculate average scores
    avg_ragas = sum(r['ragas_v2'] for r in results) / len(results)
    avg_faithfulness = sum(r['faithfulness'] for r in results) / len(results)
    avg_relevance = sum(r['answer_relevance'] for r in results) / len(results)
    avg_correctness = sum(r['answer_correctness'] for r in results) / len(results)
    avg_recall = sum(r['context_recall'] for r in results) / len(results)
    
    summary = {
        'retriever': retriever_name,
        'generator': generator_name,
        'dataset': dataset_name,
        'n': n,
        'ragas_v2': avg_ragas,
        'faithfulness': avg_faithfulness,
        'answer_relevance': avg_relevance,
        'answer_correctness': avg_correctness,
        'context_recall': avg_recall,
        'num_questions': len(results)
    }
    
    return summary, results, manual_eval_rows

## Run Evaluations for All Combinations

This will evaluate all instruction models with n=[1, 5] for both datasets.

In [None]:
import os
from datetime import datetime

# Create output directories
os.makedirs("../../output/ragas_v2", exist_ok=True)
os.makedirs("../../output/ragas_v2/manual_eval", exist_ok=True)

# Track all summaries
all_summaries = []
all_detailed_results = []

ns = [1, 5]

print("=" * 80)
print("STARTING RAGAS V2 EVALUATION WITH NEW ANSWERS")
print("Cache key: instruction_v3 (will regenerate all answers)")
print("=" * 80)

# PoQuAD dataset evaluations
print("\n### POQUAD DATASET ###\n")
for get_retriever in poquad_retriever_functions:
    retriever, retriever_name = get_retriever()
    print(f"\nRetriever: {retriever_name}")
    
    for inst_model_path in INST_MODEL_PATHS:
        print(f"  Generator: {inst_model_path}")
        generator = InstructionGenerator(inst_model_path, cache)
        
        for n in ns:
            print(f"    Evaluating with n={n}...")
            summary, detailed, manual_rows = evaluate_generator_ragas_v2(
                retriever_name,
                retriever,
                inst_model_path,
                generator,
                poquad_dataset,
                "poquad",
                n
            )
            
            all_summaries.append(summary)
            
            # Add metadata to detailed results
            for detail in detailed:
                detail['retriever'] = retriever_name
                detail['generator'] = inst_model_path
                detail['dataset'] = 'poquad'
                detail['n'] = n
            all_detailed_results.extend(detailed)
            
            # Save manual evaluation file
            filename = create_safe_filename_v2(retriever_name, inst_model_path, "INST", "poquad", n)
            filepath = f"../../output/ragas_v2/manual_eval/{filename}"
            
            with open(filepath, mode="w", newline="", encoding="utf-8") as file:
                # Write metadata
                file.write(f"# RETRIEVER: {clean_text_for_csv(retriever_name)}\n")
                file.write(f"# GENERATOR: {clean_text_for_csv(inst_model_path)}\n")
                file.write(f"# TYPE: INST\n")
                file.write(f"# DATASET: poquad\n")
                file.write(f"# TOP_N: {n}\n")
                file.write(f"# CACHE_VERSION: instruction_v3\n")
                file.write(f"# RAGAS_VERSION: v2\n")
                file.write("\n")
                
                # Write CSV data
                writer = csv.writer(file, quoting=csv.QUOTE_ALL)
                writer.writerow(["question", "question_id", "hasCorrectPassages", "ragas_v2_score", "faithfulness", "answer_relevance", "answer_correctness", "context_recall", "answer", "correct_answer", "manual_result"])
                writer.writerows(manual_rows)
            
            print(f"      RAGAS V2: {summary['ragas_v2']:.4f} | Correctness: {summary['answer_correctness']:.4f}")
            print(f"      Saved: {filename}")

# PoQuAD OpenAI evaluations
print("\n### POQUAD OPENAI DATASET ###\n")
for get_retriever in poquad_openai_retriever_functions:
    retriever, retriever_name = get_retriever()
    print(f"\nRetriever: {retriever_name}")
    
    print(f"  Generator: gpt-4o-mini")
    generator = OpenAIGenerator(cache)
    
    for n in ns:
        print(f"    Evaluating with n={n}...")
        summary, detailed, manual_rows = evaluate_generator_ragas_v2(
            retriever_name,
            retriever,
            "gpt-4o-mini",
            generator,
            poquad_dataset,
            "poquad_openai",
            n
        )
        
        all_summaries.append(summary)
        
        for detail in detailed:
            detail['retriever'] = retriever_name
            detail['generator'] = 'gpt-4o-mini'
            detail['dataset'] = 'poquad_openai'
            detail['n'] = n
        all_detailed_results.extend(detailed)
        
        # Save manual evaluation file
        filename = create_safe_filename_v2(retriever_name, "gpt-4o-mini", "INST", "poquad_openai", n)
        filepath = f"../../output/ragas_v2/manual_eval/{filename}"
        
        with open(filepath, mode="w", newline="", encoding="utf-8") as file:
            file.write(f"# RETRIEVER: {clean_text_for_csv(retriever_name)}\n")
            file.write(f"# GENERATOR: gpt-4o-mini\n")
            file.write(f"# TYPE: INST\n")
            file.write(f"# DATASET: poquad_openai\n")
            file.write(f"# TOP_N: {n}\n")
            file.write(f"# CACHE_VERSION: openai\n")
            file.write(f"# RAGAS_VERSION: v2\n")
            file.write("\n")
            
            writer = csv.writer(file, quoting=csv.QUOTE_ALL)
            writer.writerow(["question", "question_id", "hasCorrectPassages", "ragas_v2_score", "faithfulness", "answer_relevance", "answer_correctness", "context_recall", "answer", "correct_answer", "manual_result"])
            writer.writerows(manual_rows)
        
        print(f"      RAGAS V2: {summary['ragas_v2']:.4f} | Correctness: {summary['answer_correctness']:.4f}")
        print(f"      Saved: {filename}")

# PolQA dataset evaluations
print("\n### POLQA DATASET ###\n")
for get_retriever in polqa_retriever_functions:
    retriever, retriever_name = get_retriever()
    print(f"\nRetriever: {retriever_name}")
    
    for inst_model_path in INST_MODEL_PATHS:
        print(f"  Generator: {inst_model_path}")
        generator = InstructionGenerator(inst_model_path, cache)
        
        for n in ns:
            print(f"    Evaluating with n={n}...")
            summary, detailed, manual_rows = evaluate_generator_ragas_v2(
                retriever_name,
                retriever,
                inst_model_path,
                generator,
                polqa_dataset,
                "polqa",
                n
            )
            
            all_summaries.append(summary)
            
            for detail in detailed:
                detail['retriever'] = retriever_name
                detail['generator'] = inst_model_path
                detail['dataset'] = 'polqa'
                detail['n'] = n
            all_detailed_results.extend(detailed)
            
            # Save manual evaluation file
            filename = create_safe_filename_v2(retriever_name, inst_model_path, "INST", "polqa", n)
            filepath = f"../../output/ragas_v2/manual_eval/{filename}"
            
            with open(filepath, mode="w", newline="", encoding="utf-8") as file:
                file.write(f"# RETRIEVER: {clean_text_for_csv(retriever_name)}\n")
                file.write(f"# GENERATOR: {clean_text_for_csv(inst_model_path)}\n")
                file.write(f"# TYPE: INST\n")
                file.write(f"# DATASET: polqa\n")
                file.write(f"# TOP_N: {n}\n")
                file.write(f"# CACHE_VERSION: instruction_v3\n")
                file.write(f"# RAGAS_VERSION: v2\n")
                file.write("\n")
                
                writer = csv.writer(file, quoting=csv.QUOTE_ALL)
                writer.writerow(["question", "question_id", "hasCorrectPassages", "ragas_v2_score", "faithfulness", "answer_relevance", "answer_correctness", "context_recall", "answer", "correct_answer", "manual_result"])
                writer.writerows(manual_rows)
            
            print(f"      RAGAS V2: {summary['ragas_v2']:.4f} | Correctness: {summary['answer_correctness']:.4f}")
            print(f"      Saved: {filename}")

# PolQA OpenAI evaluations
print("\n### POLQA OPENAI DATASET ###\n")
for get_retriever in polqa_openai_retriever_functions:
    retriever, retriever_name = get_retriever()
    print(f"\nRetriever: {retriever_name}")
    
    print(f"  Generator: gpt-4o-mini")
    generator = OpenAIGenerator(cache)
    
    for n in ns:
        print(f"    Evaluating with n={n}...")
        summary, detailed, manual_rows = evaluate_generator_ragas_v2(
            retriever_name,
            retriever,
            "gpt-4o-mini",
            generator,
            polqa_dataset,
            "polqa_openai",
            n
        )
        
        all_summaries.append(summary)
        
        for detail in detailed:
            detail['retriever'] = retriever_name
            detail['generator'] = 'gpt-4o-mini'
            detail['dataset'] = 'polqa_openai'
            detail['n'] = n
        all_detailed_results.extend(detailed)
        
        # Save manual evaluation file
        filename = create_safe_filename_v2(retriever_name, "gpt-4o-mini", "INST", "polqa_openai", n)
        filepath = f"../../output/ragas_v2/manual_eval/{filename}"
        
        with open(filepath, mode="w", newline="", encoding="utf-8") as file:
            file.write(f"# RETRIEVER: {clean_text_for_csv(retriever_name)}\n")
            file.write(f"# GENERATOR: gpt-4o-mini\n")
            file.write(f"# TYPE: INST\n")
            file.write(f"# DATASET: polqa_openai\n")
            file.write(f"# TOP_N: {n}\n")
            file.write(f"# CACHE_VERSION: openai\n")
            file.write(f"# RAGAS_VERSION: v2\n")
            file.write("\n")
            
            writer = csv.writer(file, quoting=csv.QUOTE_ALL)
            writer.writerow(["question", "question_id", "hasCorrectPassages", "ragas_v2_score", "faithfulness", "answer_relevance", "answer_correctness", "context_recall", "answer", "correct_answer", "manual_result"])
            writer.writerows(manual_rows)
        
        print(f"      RAGAS V2: {summary['ragas_v2']:.4f} | Correctness: {summary['answer_correctness']:.4f}")
        print(f"      Saved: {filename}")

print("\n" + "=" * 80)
print("EVALUATION COMPLETE!")
print(f"Total configurations evaluated: {len(all_summaries)}")
print(f"Manual evaluation files saved to: ../../output/ragas_v2/manual_eval/")
print("=" * 80)

STARTING RAGAS V2 EVALUATION WITH NEW ANSWERS
Cache key: instruction_v3 (will regenerate all answers)

### POQUAD DATASET ###

Vectorizer with model intfloat/multilingual-e5-large initialized
Qdrant collection intfloat-multilingual-e5-large-Cosine repository initialized
Vectorizer with model intfloat/multilingual-e5-large initialized
Qdrant collection intfloat-multilingual-e5-large-Cosine repository initialized
Vectorizer with model sdadas/polish-reranker-large-ranknet initialized

Retriever: morfologik_index-intfloat/multilingual-e5-large-Cosine-clarin-pl-poquad-100000-0.5-sdadas/polish-reranker-large-ranknet
  Generator: ../../models/Bielik-11B-v2.2-Instruct-q4
Vectorizer with model sdadas/polish-reranker-large-ranknet initialized

Retriever: morfologik_index-intfloat/multilingual-e5-large-Cosine-clarin-pl-poquad-100000-0.5-sdadas/polish-reranker-large-ranknet
  Generator: ../../models/Bielik-11B-v2.2-Instruct-q4


Exception ignored in: <bound method IPythonKernel._clean_thread_parent_frames of <ipykernel.ipkernel.IPythonKernel object at 0x12180b140>>
Traceback (most recent call last):
  File "/Users/jakubkusiowski/Desktop/Workspace/polish-nl-qa/env/lib/python3.12/site-packages/ipykernel/ipkernel.py", line 770, in _clean_thread_parent_frames
    def _clean_thread_parent_frames(

KeyboardInterrupt: 


    Evaluating with n=1...


poquad - ../../models/Bielik-11B-v2.2-I: 100%|██████████| 100/100 [00:02<00:00, 33.48it/s]
poquad - ../../models/Bielik-11B-v2.2-I: 100%|██████████| 100/100 [00:02<00:00, 33.48it/s]


      RAGAS V2: 0.8021 | Correctness: 0.6654
      Saved: ragas_v2_poquad_morfologik_index_intfloat_multilingual_e_01c8b7a6_.._.._models_Bielik_11B_v2.2_I_INST_n1.csv
    Evaluating with n=5...


poquad - ../../models/Bielik-11B-v2.2-I:  19%|█▉        | 19/100 [00:00<00:00, 89.69it/s]

## Save Results to CSV

In [None]:
# Save summary results
summary_file = "../../output/ragas_v2/ragas_v2_summary.csv"
with open(summary_file, mode="w", newline="", encoding="utf-8") as file:
    writer = csv.DictWriter(file, fieldnames=[
        'dataset', 'retriever', 'generator', 'n', 
        'ragas_v2', 'faithfulness', 'answer_relevance', 
        'answer_correctness', 'context_recall', 'num_questions'
    ])
    writer.writeheader()
    writer.writerows(all_summaries)

print(f"✅ Summary results saved to: {summary_file}")

# Save detailed results
detailed_file = "../../output/ragas_v2/ragas_v2_detailed.csv"
with open(detailed_file, mode="w", newline="", encoding="utf-8") as file:
    if all_detailed_results:
        writer = csv.DictWriter(file, fieldnames=all_detailed_results[0].keys(), quoting=csv.QUOTE_ALL)
        writer.writeheader()
        writer.writerows(all_detailed_results)

print(f"✅ Detailed results saved to: {detailed_file}")

## Analysis: Top and Bottom Performers

In [None]:
import pandas as pd

# Load summary results
df = pd.DataFrame(all_summaries)

print("=" * 80)
print("TOP 10 CONFIGURATIONS BY RAGAS V2 SCORE")
print("=" * 80)
top_10 = df.nlargest(10, 'ragas_v2')[['dataset', 'generator', 'n', 'ragas_v2', 'answer_correctness']]
for idx, row in top_10.iterrows():
    print(f"{row['dataset']:15} | {row['generator'][:30]:30} | n={row['n']} | RAGAS: {row['ragas_v2']:.4f} | Correctness: {row['answer_correctness']:.4f}")

print("\n" + "=" * 80)
print("BOTTOM 10 CONFIGURATIONS BY RAGAS V2 SCORE")
print("=" * 80)
bottom_10 = df.nsmallest(10, 'ragas_v2')[['dataset', 'generator', 'n', 'ragas_v2', 'answer_correctness']]
for idx, row in bottom_10.iterrows():
    print(f"{row['dataset']:15} | {row['generator'][:30]:30} | n={row['n']} | RAGAS: {row['ragas_v2']:.4f} | Correctness: {row['answer_correctness']:.4f}")

print("\n" + "=" * 80)
print("AVERAGE SCORES BY GENERATOR")
print("=" * 80)
by_generator = df.groupby('generator')[['ragas_v2', 'answer_correctness', 'faithfulness', 'answer_relevance']].mean()
print(by_generator.sort_values('ragas_v2', ascending=False))

print("\n" + "=" * 80)
print("AVERAGE SCORES BY DATASET")
print("=" * 80)
by_dataset = df.groupby('dataset')[['ragas_v2', 'answer_correctness', 'faithfulness', 'answer_relevance']].mean()
print(by_dataset.sort_values('ragas_v2', ascending=False))

print("\n" + "=" * 80)
print("AVERAGE SCORES BY N")
print("=" * 80)
by_n = df.groupby('n')[['ragas_v2', 'answer_correctness', 'faithfulness', 'answer_relevance']].mean()
print(by_n.sort_values('ragas_v2', ascending=False))

## Sample Answers for Manual Inspection

Let's look at some actual answers to see if the reasoning problem is fixed.

In [None]:
# Show a few random examples
import random

print("=" * 80)
print("SAMPLE ANSWERS FROM RAGAS V2 EVALUATION")
print("=" * 80)

sample_results = random.sample(all_detailed_results, min(10, len(all_detailed_results)))

for i, result in enumerate(sample_results, 1):
    print(f"\n--- EXAMPLE {i} ---")
    print(f"Dataset: {result['dataset']}")
    print(f"Generator: {result['generator'][:40]}")
    print(f"N: {result['n']}")
    print(f"\nQuestion: {result['question']}")
    print(f"\nGenerated Answer: {result['answer']}")
    print(f"\nCorrect Answer(s): {result['correct_answers']}")
    print(f"\nRAGAS V2: {result['ragas_v2']:.4f}")
    print(f"Correctness: {result['answer_correctness']:.4f}")
    print(f"Faithfulness: {result['faithfulness']:.4f}")
    print(f"Relevance: {result['answer_relevance']:.4f}")
    print("-" * 80)

## Conclusion

This notebook:
1. ✅ **Regenerated ALL answers** using improved prompt with cache key `instruction_v3` (NOT using old cache)
2. ✅ **Evaluated with RAGAS V2** including answer correctness compared to ground truth
3. ✅ **Saved 56 manual evaluation files** (28 PoQuAD + 28 PolQA) with RAGAS V2 scores in CSV format
4. ✅ **Saved summary and detailed results** to CSV files
5. ✅ **Provided comprehensive analysis** of performance

### Output Files:

**Summary Results:**
- `../../output/ragas_v2/ragas_v2_summary.csv` - Aggregate scores per configuration

**Detailed Results:**
- `../../output/ragas_v2/ragas_v2_detailed.csv` - Per-question results with all metrics

**Manual Evaluation Files (56 files):**
- `../../output/ragas_v2/manual_eval/ragas_v2_*.csv` - One file per configuration
- Each file contains: question, question_id, hasCorrectPassages, answer, correct_answer, ragas_v2_score, manual_result
- Files include metadata headers showing retriever, generator, dataset, n, and versions

### Key Improvements:

1. **New Answers**: All generated with improved prompt (instruction_v3) that emphasizes "ONLY answer, NO reasoning"
2. **Better Evaluation**: RAGAS V2 with continuous scoring and ground truth comparison
3. **Ready for Manual Review**: CSV files formatted for Excel with RAGAS V2 scores for comparison

Check the sample answers above to verify reasoning is removed, and use the manual evaluation files to compare your manual scores with RAGAS V2 automated scores!