# GPT-4o-mini RAGAS V2 Evaluation - ONLY GPT

This notebook evaluates **ONLY GPT-4o-mini** using RAGAS V2 metrics.

## Steps:
1. Load GPT answers from JSONL files to cache
2. Evaluate GPT-4o-mini with RAGAS V2
3. Generate manual_eval CSV files

**NO INSTRUCTION MODELS - JUST GPT!**

In [14]:
import sys
sys.path.append("../")

In [15]:
from elasticsearch import Elasticsearch
from qdrant_client import QdrantClient
from cache.cache import Cache

qdrant_client = QdrantClient(host="localhost", port=6333)
es_client = Elasticsearch(
    hosts=["http://localhost:9200"],
)
cache = Cache()

  qdrant_client = QdrantClient(host="localhost", port=6333)


## Load GPT Answers to Cache

In [16]:
import json
from common.utils import replace_slash_with_dash

def load_openai_answers_to_cache(cache):
    """Load GPT-4o-mini answers from batch files to cache"""
    datasets = [
        "clarin-pl-poquad-1000",
        "clarin-pl-poquad-100000", 
        "clarin-pl-poquad-2000",
        "clarin-pl-poquad-500",
        "ipipan-polqa-1000",
        "ipipan-polqa-100000",
        "ipipan-polqa-2000", 
        "ipipan-polqa-500"
    ]
    
    loaded_count = 0
    for dataset_key in datasets:
        filename = replace_slash_with_dash(f"gpt-4o-mini_{dataset_key}.jsonl")
        filepath = f"../../openai_batches/{filename}"
        
        try:
            with open(filepath, "r", encoding="utf-8") as f:
                for line in f:
                    data = json.loads(line)
                    custom_id = data["custom_id"]
                    answer = data["response"]["body"]["choices"][0]["message"]["content"]
                    cache.set(custom_id, answer)
                    loaded_count += 1
            print(f"✓ Loaded {filename}")
        except FileNotFoundError:
            print(f"✗ File not found: {filename}")
    
    print(f"\n✅ Loaded {loaded_count} GPT answers to cache")

load_openai_answers_to_cache(cache)

✓ Loaded gpt-4o-mini_clarin-pl-poquad-1000.jsonl
✓ Loaded gpt-4o-mini_clarin-pl-poquad-100000.jsonl
✓ Loaded gpt-4o-mini_clarin-pl-poquad-2000.jsonl
✓ Loaded gpt-4o-mini_clarin-pl-poquad-500.jsonl
✓ Loaded gpt-4o-mini_ipipan-polqa-1000.jsonl
✓ Loaded gpt-4o-mini_ipipan-polqa-100000.jsonl
✓ Loaded gpt-4o-mini_ipipan-polqa-2000.jsonl
✓ Loaded gpt-4o-mini_ipipan-polqa-500.jsonl

✅ Loaded 1898 GPT answers to cache


In [17]:
# Load CURATED batch results from OpenAI
import os

curated_loaded = 0
batch_files = []

# Find all batch output files
for filename in os.listdir("../openai_batches"):
    if filename.startswith("batch_") and filename.endswith("_output.jsonl"):
        batch_files.append(os.path.join("../openai_batches", filename))

print(f"Found {len(batch_files)} batch output files")

# Load curated input files to map hashes
curated_input_files = []
for filename in os.listdir("../openai_batches"):
    if "curated" in filename and filename.startswith("gpt-4o-mini_") and not filename.endswith("_output.jsonl"):
        curated_input_files.append(os.path.join("../openai_batches", filename))

print(f"Found {len(curated_input_files)} curated input files")

# Build hash -> input file mapping
hash_to_input = {}
for input_file in curated_input_files:
    with open(input_file, 'r', encoding='utf-8') as f:
        for line in f:
            data = json.loads(line)
            custom_id = data["custom_id"]
            hash_to_input[custom_id] = os.path.basename(input_file)

print(f"Loaded {len(hash_to_input)} unique hashes from curated input files")

# Load batch outputs to cache
for batch_file in batch_files:
    # Check first line to see if it's a curated batch
    with open(batch_file, 'r', encoding='utf-8') as f:
        first_line = f.readline()
        first_data = json.loads(first_line)
        sample_hash = first_data["custom_id"]
    
    if sample_hash in hash_to_input:
        # This is a curated batch - load it
        count = 0
        with open(batch_file, 'r', encoding='utf-8') as f:
            for line in f:
                data = json.loads(line)
                custom_id = data["custom_id"]
                try:
                    answer = data["response"]["body"]["choices"][0]["message"]["content"]
                    cache.set(custom_id, answer)
                    count += 1
                    curated_loaded += 1
                except (KeyError, IndexError):
                    pass
        print(f"✓ Loaded {os.path.basename(batch_file)}: {count} answers")

print(f"\n✅ Loaded {curated_loaded} curated GPT answers to cache")

Found 7 batch output files
Found 7 curated input files
Loaded 1930 unique hashes from curated input files
✓ Loaded batch_68eeb9f42f288190bfe3ae2a4fe66c4c_output.jsonl: 433 answers
✓ Loaded batch_68eeb9dd63008190b27d1ddb326c4b1b_output.jsonl: 126 answers
✓ Loaded batch_68eeba02bdb8819087a65b0d57c86e99_output.jsonl: 442 answers
✓ Loaded batch_68eeba269b108190be495cb28e762328_output.jsonl: 171 answers
✓ Loaded batch_68eeb9e74b34819097b4a5740168a11a_output.jsonl: 8 answers
✓ Loaded batch_68eeb9cb29c08190b8584366091d9ffa_output.jsonl: 410 answers
✓ Loaded batch_68eeba0fe5f08190b0a9a7c0984603f8_output.jsonl: 340 answers

✅ Loaded 1930 curated GPT answers to cache


## Load Datasets and Setup

In [18]:
from common.names import OPENAI_EMBEDDING_MODEL_NAMES
from repository.qdrant_openai_repository import QdrantOpenAIRepository
from retrievers.qdrant_retriever import QdrantRetriever
from generators.openai_generator import OpenAIGenerator
from qdrant_client.models import Distance
from dataset.curated_dataset_getter import CuratedDatasetGetter

# Load CURATED datasets (for which we just generated GPT answers)
poquad_dataset = CuratedDatasetGetter.get_curated_poquad()
polqa_dataset = CuratedDatasetGetter.get_curated_polqa()

print(f"✅ Using CURATED datasets (manually selected questions)")
print(f"PoQuAD dataset: {len(poquad_dataset)} questions")
print(f"PolQA dataset: {len(polqa_dataset)} questions")

✅ Using CURATED datasets (manually selected questions)
PoQuAD dataset: 100 questions
PolQA dataset: 100 questions


## Define OpenAI Retrievers

In [19]:
def get_best_poquad_openai_retriever():
    repository = QdrantOpenAIRepository.get_repository(
        qdrant_client, OPENAI_EMBEDDING_MODEL_NAMES[0], Distance.EUCLID, cache
    )
    retriever = QdrantRetriever(repository, "clarin-pl-poquad-2000")
    return (retriever, "text-embedding-3-large-Euclid-clarin-pl-poquad-2000")

def get_worst_poquad_openai_retriever():
    repository = QdrantOpenAIRepository.get_repository(
        qdrant_client, OPENAI_EMBEDDING_MODEL_NAMES[0], Distance.COSINE, cache
    )
    retriever = QdrantRetriever(repository, "clarin-pl-poquad-500")
    return (retriever, "text-embedding-3-large-Cosine-clarin-pl-poquad-500")

def get_best_polqa_openai_retriever():
    repository = QdrantOpenAIRepository.get_repository(
        qdrant_client, OPENAI_EMBEDDING_MODEL_NAMES[0], Distance.EUCLID, cache
    )
    retriever = QdrantRetriever(repository, "ipipan-polqa-2000")
    return (retriever, "text-embedding-3-large-Euclid-ipipan-polqa-2000")

def get_worst_polqa_openai_retriever():
    repository = QdrantOpenAIRepository.get_repository(
        qdrant_client, OPENAI_EMBEDDING_MODEL_NAMES[0], Distance.COSINE, cache
    )
    retriever = QdrantRetriever(repository, "ipipan-polqa-500")
    return (retriever, "text-embedding-3-large-Cosine-ipipan-polqa-500")

poquad_openai_retriever_functions = [
    get_best_poquad_openai_retriever,
    get_worst_poquad_openai_retriever,
]

polqa_openai_retriever_functions = [
    get_best_polqa_openai_retriever,
    get_worst_polqa_openai_retriever,
]

## Initialize RAGAS V2 Evaluator

In [20]:
from evaluation.ragas_evaulator_v2 import RAGASEvaluatorV2
from vectorizer.hf_vectorizer import HFVectorizer
from common.names import INST_MODEL_PATHS

vectorizer = HFVectorizer("intfloat/multilingual-e5-large", cache)
ragas_evaluator = RAGASEvaluatorV2(
    reranker_model_name="sdadas/polish-reranker-large-ranknet",
    cache=cache,
    generator_model_name=INST_MODEL_PATHS[2],  # PLLuM-12B for question generation
    vectorizer=vectorizer,
)

print("✅ RAGAS V2 evaluator initialized!")

Vectorizer with model intfloat/multilingual-e5-large initialized
✅ RAGAS V2 evaluator initialized!


## Evaluation Function

In [21]:
import csv
from tqdm import tqdm

def clean_text_for_csv(text):
    if text is None:
        return ""
    cleaned = str(text).replace('\n', ' ').replace('\r', ' ')
    cleaned = ' '.join(cleaned.split())
    return cleaned

def evaluate_gpt_ragas_v2(retriever_name, retriever, dataset, dataset_name, n):
    results = []
    manual_eval_rows = []
    generator = OpenAIGenerator(cache)
    
    for i, entry in enumerate(tqdm(dataset, desc=f"{dataset_name} - GPT-4o-mini")):
        question = entry.question
        correct_passage_id = entry.passage_id
        correct_answers = entry.answers
        
        # Get retrieval results
        retriever_result = retriever.get_relevant_passages(question)
        passages = [passage for (passage, _) in retriever_result.passages]
        top_n_passages = passages[:n]
        
        # Generate answer from cache
        answer = generator.generate_answer(question, top_n_passages)
        
        # Check if correct passage in top n
        retrieved_ids = [passage.id for passage in top_n_passages]
        has_correct_passages = str(correct_passage_id in retrieved_ids).upper()
        
        # RAGAS V2 evaluation
        ragas_score = ragas_evaluator.ragas(
            retriever_result,
            correct_passage_id,
            answer,
            correct_answers=correct_answers
        )
        
        faithfulness = ragas_evaluator.faithfulness(retriever_result, answer)
        answer_relevance = ragas_evaluator.answer_relevance(question, answer)
        answer_correctness = ragas_evaluator.answer_correctness(answer, correct_answers)
        context_recall = ragas_evaluator.context_recall(retriever_result, correct_passage_id)
        
        results.append({
            'ragas_v2': ragas_score,
            'faithfulness': faithfulness,
            'answer_relevance': answer_relevance,
            'answer_correctness': answer_correctness,
            'context_recall': context_recall,
        })
        
        # Manual eval row
        question_id = f"{dataset_name}_q{i+1}"
        question_text_clean = clean_text_for_csv(question)
        answer_clean = clean_text_for_csv(answer)
        
        if isinstance(correct_answers, list):
            correct_answer_text = " | ".join([clean_text_for_csv(ans) for ans in correct_answers])
        else:
            correct_answer_text = clean_text_for_csv(str(correct_answers))
        
        manual_eval_rows.append([
            question_text_clean,
            question_id,
            has_correct_passages,
            f"{ragas_score:.4f}",
            f"{faithfulness:.4f}",
            f"{answer_relevance:.4f}",
            f"{answer_correctness:.4f}",
            f"{context_recall:.4f}",
            answer_clean,
            correct_answer_text,
            ""
        ])
    
    # Calculate averages
    avg_ragas = sum(r['ragas_v2'] for r in results) / len(results)
    avg_faithfulness = sum(r['faithfulness'] for r in results) / len(results)
    avg_relevance = sum(r['answer_relevance'] for r in results) / len(results)
    avg_correctness = sum(r['answer_correctness'] for r in results) / len(results)
    avg_recall = sum(r['context_recall'] for r in results) / len(results)
    
    summary = {
        'retriever': retriever_name,
        'dataset': dataset_name,
        'n': n,
        'ragas_v2': avg_ragas,
        'faithfulness': avg_faithfulness,
        'answer_relevance': avg_relevance,
        'answer_correctness': avg_correctness,
        'context_recall': avg_recall,
    }
    
    return summary, manual_eval_rows

## Run GPT Evaluations

In [22]:
import os

os.makedirs("../../output/ragas_v2", exist_ok=True)
os.makedirs("../../output/ragas_v2/manual_eval", exist_ok=True)

all_summaries = []
ns = [1, 5]

print("=" * 80)
print("GPT-4o-mini RAGAS V2 EVALUATION")
print("=" * 80)

# PoQuAD OpenAI
print("\n### POQUAD OPENAI ###\n")
for get_retriever in poquad_openai_retriever_functions:
    retriever, retriever_name = get_retriever()
    print(f"\nRetriever: {retriever_name}")
    
    for n in ns:
        print(f"  Evaluating n={n}...")
        summary, manual_rows = evaluate_gpt_ragas_v2(
            retriever_name,
            retriever,
            poquad_dataset,
            "poquad_openai",
            n
        )
        
        all_summaries.append(summary)
        
        # Save manual eval file
        filename = f"ragas_v2_poquad_openai_gpt_4o_mini_n{n}_{retriever_name.replace('/', '_')}.csv"
        filepath = f"../../output/ragas_v2/manual_eval/{filename}"
        
        with open(filepath, mode="w", newline="", encoding="utf-8") as file:
            file.write(f"# RETRIEVER: {retriever_name}\n")
            file.write(f"# GENERATOR: gpt-4o-mini\n")
            file.write(f"# DATASET: poquad_openai\n")
            file.write(f"# TOP_N: {n}\n")
            file.write("\n")
            
            writer = csv.writer(file, quoting=csv.QUOTE_ALL)
            writer.writerow(["question", "question_id", "hasCorrectPassages", "ragas_v2_score", 
                           "faithfulness", "answer_relevance", "answer_correctness", "context_recall", 
                           "answer", "correct_answer", "manual_result"])
            writer.writerows(manual_rows)
        
        print(f"    RAGAS V2: {summary['ragas_v2']:.4f} | Correctness: {summary['answer_correctness']:.4f}")
        print(f"    Saved: {filename}")

# PolQA OpenAI
print("\n### POLQA OPENAI ###\n")
for get_retriever in polqa_openai_retriever_functions:
    retriever, retriever_name = get_retriever()
    print(f"\nRetriever: {retriever_name}")
    
    for n in ns:
        print(f"  Evaluating n={n}...")
        summary, manual_rows = evaluate_gpt_ragas_v2(
            retriever_name,
            retriever,
            polqa_dataset,
            "polqa_openai",
            n
        )
        
        all_summaries.append(summary)
        
        # Save manual eval file
        filename = f"ragas_v2_polqa_openai_gpt_4o_mini_n{n}_{retriever_name.replace('/', '_')}.csv"
        filepath = f"../../output/ragas_v2/manual_eval/{filename}"
        
        with open(filepath, mode="w", newline="", encoding="utf-8") as file:
            file.write(f"# RETRIEVER: {retriever_name}\n")
            file.write(f"# GENERATOR: gpt-4o-mini\n")
            file.write(f"# DATASET: polqa_openai\n")
            file.write(f"# TOP_N: {n}\n")
            file.write("\n")
            
            writer = csv.writer(file, quoting=csv.QUOTE_ALL)
            writer.writerow(["question", "question_id", "hasCorrectPassages", "ragas_v2_score", 
                           "faithfulness", "answer_relevance", "answer_correctness", "context_recall", 
                           "answer", "correct_answer", "manual_result"])
            writer.writerows(manual_rows)
        
        print(f"    RAGAS V2: {summary['ragas_v2']:.4f} | Correctness: {summary['answer_correctness']:.4f}")
        print(f"    Saved: {filename}")

print("\n" + "=" * 80)
print("✅ GPT EVALUATION COMPLETE!")
print(f"Generated {len(all_summaries)} manual evaluation files")
print("=" * 80)

GPT-4o-mini RAGAS V2 EVALUATION

### POQUAD OPENAI ###

Vectorizer with model text-embedding-3-large initialized
Qdrant openai collection text-embedding-3-large-Euclid repository initialized

Retriever: text-embedding-3-large-Euclid-clarin-pl-poquad-2000
  Evaluating n=1...


poquad_openai - GPT-4o-mini: 100%|██████████| 100/100 [00:01<00:00, 72.63it/s]


    RAGAS V2: 0.8092 | Correctness: 0.6114
    Saved: ragas_v2_poquad_openai_gpt_4o_mini_n1_text-embedding-3-large-Euclid-clarin-pl-poquad-2000.csv
  Evaluating n=5...


poquad_openai - GPT-4o-mini: 100%|██████████| 100/100 [03:12<00:00,  1.93s/it]


    RAGAS V2: 0.8136 | Correctness: 0.6221
    Saved: ragas_v2_poquad_openai_gpt_4o_mini_n5_text-embedding-3-large-Euclid-clarin-pl-poquad-2000.csv
Vectorizer with model text-embedding-3-large initialized
Qdrant openai collection text-embedding-3-large-Cosine repository initialized

Retriever: text-embedding-3-large-Cosine-clarin-pl-poquad-500
  Evaluating n=1...


poquad_openai - GPT-4o-mini: 100%|██████████| 100/100 [04:38<00:00,  2.78s/it]


    RAGAS V2: 0.8033 | Correctness: 0.6091
    Saved: ragas_v2_poquad_openai_gpt_4o_mini_n1_text-embedding-3-large-Cosine-clarin-pl-poquad-500.csv
  Evaluating n=5...


poquad_openai - GPT-4o-mini: 100%|██████████| 100/100 [04:06<00:00,  2.47s/it]


    RAGAS V2: 0.8069 | Correctness: 0.6181
    Saved: ragas_v2_poquad_openai_gpt_4o_mini_n5_text-embedding-3-large-Cosine-clarin-pl-poquad-500.csv

### POLQA OPENAI ###

Vectorizer with model text-embedding-3-large initialized
Qdrant openai collection text-embedding-3-large-Euclid repository initialized

Retriever: text-embedding-3-large-Euclid-ipipan-polqa-2000
  Evaluating n=1...


polqa_openai - GPT-4o-mini: 100%|██████████| 100/100 [07:04<00:00,  4.24s/it]


    RAGAS V2: 0.8100 | Correctness: 0.6126
    Saved: ragas_v2_polqa_openai_gpt_4o_mini_n1_text-embedding-3-large-Euclid-ipipan-polqa-2000.csv
  Evaluating n=5...


polqa_openai - GPT-4o-mini: 100%|██████████| 100/100 [04:20<00:00,  2.61s/it]


    RAGAS V2: 0.8136 | Correctness: 0.6102
    Saved: ragas_v2_polqa_openai_gpt_4o_mini_n5_text-embedding-3-large-Euclid-ipipan-polqa-2000.csv
Vectorizer with model text-embedding-3-large initialized
Qdrant openai collection text-embedding-3-large-Cosine repository initialized

Retriever: text-embedding-3-large-Cosine-ipipan-polqa-500
  Evaluating n=1...


polqa_openai - GPT-4o-mini: 100%|██████████| 100/100 [01:02<00:00,  1.60it/s]


    RAGAS V2: 0.8090 | Correctness: 0.6136
    Saved: ragas_v2_polqa_openai_gpt_4o_mini_n1_text-embedding-3-large-Cosine-ipipan-polqa-500.csv
  Evaluating n=5...


polqa_openai - GPT-4o-mini: 100%|██████████| 100/100 [01:20<00:00,  1.24it/s]

    RAGAS V2: 0.8128 | Correctness: 0.6098
    Saved: ragas_v2_polqa_openai_gpt_4o_mini_n5_text-embedding-3-large-Cosine-ipipan-polqa-500.csv

✅ GPT EVALUATION COMPLETE!
Generated 8 manual evaluation files





## Summary Results

In [23]:
import pandas as pd

df = pd.DataFrame(all_summaries)
print("\nGPT-4o-mini RAGAS V2 Results:")
print("=" * 80)
for idx, row in df.iterrows():
    print(f"{row['dataset']:20} | n={row['n']} | RAGAS: {row['ragas_v2']:.4f} | Correctness: {row['answer_correctness']:.4f}")

print("\n" + "=" * 80)
print("Average by N:")
print(df.groupby('n')[['ragas_v2', 'answer_correctness']].mean())

print("\n" + "=" * 80)
print("Average by Dataset:")
print(df.groupby('dataset')[['ragas_v2', 'answer_correctness']].mean())


GPT-4o-mini RAGAS V2 Results:
poquad_openai        | n=1 | RAGAS: 0.8092 | Correctness: 0.6114
poquad_openai        | n=5 | RAGAS: 0.8136 | Correctness: 0.6221
poquad_openai        | n=1 | RAGAS: 0.8033 | Correctness: 0.6091
poquad_openai        | n=5 | RAGAS: 0.8069 | Correctness: 0.6181
polqa_openai         | n=1 | RAGAS: 0.8100 | Correctness: 0.6126
polqa_openai         | n=5 | RAGAS: 0.8136 | Correctness: 0.6102
polqa_openai         | n=1 | RAGAS: 0.8090 | Correctness: 0.6136
polqa_openai         | n=5 | RAGAS: 0.8128 | Correctness: 0.6098

Average by N:
   ragas_v2  answer_correctness
n                              
1  0.807898            0.611684
5  0.811747            0.615045

Average by Dataset:
               ragas_v2  answer_correctness
dataset                                    
polqa_openai   0.811355            0.611550
poquad_openai  0.808291            0.615178
