# Cross-Encoder(evaluates the pair (query + chunk) using a BERT-style encoder with attention)

## Notebook Summary: Hybrid QA with Cross-Encoder Reranking + Compound + Procedural QA

This notebook implements a robust RAG-based extractive QA system for telecom documents, integrating multiple enhancements to improve factual accuracy and interpretability.

### Key Features:

1. **Cross-Encoder Reranking**  
   Improves retrieval quality by reranking initial FAISS results using `cross-encoder/ms-marco-MiniLM-L-6-v2`, allowing deeper semantic alignment between query and chunks.

2. **Compound Question Decomposition**  
   Automatically splits multi-clause questions (e.g., with "and", "or") and answers each clause individually using separate prompts.

3. **Procedural Multi-Span Extraction**  
   For procedural or stepwise queries, uses regex-based patterns to extract actionable steps directly from context.

4. **Auto-Routing Strategy**  
   The pipeline selects between standard extractive QA and procedural span extraction based on query structure.

5. **Evaluation Metrics**  
   Evaluated on 100 curated QA pairs using:
   - **SQuAD (EM / F1)**
   - **ROUGE-L**
   - **BLEU**

This setup demonstrates the most advanced variant of the centralized RAG system in the project, optimized for telecom-specific factual and procedural queries.

In [1]:
from pathlib import Path
import faiss
import pickle
import torch
import re
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from sentence_transformers import SentenceTransformer

In [2]:
# Load FAISS index and chunks
index_path = "/mnt/data/RAG/3gpp_index.faiss"
chunks_path = "/mnt/data/RAG/3gpp_chunks.pkl"

index = faiss.read_index(index_path)
with open(chunks_path, "rb") as f:
    documents = pickle.load(f)

# Load embedding model used for indexing
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
def retrieve_context(query, top_k=3):
    query_emb = embedding_model.encode([query], normalize_embeddings=True)
    D, I = index.search(query_emb.astype("float32"), top_k)
    return [documents[i] for i in I[0]]

def retrieve_with_rerank(query, top_k=5):
    # Step 1 — initial FAISS search
    query_emb = embedding_model.encode([query], normalize_embeddings=True)
    D, I = index.search(np.array(query_emb).astype("float32"), top_k * 2)  # wider net

    initial_results = [documents[i] for i in I[0]]

    # Step 2 — prepare (query, chunk) pairs
    pairs = [(query, doc["content"]) for doc in initial_results]

    # Step 3 — rerank with cross-encoder
    scores = reranker.predict(pairs)
    reranked = sorted(zip(scores, initial_results), key=lambda x: x[0], reverse=True)[:top_k]

    return [doc for _, doc in reranked]

In [3]:
from sentence_transformers import CrossEncoder

# Load Cross-Encoder model once
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

In [4]:
SYSTEM_PROMPT = (
    "You are a precise assistant. Extract the exact answer span from the context. "
    "Do not paraphrase, summarize, or add extra information. "
    "The answer must appear exactly in the context."
    "If the context lists multiple conditions, actions, or branches, include them all as written. "
    "Do not summarize or paraphrase — copy the exact text from the context, line by line."
)

def build_rag_prompt(context_chunks, question):
    combined_context = "\n\n".join([chunk['content'] for chunk in context_chunks])
    user_prompt = (
        f"Context: {combined_context}\n\n"
        f"Question: {question}\n"
        f"Answer from the context only:"
    )
    return f"<s>[INST] <<SYS>>\n{SYSTEM_PROMPT}\n<</SYS>>\n\n{user_prompt} [/INST]"

model_path = "/mnt/data/llama2_qa_lora_output5/final"

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16).to("cuda")

qa_pipeline = pipeline("text-generation", model=model, tokenizer=tokenizer)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cuda:0


In [5]:
def clean_prediction(raw_text):
    # Remove everything before the last [INST]
    answer = raw_text.split("[/INST]")[-1].strip()

    # Remove strange characters
    answer = re.sub(r"[^\w\s\-.,:/()]", "", answer)

    # Remove repeating phrases like "The key is... The key is... The key is..."
    answer = re.sub(r'(\b.+?:)(\s*\1)+', r'\1', answer)

    # Trim repetitive word loops (e.g., "structured as follows" x 5)
    tokens = answer.split()
    for i in range(1, len(tokens) // 2):
        if tokens[:i] == tokens[i:2*i]:
            answer = " ".join(tokens[:i])
            break

    # Optionally truncate to sentence boundary
    sentence_end = re.search(r'[.?!]', answer)
    if sentence_end:
        answer = answer[:sentence_end.end()]

    return answer.strip()

In [6]:
import re
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk import word_tokenize
import nltk

STOPWORDS = set(stopwords.words("english"))

import re

def normalize(text):
    return re.sub(r'\W+', ' ', text.lower())

def lexical_overlap(query, chunk):
    q_tokens = set(normalize(query).split()) - STOPWORDS
    c_tokens = set(normalize(chunk).split()) - STOPWORDS
    return len(q_tokens & c_tokens) / (len(q_tokens | c_tokens) + 1e-5)

def tfidf_score(query, chunk, vectorizer=None):
    docs = [query, chunk]
    if not vectorizer:
        vectorizer = TfidfVectorizer().fit(docs)
    vecs = vectorizer.transform(docs)
    return (vecs[0] @ vecs[1].T).A[0][0]

In [7]:
def split_compound_question(q):
    parts = re.split(r"\band\b|\bor\b|[,;]", q)
    return [p.strip() for p in parts if len(p.strip().split()) > 3]


def answer_with_rag_llama(question, top_k=5, verbose=False):
    retrieved = retrieve_with_rerank(question, top_k=top_k)

    sub_qs = split_compound_question(question)

    # Handle compound question (multi-prompt)
    if len(sub_qs) > 1:
        answers = []
        for sq in sub_qs:
            sub_prompt = build_rag_prompt(retrieved, sq)
            raw = qa_pipeline(
                sub_prompt, 
                max_new_tokens=160, 
                do_sample=False, 
                eos_token_id=tokenizer.eos_token_id, 
                pad_token_id=tokenizer.eos_token_id
            )[0]["generated_text"]

            ans = clean_prediction(raw)
            answers.append(f"→ {sq}: {ans}")

        full_answer = "\n".join(answers)

        # Context containment check (on full answer)
        all_context = " ".join([c["content"] for c in retrieved])
        if not any(ans.split(": ", 1)[-1] in all_context for ans in answers):
            print("🚨 One or more sub-answers not found in context — check retrieval or generation.")
        return full_answer, retrieved

    # Handle simple (single-clause) question
    prompt = build_rag_prompt(retrieved, question)
    raw_output = qa_pipeline(
        prompt, 
        max_new_tokens=160, 
        do_sample=False, 
        eos_token_id=tokenizer.eos_token_id, 
        pad_token_id=tokenizer.eos_token_id
    )[0]["generated_text"]

    answer = clean_prediction(raw_output)

    # Sanity check
    if len(answer.split()) < 2 or len(answer.split()) > 40:
        print("⚠️ Warning: Possibly bad output. Check content or retrieval.")

    # ✅ Context containment validation
    all_context = " ".join([c["content"] for c in retrieved])
    if answer not in all_context:
        print("🚨 Answer not found in retrieved context — check prompt or retrieval quality.")

    if verbose:
        print("📌 Prompt:\n", prompt)
        print("\n🧾 Raw Output:\n", raw_output)
        print("\n✅ Cleaned Answer:", answer)
        for i, chunk in enumerate(retrieved):
            print(f"\n--- Context {i+1} ---")
            print(chunk["content"])

    return answer, retrieved

import re

def extract_multi_spans(context: str) -> list:
    """Extract procedural-style sentences from telecom context using regex patterns."""
    spans = []

    # Patterns that catch conditional and action rules
    patterns = [
        r"(?i)(?:upon|when|if|after|before).+?shall.+?[.;]",  # conditional + shall
        r"(?i)the (?:ue|amf|network|nas|gnb).+?shall.+?[.;]",  # direct instructions
        r"(?i)-\s*.+?shall.+?[.;]",  # bullet points with 'shall'
        r"(?i)the (?:ue|amf|nas|network).+?enters.+?[.;]",     # entry triggers
    ]

    for pattern in patterns:
        matches = re.findall(pattern, context)
        spans.extend(matches)

    import difflib

    def is_similar(a, b, threshold=0.85):
        return difflib.SequenceMatcher(None, a, b).ratio() > threshold
    
    unique_spans = []
    for s in spans:
        cleaned = s.strip()
        if not any(is_similar(cleaned, u) for u in unique_spans):
            unique_spans.append(cleaned)
    
    return unique_spans[:6]

def answer_with_rag_llama_multispan(question, top_k=5, verbose=False):
    retrieved = retrieve_with_rerank(question, top_k=top_k)
    combined_context = " ".join([re.sub(r'\s+', ' ', c["content"]) for c in retrieved])
    spans = extract_multi_spans(combined_context)  # uses regex

    if not spans:
        return "⚠️ No clear steps found in context.", retrieved

    final = "\n".join([f"• {s.strip()}" for s in spans])
    return final, retrieved

def route_question_to_best_strategy(question, top_k=5, verbose=False):
    # Use multi-span for procedural questions
    if any(q in question.lower() for q in ["steps", "procedures", "when does", "what happens if", "if the ue"]):
        return answer_with_rag_llama_multispan(question, top_k=top_k, verbose=verbose)
    else:
        return answer_with_rag_llama(question, top_k=top_k, verbose=verbose)

In [8]:
import json
from tqdm import tqdm
from evaluate import load

# Load QA pairs
def load_qa_pairs(path):
    with open(path, "r", encoding="utf-8") as f:
        return [json.loads(line) for line in f]

qa_pairs = load_qa_pairs("3gpp_qa_100_pairs.jsonl")

# Load metrics
squad_metric = load("squad")
rouge = load("rouge")
bleu = load("bleu")

bleu_predictions = []
bleu_references = []
results = []

for sample in tqdm(qa_pairs):
    question = sample["question"]
    reference = sample["answer"]

    try:
        prediction, _ = route_question_to_best_strategy(question)
    except Exception as e:
        print(f"⚠️ Error on: {question}\n{e}")
        prediction = ""

    # Add to metrics
    squad_metric.add(
        prediction={"id": str(hash(question)), "prediction_text": prediction},
        reference={"id": str(hash(question)), "answers": {"text": [reference], "answer_start": [0]}}
    )
    rouge.add(prediction=prediction, reference=reference)
    bleu_predictions.append(prediction)
    bleu_references.append([reference])
    results.append({
        "question": question,
        "reference": reference,
        "prediction": prediction
    })

# Compute final scores
squad_scores = squad_metric.compute()
rouge_scores = rouge.compute()
bleu_score = bleu.compute(predictions=bleu_predictions, references=bleu_references)["bleu"]

# Print results
print("\n📊 Final Evaluation Results (Setup 4 — Cross-Encoder + Compound + Procedural):")
print(f"Exact Match (EM): {squad_scores['exact_match']:.2f}")
print(f"F1 Score        : {squad_scores['f1']:.2f}")
print(f"ROUGE-L         : {rouge_scores['rougeL']:.4f}")
print(f"BLEU            : {bleu_score:.4f}")

  0%|                                                   | 0/100 [00:00<?, ?it/s]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
  1%|▍                                          | 1/100 [00:03<06:09,  3.73s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


  2%|▊                                          | 2/100 [00:05<03:45,  2.30s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


  3%|█▎                                         | 3/100 [00:12<07:51,  4.86s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


  4%|█▋                                         | 4/100 [00:20<09:37,  6.02s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


  5%|██▏                                        | 5/100 [00:24<08:21,  5.28s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
  6%|██▌                                        | 6/100 [00:32<09:33,  6.10s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
  7%|███                                        | 7/100 [00:40<10:21,  6.68s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


  8%|███▍                                       | 8/100 [00:46<09:56,  6.48s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
  9%|███▊                                       | 9/100 [00:54<10:29,  6.92s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 10%|████▏                                     | 10/100 [01:01<10:44,  7.16s/it]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 11%|████▌                                     | 11/100 [01:09<10:51,  7.32s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.




 12%|█████                                     | 12/100 [01:18<11:19,  7.73s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 13%|█████▍                                    | 13/100 [01:26<11:13,  7.75s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 14%|█████▉                                    | 14/100 [01:33<11:06,  7.75s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 15%|██████▎                                   | 15/100 [01:41<11:01,  7.78s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 16%|██████▋                                   | 16/100 [01:42<07:59,  5.71s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 17%|███████▏                                  | 17/100 [01:50<08:39,  6.26s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 18%|███████▌                                  | 18/100 [01:57<09:11,  6.73s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 19%|███████▉                                  | 19/100 [02:06<09:38,  7.14s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 20%|████████▍                                 | 20/100 [02:13<09:43,  7.29s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.




 21%|████████▊                                 | 21/100 [02:15<07:35,  5.77s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 22%|█████████▏                                | 22/100 [02:23<08:16,  6.37s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 23%|█████████▋                                | 23/100 [02:31<08:40,  6.77s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 24%|██████████                                | 24/100 [02:38<08:50,  6.97s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 25%|██████████▌                               | 25/100 [02:46<09:07,  7.30s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 26%|██████████▉                               | 26/100 [02:47<06:40,  5.41s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 27%|███████████▎                              | 27/100 [02:55<07:34,  6.23s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.




 28%|███████████▊                              | 28/100 [03:03<08:05,  6.75s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 29%|████████████▏                             | 29/100 [03:11<08:19,  7.03s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 30%|████████████▌                             | 30/100 [03:19<08:26,  7.24s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 31%|█████████████                             | 31/100 [03:20<06:10,  5.37s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 32%|█████████████▍                            | 32/100 [03:21<04:32,  4.00s/it]

🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 33%|█████████████▊                            | 33/100 [03:21<03:22,  3.02s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 34%|██████████████▎                           | 34/100 [03:28<04:22,  3.98s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 35%|██████████████▋                           | 35/100 [03:29<03:29,  3.22s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 36%|███████████████                           | 36/100 [03:37<04:51,  4.56s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 37%|███████████████▌                          | 37/100 [03:44<05:45,  5.49s/it]

🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 38%|███████████████▉                          | 38/100 [03:45<04:04,  3.95s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 39%|████████████████▍                         | 39/100 [03:48<03:40,  3.61s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 40%|████████████████▊                         | 40/100 [03:50<03:07,  3.13s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 41%|█████████████████▏                        | 41/100 [03:57<04:25,  4.50s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 42%|█████████████████▋                        | 42/100 [04:05<05:19,  5.51s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 43%|██████████████████                        | 43/100 [04:10<05:07,  5.39s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 44%|██████████████████▍                       | 44/100 [04:18<05:43,  6.13s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 45%|██████████████████▉                       | 45/100 [04:26<06:03,  6.61s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 46%|███████████████████▎                      | 46/100 [04:28<04:41,  5.21s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 47%|███████████████████▋                      | 47/100 [04:35<05:12,  5.89s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 48%|████████████████████▏                     | 48/100 [04:36<03:47,  4.37s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 49%|████████████████████▌                     | 49/100 [04:37<02:55,  3.43s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 50%|█████████████████████                     | 50/100 [04:38<02:14,  2.69s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 51%|█████████████████████▍                    | 51/100 [04:39<01:47,  2.19s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 52%|█████████████████████▊                    | 52/100 [04:47<03:05,  3.87s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 53%|██████████████████████▎                   | 53/100 [04:48<02:20,  2.99s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 54%|██████████████████████▋                   | 54/100 [04:56<03:27,  4.52s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 55%|███████████████████████                   | 55/100 [05:01<03:31,  4.70s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 56%|███████████████████████▌                  | 56/100 [05:03<02:41,  3.67s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 58%|████████████████████████▎                 | 58/100 [05:10<02:25,  3.46s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 59%|████████████████████████▊                 | 59/100 [05:18<03:16,  4.78s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 60%|█████████████████████████▏                | 60/100 [05:26<03:45,  5.63s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 61%|█████████████████████████▌                | 61/100 [05:33<04:02,  6.23s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 62%|██████████████████████████                | 62/100 [05:34<02:57,  4.66s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 63%|██████████████████████████▍               | 63/100 [05:42<03:25,  5.57s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.




 64%|██████████████████████████▉               | 64/100 [05:50<03:44,  6.25s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 65%|███████████████████████████▎              | 65/100 [05:57<03:50,  6.59s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 66%|███████████████████████████▋              | 66/100 [06:05<03:55,  6.93s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 67%|████████████████████████████▏             | 67/100 [06:10<03:26,  6.26s/it]

🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 68%|████████████████████████████▌             | 68/100 [06:10<02:23,  4.49s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 69%|████████████████████████████▉             | 69/100 [06:18<02:49,  5.45s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 71%|█████████████████████████████▊            | 71/100 [06:19<01:25,  2.96s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 72%|██████████████████████████████▏           | 72/100 [06:27<02:04,  4.45s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 74%|███████████████████████████████           | 74/100 [06:29<01:07,  2.60s/it]The following generation fla



 75%|███████████████████████████████▌          | 75/100 [06:36<01:41,  4.06s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 76%|███████████████████████████████▉          | 76/100 [06:44<02:03,  5.14s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 77%|████████████████████████████████▎         | 77/100 [06:47<01:40,  4.37s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 78%|████████████████████████████████▊         | 78/100 [06:51<01:38,  4.48s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 79%|█████████████████████████████████▏        | 79/100 [06:59<01:56,  5.56s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 80%|█████████████████████████████████▌        | 80/100 [07:07<02:04,  6.22s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 81%|██████████████████████████████████        | 81/100 [07:15<02:06,  6.66s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 82%|██████████████████████████████████▍       | 82/100 [07:16<01:29,  4.97s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 83%|██████████████████████████████████▊       | 83/100 [07:17<01:05,  3.88s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 84%|███████████████████████████████████▎      | 84/100 [07:23<01:08,  4.29s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 85%|███████████████████████████████████▋      | 85/100 [07:30<01:18,  5.24s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 86%|████████████████████████████████████      | 86/100 [07:33<01:02,  4.48s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 87%|████████████████████████████████████▌     | 87/100 [07:41<01:11,  5.49s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 88%|████████████████████████████████████▉     | 88/100 [07:43<00:53,  4.47s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 89%|█████████████████████████████████████▍    | 89/100 [07:51<01:00,  5.50s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 90%|█████████████████████████████████████▊    | 90/100 [07:58<01:01,  6.16s/it]

🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 91%|██████████████████████████████████████▏   | 91/100 [07:58<00:39,  4.38s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 92%|██████████████████████████████████████▋   | 92/100 [08:02<00:32,  4.06s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 93%|███████████████████████████████████████   | 93/100 [08:10<00:36,  5.15s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 94%|███████████████████████████████████████▍  | 94/100 [08:17<00:35,  5.95s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.




 95%|███████████████████████████████████████▉  | 95/100 [08:21<00:26,  5.39s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 96%|████████████████████████████████████████▎ | 96/100 [08:25<00:19,  4.97s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 97%|████████████████████████████████████████▋ | 97/100 [08:27<00:12,  4.04s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 98%|█████████████████████████████████████████▏| 98/100 [08:29<00:06,  3.35s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 99%|█████████████████████████████████████████▌| 99/100 [08:32<00:03,  3.25s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
100%|█████████████████████████████████████████| 100/100 [08:33<00:00,  5.14s/it]

🚨 Answer not found in retrieved context — check prompt or retrieval quality.






📊 Final Evaluation Results (Setup 4 — Cross-Encoder + Compound + Procedural):
Exact Match (EM): 0.00
F1 Score        : 19.51
ROUGE-L         : 0.2018
BLEU            : 0.0092
