# Basic FAISS

## Notebook Summary: RAG + LLaMA-2 with Compound QA Support

This notebook extends centralized RAG-based extractive QA using a fine-tuned LLaMA-2 model to handle compound telecom questions.

### Key Enhancements Over Previous Version:

1. **Compound Question Handling**  
   Detects and splits multi-part queries using logical connectors (e.g., "and", "or"). Each clause is independently answered using the same RAG + LLaMA-2 pipeline.

2. **Multi-Answer Inference**  
   For compound queries, generates multiple precise answers and formats them as bullet-style sub-responses.

3. **Context Validation**  
   Verifies that each sub-answer appears in the retrieved context to ensure factual grounding.

4. **Evaluation**  
   Measures performance on 100 QA pairs using:
   - **Exact Match** / **F1** (SQuAD)
   - **ROUGE-L**
   - **BLEU**

This version improves robustness in handling complex telecom queries and offers more reliable QA outputs for your thesis RAG baseline.

In [1]:
from pathlib import Path
import faiss
import pickle
import torch
import re
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from sentence_transformers import SentenceTransformer

In [2]:
# Load FAISS index and chunks
index_path = "/mnt/data/RAG/3gpp_index.faiss"
chunks_path = "/mnt/data/RAG/3gpp_chunks.pkl"

index = faiss.read_index(index_path)
with open(chunks_path, "rb") as f:
    documents = pickle.load(f)

# Load embedding model used for indexing
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
def retrieve_context(query, top_k=3):
    query_emb = embedding_model.encode([query], normalize_embeddings=True)
    D, I = index.search(query_emb.astype("float32"), top_k)
    return [documents[i] for i in I[0]]

In [3]:
SYSTEM_PROMPT = (
    "You are a precise assistant. Extract the exact answer span from the context. "
    "Do not paraphrase, summarize, or add extra information. "
    "The answer must appear exactly in the context."
)

def build_rag_prompt(context_chunks, question):
    combined_context = "\n\n".join([chunk['content'] for chunk in context_chunks])
    user_prompt = (
        f"Context: {combined_context}\n\n"
        f"Question: {question}\n"
        f"Answer from the context only:"
    )
    return f"<s>[INST] <<SYS>>\n{SYSTEM_PROMPT}\n<</SYS>>\n\n{user_prompt} [/INST]"

model_path = "/mnt/data/llama2_qa_lora_output5/final"

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16).to("cuda")

qa_pipeline = pipeline("text-generation", model=model, tokenizer=tokenizer)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cuda:0


In [4]:
def clean_prediction(raw_text):
    # Remove everything before the last [INST]
    answer = raw_text.split("[/INST]")[-1].strip()

    # Remove strange characters
    answer = re.sub(r"[^\w\s\-.,:/()]", "", answer)

    # Remove repeating phrases like "The key is... The key is... The key is..."
    answer = re.sub(r'(\b.+?:)(\s*\1)+', r'\1', answer)

    # Trim repetitive word loops (e.g., "structured as follows" x 5)
    tokens = answer.split()
    for i in range(1, len(tokens) // 2):
        if tokens[:i] == tokens[i:2*i]:
            answer = " ".join(tokens[:i])
            break

    # Optionally truncate to sentence boundary
    sentence_end = re.search(r'[.?!]', answer)
    if sentence_end:
        answer = answer[:sentence_end.end()]

    return answer.strip()

In [5]:
import re
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk import word_tokenize
import nltk

STOPWORDS = set(stopwords.words("english"))

import re

def normalize(text):
    return re.sub(r'\W+', ' ', text.lower())

def lexical_overlap(query, chunk):
    q_tokens = set(normalize(query).split()) - STOPWORDS
    c_tokens = set(normalize(chunk).split()) - STOPWORDS
    return len(q_tokens & c_tokens) / (len(q_tokens | c_tokens) + 1e-5)

def tfidf_score(query, chunk, vectorizer=None):
    docs = [query, chunk]
    if not vectorizer:
        vectorizer = TfidfVectorizer().fit(docs)
    vecs = vectorizer.transform(docs)
    return (vecs[0] @ vecs[1].T).A[0][0]

In [6]:
def rerank_chunks(chunks, query, alpha_overlap=0.7, beta_faiss=0.3, top_k=3):
    vectorizer = TfidfVectorizer().fit([query] + [c["content"] for c in chunks])
    reranked = []

    for idx, c in enumerate(chunks):
        overlap = lexical_overlap(query, c["content"])
        tfidf_sim = tfidf_score(query, c["content"], vectorizer)
        faiss_rank_bonus = (len(chunks) - idx) / len(chunks)

        # Final rerank score = weighted combination
        score = alpha_overlap * overlap + (1 - alpha_overlap) * tfidf_sim + beta_faiss * faiss_rank_bonus

        reranked.append((score, c))

    reranked.sort(reverse=True, key=lambda x: x[0])
    return [c for _, c in reranked[:top_k]]

In [7]:
def split_compound_question(q):
    parts = re.split(r"\band\b|\bor\b|[,;]", q)
    return [p.strip() for p in parts if len(p.strip().split()) > 3]


def answer_with_rag_llama(question, top_k=5, verbose=False):
    initial_chunks = retrieve_context(question, top_k=10)  # Increase FAISS recall
    retrieved = rerank_chunks(initial_chunks, question, top_k=top_k)  # Rerank + filter

    sub_qs = split_compound_question(question)

    # Handle compound question (multi-prompt)
    if len(sub_qs) > 1:
        answers = []
        for sq in sub_qs:
            sub_prompt = build_rag_prompt(retrieved, sq)
            raw = qa_pipeline(
                sub_prompt, 
                max_new_tokens=160, 
                do_sample=False, 
                eos_token_id=tokenizer.eos_token_id, 
                pad_token_id=tokenizer.eos_token_id
            )[0]["generated_text"]

            ans = clean_prediction(raw)
            answers.append(f"→ {sq}: {ans}")

        full_answer = "\n".join(answers)

        # Context containment check (on full answer)
        all_context = " ".join([c["content"] for c in retrieved])
        if not any(ans.split(": ", 1)[-1] in all_context for ans in answers):
            print("🚨 One or more sub-answers not found in context — check retrieval or generation.")
        return full_answer, retrieved

    # Handle simple (single-clause) question
    prompt = build_rag_prompt(retrieved, question)
    raw_output = qa_pipeline(
        prompt, 
        max_new_tokens=160, 
        do_sample=False, 
        eos_token_id=tokenizer.eos_token_id, 
        pad_token_id=tokenizer.eos_token_id
    )[0]["generated_text"]

    answer = clean_prediction(raw_output)

    # Sanity check
    if len(answer.split()) < 2 or len(answer.split()) > 40:
        print("⚠️ Warning: Possibly bad output. Check content or retrieval.")

    # Context containment validation
    all_context = " ".join([c["content"] for c in retrieved])
    if answer not in all_context:
        print("🚨 Answer not found in retrieved context — check prompt or retrieval quality.")

    if verbose:
        print("📌 Prompt:\n", prompt)
        print("\n🧾 Raw Output:\n", raw_output)
        print("\n✅ Cleaned Answer:", answer)
        for i, chunk in enumerate(retrieved):
            print(f"\n--- Context {i+1} ---")
            print(chunk["content"])

    return answer, retrieved

In [8]:
import json
from tqdm import tqdm
from evaluate import load

# Load QA pairs
def load_qa_pairs(path):
    with open(path, "r", encoding="utf-8") as f:
        return [json.loads(line) for line in f]

qa_pairs = load_qa_pairs("3gpp_qa_100_pairs.jsonl")

# Load metrics
squad_metric = load("squad")
rouge = load("rouge")
bleu = load("bleu")

bleu_predictions = []
bleu_references = []
results = []

for sample in tqdm(qa_pairs):
    question = sample["question"]
    reference = sample["answer"]

    try:
        prediction, _ = answer_with_rag_llama(question)
    except Exception as e:
        print(f"⚠️ Error on: {question}\n{e}")
        prediction = ""

    # Add to metrics
    squad_metric.add(
        prediction={"id": str(hash(question)), "prediction_text": prediction},
        reference={"id": str(hash(question)), "answers": {"text": [reference], "answer_start": [0]}}
    )
    rouge.add(prediction=prediction, reference=reference)
    bleu_predictions.append(prediction)
    bleu_references.append([reference])
    results.append({
        "question": question,
        "reference": reference,
        "prediction": prediction
    })

# Compute final scores
squad_scores = squad_metric.compute()
rouge_scores = rouge.compute()
bleu_score = bleu.compute(predictions=bleu_predictions, references=bleu_references)["bleu"]

# Print results
print("\n📊 Final Evaluation Results (Compound QA Enabled):")
print(f"Exact Match (EM): {squad_scores['exact_match']:.2f}")
print(f"F1 Score        : {squad_scores['f1']:.2f}")
print(f"ROUGE-L         : {rouge_scores['rougeL']:.4f}")
print(f"BLEU            : {bleu_score:.4f}")

  0%|                                                   | 0/100 [00:00<?, ?it/s]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
  1%|▍                                          | 1/100 [00:08<13:56,  8.45s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
  2%|▊                                          | 2/100 [00:16<13:13,  8.10s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


  3%|█▎                                         | 3/100 [00:24<12:56,  8.01s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
  4%|█▋                                         | 4/100 [00:25<08:18,  5.19s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
  5%|██▏                                        | 5/100 [00:29<07:30,  4.74s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


  6%|██▌                                        | 6/100 [00:36<09:00,  5.75s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


  7%|███                                        | 7/100 [00:44<09:52,  6.37s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


  8%|███▍                                       | 8/100 [00:50<09:33,  6.24s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
  9%|███▊                                       | 9/100 [00:58<10:14,  6.75s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 10%|████▏                                     | 10/100 [00:59<07:35,  5.06s/it]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 11%|████▌                                     | 11/100 [01:06<08:36,  5.80s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 12%|█████                                     | 12/100 [01:15<09:42,  6.62s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.




 13%|█████▍                                    | 13/100 [01:23<10:04,  6.95s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 14%|█████▉                                    | 14/100 [01:30<10:14,  7.14s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 15%|██████▎                                   | 15/100 [01:31<07:32,  5.33s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.




 16%|██████▋                                   | 16/100 [01:34<06:24,  4.58s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 17%|███████▏                                  | 17/100 [01:42<07:31,  5.44s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 18%|███████▌                                  | 18/100 [01:44<06:04,  4.44s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 19%|███████▉                                  | 19/100 [01:52<07:23,  5.48s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 20%|████████▍                                 | 20/100 [01:59<08:08,  6.11s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 21%|████████▊                                 | 21/100 [02:02<06:43,  5.10s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 22%|█████████▏                                | 22/100 [02:03<04:59,  3.84s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 23%|█████████▋                                | 23/100 [02:04<03:49,  2.98s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 24%|██████████                                | 24/100 [02:05<03:08,  2.48s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 25%|██████████▌                               | 25/100 [02:13<05:08,  4.12s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 26%|██████████▉                               | 26/100 [02:14<03:55,  3.19s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 27%|███████████▎                              | 27/100 [02:21<05:03,  4.16s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 28%|███████████▊                              | 28/100 [02:28<06:21,  5.29s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 29%|████████████▏                             | 29/100 [02:36<07:06,  6.00s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 30%|████████████▌                             | 30/100 [02:44<07:34,  6.49s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 31%|█████████████                             | 31/100 [02:45<05:33,  4.83s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 32%|█████████████▍                            | 32/100 [02:52<06:24,  5.66s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 33%|█████████████▊                            | 33/100 [03:00<07:00,  6.28s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 34%|██████████████▎                           | 34/100 [03:05<06:23,  5.81s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 35%|██████████████▋                           | 35/100 [03:12<06:53,  6.36s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 36%|███████████████                           | 36/100 [03:19<06:48,  6.39s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 37%|███████████████▌                          | 37/100 [03:26<07:05,  6.76s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 38%|███████████████▉                          | 38/100 [03:34<07:12,  6.97s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 39%|████████████████▍                         | 39/100 [03:41<07:15,  7.14s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 40%|████████████████▊                         | 40/100 [03:43<05:19,  5.32s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 41%|█████████████████▏                        | 41/100 [03:44<04:09,  4.24s/it]The following generation fla

🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 43%|██████████████████                        | 43/100 [03:59<05:27,  5.74s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 44%|██████████████████▍                       | 44/100 [04:07<05:54,  6.33s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 45%|██████████████████▉                       | 45/100 [04:14<06:10,  6.74s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 46%|███████████████████▎                      | 46/100 [04:22<06:16,  6.98s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 47%|███████████████████▋                      | 47/100 [04:29<06:18,  7.15s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 48%|████████████████████▏                     | 48/100 [04:37<06:21,  7.33s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 49%|████████████████████▌                     | 49/100 [04:38<04:40,  5.49s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 50%|█████████████████████                     | 50/100 [04:39<03:25,  4.11s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 51%|█████████████████████▍                    | 51/100 [04:40<02:38,  3.23s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 52%|█████████████████████▊                    | 52/100 [04:41<02:04,  2.59s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 53%|██████████████████████▎                   | 53/100 [04:49<03:11,  4.08s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 54%|██████████████████████▋                   | 54/100 [04:52<02:49,  3.67s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 55%|███████████████████████                   | 55/100 [04:59<03:39,  4.87s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 56%|███████████████████████▌                  | 56/100 [05:00<02:43,  3.70s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 57%|███████████████████████▉                  | 57/100 [05:02<02:09,  3.01s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 58%|████████████████████████▎                 | 58/100 [05:09<03:01,  4.33s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 59%|████████████████████████▊                 | 59/100 [05:17<03:40,  5.39s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 60%|█████████████████████████▏                | 60/100 [05:24<04:00,  6.02s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 61%|█████████████████████████▌                | 61/100 [05:32<04:12,  6.47s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 62%|██████████████████████████                | 62/100 [05:33<03:02,  4.81s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 63%|██████████████████████████▍               | 63/100 [05:41<03:29,  5.66s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.




 64%|██████████████████████████▉               | 64/100 [05:48<03:46,  6.28s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 65%|███████████████████████████▎              | 65/100 [05:56<03:52,  6.65s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 66%|███████████████████████████▋              | 66/100 [05:57<02:47,  4.93s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 67%|████████████████████████████▏             | 67/100 [06:04<03:09,  5.74s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.




 68%|████████████████████████████▌             | 68/100 [06:12<03:22,  6.32s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 69%|████████████████████████████▉             | 69/100 [06:20<03:28,  6.73s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 70%|█████████████████████████████▍            | 70/100 [06:21<02:29,  4.98s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 71%|█████████████████████████████▊            | 71/100 [06:28<02:48,  5.79s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 72%|██████████████████████████████▏           | 72/100 [06:36<02:59,  6.43s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 73%|██████████████████████████████▋           | 73/100 [06:45<03:11,  7.09s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 74%|███████████████████████████████           | 74/100 [06:53<03:15,  7.54s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 75%|███████████████████████████████▌          | 75/100 [07:01<03:08,  7.52s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 76%|███████████████████████████████▉          | 76/100 [07:09<03:01,  7.56s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 77%|████████████████████████████████▎         | 77/100 [07:13<02:29,  6.52s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 78%|████████████████████████████████▊         | 78/100 [07:20<02:30,  6.83s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 79%|█████████████████████████████████▏        | 79/100 [07:28<02:30,  7.17s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 80%|█████████████████████████████████▌        | 80/100 [07:29<01:46,  5.32s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 81%|██████████████████████████████████        | 81/100 [07:37<01:55,  6.07s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 82%|██████████████████████████████████▍       | 82/100 [07:38<01:22,  4.56s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 83%|██████████████████████████████████▊       | 83/100 [07:46<01:35,  5.62s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 84%|███████████████████████████████████▎      | 84/100 [07:51<01:24,  5.28s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 85%|███████████████████████████████████▋      | 85/100 [07:53<01:07,  4.47s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 86%|████████████████████████████████████      | 86/100 [08:01<01:15,  5.39s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 87%|████████████████████████████████████▌     | 87/100 [08:09<01:19,  6.11s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 88%|████████████████████████████████████▉     | 88/100 [08:16<01:17,  6.43s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 89%|█████████████████████████████████████▍    | 89/100 [08:19<00:58,  5.36s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 90%|█████████████████████████████████████▊    | 90/100 [08:26<01:00,  6.06s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 91%|██████████████████████████████████████▏   | 91/100 [08:34<00:59,  6.56s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 92%|██████████████████████████████████████▋   | 92/100 [08:42<00:55,  6.94s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 93%|███████████████████████████████████████   | 93/100 [08:49<00:49,  7.09s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 94%|███████████████████████████████████████▍  | 94/100 [08:57<00:43,  7.27s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 95%|███████████████████████████████████████▉  | 95/100 [09:05<00:36,  7.35s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
 96%|████████████████████████████████████████▎ | 96/100 [09:07<00:23,  5.91s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 97%|████████████████████████████████████████▋ | 97/100 [09:15<00:19,  6.44s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 98%|█████████████████████████████████████████▏| 98/100 [09:23<00:13,  6.84s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🚨 Answer not found in retrieved context — check prompt or retrieval quality.


 99%|█████████████████████████████████████████▌| 99/100 [09:30<00:07,  7.11s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
100%|█████████████████████████████████████████| 100/100 [09:38<00:00,  5.78s/it]

🚨 Answer not found in retrieved context — check prompt or retrieval quality.






📊 Final Evaluation Results (Compound QA Enabled):
Exact Match (EM): 0.00
F1 Score        : 21.39
ROUGE-L         : 0.2242
BLEU            : 0.0231
