# Retrieval + Saliency rewriting with semantic deduplication + Prompt building + LLaMA-2 QA pipeline + Output cleaning + Confidence-weighted answer fusion

## Notebook Summary: Confidence-Weighted Semantic Fusion RAG (Setup 10)

This notebook implements a refined RAG QA architecture where multiple rewritten chunks are processed independently and their answers fused semantically for better accuracy, redundancy control, and resilience to hallucination.

### Key Components:

1. **Cross-Encoder Reranking + Saliency Rewriting**  
   FAISS-retrieved chunks are reranked using a cross-encoder and rewritten to include only the top-3 TF-IDF-scored sentences. Duplicate content is removed using cosine similarity filtering.

2. **Independent Per-Chunk Answer Generation**  
   Each salient chunk is individually passed to a LoRA-fine-tuned LLaMA-2 QA model, generating separate predictions for each.

3. **Semantic Clustering + Answer Fusion**  
   Answers are grouped by semantic similarity (≥ 0.85). The best answer is chosen from the largest cluster, favoring length as a tie-breaker.

4. **Sentence Boundary Optimization**  
   Uses a Punkt tokenizer enhanced with telecom-specific abbreviations to accurately split sentences before scoring.

5. **Evaluation Metrics**  
   Final answers are evaluated against a 100-question telecom benchmark using:
   - **Exact Match (EM)** and **F1** (SQuAD)
   - **ROUGE-L**
   - **BLEU**

This configuration (Setup 10) offers robust precision, reduced hallucination, and natural ensembling across context perspectives—making it ideal for high-assurance telecom QA tasks.

In [1]:
# Imports
import faiss, pickle, torch, re, json
import numpy as np
from pathlib import Path
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from collections import Counter
from difflib import SequenceMatcher
from sentence_transformers import SentenceTransformer, CrossEncoder
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

In [2]:
# Load FAISS index and chunked docs
index = faiss.read_index("/mnt/data/RAG/3gpp_index.faiss")
with open("/mnt/data/RAG/3gpp_chunks.pkl", "rb") as f:
    documents = pickle.load(f)

# Load models
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

In [3]:
import json
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters

# Load abbreviation map
with open("abbreviation_master_map.json", "r") as f:
    abbrev_dict = json.load(f)

# Convert keys to lowercase and strip dots
abbrevs = set(k.lower().strip(".") for k in abbrev_dict.keys())

In [4]:
punkt_param = PunktParameters()
punkt_param.abbrev_types = abbrevs
sentence_splitter = PunktSentenceTokenizer(punkt_param)

def sent_tokenize(text):
    return sentence_splitter.tokenize(text.strip())

In [5]:
# Retrieval with reranking
def retrieve_with_rerank(query, top_k=5):
    query_vec = embedding_model.encode(query, normalize_embeddings=True)
    query_vec = np.array(query_vec).reshape(1, -1).astype("float32")
    D, I = index.search(query_vec, top_k * 2)
    initial_results = [documents[i] for i in I[0]]
    pairs = [(query, doc["content"]) for doc in initial_results]
    scores = reranker.predict(pairs)
    reranked = sorted(zip(scores, initial_results), key=lambda x: x[0], reverse=True)[:top_k]
    return [doc for _, doc in reranked]

In [6]:
# Saliency rewriting with semantic deduplication
def extract_salient_sentences(chunk_text, query, max_sentences=3):
    sentences = sent_tokenize(chunk_text)
    if len(sentences) <= max_sentences:
        return chunk_text.strip()
    vectorizer = TfidfVectorizer().fit([query] + sentences)
    query_vec = vectorizer.transform([query])
    sentence_vecs = vectorizer.transform(sentences)
    scores = (sentence_vecs @ query_vec.T).toarray().flatten()
    top_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:max_sentences]
    salient = [sentences[i] for i in sorted(top_indices)]
    return " ".join(salient)

def rewrite_chunks_for_saliency(chunks, query, max_sentences=3, sim_threshold=0.95):
    seen_embeddings = []
    filtered_chunks = []
    for chunk in chunks:
        rewritten = extract_salient_sentences(chunk["content"], query, max_sentences).strip()
        if not rewritten:
            continue
        emb = embedding_model.encode([rewritten])[0]
        is_duplicate = any(
            cosine_similarity([emb], [prev_emb])[0][0] >= sim_threshold
            for prev_emb in seen_embeddings
        )
        if not is_duplicate:
            seen_embeddings.append(emb)
            filtered_chunks.append({
                "content": rewritten,
                "source": chunk.get("source", "unknown")
            })
    return filtered_chunks

In [7]:
# Prompt builder
SYSTEM_PROMPT = (
    "You are a precise assistant. Extract the exact answer span from the context. "
    "Do not paraphrase, summarize, or add extra information. "
    "The answer must appear exactly in the context. "
    "If the context lists multiple conditions, actions, or branches, include them all as written. "
    "Do not summarize or paraphrase — copy the exact text from the context, line by line."
)

def build_fusion_prompt(context_chunks, question):
    context_lines = []
    for chunk in context_chunks:
        source = chunk.get("source", "unknown").split("/")[-1]
        context_lines.append(f"[Source: {source}]\n-----\n{chunk['content'].strip()}")
    fused_context = "\n\n".join(context_lines)
    user_prompt = f"Context:\n{fused_context}\n\nQuestion: {question}\nAnswer from the context only:"
    return f"<s>[INST] <<SYS>>\n{SYSTEM_PROMPT}\n<</SYS>>\n\n{user_prompt} [/INST]"

In [8]:
# Output cleaning
def clean_prediction(raw_text):
    answer = raw_text.split("[/INST]")[-1].strip()
    answer = re.sub(r"[^\w\s\-.,:/()]", "", answer)
    answer = re.sub(r'(\b.+?:)(\s*\1)+', r'\1', answer)
    tokens = answer.split()
    for i in range(1, len(tokens) // 2):
        if tokens[:i] == tokens[i:2*i]:
            answer = " ".join(tokens[:i])
            break
    sentence_end = re.search(r'[.?!]', answer)
    if sentence_end:
        answer = answer[:sentence_end.end()]
    return answer.strip()

In [9]:
# Load fine-tuned LLaMA-2 QA pipeline
model_path = "/mnt/data/llama2_qa_lora_output5/final"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16).to("cuda")
qa_pipeline = pipeline("text-generation", model=model, tokenizer=tokenizer)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cuda:0


In [10]:
# Answer similarity + fusion
def string_similarity(a, b):
    return SequenceMatcher(None, a.lower(), b.lower()).ratio()

def fuse_answers_semantically(answers, sim_threshold=0.85):
    if not answers:
        return ""

    embeddings = embedding_model.encode(answers)
    clusters = []
    used = set()

    for i, emb in enumerate(embeddings):
        if i in used:
            continue
        group = [i]
        for j in range(i + 1, len(embeddings)):
            if j not in used and cosine_similarity([emb], [embeddings[j]])[0][0] >= sim_threshold:
                group.append(j)
                used.add(j)
        used.add(i)
        clusters.append(group)

    # Pick the cluster with most members, return longest answer
    clusters.sort(key=lambda g: (len(g), max(len(answers[i]) for i in g)), reverse=True)
    best_cluster = clusters[0]
    best_answer = max([answers[i] for i in best_cluster], key=len)
    return best_answer.strip()

In [11]:
# Main QA function with answer fusion
def answer_with_confidence_fusion_rag(question, top_k=5, max_sentences=3, verbose=False):
    chunks = retrieve_with_rerank(question, top_k=top_k)
    salient_chunks = rewrite_chunks_for_saliency(chunks, question, max_sentences=max_sentences)

    generated_answers = []
    for i, chunk in enumerate(salient_chunks):
        prompt = build_fusion_prompt([chunk], question)
        output = qa_pipeline(
            prompt,
            max_new_tokens=160,
            do_sample=False,
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.eos_token_id
        )[0]["generated_text"]

        answer = clean_prediction(output)
        if answer:
            generated_answers.append(answer)

        if verbose:
            print(f"\n--- Prompt for Context {i+1} ---\n{prompt[:500]}...\n")
            print(f"Answer: {answer}\n")

    final_answer = fuse_answers_semantically(generated_answers)

    if verbose:
        print("🔁 All Answers:", generated_answers)
        print("✅ Final Fused Answer:", final_answer)

    return final_answer, generated_answers, salient_chunks

In [12]:
import json
from tqdm import tqdm
from evaluate import load

# Load QA pairs
def load_qa_pairs(path):
    with open(path, "r", encoding="utf-8") as f:
        return [json.loads(line) for line in f]

qa_pairs = load_qa_pairs("3gpp_qa_100_pairs.jsonl")

# Load metrics
squad_metric = load("squad")
rouge = load("rouge")
bleu = load("bleu")

bleu_predictions = []
bleu_references = []
results = []

for sample in tqdm(qa_pairs):
    question = sample["question"]
    reference = sample["answer"]

    try:
        prediction, _, _ = answer_with_confidence_fusion_rag(question)
    except Exception as e:
        print(f"⚠️ Error on: {question}\n{e}")
        prediction = ""

    # Add to metrics
    squad_metric.add(
        prediction={"id": str(hash(question)), "prediction_text": prediction},
        reference={"id": str(hash(question)), "answers": {"text": [reference], "answer_start": [0]}}
    )
    rouge.add(prediction=prediction, reference=reference)
    bleu_predictions.append(prediction)
    bleu_references.append([reference])
    results.append({
        "question": question,
        "reference": reference,
        "prediction": prediction
    })

# Compute final scores
squad_scores = squad_metric.compute()
rouge_scores = rouge.compute()
bleu_score = bleu.compute(predictions=bleu_predictions, references=bleu_references)["bleu"]

# Print results
print("\n📊 Final Evaluation Results (Setup 10 — Confidence-Weighted Semantic Fusion):")
print(f"Exact Match (EM): {squad_scores['exact_match']:.2f}")
print(f"F1 Score        : {squad_scores['f1']:.2f}")
print(f"ROUGE-L         : {rouge_scores['rougeL']:.4f}")
print(f"BLEU            : {bleu_score:.4f}")

  0%|                                                   | 0/100 [00:00<?, ?it/s]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
  1%|▍                                          | 1/100 [00:30<50:36, 30.68s/it]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` 


📊 Final Evaluation Results (Setup 10 — Confidence-Weighted Semantic Fusion):
Exact Match (EM): 0.00
F1 Score        : 16.50
ROUGE-L         : 0.1781
BLEU            : 0.0144
