In [2]:
# --- Install dependencies (run once) ---
# In Jupyter, uncomment this line if you haven't installed them yet:
!pip install datasets sentence-transformers faiss-cpu transformers accelerate tqdm

import os
import re
import json
import pickle
import random
import numpy as np
import torch
import faiss

from datasets import load_dataset
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from tqdm.auto import tqdm

Collecting datasets
  Using cached datasets-4.2.0-py3-none-any.whl.metadata (18 kB)
Collecting sentence-transformers
  Using cached sentence_transformers-5.1.1-py3-none-any.whl.metadata (16 kB)
Collecting faiss-cpu
  Using cached faiss_cpu-1.12.0-cp310-cp310-win_amd64.whl.metadata (5.2 kB)
Collecting transformers
  Using cached transformers-4.57.1-py3-none-any.whl.metadata (43 kB)
Collecting accelerate
  Using cached accelerate-1.10.1-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=21.0.0 (from datasets)
  Using cached pyarrow-21.0.0-cp310-cp310-win_amd64.whl.metadata (3.4 kB)
Collecting dill<0.4.1,>=0.3.0 (from datasets)
  Using cached dill-0.4.0-py3-none-any.whl.metadata (10 kB)
Collecting pandas (from datasets)
  Using cached pandas-2.3.3-cp310-cp310-win_amd64.whl.metadata (19 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Using cached multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting huggingface-hub<2.0,>=0.25.0 (from datasets)
  Using cached hugging

In [3]:
SAVE_DIR = "./timeqa_tempralm"
os.makedirs(SAVE_DIR, exist_ok=True)
print("✅ Save directory set to:", SAVE_DIR)

✅ Save directory set to: ./timeqa_tempralm


Retrieval-side preprocessing stage of a Temporal RAG 
Data ingestion + Temporal feature extraction

This code loads and preprocesses the TimeQA dataset (a temporal question-answering dataset) by extracting year information from each question and context passage. It adds this temporal metadata to each example and then saves the processed dataset locally.

extract_year function searches a given text for a year pattern. Returns the integer year if found, else None.
add_years function extracts the year from the question (q_year) and the context (d_year). If the question doesn’t have a year but the context does, it copies the context year to q_year. Adds two new fields:
"query_year" — year related to the question
"doc_year" — year related to the context

In [4]:
print("Loading dataset...")
ds = load_dataset("hugosousa/TimeQA")

year_regex = re.compile(r"(19|20)\d{2}")

def extract_year(text):
    if not text:
        return None
    m = year_regex.search(text)
    return int(m.group(0)) if m else None

def add_years(example):
    q_year = extract_year(example["question"])
    d_year = extract_year(example["context"])
    if q_year is None and d_year is not None:
        q_year = d_year
    example["query_year"] = q_year
    example["doc_year"] = d_year
    return example

ds = ds.map(add_years)
print("Temporal metadata extracted.")
print(ds["train"][0])

# Save processed dataset
ds.save_to_disk(f"{SAVE_DIR}/timeqa_with_years")
print("Saved processed dataset.")


Loading dataset...


README.md:   0%|          | 0.00/449 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


train.json:   0%|          | 0.00/300M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


dev.json:   0%|          | 0.00/65.5M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


test.json:   0%|          | 0.00/64.7M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/28989 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/6108 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/6075 [00:00<?, ? examples/s]

Map:   0%|          | 0/28989 [00:00<?, ? examples/s]

Map:   0%|          | 0/6108 [00:00<?, ? examples/s]

Map:   0%|          | 0/6075 [00:00<?, ? examples/s]

Temporal metadata extracted.
{'targets': ['Ulster Unionist MP for South Antrim'], 'level': 'easy', 'question': 'Which position did Knox Cunningham hold from May 1955 to Apr 1956?', 'idx': '/wiki/Knox_Cunningham#P39#0', 'context': 'Knox Cunningham Sir Samuel Knox Cunningham , 1st Baronet , QC ( 3 April 1909 – 29 July 1976 ) , was a Northern Irish barrister , businessman and politician . As an Ulster Unionist politician at a time when the Unionists were part of the Conservative Party , he was also a significant figure in United Kingdom politics as Parliamentary Private Secretary to Harold Macmillan . His nephew was Sir Josias Cunningham . Early career . Cunningham was from an Ulster family . His father was Samuel Cunningham , and his mother was Janet Muir Knox ( nee McCosh ) of Dalry , Ayrshire . His elder brothers were Colonel James Glencairn Cunningham , Josias Cunningham stockbroker , Dunlop McCosh Cunningham owner of Murrays tobacco works , Belfast . He was sent to the Royal Belfast 

Saving the dataset (0/1 shards):   0%|          | 0/28989 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/6108 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/6075 [00:00<?, ? examples/s]

Saved processed dataset.


Retrieval data preparation phase - Passage building / indexing input prep

This code formats and serializes the documents (contexts) from the TimeQA dataset into temporally-tagged passages that can later be embedded and indexed for retrieval in a RAG system.

This build_passage function takes one example (row) from the dataset.
If the context has a known year (doc_year), it prepends a special temporal token like: [DATE: 1995] The Berlin Wall fell in 1989...
If there’s no year, it uses [DATE: unknown].
Returns a dictionary with a new key "passage" that stores this formatted text.

In [5]:
def build_passage(example):
    year_token = f"[DATE: {example['doc_year']}]" if example.get("doc_year") else "[DATE: unknown]"
    passage = f"{year_token} {example.get('context','')}"
    return {"passage": passage}

ds = ds.map(build_passage)
train_passages = [x["passage"] for x in ds["train"]]
train_years = [x["doc_year"] for x in ds["train"]]

with open(f"{SAVE_DIR}/train_passages.pkl", "wb") as f:
    pickle.dump({"passages": train_passages, "years": train_years}, f)

print("✅ Passages built and saved. Example:\n", train_passages[0][:200])


Map:   0%|          | 0/28989 [00:00<?, ? examples/s]

Map:   0%|          | 0/6108 [00:00<?, ? examples/s]

Map:   0%|          | 0/6075 [00:00<?, ? examples/s]

✅ Passages built and saved. Example:
 [DATE: 1909] Knox Cunningham Sir Samuel Knox Cunningham , 1st Baronet , QC ( 3 April 1909 – 29 July 1976 ) , was a Northern Irish barrister , businessman and politician . As an Ulster Unionist politic


Indexing and retrieval setup - Embedding & indexing

This code creates (or loads) a FAISS vector index of all the temporally tagged passages created earlier.

It uses a Sentence Transformer encoder (all-MiniLM-L6-v2) to turn each passage into a dense embedding vector, normalizes them, and stores them in a FAISS index for fast similarity search during retrieval.

If the index files already exist → load them.
Else → build the index from scratch.

In [6]:
encoder_name = "all-MiniLM-L6-v2"
encoder = SentenceTransformer(encoder_name)
encoder.max_seq_length = 512

index_path = f"{SAVE_DIR}/timeqa_index.faiss"
meta_path = f"{SAVE_DIR}/index_meta.pkl"

if os.path.exists(index_path) and os.path.exists(meta_path):
    print("🔁 Found existing FAISS index — loading...")
    index = faiss.read_index(index_path)
    with open(meta_path, "rb") as f:
        meta = pickle.load(f)
    train_years = meta["train_years"]
else:
    print("🧠 Encoding passages... (this may take a few hours on CPU)")
    embs = encoder.encode(train_passages, batch_size=256, show_progress_bar=True, convert_to_numpy=True)
    faiss.normalize_L2(embs)
    index = faiss.IndexFlatIP(embs.shape[1])
    index.add(embs)
    faiss.write_index(index, index_path)
    with open(meta_path, "wb") as f:
        pickle.dump({"encoder": encoder_name, "train_years": train_years}, f)
    print("✅ FAISS index built and saved.")

print("📊 Index ready with", index.ntotal, "documents.")


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

🧠 Encoding passages... (this may take a few hours on CPU)


Batches:   0%|          | 0/114 [00:00<?, ?it/s]

✅ FAISS index built and saved.
📊 Index ready with 28989 documents.


Retrieval (Temporal-Aware) - retrieves semantically + temporally relevant docs

This code defines a temporally aware retrieval function for your Temporal RAG system.
It combines semantic similarity (from FAISS embeddings) with a temporal compatibility score between the query’s year (q_year) and each document’s year (d_year), ensuring retrieved passages make temporal sense.
Essentially, it retrieves passages that are: Semantically relevant, and Temporally consistent (not from the “future” relative to the query)
It uses temporal score formula.

If either year is missing → return 0.0 (neutral).

If document year is after the query year → penalize heavily (-1e9) → prevents “future leakage.”
Otherwise, assign a score inversely proportional to how far apart the years are:
temporal score= α / 1+∣q_year−d_year∣
So documents closer in time get higher scores.

Semantic retrieval -> Extract candidate years and semantic scores -> Compute temporal scores -> Normalize temporal scores -> Combine & select top results

In [7]:
ALPHA_SCALE = 1.0
OVER_RETRIEVE = 50
TOP_K = 5

def temporal_score_raw(q_year, d_year, alpha=ALPHA_SCALE):
    if q_year is None or d_year is None:
        return 0.0
    if d_year > q_year:
        return -1e9
    diff = abs(q_year - d_year)
    return alpha / (1.0 + diff)

def retrieve_tempralm(query_text, query_year=None, over_retrieve=OVER_RETRIEVE, top_k=TOP_K):
    q_emb = encoder.encode([query_text], convert_to_numpy=True)
    faiss.normalize_L2(q_emb)
    D, I = index.search(q_emb, over_retrieve)
    D, I = D[0], I[0]
    sem_scores = np.array(D, dtype=float)
    candidate_years = [train_years[i] for i in I]

    temp_raw = np.array([temporal_score_raw(query_year, y) for y in candidate_years], dtype=float)
    mask_future = temp_raw < -1e8

    if np.all(np.isclose(temp_raw, 0)) or np.nanstd(temp_raw) == 0:
        temp_norm = np.zeros_like(temp_raw)
        temp_norm[mask_future] = -1e9
    else:
        mu_tau = np.mean(temp_raw[~mask_future])
        sigma_tau = np.std(temp_raw[~mask_future])
        mu_s, sigma_s = np.mean(sem_scores), np.std(sem_scores)
        temp_norm = ((temp_raw - mu_tau) / (sigma_tau + 1e-12)) * sigma_s + mu_s
        temp_norm[mask_future] = -1e9

    combined = sem_scores + temp_norm
    topk = np.argsort(combined)[::-1][:top_k]
    final_docs = [train_passages[I[i]] for i in topk]
    final_scores = combined[topk]
    return final_docs, final_scores


Generation component of Retrieval-Augmented Generation (RAG) pipeline

Loads T5-base, a seq2seq model ideal for question answering and summarization.
Takes a query (question) and a list of retrieved documents (contexts) from your temporal retriever.

Tokenize and truncate to fit model input size -> Generate output -> Decode the tokens
temporal retriever (retrieve_tempralm) = retrieval module (R)
T5 generator (generate_answer) = generation module (G)
Together → form a RAG system with temporal awareness.

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
GEN_MODEL = "t5-base"

tokenizer_gen = AutoTokenizer.from_pretrained(GEN_MODEL)
model_gen = AutoModelForSeq2SeqLM.from_pretrained(GEN_MODEL).to(device)

def generate_answer(query, contexts, max_new_tokens=64):
    context = " ".join(contexts)
    prompt = f"Question: {query}\nContext: {context}\nAnswer:"
    inputs = tokenizer_gen(prompt, return_tensors="pt", truncation=True, max_length=512).to(device)
    out = model_gen.generate(**inputs, max_new_tokens=max_new_tokens)
    return tokenizer_gen.decode(out[0], skip_special_tokens=True).strip()


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

Evaluation / Comparison

Compares the generated answer with the dataset’s reference answer.

In [12]:
sample = ds["validation"][20]
query = sample["question"]
q_year = sample["query_year"]
print("Query:", query, "| q_year:", q_year)

retrieved_docs, _ = retrieve_tempralm(query, q_year)
answer = generate_answer(query, retrieved_docs)

print("Generated Answer:", answer)
print("Ground Truth:", sample["targets"])


Query: Who was the chair of Odder Municipality from 2006 to Dec 2009? | q_year: 2006
Generated Answer: Henricus Gregorius Jozeph Henk Kamp
Ground Truth: ['']


“Evaluation” stage of the Temporal RAG pipeline

This block compares how well:
a standard RAG model (semantic-only retrieval), and a temporal RAG model (semantic + time-aware retrieval)
perform on a random subset of validation questions.

| Step                     | Module             | Function                             |
| ------------------------ | ------------------ | ------------------------------------ |
| **Retriever**            | Baseline retriever | `retrieve_semantic_only()`           |
| **Retriever (temporal)** | Temporal retriever | `retrieve_tempralm()`                |
| **Generator**            | T5 generator       | `generate_answer()`                  |
| **Evaluator**            | Comparison metric  | `exact_match()`, `evaluate_subset()` |


In [21]:
def retrieve_semantic_only(query, top_k=TOP_K):
    q_emb = encoder.encode([query], convert_to_numpy=True)
    faiss.normalize_L2(q_emb)
    D, I = index.search(q_emb, top_k)
    return [train_passages[i] for i in I[0]]

def exact_match(pred, gold):
    return gold.lower().strip() in pred.lower().strip()

def evaluate_subset(n=50, seed=42):
    random.seed(seed)
    subset = random.sample(list(ds["validation"]), n)
    baseline_correct = temp_correct = 0

    for ex in tqdm(subset):
        q, q_year = ex["question"], ex["query_year"]
        gold = ex["targets"]
        if isinstance(gold, list):
            gold = gold[0] if gold else ""

        base_docs = retrieve_semantic_only(q)
        base_ans = generate_answer(q, base_docs)
        if exact_match(base_ans, gold): baseline_correct += 1

        tr_docs, _ = retrieve_tempralm(q, q_year)
        tr_ans = generate_answer(q, tr_docs)
        if exact_match(tr_ans, gold): temp_correct += 1

    results = {
        "baseline_acc": baseline_correct / n,
        "tempralm_acc": temp_correct / n
    }
    return results

results = evaluate_subset(n=200)
print("✅ Results:", results)

with open(f"{SAVE_DIR}/eval_results.json", "w") as f:
    json.dump(results, f, indent=2)


100%|████████████████████████████████████████████████████████████████████████████████| 200/200 [26:29<00:00,  7.95s/it]

✅ Results: {'baseline_acc': 0.1, 'tempralm_acc': 0.11}





The Results:
    
| Model                            | Accuracy (Exact Match) |
| -------------------------------- | ---------------------- |
| **Baseline (semantic-only RAG)** | 0.10 → **10%**         |
| **Temporal RAG**                 | 0.11 → **11%**         |

Temporal RAG performs slightly better (+1%) than the plain semantic RAG.
That small bump suggests that temporal reasoning helps a bit — the model retrieves slightly more relevant or time-consistent documents.
Retrieval/generation pipeline still needs more optimization. (makes us want to try M-RAG, TA-RAG and variants)

In [24]:
# ============= DIAGNOSTIC ANALYSIS =============

def analyze_failures(n=50, seed=42):
    """
    Analyze WHY the model is failing
    """
    random.seed(seed)
    subset = random.sample(list(ds["validation"]), min(n, len(ds["validation"])))
    
    failure_analysis = {
        'retrieval_failed': [],  # Gold not in retrieved docs
        'generation_failed': [],  # Gold in docs but wrong generation
        'both_succeeded': [],     # Got it right
        'partial_match': []       # Close but not exact
    }
    
    print("Analyzing failures...")
    
    for ex in tqdm(subset):
        q = ex["question"]
        q_year = ex["query_year"]
        gold = ex["targets"]
        if isinstance(gold, list):
            gold = gold[0] if gold else ""
        
        # Get TempRALM results
        tr_docs, _ = retrieve_tempralm(q, q_year)
        tr_ans = generate_answer(q, tr_docs)
        
        # Check if gold is in retrieved docs
        gold_in_docs = any(gold.lower() in doc.lower() for doc in tr_docs)
        
        # Check if answer is correct
        answer_correct = exact_match(tr_ans, gold)
        
        # Categorize
        if answer_correct:
            failure_analysis['both_succeeded'].append({
                'question': q,
                'gold': gold,
                'prediction': tr_ans
            })
        elif gold_in_docs and not answer_correct:
            failure_analysis['generation_failed'].append({
                'question': q,
                'gold': gold,
                'prediction': tr_ans,
                'retrieved_docs': tr_docs
            })
        elif not gold_in_docs:
            failure_analysis['retrieval_failed'].append({
                'question': q,
                'gold': gold,
                'prediction': tr_ans,
                'retrieved_docs': tr_docs
            })
        
        # Check partial matches
        if not answer_correct:
            pred_tokens = set(tr_ans.lower().split())
            gold_tokens = set(gold.lower().split())
            overlap = len(pred_tokens & gold_tokens)
            if overlap > 0:
                failure_analysis['partial_match'].append({
                    'question': q,
                    'gold': gold,
                    'prediction': tr_ans,
                    'overlap': overlap
                })
    
    # Print summary
    print("\n" + "="*60)
    print("FAILURE ANALYSIS")
    print("="*60)
    print(f"\nTotal analyzed: {n}")
    print(f"\n✓ Correct answers: {len(failure_analysis['both_succeeded'])} ({len(failure_analysis['both_succeeded'])/n*100:.1f}%)")
    print(f"\n✗ Retrieval failed (gold not in docs): {len(failure_analysis['retrieval_failed'])} ({len(failure_analysis['retrieval_failed'])/n*100:.1f}%)")
    print(f"✗ Generation failed (gold in docs, wrong answer): {len(failure_analysis['generation_failed'])} ({len(failure_analysis['generation_failed'])/n*100:.1f}%)")
    print(f"⚠ Partial matches: {len(failure_analysis['partial_match'])} ({len(failure_analysis['partial_match'])/n*100:.1f}%)")
    
    # Show examples
    print("\n" + "="*60)
    print("EXAMPLE RETRIEVAL FAILURES:")
    print("="*60)
    for i, ex in enumerate(failure_analysis['retrieval_failed'][:3]):
        print(f"\nExample {i+1}:")
        print(f"Question: {ex['question']}")
        print(f"Gold: {ex['gold']}")
        print(f"Predicted: {ex['prediction']}")
        print(f"Top retrieved doc: {ex['retrieved_docs'][0][:200]}...")
    
    print("\n" + "="*60)
    print("EXAMPLE GENERATION FAILURES:")
    print("="*60)
    for i, ex in enumerate(failure_analysis['generation_failed'][:3]):
        print(f"\nExample {i+1}:")
        print(f"Question: {ex['question']}")
        print(f"Gold: {ex['gold']}")
        print(f"Predicted: {ex['prediction']}")
        print(f"(Gold WAS in retrieved docs!)")
    
    print("\n" + "="*60)
    print("SUCCESSFUL EXAMPLES:")
    print("="*60)
    for i, ex in enumerate(failure_analysis['both_succeeded'][:3]):
        print(f"\nExample {i+1}:")
        print(f"Question: {ex['question']}")
        print(f"Gold: {ex['gold']}")
        print(f"Predicted: {ex['prediction']}")
    
    return failure_analysis


def check_temporal_coverage(n=100, seed=42):
    """
    Check if questions actually have temporal information
    """
    random.seed(seed)
    subset = random.sample(list(ds["validation"]), min(n, len(ds["validation"])))
    
    has_q_year = 0
    has_d_year = 0
    has_both = 0
    
    examples_no_year = []
    
    for ex in subset:
        q_year = ex.get('query_year')
        d_year = ex.get('doc_year')
        
        if q_year:
            has_q_year += 1
        if d_year:
            has_d_year += 1
        if q_year and d_year:
            has_both += 1
        
        if not q_year:
            examples_no_year.append(ex)
    
    print("\n" + "="*60)
    print("TEMPORAL COVERAGE ANALYSIS")
    print("="*60)
    print(f"\nQuery has year: {has_q_year}/{n} ({has_q_year/n*100:.1f}%)")
    print(f"Doc has year: {has_d_year}/{n} ({has_d_year/n*100:.1f}%)")
    print(f"Both have year: {has_both}/{n} ({has_both/n*100:.1f}%)")
    
    print(f"\n⚠️ Your temporal approach can only help {has_both}/{n} ({has_both/n*100:.1f}%) questions")
    
    if examples_no_year:
        print("\n" + "="*60)
        print("EXAMPLES WITHOUT QUERY YEAR:")
        print("="*60)
        for i, ex in enumerate(examples_no_year[:5]):
            print(f"\n{i+1}. {ex['question']}")
            print(f"   Answer: {ex['targets']}")
    
    return {
        'has_q_year': has_q_year,
        'has_d_year': has_d_year,
        'has_both': has_both,
        'coverage': has_both / n
    }


def compare_retrieval_directly(n=50, seed=42):
    """
    Compare what baseline vs temporal retrieves
    """
    random.seed(seed)
    subset = random.sample(list(ds["validation"]), min(n, len(ds["validation"])))
    
    better_retrieval = 0
    worse_retrieval = 0
    same_retrieval = 0
    
    print("\nComparing retrieval quality...")
    
    for ex in tqdm(subset):
        q = ex["question"]
        q_year = ex["query_year"]
        gold = ex["targets"]
        if isinstance(gold, list):
            gold = gold[0] if gold else ""
        
        # Baseline retrieval
        base_docs = retrieve_semantic_only(q, top_k=5)
        base_has_gold = any(gold.lower() in doc.lower() for doc in base_docs)
        
        # Temporal retrieval
        temp_docs, _ = retrieve_tempralm(q, q_year, top_k=5)
        temp_has_gold = any(gold.lower() in doc.lower() for doc in temp_docs)
        
        if temp_has_gold and not base_has_gold:
            better_retrieval += 1
        elif base_has_gold and not temp_has_gold:
            worse_retrieval += 1
        elif base_has_gold and temp_has_gold:
            same_retrieval += 1
    
    print("\n" + "="*60)
    print("RETRIEVAL COMPARISON")
    print("="*60)
    print(f"\nTempRALM retrieves gold better: {better_retrieval}/{n} ({better_retrieval/n*100:.1f}%)")
    print(f"Baseline retrieves gold better: {worse_retrieval}/{n} ({worse_retrieval/n*100:.1f}%)")
    print(f"Both retrieve gold: {same_retrieval}/{n} ({same_retrieval/n*100:.1f}%)")
    
    return {
        'better': better_retrieval,
        'worse': worse_retrieval,
        'same': same_retrieval
    }


def evaluate_with_relaxed_matching(n=100, seed=42):
    """
    Try relaxed matching (F1 score) instead of exact match
    """
    random.seed(seed)
    subset = random.sample(list(ds["validation"]), min(n, len(ds["validation"])))
    
    baseline_exact = []
    tempralm_exact = []
    baseline_f1 = []
    tempralm_f1 = []
    
    print("\nEvaluating with F1 scores...")
    
    for ex in tqdm(subset):
        q = ex["question"]
        q_year = ex["query_year"]
        gold = ex["targets"]
        if isinstance(gold, list):
            gold = gold[0] if gold else ""
        
        # Baseline
        base_docs = retrieve_semantic_only(q)
        base_ans = generate_answer(q, base_docs)
        baseline_exact.append(1 if exact_match(base_ans, gold) else 0)
        
        # F1 score
        pred_tokens = set(base_ans.lower().split())
        gold_tokens = set(gold.lower().split())
        if len(pred_tokens) > 0 and len(gold_tokens) > 0:
            precision = len(pred_tokens & gold_tokens) / len(pred_tokens)
            recall = len(pred_tokens & gold_tokens) / len(gold_tokens)
            f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
        else:
            f1 = 0
        baseline_f1.append(f1)
        
        # TempRALM
        temp_docs, _ = retrieve_tempralm(q, q_year)
        temp_ans = generate_answer(q, temp_docs)
        tempralm_exact.append(1 if exact_match(temp_ans, gold) else 0)
        
        # F1 score
        pred_tokens = set(temp_ans.lower().split())
        gold_tokens = set(gold.lower().split())
        if len(pred_tokens) > 0 and len(gold_tokens) > 0:
            precision = len(pred_tokens & gold_tokens) / len(pred_tokens)
            recall = len(pred_tokens & gold_tokens) / len(gold_tokens)
            f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
        else:
            f1 = 0
        tempralm_f1.append(f1)
    
    print("\n" + "="*60)
    print("EXACT MATCH vs F1 COMPARISON")
    print("="*60)
    print(f"\nBaseline:")
    print(f"  Exact Match: {np.mean(baseline_exact):.2%}")
    print(f"  F1 Score: {np.mean(baseline_f1):.2%}")
    
    print(f"\nTempRALM:")
    print(f"  Exact Match: {np.mean(tempralm_exact):.2%}")
    print(f"  F1 Score: {np.mean(tempralm_f1):.2%}")
    
    print(f"\nF1 Improvement: {(np.mean(tempralm_f1) - np.mean(baseline_f1))*100:+.1f} points")
    
    return {
        'baseline_f1': np.mean(baseline_f1),
        'tempralm_f1': np.mean(tempralm_f1)
    }


# ============= RUN DIAGNOSTICS =============

print("\n" + "="*80)
print(" RUNNING DIAGNOSTIC ANALYSIS")
print("="*80)

# 1. Check temporal coverage
coverage = check_temporal_coverage(n=100)

# 2. Analyze failures
failures = analyze_failures(n=50)

# 3. Compare retrieval directly
retrieval_comp = compare_retrieval_directly(n=50)

# 4. Try F1 scoring
f1_results = evaluate_with_relaxed_matching(n=100)

print("\n" + "="*80)
print(" DIAGNOSTIC SUMMARY")
print("="*80)
print(f"\n1. Temporal Coverage: {coverage['coverage']*100:.1f}% of questions have both query & doc years")
print(f"2. Retrieval helps: TempRALM retrieves gold better in {retrieval_comp['better']} cases")
print(f"3. With F1 metric: Baseline {f1_results['baseline_f1']:.1%}, TempRALM {f1_results['tempralm_f1']:.1%}")

print("\n Key Insights:")
if coverage['coverage'] < 0.7:
    print("Many questions lack temporal info - limits your approach's effectiveness")
if retrieval_comp['better'] < 5:
    print("Temporal scoring not improving retrieval much")
if f1_results['tempralm_f1'] > f1_results['baseline_f1'] * 1.1:
    print("✓ F1 shows better improvement than exact match - predictions are partially correct")


 RUNNING DIAGNOSTIC ANALYSIS

TEMPORAL COVERAGE ANALYSIS

Query has year: 97/100 (97.0%)
Doc has year: 97/100 (97.0%)
Both have year: 97/100 (97.0%)

⚠️ Your temporal approach can only help 97/100 (97.0%) questions

EXAMPLES WITHOUT QUERY YEAR:

1. What position did Ecgfrith of Northumbria take from 670 to 685?
   Answer: ['King of Northumbria']

2. What was the position of Bernard A. Maguire from 1866 to 1870?
   Answer: ['president of Georgetown University']

3. What position did Ecgfrith of Northumbria take from 664 to 670?
   Answer: ['King of Deira']
Analyzing failures...


100%|██████████████████████████████████████████████████████████████████████████████████| 50/50 [04:24<00:00,  5.30s/it]



FAILURE ANALYSIS

Total analyzed: 50

✓ Correct answers: 6 (12.0%)

✗ Retrieval failed (gold not in docs): 40 (80.0%)
✗ Generation failed (gold in docs, wrong answer): 4 (8.0%)
⚠ Partial matches: 9 (18.0%)

EXAMPLE RETRIEVAL FAILURES:

Example 1:
Question: Pablo Casado went to which school in Mar 2009?
Gold: King Juan Carlos University
Predicted: Wharton School
Top retrieved doc: [DATE: 1941] Raul Roco Raul Sagarbarria Roco ( October 26 , 1941 – August 5 , 2005 ) was a political figure in the Philippines . He was the standard-bearer of Aksyon Demokratiko , which he founded in ...

Example 2:
Question: Which team did Simon Lappin play for from 2004 to 2005?
Gold: Scotland Under-21
Predicted: Edinburgh Rugby
Top retrieved doc: [DATE: 1979] Simon Taylor ( rugby union ) Simon Marcus Taylor ( born 17 August 1979 ) is a Scottish retired professional rugby union footballer who played for Bath Rugby , Stade Français and Edinburgh...

Example 3:
Question: Which team did A. J. DeLaGarza play fo

100%|██████████████████████████████████████████████████████████████████████████████████| 50/50 [00:18<00:00,  2.68it/s]



RETRIEVAL COMPARISON

TempRALM retrieves gold better: 1/50 (2.0%)
Baseline retrieves gold better: 2/50 (4.0%)
Both retrieve gold: 8/50 (16.0%)

Evaluating with F1 scores...


100%|████████████████████████████████████████████████████████████████████████████████| 100/100 [17:18<00:00, 10.39s/it]


EXACT MATCH vs F1 COMPARISON

Baseline:
  Exact Match: 10.00%
  F1 Score: 7.85%

TempRALM:
  Exact Match: 12.00%
  F1 Score: 8.97%

F1 Improvement: +1.1 points

 DIAGNOSTIC SUMMARY

1. Temporal Coverage: 97.0% of questions have both query & doc years
2. Retrieval helps: TempRALM retrieves gold better in 1 cases
3. With F1 metric: Baseline 7.9%, TempRALM 9.0%

 Key Insights:
Temporal scoring not improving retrieval much
✓ F1 shows better improvement than exact match - predictions are partially correct





1. Added temporal metadata
added time information (like years) to both:
the query (query_year) → when the question is asked
the documents/passages (doc_year) → when the content was written or relevant
This gives the model temporal awareness — it knows when things happened.

2. Combined semantic + temporal similarity
Normally in vanilla RAG, retrieval is based only on semantic similarity (cosine similarity of embeddings).
But in Temporal RAG (TempRALM), you modified this by adding a temporal penalty or weight.

So instead of:
sim(q,d)=cos(q,d)
used something like:
temporal_sim(q,d)=cos(q,d)−α×∣t
where α = scaling factor (temporal penalty strength)
t & d = years of query and document

This means documents further away in time get lower scores, even if semantically similar.
Encoded temporal grounding into your retriever.

3. Retrieved the top-k most relevant docs
Then used FAISS to:
Compute these temporal-adjusted similarities
Retrieve top k documents
Pass them as context to the generator (T5 model)

4. Generated the final answer