In [1]:
!pip install rank_bm25 transformers datasets torch
!pip install rouge_score

Collecting rank_bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl.metadata (3.2 kB)
Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Installing collected packages: rank_bm25
Successfully installed rank_bm25-0.2.2
Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=cbf87573e6f1b3c9c8cf57b7564e2bb7441cee32ca606541d8496a8b71e26ba3
  Stored in directory: /root/.cache/pip/wheels/85/9d/af/01feefbe7d55ef5468796f0c68225b6788e85d9d0a281e7a70
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


In [None]:
from datasets import load_dataset

# SQuAD dataset: Includes Wikipedia paragraphs and questions asked to them
dataset = load_dataset("squad", split="validation") # First 1000 examples for speed up (now increased)

# Create Corpus
# 'context' (paragraph) in every example is a document
# Clean up repetative documents using set
corpus = list(set(dataset['context']))

print(f"Count of documents in library: {len(corpus)}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

plain_text/train-00000-of-00001.parquet:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

plain_text/validation-00000-of-00001.par(‚Ä¶):   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

Count of documents in library: 2067


In [3]:
from rank_bm25 import BM25Okapi
import string

# A simple tokenizer (Word seperator)
def simple_tokenizer(text):
    # Convert to lowercase, erease punctuation and split
    text = text.lower().translate(str.maketrans('', '', string.punctuation))
    return text.split()


tokenized_corpus = [simple_tokenizer(doc) for doc in corpus]

bm25 = BM25Okapi(tokenized_corpus)


# TEST
query = "What creates a magnetic field?"
tokenized_query = simple_tokenizer(query)
doc_scores = bm25.get_scores(tokenized_query)
best_docs = bm25.get_top_n(tokenized_query, corpus, n=1)

print(f"\nQuestion: {query}")
print(f"Found Document: {best_docs[0]}")


Question: What creates a magnetic field?
Found Document: Through combining the definition of electric current as the time rate of change of electric charge, a rule of vector multiplication called Lorentz's Law describes the force on a charge moving in a magnetic field. The connection between electricity and magnetism allows for the description of a unified electromagnetic force that acts on a charge. This force can be written as a sum of the electrostatic force (due to the electric field) and the magnetic force (due to the magnetic field). Fully stated, this is the law:


In [4]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model_name = "google/flan-t5-small"
print(f"model name: {model_name} ")

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)

model name: google/flan-t5-small 


tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [5]:
def rag_pipeline(question, n_docs=1):
    # 1. STEP: Retrieval
    tokenized_query = simple_tokenizer(question)
    # Bring most related 'n' documents
    retrieved_docs = bm25.get_top_n(tokenized_query, corpus, n=n_docs)
    context = " ".join(retrieved_docs) # Concatanate documents

    # 2. STEP: Prompt Preperation (Generation Input)
    input_text = f"Answer the question based on the context.\nContext: {context}\nQuestion: {question}"

    # 3. STEP: Give answer (Generation)
    input_ids = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=512).input_ids.to(device)

    outputs = model.generate(input_ids, max_length=64, num_beams=4, early_stopping=True)
    answer = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return answer, context

# TEST
question = "Which team won the Super Bowl 50?"
answer, source = rag_pipeline(question)

print(f"QUESTION: {question}")
print(f"ANSWER (RAG): {answer}")
print(f"USED SOURCE: {source[:200]}...")

QUESTION: Which team won the Super Bowl 50?
ANSWER (RAG): Denver
USED SOURCE: Super Bowl 50 featured numerous records from individuals and teams. Denver won despite being massively outgained in total yards (315 to 194) and first downs (21 to 11). Their 194 yards and 11 first do...


In [6]:
!pip install evaluate rouge_score bert_score

Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Collecting bert_score
  Downloading bert_score-0.3.13-py3-none-any.whl.metadata (15 kB)
Downloading evaluate-0.4.6-py3-none-any.whl (84 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m84.1/84.1 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading bert_score-0.3.13-py3-none-any.whl (61 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m61.1/61.1 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate, bert_score
Successfully installed bert_score-0.3.13 evaluate-0.4.6


In [7]:
import evaluate
import numpy as np
from tqdm import tqdm

# Upload Metrics
bleu = evaluate.load("bleu")
rouge = evaluate.load("rouge")
bertscore = evaluate.load("bertscore")

# Test Settings
TEST_SIZE = 100 # Only 20 questions for speed up now 100
test_data = dataset.select(range(TEST_SIZE))


predictions = []
references = []
retrieval_hits = 0 # Num of correct documents

# Evaluation Loop
for item in tqdm(test_data):
    question = item['question']
    true_answer = item['answers']['text'][0] # Reel answer
    true_context = item['context']           # Reel source text

    # RAG System
    # n_docs=1, so it returns most related 1 document
    pred_answer, retrieved_context = rag_pipeline(question, n_docs=1)

    # Append to Lists (to measure Generation)
    predictions.append(pred_answer)
    references.append(true_answer)

    # Retrieval Success (Recall@1)
    # Is the found document a piece from real text or does it match perfectly?
    # Manual control is needed
    if true_context[:100] in retrieved_context: # If first 100 characters are accurate then it is correct
        retrieval_hits += 1

# Calculate scores

# Retrieval Score
recall_at_1 = retrieval_hits / TEST_SIZE

# Generation Scores
bleu_result = bleu.compute(predictions=predictions, references=references)
rouge_result = rouge.compute(predictions=predictions, references=references)

print("\n--- Calculating BERTScore ---")
bert_result = bertscore.compute(predictions=predictions, references=references, lang="en")

# Reporting
print("\n" + "="*30)
print("   RAG SYSTEM")
print("="*30)
print(f"üîç Retrieval Success:")
print(f"   Recall@1: {recall_at_1:.2%} (Portion of finding right document)")
print("-" * 30)
print(f"üìù Generation Quality:")
print(f"   BLEU Score: {bleu_result['bleu']:.4f}")
print(f"   ROUGE-L   : {rouge_result['rougeL']:.4f}")
print(f"   BERTScore : {np.mean(bert_result['f1']):.4f}")
print("="*30)

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100/100 [00:19<00:00,  5.05it/s]



--- Calculating BERTScore ---


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



   RAG SYSTEM
üîç Retrieval Success:
   Recall@1: 41.00% (Portion of finding right document)
------------------------------
üìù Generation Quality:
   BLEU Score: 0.2799
   ROUGE-L   : 0.4177
   BERTScore : 0.9340


In [8]:
import pandas as pd

# Calc Precision@k
def calculate_retrieval_metrics(queries, true_contexts, k=1):
    hits = 0
    for q, true_ctx in zip(queries, true_contexts):
        tokenized_query = simple_tokenizer(q)
        # Get top-k docs
        top_k_docs = bm25.get_top_n(tokenized_query, corpus, n=k)

        # match check
        if any(true_ctx == doc for doc in top_k_docs):
            hits += 1

    return hits / len(queries)

# Preparing lists
test_queries = [item['question'] for item in test_data]
test_contexts = [item['context'] for item in test_data]

# scores
precision_at_1 = calculate_retrieval_metrics(test_queries, test_contexts, k=1)
precision_at_3 = calculate_retrieval_metrics(test_queries, test_contexts, k=3)

print(f"\n--- Retrieval Metrics ---")
print(f"Precision@1: {precision_at_1:.4f}")
print(f"Precision@3: {precision_at_3:.4f}")

# Error Analysis
print(f"\n{'='*20} FAILURE CASES {'='*20}")

# Create DF for analysis
results_df = pd.DataFrame({
    "Question": test_queries,
    "True Context": test_contexts,
    "True Answer": references,
    "Generated": predictions,
})

# Filter bad predictions
failures = []
for idx, row in results_df.iterrows():
    # Simple string matching
    if row['True Answer'].lower() not in row['Generated'].lower():
        failures.append(row)
        if len(failures) >= 3: break

# Print examples
for i, fail in enumerate(failures):
    print(f"\n[Case {i+1}]")
    print(f"Q: {fail['Question']}")
    print(f"Ctx: {fail['True Context'][:150]}...") # First 150 chars
    print(f"True: {fail['True Answer']}")
    print(f"Gen: {fail['Generated']}")
    print("-" * 40)


--- Retrieval Metrics ---
Precision@1: 0.4100
Precision@3: 0.7800


[Case 1]
Q: Where did Super Bowl 50 take place?
Ctx: Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football...
True: Santa Clara, California
Gen: U.S.
----------------------------------------

[Case 2]
Q: What color was used to emphasize the 50th anniversary of the Super Bowl?
Ctx: Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football...
True: gold
Gen: red
----------------------------------------

[Case 3]
Q: What was the theme of Super Bowl 50?
Ctx: Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football...
True: "golden anniversary"
Gen: Super Bowl XLVII
----------------------------------------
