# Retrieval-Augmented Generation with Phi-2: Optimized Inference Notebook

This step-by-step notebook demonstrates a Retrieval-Augmented Generation (RAG) pipeline using the Microsoft Phi-2 language model as the answer generator. It is optimized for efficient inference on various GPU types (T4, L4, A100, etc.) by leveraging techniques like half-precision model weights and PyTorch's inference mode. We have removed any Gradio interface to focus on core performance. We will maintain evaluation metrics BERTScore and ROUGE-L to assess the quality of generated answers.

# 1. Environment Setup and Dependencies

First, install and import the required libraries. We use Hugging Face Transformers for the Phi-2 model and tokenization, SentenceTransformers for embedding generation, FAISS for vector similarity search, and Hugging Face Evaluate (with bert-score and rouge-score backends) for metrics. We also ensure the GPU is utilized if available.

In [None]:
!pip install -U transformers accelerate sentence-transformers faiss-cpu evaluate rouge-score bert-score

Collecting transformers
  Downloading transformers-4.50.3-py3-none-any.whl.metadata (39 kB)
Collecting sentence-transformers
  Downloading sentence_transformers-4.0.1-py3-none-any.whl.metadata (13 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.10.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.4 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting bert-score
  Downloading bert_score-0.3.13-py3-none-any.whl.metadata (15 kB)
Collecting datasets>=2.0.0 (from evaluate)
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting dill (from evaluate)
  Downloading dill-0.3.9-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from evaluate)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from evaluate)
  Downloading multipro

In [None]:
import torch, faiss, numpy as np, time
from transformers import AutoTokenizer, AutoModelForCausalLM
from sentence_transformers import SentenceTransformer
import evaluate

# Use GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print("Using device:", device)

Using device: cuda


# 2. Data Loading and Preparation

Next, load or define the knowledge documents that the RAG system will use to answer questions. In a real scenario, these could be loaded from files or a database. For this demonstration, we'll define a small set of documents manually. Each document is a text passage containing facts that can be used to answer questions.

In [None]:
# Step 2: Loading and Preparing Base Documents

from datasets import load_dataset

# Load the "sciq" dataset, which contains scientific questions, answers, and context
# Only the first 100 documents are used for speed and efficiency
dataset = load_dataset("sciq", split="train[:100]")

# Extract only the context texts (field 'support') as base documents
documents = dataset["support"]

# Display total document count and an example
print(f"✅ {len(documents)} documents loaded from the SciQ dataset.")
print("📄 Document example:\n")
print(documents[0])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.02k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/3.99M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/339k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/343k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/11679 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1000 [00:00<?, ? examples/s]

✅ 100 documentos cargados desde el dataset SciQ.
📄 Ejemplo de documento:

Mesophiles grow best in moderate temperature, typically between 25°C and 40°C (77°F and 104°F). Mesophiles are often found living in or on the bodies of humans or other animals. The optimal growth temperature of many pathogenic mesophiles is 37°C (98°F), the normal human body temperature. Mesophilic organisms have important uses in food preparation, including cheese, yogurt, beer and wine.


# 3. Compute Document Embeddings

We convert each document into a vector embedding for similarity search. We use a pretrained SentenceTransformer model to obtain embeddings that capture semantic meaning. The embeddings are then L2-normalized so that we can use inner product as a proxy for cosine similarity. This step may be executed on GPU for speed if available.

In [None]:
# 3. Compute Document Embeddings

# Load an embedding model (SentenceTransformer) and encode documents
embedding_model = SentenceTransformer('all-MiniLM-L6-v2', device=device)

# Encode documents directly on the specified device (important for GPU)
doc_embeddings = embedding_model.encode(
    documents,
    convert_to_numpy=True,
    device=device,
    show_progress_bar=True  # Optional: shows a progress bar if documents list is large
)

# Normalize embeddings for cosine similarity search via inner product
faiss.normalize_L2(doc_embeddings)

print("Embedding dimension:", doc_embeddings.shape[1])

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

Embedding dimension: 384


# 4. Build FAISS Index for Retrieval

Using the document embeddings, we construct a FAISS index to enable fast nearest-neighbor search. We choose an index based on inner product (dot product) since our vectors are normalized (making dot product equivalent to cosine similarity). The index will store all document vectors and allow us to quickly retrieve the most relevant documents given a query vector.

In [None]:
# 4. Build FAISS Index for Retrieval

# Determine the dimensionality of the embeddings
dimension = doc_embeddings.shape[1]

# Create a FAISS index optimized for inner product (cosine similarity with normalized vectors)
index = faiss.IndexFlatIP(dimension)

# Add the normalized document embeddings to the index
index.add(doc_embeddings)

# Confirm index size
print(f"✅ FAISS index successfully built with {index.ntotal} vectors (dimension: {dimension}).")


✅ FAISS index successfully built with 100 vectors (dimension: 384).


# 5. Load the Phi-2 Language Model

Now we load the Phi-2 model and tokenizer from Hugging Face. Phi-2 is a 2.7B-parameter causal language model for text generation. We use half precision (fp16) to reduce memory usage and increase speed. The model is moved to the GPU and set to evaluation mode. We enable trust_remote_code=True because the Phi-2 model implementation may have custom code in its Hugging Face repository (ensuring proper loading).

In [None]:
# 5. Load the Phi-2 Language Model

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Define model name
model_name = "microsoft/phi-2"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

# Load the Phi-2 model with optimized settings
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto",  # Automatically use GPU if available
    trust_remote_code=True
)

# Set padding token (required for batching or safe generation)
if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id

# Set model to evaluation mode (disables dropout layers, etc.)
model.eval()

# Report current device
device = model.device
print(f"✅ Phi-2 model loaded and moved to device: {device}")

tokenizer_config.json:   0%|          | 0.00/7.34k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/735 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/35.7k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/564M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

✅ Phi-2 model loaded and moved to device: cuda:0


# 6. Retrieval-Augmented Generation Function

We define a function to generate answers given a user query. This function implements the RAG workflow:

    It embeds the query and retrieves the top relevant document(s) from the FAISS index.

    It constructs a prompt containing the retrieved context and the question. We use an instruction-style prompt to guide the model to use the provided context (e.g., using "Instruct:" and "Output:" format for conciseness).

    It encodes the prompt and uses the Phi-2 model to generate an answer. We wrap the generation in torch.inference_mode() to disable gradient tracking and improve speed.

    The function returns the generated answer text.

We also utilize max_new_tokens to limit the length of the generated answer and do_sample=False for deterministic output (greedy decoding). All heavy operations (encoding, retrieval, generation) happen inside the function for each query.

In [None]:
# 6. Retrieval-Augmented Generation Function (Optimized)

def generate_answer_phi2(query: str, top_k: int = 1, max_new_tokens: int = 100) -> str:
    """
    Generate an answer to the user query using RAG (Retrieval-Augmented Generation)
    with the Phi-2 model and FAISS document index.

    Parameters:
        query (str): The user question.
        top_k (int): Number of top documents to retrieve as context.
        max_new_tokens (int): Maximum number of tokens to generate in the answer.

    Returns:
        str: Generated answer.
    """

    # 1. Embed the user query and normalize it
    query_vector = embedding_model.encode([query], convert_to_numpy=True)
    faiss.normalize_L2(query_vector)

    # 2. Retrieve top-k relevant documents using FAISS index
    _, indices = index.search(query_vector, top_k)
    retrieved_docs = [documents[i] for i in indices[0]]
    context = "\n".join(retrieved_docs)

    # 3. Build the prompt in instruct-style format
    prompt = (
        f"Instruct: Answer the question based on the given context.\n"
        f"Context: {context}\n"
        f"Question: {query}\n"
        f"Output:"
    )

    # 4. Tokenize input and move tensors to GPU
    inputs = tokenizer(prompt, return_tensors="pt").to(device)

    # 5. Run the model to generate a response
    with torch.inference_mode():
        output_ids = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False,
            temperature=0.0,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id
        )

    # 6. Decode output tokens, skipping the prompt portion
    generated_ids = output_ids[0][inputs["input_ids"].shape[-1]:]
    answer = tokenizer.decode(generated_ids, skip_special_tokens=True).strip()

    return answer

# 7. Quick Test on a Sample Query

Let's test the pipeline on a sample query to ensure everything is working. We will ask a question and print the model's answer. The question is about information contained in our documents, so the retrieved context should help the model answer correctly.

In [None]:
# 7. Quick Test on a Sample Query (Domain-Specific)

sample_question = "What is the purpose of regularization in machine learning?"
print("Question:", sample_question)
answer = generate_answer_phi2(sample_question)
print("Answer:", answer)

# Synchronize GPU before timing
torch.cuda.synchronize()
start_time = time.time()

# Generate the answer
sample_answer = generate_answer_phi2(sample_question)

# Synchronize GPU after timing
torch.cuda.synchronize()
elapsed_time = time.time() - start_time

# Display result
print("📌 Question:", sample_question)
print("\n💡 Answer:\n", sample_answer)
print(f"\n⏱️ Inference time: {elapsed_time:.2f} seconds")


Question: What is the purpose of regularization in machine learning?




Answer: The purpose of regularization in machine learning is to prevent overfitting and improve the generalization of the model.
📌 Question: What is the purpose of regularization in machine learning?

💡 Answer:
 The purpose of regularization in machine learning is to prevent overfitting and improve the generalization of the model.

⏱️ Inference time: 0.70 seconds


Expected outcome: The model should output an answer like "The Eiffel Tower is 324 meters tall." (using the context from the documents). This confirms the RAG system is retrieving relevant information and the Phi-2 model is generating a grounded answer.

# 8. Inference Speed Benchmark

Now, we benchmark the inference speed of the pipeline. We measure the time it takes for the model to generate an answer, including retrieval and generation. To get a reliable estimate, we run the generation multiple times and take the average. We use torch.cuda.synchronize() to ensure we accurately capture the GPU computation time (waiting for all GPU kernels to finish before timing). The model is already in half-precision and inference mode to maximize throughput.

In [None]:
# 8. Inference Speed Benchmark (Optimized)

def benchmark_inference_time(query, n_runs=3):
    """
    Benchmarks the average inference time (retrieval + generation)
    for a given query over `n_runs` executions.
    """
    print(f"\n🧪 Benchmarking inference time for:\n📌 \"{query}\"")

    # Warm-up run (not timed)
    _ = generate_answer_phi2(query)
    torch.cuda.synchronize()

    # Timed runs
    times = []
    for _ in range(n_runs):
        torch.cuda.synchronize()
        start = time.perf_counter()
        _ = generate_answer_phi2(query)
        torch.cuda.synchronize()
        end = time.perf_counter()
        times.append(end - start)

    avg_time = np.mean(times)
    print(f"⏱️ Average inference time over {n_runs} runs: {avg_time:.2f} seconds")

# Example benchmark using an ML-related query
benchmark_inference_time("How does overfitting affect model performance?", n_runs=3)


🧪 Benchmarking inference time for:
📌 "How does overfitting affect model performance?"
⏱️ Average inference time over 3 runs: 1.34 seconds


This will output the average time in seconds to process one query end-to-end. The retrieval overhead is minimal compared to the generation, so this gives a good sense of the Phi-2 model's response latency on the given GPU. (If needed, one could separate model generation time specifically by excluding the embedding+retrieval steps, but those are fast in this setup.)

# 9. Automatic Benchmarking with BERTScore and ROUGE-L

Finally, we evaluate the quality of the answers produced by our RAG system on a set of test queries. We define several questions (along with their expected reference answers) and have the model answer each. We then compute BERTScore and ROUGE-L metrics between the generated answers and the reference answers:

    BERTScore uses pretrained model embeddings to measure semantic similarity between the output and reference, producing Precision, Recall, and F1 scores. We will use the F1 score as an overall similarity measure.

    ROUGE-L measures the overlap based on the Longest Common Subsequence between output and reference, a common metric for QA and summarization quality.

Higher scores (closer to 1.0) indicate the generated answer is very similar to the reference answer.

In [None]:
# 9. Evaluation Metrics Benchmark with BERTScore and ROUGE-L (Corrected)

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np
import evaluate
import time

# Define test prompts and references
eval_queries = [
    "What is the difference between supervised and unsupervised learning?",
    "Explain the concept of overfitting in machine learning.",
    "What is the purpose of regularization in neural networks?",
    "How does the gradient descent algorithm work?",
    "What is a confusion matrix and how is it used?"
]

references = [
    "Supervised learning uses labeled data to train models, while unsupervised learning identifies patterns in unlabeled data.",
    "Overfitting occurs when a model learns the training data too well, including noise, resulting in poor generalization to new data.",
    "Regularization prevents overfitting by adding a penalty to the loss function, discouraging complex models.",
    "Gradient descent is an optimization algorithm that updates model parameters by minimizing the loss function using gradients.",
    "A confusion matrix is a table that summarizes classification results, showing true positives, false positives, true negatives, and false negatives."
]

# Generate predictions and measure inference time
predictions = []
inference_times = []

for query in eval_queries:
    torch.cuda.synchronize()
    start_time = time.time()
    answer = generate_answer_phi2(query)
    torch.cuda.synchronize()
    end_time = time.time()

    predictions.append(answer)
    inference_times.append(round(end_time - start_time, 2))

# Display generated answers
for q, ref, pred in zip(eval_queries, references, predictions):
    print(f"Q: {q}\nGenerated: {pred}\nReference: {ref}\n")

# Load and compute BERTScore
bertscore = evaluate.load("bertscore")
bertscore_results = bertscore.compute(predictions=predictions, references=references, lang="en")
f1_scores = bertscore_results["f1"]
avg_bertscore_f1 = float(np.mean(f1_scores))

# ✅ Compute ROUGE-L per pair (important fix)
rouge = evaluate.load("rouge")
rouge_scores = []
for pred, ref in zip(predictions, references):
    result = rouge.compute(predictions=[pred], references=[ref], rouge_types=["rougeL"])
    rouge_scores.append(result["rougeL"])

avg_rougeL = float(np.mean(rouge_scores))

print(f"✅ Avg BERTScore F1: {avg_bertscore_f1:.4f}")
print(f"✅ Avg ROUGE-L: {avg_rougeL:.4f}")


Q: What is the difference between supervised and unsupervised learning?
Generated: Supervised learning is a type of machine learning where the algorithm is trained on labeled data, meaning that each input has a corresponding output or target value. The algorithm learns to map the inputs to the outputs and can then make predictions on new inputs. Unsupervised learning is a type of machine learning where the algorithm is trained on unlabeled data, meaning that there is no output or target value for each input. The algorithm learns to find patterns, clusters, or structures in the data and can then
Reference: Supervised learning uses labeled data to train models, while unsupervised learning identifies patterns in unlabeled data.

Q: Explain the concept of overfitting in machine learning.
Generated: Overfitting is a problem in machine learning when a model learns too much from the training data and fails to generalize well to new data. This means that the model performs well on the training

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✅ Avg BERTScore F1: 0.9014
✅ Avg ROUGE-L: 0.2531


# Benchmark

In [None]:
# Step 10 - DataFrame with benchmark results (Corrected with per-response ROUGE-L F1)
benchmark_results = pd.DataFrame({
    "Prompt": eval_queries,
    "Generated Answer": predictions,
    "Reference Answer": references,
    "Inference Time (s)": inference_times,
    "BERTScore F1": f1_scores,
    "ROUGE-L F1": rouge_scores,  # ✅ valores individuales por respuesta
    "Model": ["Phi-2"] * len(eval_queries),
    "GPU": [torch.cuda.get_device_name(0)] * len(eval_queries),
    "Subjective Score": [np.nan] * len(eval_queries)
})

print("✅ Benchmark results stored in `benchmark_results`")
benchmark_results.head()


✅ Benchmark results stored in `benchmark_results`


Unnamed: 0,Prompt,Generated Answer,Reference Answer,Inference Time (s),BERTScore F1,ROUGE-L F1,Model,GPU,Subjective Score
0,What is the difference between supervised and ...,Supervised learning is a type of machine learn...,Supervised learning uses labeled data to train...,2.9,0.886008,0.192308,Phi-2,Tesla T4,
1,Explain the concept of overfitting in machine ...,Overfitting is a problem in machine learning w...,Overfitting occurs when a model learns the tra...,2.7,0.911868,0.24,Phi-2,Tesla T4,
2,What is the purpose of regularization in neura...,Regularization is used to prevent overfitting ...,Regularization prevents overfitting by adding ...,0.43,0.91955,0.173913,Phi-2,Tesla T4,
3,How does the gradient descent algorithm work?,The gradient descent algorithm is an optimizat...,Gradient descent is an optimization algorithm ...,2.83,0.868304,0.166667,Phi-2,Tesla T4,
4,What is a confusion matrix and how is it used?,A confusion matrix is a table that is used to ...,A confusion matrix is a table that summarizes ...,1.86,0.921123,0.492754,Phi-2,Tesla T4,


# Save Results

In [None]:
import os
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
benchmark_results.to_csv("/content/drive/MyDrive/Benchmark_ChatbotRAG/results/benchmark_results_phi2.csv", index=False)
print("✅ Phi-2 benchmark results saved to 'Benchmark_ChatbotRAG/results/benchmark_results_phi2.csv'")

✅ Phi-2 benchmark results saved to 'Benchmark_ChatbotRAG/results/benchmark_results_phi2.csv'
