# Building a RAG-Enhanced Question Answering System with Fine-Tuned LLMs
## This notebook guides you through the process of building a comprehensive question answering system that combines:

### 1. Fine-tuning a pre-trained language model (Microsoft's Phi-3 Mini 4K Instruct) for enhanced question answering
### 2. Implementing a Retrieval-Augmented Generation (RAG) system to provide factual, accurate answers

### 3. We'll cover the entire workflow from model fine-tuning to creating a knowledge-enhanced QA system that leverages both the capabilities of the fine-tuned model and relevant information retrieved from a knowledge base.
## Overview
### In this tutorial, you'll learn how to:

* Set up the necessary libraries and environment
* Load and quantize a pre-trained model to reduce memory requirements
Configure Low-Rank Adapters (``` LoRA ```) to efficiently fine-tune the model
Format a dataset for fine-tuning
Train the model using the Supervised Fine-Tuning Trainer (SFTTrainer)
Generate text with your fine-tuned model
Save and share your adapter weights

### Let's get started!
1. Environment Setup
First, let's install the required packages. We'll use specific versions to ensure compatibility.

In [1]:
# Install required packages for fine-tuning
!pip install -q transformers==4.46.2 peft==0.13.2 accelerate==1.1.1 trl==0.12.1 bitsandbytes==0.44.1 datasets==3.1.0 huggingface-hub==0.26.2 safetensors==0.4.5 pandas==2.2.2 matplotlib==3.8.0 numpy==1.26.4

# Install additional packages for RAG
!pip install -q faiss-gpu==1.7.2 sentence-transformers==2.2.2 gradio==3.50.2

[31mERROR: Could not find a version that satisfies the requirement faiss-gpu==1.7.2 (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for faiss-gpu==1.7.2[0m[31m
[0m

### Now, let's import all the necessary libraries:


In [2]:
import os
import torch
import numpy as np
import pickle
from datasets import load_dataset
from peft import get_peft_model, LoraConfig, prepare_model_for_kbit_training
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from trl import SFTConfig, SFTTrainer
from huggingface_hub import login, HfApi

## 2. Loading a Quantized Base Model
### Quantization reduces the model's memory footprint by representing weights with fewer bits. We'll use 4-bit quantization (NF4) to reduce memory requirements by approximately 8x.

In [3]:
# Configure BitsAndBytes for 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float32
)

# Define the model repository
repo_id = 'microsoft/Phi-3-mini-4k-instruct'

# Load the model with quantization
model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    device_map="cuda:0",
    quantization_config=bnb_config
)

# Print model memory usage
print(f"Model memory footprint: {model.get_memory_footprint()/1e9:.2f} GB")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


model.safetensors.index.json:   0%|          | 0.00/16.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

Model memory footprint: 2.21 GB


## 3. Setting Up Low-Rank Adapters (LoRA)
### Instead of fine-tuning all parameters of the model, we'll use LoRA adapters. These are small, trainable matrices attached to the frozen quantized layers, significantly reducing the number of parameters we need to train.

In [4]:
# Prepare the model for k-bit training (improves numerical stability)
model = prepare_model_for_kbit_training(model)

# Configure LoRA adapters
lora_config = LoraConfig(
    r=8,                     # Rank - the lower, the fewer parameters to train
    lora_alpha=16,           # Alpha parameter, usually 2*r
    bias="none",             # Don't train bias parameters
    lora_dropout=0.05,       # Dropout probability for LoRA layers
    task_type="CAUSAL_LM",   # Specify that we're training a causal language model

    # Define which layers to apply LoRA to
    # For Phi-3, we need to specify these manually
    target_modules=['o_proj', 'qkv_proj', 'gate_up_proj', 'down_proj'],
)

# Apply LoRA to the model
model = get_peft_model(model, lora_config)

# Print information about trainable parameters
trainable_params, total_params = model.get_nb_trainable_parameters()
print(f"Trainable parameters: {trainable_params/1e6:.2f}M")
print(f"Total parameters: {total_params/1e6:.2f}M")
print(f"Percentage of trainable parameters: {100*trainable_params/total_params:.2f}%")

Trainable parameters: 12.58M
Total parameters: 3833.66M
Percentage of trainable parameters: 0.33%


## 4. Preparing the Dataset
### For this tutorial, we'll use a dataset of questions and answers from the Natural Questions dataset, which contains real questions from Google Search with answers from Wikipedia.


In [5]:
# Load the dataset
# dataset = load_dataset("lmqg/qags_nq", split="train") # Original line causing error
dataset = load_dataset("squad", split="train") # Changed to a built-in dataset "squad"

# Display dataset structure
print(dataset)
print(f"Number of examples: {len(dataset)}")
print(f"Sample example: {dataset[0]}")

# Let's select a smaller subset for faster training
dataset = dataset.select(range(1000))

# Format the dataset for instruction tuning
# dataset = dataset.map(
#     lambda examples: {
#         "prompt": examples["question"],
#         "completion": examples["answer"]
#     }
# )
# dataset = dataset.remove_columns(["question", "answer", "text", "id"])

# Format the dataset for instruction tuning to be compatible with "squad" dataset
dataset = dataset.map(
    lambda examples: {
        "prompt": examples["question"],
        "completion": examples["context"] # Changed to "context" for "squad" dataset
    }
)
dataset = dataset.remove_columns(["question", "id", "title", "context", "answers"]) # Changed removed columns


# Display formatted dataset
print("\nFormatted dataset:")
print(dataset)
print(f"Sample example after formatting: {dataset[0]}")

README.md:   0%|          | 0.00/7.62k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 87599
})
Number of examples: 87599
Sample example: {'id': '5733be284776f41900661182', 'title': 'University_of_Notre_Dame', 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.', 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes 

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]


Formatted dataset:
Dataset({
    features: ['prompt', 'completion'],
    num_rows: 1000
})
Sample example after formatting: {'prompt': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?', 'completion': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.'}


## 5. Setting Up the Tokenizer
### The tokenizer converts text to tokens (numbers) that the model can process. It also contains a chat template that specifies how to format conversations for instruction-tuned models.

In [6]:
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(repo_id)

# Display tokenizer information
print(f"Vocabulary size: {len(tokenizer)}")

# Create an example message to visualize tokenization
messages = [
    {"role": "user", "content": dataset[0]['prompt']},
    {"role": "assistant", "content": dataset[0]['completion']}
]

# Show how the chat template formats the conversation
print("\nChat template example:")
print(tokenizer.apply_chat_template(messages, tokenize=False))

tokenizer_config.json:   0%|          | 0.00/3.44k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.94M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

Vocabulary size: 32011

Chat template example:
<|user|>
To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?<|end|>
<|assistant|>
Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.<|end|>
<|endoftext|>


## 6. Configuring the SFTTrainer
### We'll use Hugging Face's Supervised Fine-Tuning Trainer (SFTTrainer) to handle the training loop and data processing.

In [7]:
# Configure the SFTTrainer
sft_config = SFTConfig(
    # Memory optimization
    gradient_checkpointing=True,                            # Saves memory by recomputing gradients
    gradient_checkpointing_kwargs={'use_reentrant': False}, # Required for newer PyTorch versions
    gradient_accumulation_steps=1,                          # Number of steps to accumulate gradients
    per_device_train_batch_size=16,                         # Batch size per device
    auto_find_batch_size=True,                              # Automatically reduce batch size if OOM

    # Dataset configuration
    max_seq_length=512,                                     # Maximum sequence length (increased for QA)
    packing=True,                                           # Packs sequences to improve efficiency

    # Training parameters
    num_train_epochs=3,                                     # Number of training epochs
    learning_rate=3e-4,                                     # Learning rate
    optim='paged_adamw_8bit',                              # Optimizer (8-bit Adam)

    # Logging and output
    logging_steps=10,                                       # Log every 10 steps
    logging_dir='./logs',                                   # Directory for logs
    output_dir='./phi3-mini-qa-adapter_QA',                    # Where to save the model
    report_to='none'                                        # Disable reporting to tools like W&B
)

# Create the trainer
trainer = SFTTrainer(
    model=model,
    processing_class=tokenizer,
    args=sft_config,
    train_dataset=dataset,
)

# Examine a batch of data
train_dataloader = trainer.get_train_dataloader()
batch = next(iter(train_dataloader))
print(f"Input shape: {batch['input_ids'].shape}")
print(f"Are labels automatically created? {('labels' in batch)}")

Generating train split: 0 examples [00:00, ? examples/s]

Input shape: torch.Size([16, 512])
Are labels automatically created? True




## 7. Training the Model
### Now we're ready to fine-tune the model! This step will train the LoRA adapters while keeping the base model frozen.

In [None]:
# Start training
#trainer.train()

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Step,Training Loss
10,1.8422
20,1.5537
30,1.3921
40,1.1453
50,0.9458
60,0.7541
70,0.6223
80,0.5643
90,0.5502


TrainOutput(global_step=93, training_loss=1.0252717066836614, metrics={'train_runtime': 5350.5761, 'train_samples_per_second': 0.274, 'train_steps_per_second': 0.017, 'total_flos': 1.6832970064134144e+16, 'train_loss': 1.0252717066836614, 'epoch': 3.0})

## 8. Perom QA task with the Fine-Tuned Model
### Let's test our fine-tuned model by asking some questions .

In [8]:
# Define a function to format prompts properly
def gen_prompt(tokenizer, sentence):
    """Format a sentence into a chat prompt with generation token."""
    converted_sample = [{"role": "user", "content": sentence}]
    prompt = tokenizer.apply_chat_template(
        converted_sample, tokenize=False, add_generation_prompt=True
    )
    return prompt

# Define a function to generate text from the model
def generate(model, tokenizer, prompt, max_new_tokens=64, skip_special_tokens=False):
    """Generate text from the model given a prompt."""
    tokenized_input = tokenizer(
        prompt, add_special_tokens=False, return_tensors="pt"
    ).to(model.device)

    model.eval()
    gen_output = model.generate(
        **tokenized_input,
        eos_token_id=tokenizer.eos_token_id,
        max_new_tokens=max_new_tokens
    )

    output = tokenizer.batch_decode(gen_output, skip_special_tokens=skip_special_tokens)
    return output[0]

# Test the model with a few examples
test_questions = [
    "Who was the first person to walk on the moon?",
    "What is the capital of France?",
    "When was the Declaration of Independence signed?",
    "What is photosynthesis?",
    "Who wrote the novel 'Pride and Prejudice'?"
]

for question in test_questions:
    prompt = gen_prompt(tokenizer, question)
    output = generate(model, tokenizer, prompt)
    print(f"Question: {question}")
    print(f"Answer: {output.split('<|assistant|>')[1].split('<|end|>')[0].strip()}")
    print("-" * 50)

Question: Who was the first person to walk on the moon?
Answer: The first person to walk on the moon was Neil Armstrong. He was an American astronaut and aerospace engineer who served as mission commander in the Apollo 11 lunar landing mission. On July 20, 1969, Armstrong became the first human to step onto the lun
--------------------------------------------------
Question: What is the capital of France?
Answer: The capital of France is Paris. It is not only the largest city in the country but also a global center for art, fashion, gastronomy, and culture. Its 19th-century cityscape is crisscrossed by wide boulevards and the River Seine. Paris is also known for
--------------------------------------------------
Question: When was the Declaration of Independence signed?
Answer: The Declaration of Independence was signed on July 4, 1776. This date is now celebrated as Independence Day in the United States. The document, primarily authored by Thomas Jefferson, was adopted by the Continen

## 9. Saving and Sharing the Model
### Finally, let's save our fine-tuned adapter weights and optionally share them on the Hugging Face Hub.

In [None]:
# Save the model locally
trainer.save_model('local-phi3-mini-qa-adapter_QA')
print("Model saved locally to 'local-phi3-mini-qa-adapter_QA'")

# Push to Hugging Face Hub with explicit token and repository configuration
from huggingface_hub import HfApi

# Set your Hugging Face token
hf_token = "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"  # Replace with your actual token
hf_username = "MHamdan"  # Your Hugging Face username
model_name = "phi3-mini-qa-adapter_QA"
repo_id = f"{hf_username}/{model_name}"

# Login with token
api = HfApi(token=hf_token)

# Create a new repository if it doesn't exist
try:
    api.create_repo(repo_id=repo_id, private=False, exist_ok=True)
    print(f"Repository {repo_id} is ready")
except Exception as e:
    print(f"Repository creation error: {e}")
    print("Attempting to push to existing repository or with different permissions...")

# Configure the trainer with the correct repository ID
trainer.args.hub_model_id = repo_id
trainer.args.hub_token = hf_token

# Push model to Hugging Face Hub
try:
    trainer.push_to_hub()
    print(f"Model successfully pushed to Hugging Face Hub at {repo_id}")
except Exception as e:
    print(f"Error pushing to hub: {e}")
    print("Trying alternative upload method...")

    # Alternative: Direct upload using the HfApi
    try:
        api.upload_folder(
            folder_path=trainer.args.output_dir,
            repo_id=repo_id,
            commit_message="Upload fine-tuned QA adapter"
        )
        print(f"Model successfully pushed to Hugging Face Hub at {repo_id} using direct upload")
    except Exception as e2:
        print(f"Direct upload also failed: {e2}")
        print("Please check your token permissions and try again")

Model saved locally to 'local-phi3-mini-qa-adapter_QA'
Repository MHamdan/phi3-mini-qa-adapter_QA is ready


adapter_model.safetensors:   0%|          | 0.00/50.4M [00:00<?, ?B/s]

Upload 3 LFS files:   0%|          | 0/3 [00:00<?, ?it/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/5.62k [00:00<?, ?B/s]

Model successfully pushed to Hugging Face Hub at MHamdan/phi3-mini-qa-adapter_QA


## 10. Loading and Using Your Fine-Tuned Model

### Here's how you can load and use your fine-tuned model:



In [None]:

# This code shows how to load your fine-tuned model from either local storage or the Hub
from peft import PeftModel

# Load the base model
base_model = AutoModelForCausalLM.from_pretrained(
    'microsoft/Phi-3-mini-4k-instruct',
    device_map="cuda:0",
    quantization_config=bnb_config
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained('microsoft/Phi-3-mini-4k-instruct')

# Option 1: Load from local storage
model_path = 'local-phi3-mini-qa-adapter_QA'



# Load the adapter onto the base model
model = PeftModel.from_pretrained(base_model, model_path)

# Now you can use the model to generate text
question = "What causes the Northern Lights?"
prompt = gen_prompt(tokenizer, question)
output = generate(model, tokenizer, prompt)
print(f"Question: {question}")
print(f"Answer: {output.split('<|assistant|>')[1].split('<|end|>')[0].strip()}")


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



Question: What causes the Northern Lights?
Answer: The Northern Lights, also known as the aurora borealis, are a natural light display in the Earth's sky predominantly seen in the high-latitude regions around the Arctic and Antarctic. The lights have been described as one of the most spectacular natural phenomena. The aur


## 10.1.   Load from Hub


In [9]:

from peft import PeftModel
# Load the base model
base_model = AutoModelForCausalLM.from_pretrained(
    'microsoft/Phi-3-mini-4k-instruct',
    device_map="cuda:0",
    quantization_config=bnb_config
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained('microsoft/Phi-3-mini-4k-instruct')

# Option 2: Load from Hub
model_path = 'MHamdan/phi3-mini-qa-adapter_QA'

# Load the adapter onto the base model
model = PeftModel.from_pretrained(base_model, model_path)

# Now you can use the model to generate text
question = "What causes the Northern Lights?"
prompt = gen_prompt(tokenizer, question)
output = generate(model, tokenizer, prompt)
print(f"Question: {question}")
print(f"Answer: {output.split('<|assistant|>')[1].split('<|end|>')[0].strip()}")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

adapter_config.json:   0%|          | 0.00/696 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/50.4M [00:00<?, ?B/s]



Question: What causes the Northern Lights?
Answer: The Northern Lights, also known as the aurora borealis, are a natural light display in the Earth's sky predominantly seen in the high-latitude regions around the Arctic and Antarctic. The lights have been described as one of the most spectacular natural phenomena. The aur


##  11. Building a RAG System for Enhanced Question Answering
### Now that we have a fine-tuned model, let's enhance it with a Retrieval-Augmented Generation (RAG) system. RAG combines the strengths of retrieval-based and generation-based approaches to improve answer quality and factuality.

In [10]:
# Install required libraries for vector database and RAG
!pip install faiss-cpu sentence-transformers wikipedia --no-cache-dir


Collecting faiss-cpu
  Downloading faiss_cpu-1.10.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.4 kB)
Collecting wikipedia
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Downloading faiss_cpu-1.10.0-cp311-cp311-manylinux_2_28_x86_64.whl (30.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.7/30.7 MB[0m [31m304.9 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: wikipedia
  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia: filename=wikipedia-1.4.0-py3-none-any.whl size=11679 sha256=0de151357ce58b9ec6daf920d9c83d4191bca10ad160de8eea7541c484dfe51e
  Stored in directory: /tmp/pip-ephem-wheel-cache-hzaioggb/wheels/8f/ab/cb/45ccc40522d3a1c41e1d2ad53b8f33a62f394011ec38cd71c6
Successfully built wikipedia
Installing collected packages: faiss-cpu, wikipedia
Successfully installed faiss-cpu-1.10.0 wikipedia-1.4.0


In [11]:

# Import libraries
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer
import wikipedia
from datetime import datetime

## 11.1 Creating the Knowledge Base
### First, we'll build a knowledge base using Wikipedia articles that will serve as our source of factual information.

In [12]:
# Load a sentence embedding model
encoder = SentenceTransformer('all-MiniLM-L6-v2')
print(f"Loaded sentence embedding model: {encoder.get_sentence_embedding_dimension()}-dimensional")

# Function to fetch Wikipedia articles and create a knowledge base
def build_knowledge_base(topics, max_paragraphs_per_topic=20):
    knowledge_base = {
        "texts": [],
        "sources": [],
        "embeddings": []
    }

    print(f"Building knowledge base from {len(topics)} topics...")

    for topic in topics:
        try:
            # Get Wikipedia page
            page = wikipedia.page(topic)
            print(f"Retrieved '{page.title}'")

            # Split content into paragraphs
            paragraphs = page.content.split('\n\n')

            # Keep only non-empty paragraphs, limited to max_paragraphs_per_topic
            valid_paragraphs = [p.strip() for p in paragraphs if len(p.strip()) > 50][:max_paragraphs_per_topic]

            # Add to knowledge base
            for paragraph in valid_paragraphs:
                knowledge_base["texts"].append(paragraph)
                knowledge_base["sources"].append(f"{page.title} (Wikipedia)")

            print(f"Added {len(valid_paragraphs)} paragraphs from '{page.title}'")

        except (wikipedia.exceptions.DisambiguationError, wikipedia.exceptions.PageError) as e:
            print(f"Error retrieving '{topic}': {e}")
            continue

    print(f"Knowledge base contains {len(knowledge_base['texts'])} paragraphs")

    # Generate embeddings
    print("Generating embeddings for knowledge base...")
    batch_size = 32
    for i in range(0, len(knowledge_base["texts"]), batch_size):
        batch = knowledge_base["texts"][i:i+batch_size]
        embeddings = encoder.encode(batch, convert_to_tensor=False)
        knowledge_base["embeddings"].extend(embeddings)
        if i % 100 == 0 and i > 0:
            print(f"Processed {i} paragraphs...")

    return knowledge_base

# Define topics to include in our knowledge base
topics = [
    "Moon landing", "Apollo 11", "Neil Armstrong",
    "Paris", "France", "French history",
    "United States Declaration of Independence", "American Revolution",
    "Photosynthesis", "Plant biology",
    "Jane Austen", "Pride and Prejudice",
    "Solar System", "Planets", "Mars", "Jupiter",
    "World War II", "World War I",
    "Climate change", "Global warming",
    "Artificial intelligence", "Machine learning",
    "DNA", "Genetics", "Human Genome Project",
    "Albert Einstein", "Theory of relativity",
    "Quantum mechanics", "Physics",
    "Leonardo da Vinci", "Renaissance art",
    "Internet", "World Wide Web", "Tim Berners-Lee",
    "COVID-19", "Coronavirus", "Pandemic",
    "Northern Lights", "Aurora Borealis",
    "Democracy", "Political systems"
]

# Build the knowledge base
knowledge_base = build_knowledge_base(topics)

# Convert embeddings to numpy array
embeddings_np = np.array(knowledge_base["embeddings"]).astype('float32')

# Build FAISS index with cosine similarity
embedding_dim = embeddings_np.shape[1]

# Normalize vectors for cosine similarity
faiss.normalize_L2(embeddings_np)
index = faiss.IndexFlatIP(embedding_dim)  # Inner product for cosine similarity with normalized vectors
index.add(embeddings_np)

print(f"Created FAISS index with {index.ntotal} vectors of dimension {embedding_dim}")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Loaded sentence embedding model: 384-dimensional
Building knowledge base from 41 topics...
Retrieved 'Moon landing'
Added 20 paragraphs from 'Moon landing'
Retrieved 'Apollo 1'
Added 20 paragraphs from 'Apollo 1'
Retrieved 'Neil Armstrong'
Added 20 paragraphs from 'Neil Armstrong'
Retrieved 'Perić'
Added 3 paragraphs from 'Perić'
Retrieved 'France'
Added 20 paragraphs from 'France'
Retrieved 'History of France'
Added 20 paragraphs from 'History of France'
Retrieved 'United States Declaration of Independence'
Added 20 paragraphs from 'United States Declaration of Independence'
Retrieved 'American Revolution'
Added 20 paragraphs from 'American Revolution'
Retrieved 'Photosynthesis'
Added 20 paragraphs from 'Photosynthesis'
Retrieved 'Botany'
Added 20 paragraphs from 'Botany'




  lis = BeautifulSoup(html).find_all('li')


Error retrieving 'Jane Austen': "Jane Austin" may refer to: 
Jane Austen
Jane G. Austin
Jane Austin McCurtain
Retrieved 'Pride and Prejudice'
Added 20 paragraphs from 'Pride and Prejudice'
Error retrieving 'Solar System': Page id "soler system" does not match any pages. Try another id!
Retrieved 'Plant'
Added 20 paragraphs from 'Plant'
Error retrieving 'Mars': "mar" may refer to: 
Mar (title)
Earl of Mar
Mar.
Gospel of Mark
Mar (Scottish province)
Mesoamerican region
Mar, Isfahan
Mar, Markazi
Mar, Russia
Mid-Atlantic Ridge
Mar (surname)
Mar (singer)
Mar Abhai
Mar Amongo
Mar Cambrollé
Mar Roxas
MÄR
Mar (boat)
Minorities at Risk
Mixed Antiglobulin Reaction
Matrix attachment region
Medication Administration Record
Memory address register
LogMAR chart
Missing at random
Model Audit Rule 205
Molapo Armoured Regiment
Modified Aspect Ratio
Mouvement Action Renouveau
Morocco
Mauritius
Marathi language
La Chinita International Airport
Marriott International
Ju-On
Mars (disambiguation)
Marr (disa

## 11.2 Implementing the RAG Question Answering System
### Now, let's create our RAG-enhanced QA system:

In [13]:
def rag_answer(question, model, tokenizer, index, knowledge_base, k=3):
    """
    Answers questions using RAG approach that combines retrieval and generation.

    Args:
        question: The question to answer
        model: The fine-tuned QA model
        tokenizer: The tokenizer for the model
        index: FAISS index for retrieval
        knowledge_base: Dictionary containing knowledge base texts and sources
        k: Number of passages to retrieve

    Returns:
        Dictionary with model answer, RAG answer, and retrieved passages
    """
    # Generate embedding for the question
    question_embedding = encoder.encode([question], convert_to_tensor=False)

    # Normalize the query vector for cosine similarity
    question_embedding_np = np.array(question_embedding).astype('float32')
    faiss.normalize_L2(question_embedding_np)

    # Search the index
    similarities, indices = index.search(question_embedding_np, k)

    # Collect the retrieved passages
    retrieved_passages = []
    for i, idx in enumerate(indices[0]):
        retrieved_passages.append({
            "text": knowledge_base["texts"][idx],
            "source": knowledge_base["sources"][idx],
            "similarity": float(similarities[0][i])
        })

    # Generate answer using just the model
    prompt = gen_prompt(tokenizer, question)
    model_output = generate(model, tokenizer, prompt)
    model_answer = model_output.split('<|assistant|>')[1].split('<|end|>')[0].strip()

    # Prepare RAG prompt with retrieved passages
    rag_context = "I'll answer your question based on the following information:\n\n"
    for i, passage in enumerate(retrieved_passages):
        rag_context += f"Information {i+1} (from {passage['source']}):\n{passage['text']}\n\n"

    rag_context += f"Based on the above information, please answer this question: {question}"

    # Generate answer using RAG prompt
    rag_prompt = gen_prompt(tokenizer, rag_context)
    rag_output = generate(model, tokenizer, rag_prompt, max_new_tokens=512)
    rag_answer = rag_output.split('<|assistant|>')[1].split('<|end|>')[0].strip()

    return {
        "model_answer": model_answer,
        "rag_answer": rag_answer,
        "retrieved_passages": retrieved_passages
    }

## 11.3 Evaluating the RAG System
### Let's test our QA system with some examples:

In [14]:
# Prepare test questions that benefit from factual knowledge
test_questions = [
    "Who was the first person to walk on the moon?",
    "What is the capital of France?",
    "When was the Declaration of Independence signed?",
    "What is photosynthesis?",
    "Who wrote the novel 'Pride and Prejudice'?",
    "What causes the Northern Lights?",
    "How many planets are in our solar system?",
    "Who was Albert Einstein?",
    "What is DNA made of?",
    "When did World War II end?"
]

# Test both the model-only and RAG-enhanced answers
for question in test_questions:
    print(f"Question: {question}")
    print("-" * 80)

    # Get answers
    result = rag_answer(question, model, tokenizer, index, knowledge_base, k=3)

    # Print model-only answer
    print(f"Model-only answer: {result['model_answer']}")

    # Print RAG-enhanced answer
    print(f"\nRAG-enhanced answer: {result['rag_answer']}")

    # Print retrieved passages
    print("\nRetrieved passages:")
    for i, passage in enumerate(result['retrieved_passages']):
        print(f"Passage {i+1} (similarity: {passage['similarity']:.2f}) from {passage['source']}:")
        print(f"{passage['text'][:300]}..." if len(passage['text']) > 300 else passage['text'])
        print()

    print("=" * 80)

Question: Who was the first person to walk on the moon?
--------------------------------------------------------------------------------
Model-only answer: The first man to walk on the moon was Neil Armstrong, commander of Apollo 11, who stepped onto the lunar surface on July 20, 1969, at 02:56:15 UTC. His first words were "That's one small step for

RAG-enhanced answer: The first human-made object to touch the Moon was Luna 2 in 1959. The first spacecraft to orbit the Moon was Luna 10, which was launched on 16 March 1965 and began orbiting the Moon on 4 April 1965. The first successful landing on the Moon was by the United States' Apollo 11 mission on 20 July 1969. The first soft landing was accomplished by the Soviet Luna 9 probe on 3 January 1968. The first successful roving of the lunar surface was by the Soviet Luna 9 probe on 3 January 1968. The first color photograph of the Earth from the Moon was taken by Apollo 8 on Christmas Eve 1968. The first unmanned upward shot from the M

## 12. Creating a Simple Web Interface for the QA System
### Let's create a simple Gradio interface for our RAG-enhanced QA system:

In [15]:
# Install gradio
!pip install -q gradio


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.3/62.3 MB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m322.1/322.1 kB[0m [31m25.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m94.9/94.9 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.1/468.1 kB[0m [31m34.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.3/11.3 MB[0m [31m98.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.0/72.0 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.3/62.3 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [16]:

import gradio as gr

def answer_question(question, use_rag=True, num_passages=3):
    if not question.strip():
        return "Please enter a question", "", ""

    if use_rag:
        result = rag_answer(
            question, model, tokenizer, index, knowledge_base, k=num_passages
        )

        # Format retrieved passages for display
        passages_text = ""
        for i, passage in enumerate(result['retrieved_passages']):
            passages_text += f"Passage {i+1} (from {passage['source']}):\n"
            passages_text += f"{passage['text'][:500]}..." if len(passage['text']) > 500 else passage['text']
            passages_text += f"\n\nSimilarity score: {passage['similarity']:.2f}\n\n"

        return result['rag_answer'], result['model_answer'], passages_text
    else:
        # Model-only answer
        prompt = gen_prompt(tokenizer, question)
        output = generate(model, tokenizer, prompt)
        answer = output.split('<|assistant|>')[1].split('<|end|>')[0].strip()

        return answer, answer, "RAG not used."

# Create Gradio Interface
demo = gr.Interface(
    fn=answer_question,
    inputs=[
        gr.Textbox(lines=2, placeholder="Enter your question..."),
        gr.Checkbox(label="Use RAG enhancement", value=True),
        gr.Slider(minimum=1, maximum=5, value=3, step=1, label="Number of passages to retrieve")
    ],
    outputs=[
        gr.Textbox(label="Answer"),
        gr.Textbox(label="Model-only Answer (for comparison)"),
        gr.Textbox(label="Retrieved Passages")
    ],
    title="RAG-Enhanced Question Answering System",
    description="Ask a question and get an answer enhanced with relevant factual information from Wikipedia."
)

# Launch the interface
demo.launch()

Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://508cde2cb6cb218f3f.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




## Conclusion
### Congratulations! You've successfully built a sophisticated question answering system that combines the best of both worlds:

### A fine-tuned language model that understands how to answer questions
### A RAG system that enhances answers with factual information from Wikipedia

### The RAG-enhanced system offers several significant advantages:

* Improved factuality: The model's answers are grounded in real information
Up-to-date
* knowledge: The knowledge base can be updated without retraining the model
* Verifiable responses: Users can see the sources of information
* Transparent reasoning: The system shows both the model-only and RAG-enhanced answers for comparison

### This approach is exactly how many modern AI assistants work behind the scenes, combining the fluent natural language capabilities of LLMs with the factual grounding of retrieved information.
## Some extensions you might want to try:

1. Expand the knowledge base with more topics or specialized domains
2. Implement citation linking in the answers to specific passages
3. Create a fact-checking system that verifies model outputs against retrieved information
4. Build a feedback loop to continuously improve retrieval quality
5. Add the ability to upload PDFs or documents to create a custom knowledge base

### The RAG pattern is one of the most powerful approaches in modern AI, enabling more accurate, helpful, and trustworthy AI systems!dataset("opus_books", "en-fr", split="train")
## Limit to 10,000 examples for memory considerations
