# Complete RAG Pipeline for Multiple PDFs

This notebook implements a complete Retrieval-Augmented Generation (RAG) system that:
- **Processes ALL PDF files** in `/content/data` directory
- Creates embeddings for all documents
- Builds a unified search index
- Enables Q&A across all documents

## Setup and Installation

In [1]:
# Install required packages for Google Colab
!pip install -q PyPDF2
!pip install -q sentence-transformers
!pip install -q faiss-cpu
!pip install -q transformers
!pip install -q torch

# Create directories for data storage
import os
os.makedirs('/content/data', exist_ok=True)
os.makedirs('/content/output', exist_ok=True)

print('✅ All dependencies installed successfully!')

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.4/31.4 MB[0m [31m55.2 MB/s[0m eta [36m0:00:00[0m
[?25h✅ All dependencies installed successfully!


In [2]:
# Configuration and File Paths
import os

# Simple path configuration for Google Colab
DATA_DIR = '/content/data'  # Put all your PDF files here
OUTPUT_DIR = '/content/output'

# File paths for outputs
CHUNKS_FILE = os.path.join(OUTPUT_DIR, 'all_chunks.jsonl')
EMBEDDINGS_FILE = os.path.join(OUTPUT_DIR, 'chunks_with_embeddings.jsonl')
INDEX_FILE = os.path.join(OUTPUT_DIR, 'faiss_index.idx')

# Model configuration
EMBEDDING_MODEL = 'sentence-transformers/all-MiniLM-L6-v2'
GENERATION_MODEL = 'google/flan-t5-base'

# Chunking parameters
MAX_TOKENS = 256
OVERLAP_TOKENS = 32

# RAG parameters
TOP_K_RESULTS = 5
MAX_CONTEXT_LENGTH = 2048

print('Configuration loaded successfully!')
print(f'Data directory: {DATA_DIR}')
print(f'Output directory: {OUTPUT_DIR}')

Configuration loaded successfully!
Data directory: /content/data
Output directory: /content/output


## Part 1: Document Processing - All PDFs in Directory

In [3]:
import re
import unicodedata
import hashlib
from typing import List, Dict, Any
import glob

def normalize_slug(s: str, repl="-"):
    """Normalize string to create a clean slug"""
    if s is None:
        return ""
    s = unicodedata.normalize("NFKC", s).lower()
    s = re.sub(r"[^a-z0-9]+", repl, s)
    s = re.sub(rf"{repl}{{2,}}", repl, s)
    s = s.strip(repl)
    return s or "na"

def get_document_hash(content: str) -> str:
    """Generate SHA256 hash of document content"""
    return hashlib.sha256(content.encode()).hexdigest()

def get_all_pdfs(directory: str) -> List[str]:
    """Get all PDF files in a directory"""
    pdf_pattern = os.path.join(directory, '*.pdf')
    pdf_files = glob.glob(pdf_pattern)

    # Also check for PDFs with uppercase extension
    pdf_pattern_upper = os.path.join(directory, '*.PDF')
    pdf_files.extend(glob.glob(pdf_pattern_upper))

    return list(set(pdf_files))  # Remove duplicates

print('Helper functions loaded!')

Helper functions loaded!


In [4]:
from transformers import AutoTokenizer

# Initialize tokenizer
tokenizer = AutoTokenizer.from_pretrained(EMBEDDING_MODEL)

def chunk_by_tokens(text: str, max_tokens=MAX_TOKENS, overlap=OVERLAP_TOKENS):
    """Split text into overlapping chunks based on token count"""
    tokens = tokenizer.encode(text, add_special_tokens=False)
    chunks = []

    start = 0
    while start < len(tokens):
        end = min(start + max_tokens, len(tokens))
        chunk_tokens = tokens[start:end]
        chunk_text = tokenizer.decode(chunk_tokens, skip_special_tokens=True)
        chunks.append(chunk_text)

        if end >= len(tokens):
            break
        start += max_tokens - overlap

    return chunks

print(f'Tokenizer loaded: {EMBEDDING_MODEL}')
print(f'Chunking configured: max_tokens={MAX_TOKENS}, overlap={OVERLAP_TOKENS}')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Tokenizer loaded: sentence-transformers/all-MiniLM-L6-v2
Chunking configured: max_tokens=256, overlap=32


In [5]:
import PyPDF2
import json
from google.colab import files
import os

def extract_text_from_pdf(pdf_path: str) -> List[Dict[str, Any]]:
    """Extract text from a single PDF and return page data"""
    pages_data = []

    try:
        with open(pdf_path, 'rb') as file:
            reader = PyPDF2.PdfReader(file)
            total_pages = len(reader.pages)

            for page_num in range(total_pages):
                page = reader.pages[page_num]
                text = page.extract_text()

                if text.strip():
                    pages_data.append({
                        'page': page_num + 1,
                        'text': text
                    })
    except Exception as e:
        print(f'⚠️ Error reading {pdf_path}: {e}')
        return []

    return pages_data

def process_all_pdfs(data_dir: str, output_path: str):
    """Process ALL PDFs in a directory and create chunks"""

    # Get all PDF files
    pdf_files = get_all_pdfs(data_dir)

    if not pdf_files:
        print(f'⚠️ No PDF files found in {data_dir}')
        print('Please upload PDF files to /content/data/')
        return 0

    print(f'Found {len(pdf_files)} PDF files:')
    for pdf_file in pdf_files:
        print(f'  - {os.path.basename(pdf_file)}')

    all_chunks = []
    total_chunks = 0

    # Process each PDF
    for pdf_idx, pdf_path in enumerate(pdf_files, 1):
        pdf_name = os.path.basename(pdf_path)
        print(f'\nProcessing [{pdf_idx}/{len(pdf_files)}]: {pdf_name}')

        # Extract pages from PDF
        pages = extract_text_from_pdf(pdf_path)

        if not pages:
            print(f'  ⚠️ No text extracted from {pdf_name}')
            continue

        # Get document hash
        full_text = ' '.join([p['text'] for p in pages])
        doc_hash = get_document_hash(full_text)[:8]

        # Create chunks for this PDF
        pdf_chunks = 0
        for page_data in pages:
            page_num = page_data['page']
            text = page_data['text']

            # Create chunks for this page
            chunks = chunk_by_tokens(text)

            for i, chunk in enumerate(chunks):
                chunk_data = {
                    'id': f'{doc_hash}_p{page_num}_c{i}',
                    'text': chunk,
                    'metadata': {
                        'page': page_num,
                        'chunk_index': i,
                        'doc_hash': doc_hash,
                        'source': pdf_name,
                        'pdf_index': pdf_idx
                    }
                }
                all_chunks.append(chunk_data)
                pdf_chunks += 1

        print(f'  ✅ Created {pdf_chunks} chunks from {len(pages)} pages')
        total_chunks += pdf_chunks

    # Save all chunks to file
    with open(output_path, 'w', encoding='utf-8') as f:
        for chunk in all_chunks:
            f.write(json.dumps(chunk) + '\n')

    print(f'\n{'='*50}')
    print(f'✅ Total: {total_chunks} chunks from {len(pdf_files)} PDFs')
    print(f'✅ Saved to: {output_path}')
    return total_chunks

# Option 1: Upload PDFs manually
print('Option 1: Upload PDF files manually')
print('Click the button below to upload multiple PDF files:')
uploaded = files.upload()

if uploaded:
    # Move uploaded files to data directory
    for filename in uploaded.keys():
        if filename.lower().endswith('.pdf'):
            dest_path = os.path.join(DATA_DIR, filename)
            os.rename(filename, dest_path)
            print(f'✅ Uploaded: {filename}')
        else:
            print(f'⚠️ Skipped (not a PDF): {filename}')

# Option 2: Process all PDFs already in /content/data/
print('\nOption 2: Process PDFs already in /content/data/')
print('Processing all PDF files in the data directory...')

# Process ALL PDFs in the directory
num_chunks = process_all_pdfs(DATA_DIR, CHUNKS_FILE)

if num_chunks == 0:
    print('\n⚠️ No PDFs processed. Creating sample data for testing...')
    # Create sample chunks for testing
    sample_chunks = [
        {'id': 'sample_1', 'text': 'This is a sample document for testing the RAG pipeline.',
         'metadata': {'page': 1, 'chunk_index': 0, 'source': 'sample.pdf'}},
        {'id': 'sample_2', 'text': 'RAG systems combine retrieval and generation for better answers.',
         'metadata': {'page': 1, 'chunk_index': 1, 'source': 'sample.pdf'}},
        {'id': 'sample_3', 'text': 'Multiple documents can be processed and searched simultaneously.',
         'metadata': {'page': 1, 'chunk_index': 2, 'source': 'sample.pdf'}}
    ]
    with open(CHUNKS_FILE, 'w') as f:
        for chunk in sample_chunks:
            f.write(json.dumps(chunk) + '\n')
    print('✅ Created sample chunks for testing')

Option 1: Upload PDF files manually
Click the button below to upload multiple PDF files:



Option 2: Process PDFs already in /content/data/
Processing all PDF files in the data directory...
Found 1 PDF files:
  - UK student visa.pdf

Processing [1/1]: UK student visa.pdf


Token indices sequence length is longer than the specified maximum sequence length for this model (2899 > 512). Running this sequence through the model will result in indexing errors


  ✅ Created 255 chunks from 107 pages

✅ Total: 255 chunks from 1 PDFs
✅ Saved to: /content/output/all_chunks.jsonl


In [6]:
# Check what PDFs were processed
import json

def analyze_chunks(chunks_file: str):
    """Analyze the chunks file to see what documents were processed"""
    chunks = []
    sources = {}

    with open(chunks_file, 'r', encoding='utf-8') as f:
        for line in f:
            chunk = json.loads(line)
            chunks.append(chunk)
            source = chunk['metadata'].get('source', 'unknown')
            sources[source] = sources.get(source, 0) + 1

    print('📊 Chunk Analysis:')
    print(f'Total chunks: {len(chunks)}')
    print(f'\nDocuments processed ({len(sources)}):')
    for source, count in sorted(sources.items()):
        print(f'  - {source}: {count} chunks')

    return chunks

# Analyze the chunks
chunks_data = analyze_chunks(CHUNKS_FILE)

📊 Chunk Analysis:
Total chunks: 255

Documents processed (1):
  - UK student visa.pdf: 255 chunks


## Part 2: Embedding Generation for All Documents

In [7]:
from sentence_transformers import SentenceTransformer
import numpy as np

# Load embedding model
print('Loading embedding model...')
embedding_model = SentenceTransformer(EMBEDDING_MODEL)
print(f'✅ Model loaded: {EMBEDDING_MODEL}')

def generate_embeddings(input_file: str, output_file: str):
    """Generate embeddings for all chunks from all documents"""
    chunks = []

    # Load chunks
    print('Loading chunks...')
    with open(input_file, 'r', encoding='utf-8') as f:
        for line in f:
            chunks.append(json.loads(line))

    print(f'Loaded {len(chunks)} chunks')

    # Count chunks per document
    doc_counts = {}
    for chunk in chunks:
        source = chunk['metadata'].get('source', 'unknown')
        doc_counts[source] = doc_counts.get(source, 0) + 1

    print(f'From {len(doc_counts)} documents:')
    for doc, count in sorted(doc_counts.items()):
        print(f'  - {doc}: {count} chunks')

    # Generate embeddings in batches
    print('\nGenerating embeddings...')
    texts = [chunk['text'] for chunk in chunks]
    embeddings = embedding_model.encode(texts, show_progress_bar=True)

    # Save chunks with embeddings
    print('Saving embeddings...')
    with open(output_file, 'w', encoding='utf-8') as f:
        for chunk, embedding in zip(chunks, embeddings):
            chunk['embedding'] = embedding.tolist()
            f.write(json.dumps(chunk) + '\n')

    print(f'\n✅ Generated embeddings for {len(chunks)} chunks')
    print(f'✅ Saved to: {output_file}')
    print(f'Embedding dimension: {len(embeddings[0])}')
    return len(chunks)

# Generate embeddings for all documents
num_embeddings = generate_embeddings(CHUNKS_FILE, EMBEDDINGS_FILE)

Loading embedding model...


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

✅ Model loaded: sentence-transformers/all-MiniLM-L6-v2
Loading chunks...
Loaded 255 chunks
From 1 documents:
  - UK student visa.pdf: 255 chunks

Generating embeddings...


Batches:   0%|          | 0/8 [00:00<?, ?it/s]

Saving embeddings...

✅ Generated embeddings for 255 chunks
✅ Saved to: /content/output/chunks_with_embeddings.jsonl
Embedding dimension: 384


## Part 3: RAG Implementation - Search Across All Documents

In [8]:
import faiss

class MultiDocRAGPipeline:
    def __init__(self, embeddings_file: str):
        """Initialize RAG pipeline for multiple documents"""
        self.chunks = []
        self.embeddings = []
        self.index = None
        self.embedding_model = embedding_model
        self.doc_sources = set()

        # Load chunks and embeddings
        self.load_data(embeddings_file)
        # Build FAISS index
        self.build_index()

    def load_data(self, embeddings_file: str):
        """Load chunks with embeddings from all documents"""
        print('Loading data from all documents...')
        with open(embeddings_file, 'r', encoding='utf-8') as f:
            for line in f:
                chunk = json.loads(line)
                self.chunks.append(chunk)
                self.embeddings.append(chunk['embedding'])
                self.doc_sources.add(chunk['metadata'].get('source', 'unknown'))

        self.embeddings = np.array(self.embeddings, dtype='float32')
        print(f'Loaded {len(self.chunks)} chunks from {len(self.doc_sources)} documents')
        print(f'Documents: {", ".join(sorted(self.doc_sources))}')

    def build_index(self):
        """Build FAISS index for similarity search across all documents"""
        print('\nBuilding unified FAISS index...')

        # Normalize embeddings for cosine similarity
        faiss.normalize_L2(self.embeddings)

        # Create index
        dimension = self.embeddings.shape[1]
        self.index = faiss.IndexFlatIP(dimension)  # Inner Product for cosine similarity
        self.index.add(self.embeddings)

        print(f'✅ Index built with {self.index.ntotal} vectors')
        print(f'✅ Ready to search across {len(self.doc_sources)} documents')

    def search(self, query: str, k: int = TOP_K_RESULTS, filter_source: str = None):
        """Search for relevant chunks across all or specific documents"""
        # Encode query
        query_embedding = self.embedding_model.encode([query])
        query_embedding = np.array(query_embedding, dtype='float32')
        faiss.normalize_L2(query_embedding)

        # Search more if filtering
        search_k = k * 3 if filter_source else k
        scores, indices = self.index.search(query_embedding, search_k)

        # Gather results
        results = []
        for idx, score in zip(indices[0], scores[0]):
            if idx != -1:  # Valid result
                chunk = self.chunks[idx]
                # Filter by source if specified
                if filter_source and chunk['metadata'].get('source') != filter_source:
                    continue
                results.append({
                    'chunk': chunk,
                    'score': float(score)
                })
                if len(results) >= k:
                    break

        return results

    def search_by_document(self, query: str, k_per_doc: int = 2):
        """Get top results from each document separately"""
        all_results = []

        for source in sorted(self.doc_sources):
            results = self.search(query, k=k_per_doc, filter_source=source)
            if results:
                all_results.extend(results)

        # Sort by score
        all_results.sort(key=lambda x: x['score'], reverse=True)
        return all_results

    def format_context(self, results, show_source=True):
        """Format search results as context with source attribution"""
        context_parts = []
        seen_sources = set()

        for i, result in enumerate(results, 1):
            chunk = result['chunk']
            text = chunk['text']
            metadata = chunk.get('metadata', {})
            source = metadata.get('source', 'unknown')
            page = metadata.get('page', 'N/A')

            seen_sources.add(source)

            if show_source:
                context_parts.append(f"[{i}] (Source: {source}, Page: {page})\n{text}")
            else:
                context_parts.append(f"[{i}] {text}")

        context = "\n\n".join(context_parts)

        if seen_sources and show_source:
            sources_str = ", ".join(sorted(seen_sources))
            context = f"Sources: {sources_str}\n\n{context}"

        return context

# Initialize multi-document RAG pipeline
print('Initializing multi-document RAG pipeline...')
rag = MultiDocRAGPipeline(EMBEDDINGS_FILE)
print('\n✅ RAG pipeline ready for multi-document search!')

Initializing multi-document RAG pipeline...
Loading data from all documents...
Loaded 255 chunks from 1 documents
Documents: UK student visa.pdf

Building unified FAISS index...
✅ Index built with 255 vectors
✅ Ready to search across 1 documents

✅ RAG pipeline ready for multi-document search!


In [9]:
from transformers import pipeline
import torch

# Check if GPU is available
device = 0 if torch.cuda.is_available() else -1
print(f"Using device: {'GPU' if device == 0 else 'CPU'}")

# Initialize text generation pipeline
print(f'Loading generation model: {GENERATION_MODEL}...')
try:
    generator = pipeline(
        'text2text-generation',
        model=GENERATION_MODEL,
        device=device,
        max_length=512
    )
    print('✅ Generation model loaded')
except Exception as e:
    print(f'⚠️ Could not load generation model: {e}')
    generator = None

def generate_answer(query: str, context: str, use_generator: bool = True):
    """Generate answer based on query and context from multiple documents"""

    # Create prompt
    prompt = f"""Context from multiple documents:
{context}

Question: {query}

Based on the context provided above from the documents, please answer the question.
If the answer cannot be found in the context, say 'I cannot find this information in the provided documents.'

Answer:"""

    if generator and use_generator:
        # Use LLM for generation
        try:
            response = generator(prompt, max_length=200, do_sample=False)[0]['generated_text']
            return response
        except Exception as e:
            print(f'Generation failed: {e}')
            return "Generation failed. Please check the context below.\n\n" + context
    else:
        # Fallback: return formatted context
        return f"Here is the relevant information found across documents:\n\n{context}"

print('✅ Generation component ready!')

Using device: CPU
Loading generation model: google/flan-t5-base...


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

Device set to use cpu


✅ Generation model loaded
✅ Generation component ready!


In [11]:
def ask_question(query: str, search_mode: str = 'all', use_llm: bool = True):
    """
    Complete RAG pipeline: search + generate answer

    Args:
        query: The question to answer
        search_mode: 'all' for unified search, 'per_doc' for per-document search
        use_llm: Whether to use LLM for answer generation
    """
    print(f"\n{'='*60}")
    print(f"Query: {query}")
    print(f"Search mode: {search_mode}")
    print('='*60)

    # Search for relevant chunks
    print('\n🔍 Searching across all documents...')

    if search_mode == 'per_doc':
        # Get top results from each document
        results = rag.search_by_document(query, k_per_doc=2)
    else:
        # Get top results across all documents
        results = rag.search(query, k=TOP_K_RESULTS)

    if not results:
        print('❌ No relevant information found in any document.')
        return "I couldn't find any relevant information for your query in the available documents."

    # Display search results
    print(f'\n✅ Found {len(results)} relevant chunks:')
    sources_found = set()
    for i, result in enumerate(results, 1):
        source = result['chunk']['metadata'].get('source', 'unknown')
        page = result['chunk']['metadata'].get('page', 'N/A')
        sources_found.add(source)
        print(f"  {i}. Score: {result['score']:.3f} - {source} (Page {page})")

    print(f'\n📚 Documents used: {", ".join(sorted(sources_found))}')

    # Format context
    context = rag.format_context(results, show_source=True)

    # Generate answer
    print('\n💭 Generating answer...')
    answer = generate_answer(query, context, use_generator=use_llm)

    print('\n' + '='*60)
    print('ANSWER:')
    print('='*60)
    print(answer)

    return answer

# Example usage
print('\n🚀 Multi-Document RAG System Ready!')
print('\nThe system will search across ALL PDFs in /content/data/')
print('\nExample questions:')
print('  - What are the main topics covered in these documents?')
print('  - Compare information about [topic] across documents')
print('  - What does [specific document] say about [topic]?')
print('  - What is the minimum English language requirement for studying a degree-level course in the UK?')
print('\n')


🚀 Multi-Document RAG System Ready!

The system will search across ALL PDFs in /content/data/

Example questions:
  - What are the main topics covered in these documents?
  - Compare information about [topic] across documents
  - What does [specific document] say about [topic]?
  - What is the minimum English language requirement for studying a degree-level course in the UK?




In [13]:
# Interactive Q&A Session - Search Across All Documents
# Modify the question below and run the cell

question = "Can you explain the financial requirements for international students applying for a UK Student visa?"  # <- Change this to your question

# Search modes:
# - 'all': Get top results across all documents (default)
# - 'per_doc': Get top results from each document separately

answer = ask_question(question, search_mode='all', use_llm=True)


Query: Can you explain the financial requirements for international students applying for a UK Student visa?
Search mode: all

🔍 Searching across all documents...

✅ Found 5 relevant chunks:
  1. Score: 0.658 - UK student visa.pdf (Page 40)
  2. Score: 0.652 - UK student visa.pdf (Page 39)
  3. Score: 0.649 - UK student visa.pdf (Page 39)
  4. Score: 0.633 - UK student visa.pdf (Page 102)
  5. Score: 0.614 - UK student visa.pdf (Page 43)

📚 Documents used: UK student visa.pdf

💭 Generating answer...

ANSWER:
international students are required to pay any course fees to the sponsoring institution


In [14]:
# Search within a specific document

def search_in_document(query: str, document_name: str):
    """Search for information within a specific document"""
    print(f"\n{'='*60}")
    print(f"Searching in: {document_name}")
    print(f"Query: {query}")
    print('='*60)

    # Check if document exists
    if document_name not in rag.doc_sources:
        print(f"\n❌ Document '{document_name}' not found.")
        print(f"Available documents: {", ".join(sorted(rag.doc_sources))}")
        return None

    # Search in specific document
    results = rag.search(query, k=5, filter_source=document_name)

    if not results:
        print(f"No relevant information found in {document_name}")
        return None

    print(f"\n✅ Found {len(results)} relevant chunks in {document_name}:")
    for i, result in enumerate(results, 1):
        page = result['chunk']['metadata'].get('page', 'N/A')
        print(f"  {i}. Score: {result['score']:.3f} - Page {page}")

    # Format and display results
    context = rag.format_context(results, show_source=False)
    print(f"\n{'='*60}")
    print("RELEVANT INFORMATION:")
    print('='*60)
    print(context)

    return results

# Example: Search in a specific PDF
# First, let's see what documents are available
print("Available documents:")
for doc in sorted(rag.doc_sources):
    print(f"  - {doc}")

# Now search in a specific document (change the document name as needed)
# search_in_document("What is the main topic?", "document.pdf")

Available documents:
  - UK student visa.pdf


In [15]:
# Compare information across multiple documents

def compare_across_documents(query: str):
    """Find and compare information about a topic across all documents"""
    print(f"\n{'='*60}")
    print(f"Comparing across documents: {query}")
    print('='*60)

    # Get results from each document
    results_by_doc = {}

    for source in sorted(rag.doc_sources):
        results = rag.search(query, k=2, filter_source=source)
        if results:
            results_by_doc[source] = results

    if not results_by_doc:
        print("No relevant information found in any document.")
        return

    # Display comparison
    print(f"\n📊 Found relevant information in {len(results_by_doc)} documents:\n")

    for doc, results in results_by_doc.items():
        print(f"\n📄 {doc}:")
        print("-" * 40)
        for result in results:
            chunk = result['chunk']
            text = chunk['text'][:200] + "..." if len(chunk['text']) > 200 else chunk['text']
            page = chunk['metadata'].get('page', 'N/A')
            score = result['score']
            print(f"Page {page} (Score: {score:.3f}):")
            print(f"  {text}\n")

# Example comparison
# compare_across_documents("What are the requirements?")

In [19]:
# Batch Q&A - Ask multiple questions across all documents

questions = [
    "Can Child Students bring dependants to the UK?",
    "What mandatory information must be included in a valid CAS?",
    "For how many consecutive days must students show they have held the required funds?"
]

print('\n' + '='*70)
print('BATCH Q&A SESSION - SEARCHING ALL DOCUMENTS')
print('='*70)

answers = {}
for i, q in enumerate(questions, 1):
    print(f"\n\n[Question {i}/{len(questions)}]")
    answer = ask_question(q, search_mode='all', use_llm=True)
    answers[q] = answer
    print('\n' + '-'*50)

# Summary
print('\n\n' + '='*70)
print('SUMMARY OF ANSWERS')
print('='*70)
for q, a in answers.items():
    print(f"\nQ: {q}")
    print(f"A: {a[:200]}..." if len(a) > 200 else f"A: {a}")


BATCH Q&A SESSION - SEARCHING ALL DOCUMENTS


[Question 1/3]

Query: Can Child Students bring dependants to the UK?
Search mode: all

🔍 Searching across all documents...

✅ Found 5 relevant chunks:
  1. Score: 0.656 - UK student visa.pdf (Page 101)
  2. Score: 0.654 - UK student visa.pdf (Page 98)
  3. Score: 0.652 - UK student visa.pdf (Page 98)
  4. Score: 0.644 - UK student visa.pdf (Page 74)
  5. Score: 0.629 - UK student visa.pdf (Page 11)

📚 Documents used: UK student visa.pdf

💭 Generating answer...

ANSWER:
cannot bring dependants with them to the uk

--------------------------------------------------


[Question 2/3]

Query: What mandatory information must be included in a valid CAS?
Search mode: all

🔍 Searching across all documents...

✅ Found 5 relevant chunks:
  1. Score: 0.645 - UK student visa.pdf (Page 3)
  2. Score: 0.609 - UK student visa.pdf (Page 82)
  3. Score: 0.602 - UK student visa.pdf (Page 20)
  4. Score: 0.597 - UK student visa.pdf (Page 5)
  5. Score: 0.597

In [20]:
# System Statistics and Document Overview

def show_statistics():
    """Display statistics about the indexed documents"""
    print('\n' + '='*60)
    print('📊 SYSTEM STATISTICS')
    print('='*60)

    # Count chunks per document
    doc_stats = {}
    total_text_length = 0

    for chunk in rag.chunks:
        source = chunk['metadata'].get('source', 'unknown')
        if source not in doc_stats:
            doc_stats[source] = {
                'chunks': 0,
                'pages': set(),
                'text_length': 0
            }
        doc_stats[source]['chunks'] += 1
        doc_stats[source]['pages'].add(chunk['metadata'].get('page', 0))
        doc_stats[source]['text_length'] += len(chunk['text'])
        total_text_length += len(chunk['text'])

    print(f"\n📚 Total documents indexed: {len(doc_stats)}")
    print(f"📄 Total chunks: {len(rag.chunks)}")
    print(f"📝 Total text: {total_text_length:,} characters")
    print(f"🔢 Embedding dimension: {rag.embeddings.shape[1]}")
    print(f"💾 Index size: {rag.index.ntotal} vectors")

    print("\n📖 Document Details:")
    print("-" * 60)

    for doc in sorted(doc_stats.keys()):
        stats = doc_stats[doc]
        num_pages = len(stats['pages'])
        avg_chunk_size = stats['text_length'] // stats['chunks']
        print(f"\n📄 {doc}")
        print(f"   Pages: {num_pages}")
        print(f"   Chunks: {stats['chunks']}")
        print(f"   Total text: {stats['text_length']:,} chars")
        print(f"   Avg chunk size: {avg_chunk_size} chars")

show_statistics()


📊 SYSTEM STATISTICS

📚 Total documents indexed: 1
📄 Total chunks: 255
📝 Total text: 215,433 characters
🔢 Embedding dimension: 384
💾 Index size: 255 vectors

📖 Document Details:
------------------------------------------------------------

📄 UK student visa.pdf
   Pages: 107
   Chunks: 255
   Total text: 215,433 chars
   Avg chunk size: 844 chars


In [21]:
# Save and Load the Multi-Document Index

def save_index(index_path: str = INDEX_FILE):
    """Save FAISS index and metadata to disk"""
    import pickle

    # Save FAISS index
    faiss.write_index(rag.index, index_path)
    print(f'✅ Index saved to: {index_path}')

    # Save metadata
    metadata_path = index_path.replace('.idx', '_metadata.pkl')
    metadata = {
        'num_chunks': len(rag.chunks),
        'num_documents': len(rag.doc_sources),
        'documents': list(rag.doc_sources),
        'embedding_dim': rag.embeddings.shape[1],
        'model': EMBEDDING_MODEL
    }

    with open(metadata_path, 'wb') as f:
        pickle.dump(metadata, f)
    print(f'✅ Metadata saved to: {metadata_path}')

    # Display what was saved
    print(f"\n📊 Saved index contains:")
    print(f"   - {metadata['num_chunks']} chunks")
    print(f"   - {metadata['num_documents']} documents")
    print(f"   - Documents: {", ".join(metadata['documents'])}")

def load_index(index_path: str = INDEX_FILE):
    """Load FAISS index from disk"""
    import pickle

    if os.path.exists(index_path):
        rag.index = faiss.read_index(index_path)
        print(f'✅ Index loaded from: {index_path}')
        print(f'   Vectors in index: {rag.index.ntotal}')

        # Load metadata
        metadata_path = index_path.replace('.idx', '_metadata.pkl')
        if os.path.exists(metadata_path):
            with open(metadata_path, 'rb') as f:
                metadata = pickle.load(f)
            print(f"\n📊 Index contains:")
            print(f"   - {metadata['num_chunks']} chunks")
            print(f"   - {metadata['num_documents']} documents")
            print(f"   - Documents: {", ".join(metadata['documents'])}")
    else:
        print(f'❌ Index file not found: {index_path}')

# Save the current index
save_index()

# Example: Load a previously saved index
# load_index()

✅ Index saved to: /content/output/faiss_index.idx
✅ Metadata saved to: /content/output/faiss_index_metadata.pkl

📊 Saved index contains:
   - 255 chunks
   - 1 documents
   - Documents: UK student visa.pdf


In [24]:
import os
import zipfile
from google.colab import files

OUTPUT_DIR = "/content/output"  # change this to your actual folder

def download_rag_files():
    """Create a zip file with all RAG outputs for download"""
    zip_path = '/content/multi_doc_rag_output.zip'

    print('Creating zip file with all outputs...')
    with zipfile.ZipFile(zip_path, 'w') as zipf:
        for file in os.listdir(OUTPUT_DIR):
            file_path = os.path.join(OUTPUT_DIR, file)
            if os.path.isfile(file_path):
                zipf.write(file_path, file)
                print(f'  Added: {file}')

    size_mb = os.path.getsize(zip_path) / (1024 * 1024)
    print(f'\n✅ Created zip file: {zip_path} ({size_mb:.2f} MB)')

    # Start download (only works if <500MB)
    files.download(zip_path)
    print('✅ Download started!')

# Run
download_rag_files()


Creating zip file with all outputs...
  Added: faiss_index_metadata.pkl
  Added: faiss_index.idx
  Added: chunks_with_embeddings.jsonl
  Added: all_chunks.jsonl

✅ Created zip file: /content/multi_doc_rag_output.zip (2.92 MB)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

✅ Download started!


## Summary

This notebook implements a complete multi-document RAG pipeline that:

### Key Features
- **Processes ALL PDFs** in `/content/data` automatically
- **Unified search** across all documents
- **Document-specific search** when needed
- **Cross-document comparison** capabilities
- **Source attribution** for all results

### Pipeline Components
1. **Batch Document Processing**: Automatically processes all PDFs in the data directory
2. **Unified Embedding Generation**: Creates embeddings for all documents
3. **Multi-Document Search**: FAISS index for searching across all documents
4. **Intelligent Answer Generation**: Context-aware responses with source citations

### Usage Instructions

1. **Add PDFs to `/content/data/`**:
   - Upload multiple PDFs using the upload button
   - Or mount Google Drive and copy PDFs to the data folder

2. **Run all cells in sequence**:
   - The pipeline will automatically process all PDFs
   - Creates a unified search index

3. **Query your documents**:
   - Search across all documents
   - Search within specific documents
   - Compare information across documents

### Tips for Google Colab

- **GPU Runtime**: Enable GPU for faster processing (Runtime > Change runtime type > GPU)
- **Persistent Storage**: Mount Google Drive to save processed files
- **Large Documents**: The system handles multiple large PDFs efficiently
- **Batch Upload**: You can upload multiple PDFs at once

### Next Steps

- Experiment with different search modes
- Adjust chunking parameters for your documents
- Try different embedding models for domain-specific content
- Export the index for use in production systems