# PaperBot RAG Demo

This notebook demonstrates:
1. Embedding model training/loading
2. Document chunking and embedding
3. Vector search and retrieval
4. RAG-based Q/A with citations


In [None]:
import os
import sys
import numpy as np
from pathlib import Path

# Add project root to path
project_root = Path().resolve().parent
sys.path.insert(0, str(project_root))

os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'paperbot.settings')
import django
django.setup()


## 1. Load Embedding Model


In [None]:
from sentence_transformers import SentenceTransformer

# Load embedding model
model_name = 'sentence-transformers/all-MiniLM-L6-v2'
embedding_model = SentenceTransformer(model_name)

print(f"Model loaded: {model_name}")
print(f"Embedding dimension: {embedding_model.get_sentence_embedding_dimension()}")


## 2. Sample Document Processing


In [None]:
# Sample research paper text (in production, this comes from PDF extraction)
sample_text = """
Machine Learning for Document Understanding

Abstract: This paper presents a novel approach to document understanding using 
machine learning techniques. We propose a transformer-based architecture that 
can extract structured information from unstructured documents.

1. Introduction
Document understanding is a critical task in information retrieval systems. 
Traditional methods rely on rule-based extraction, which is brittle and 
does not scale well.

2. Methodology
Our approach uses a pre-trained transformer model fine-tuned on document 
understanding tasks. We employ a two-stage process: first, we extract text 
from PDFs using OCR, then we apply semantic embeddings to enable similarity search.

3. Results
We evaluated our system on a dataset of 10,000 research papers. The system 
achieved 95% accuracy in information extraction tasks.

4. Related Work
Previous work in document understanding includes BERT-based models (Devlin et al., 2019) 
and GPT-based approaches (Brown et al., 2020). Our method builds upon these 
foundations while introducing novel improvements.

5. Conclusion
We have demonstrated that transformer-based models can effectively understand 
and extract information from documents. Future work will focus on multi-modal 
understanding incorporating images and tables.
"""

print(f"Sample text length: {len(sample_text)} characters")


In [None]:
from api.utils import PDFProcessor

# Chunk the text
chunks = PDFProcessor.chunk_text(sample_text, chunk_size=200, overlap=50)

print(f"Number of chunks: {len(chunks)}")
print("\nFirst chunk:")
print(chunks[0]['text'][:200] + "...")


## 4. Create Embeddings


In [None]:
# Create embeddings for all chunks
chunk_texts = [chunk['text'] for chunk in chunks]
embeddings = embedding_model.encode(chunk_texts, convert_to_numpy=True)

print(f"Embeddings shape: {embeddings.shape}")
print(f"Number of chunks: {len(chunk_texts)}")
print(f"Embedding dimension: {embeddings.shape[1]}")


## 5. Build Vector Index (FAISS)


In [None]:
import faiss

# Create FAISS index
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)

# Add embeddings to index
embeddings_float32 = embeddings.astype('float32')
index.add(embeddings_float32)

print(f"Index size: {index.ntotal}")
print(f"Index dimension: {index.d}")


## 6. Query and Retrieve


In [None]:
# Example query
query = "What is the methodology used in this paper?"

# Create query embedding
query_embedding = embedding_model.encode([query], convert_to_numpy=True).astype('float32')

# Search for similar chunks
k = 3  # Top 3 results
distances, indices = index.search(query_embedding, k)

print(f"Query: {query}")
print(f"\nTop {k} similar chunks:")
print("=" * 80)

for i, (idx, dist) in enumerate(zip(indices[0], distances[0])):
    print(f"\nResult {i+1} (distance: {dist:.4f}):")
    print(f"Chunk {idx}:")
    print(chunks[idx]['text'][:300] + "...")


## 7. RAG Prompt Construction


In [None]:
# Build RAG prompt with retrieved context
retrieved_chunks = [chunks[idx] for idx in indices[0]]

context = "\n\n".join([
    f"[Chunk {i+1}]\n{chunk['text']}" 
    for i, chunk in enumerate(retrieved_chunks)
])

prompt = f"""You are a research assistant. Answer the question based on the provided context.
Always cite which chunk you're referencing.

Context:
{context}

Question: {query}

Answer:"""

print("RAG Prompt:")
print("=" * 80)
print(prompt)


## 8. Simulate LLM Response (with Citations)


In [None]:
# Simulated LLM response (in production, this calls OpenAI/Anthropic)
answer = """
The methodology used in this paper involves a two-stage process:

1. Text Extraction: First, text is extracted from PDFs using OCR technology.
   [Reference: Chunk 2]

2. Semantic Embeddings: Then, semantic embeddings are applied to enable 
   similarity search and retrieval. [Reference: Chunk 2]

The approach uses a pre-trained transformer model that is fine-tuned on 
document understanding tasks. [Reference: Chunk 2]
"""

print("Generated Answer with Citations:")
print("=" * 80)
print(answer)

# Extract citations
citations = [
    {
        'chunk_id': idx,
        'snippet': chunks[idx]['text'][:200],
        'score': float(dist)
    }
    for idx, dist in zip(indices[0], distances[0])
]

print("\n\nCitations:")
for i, cite in enumerate(citations, 1):
    print(f"\nCitation {i}:")
    print(f"  Chunk ID: {cite['chunk_id']}")
    print(f"  Score: {cite['score']:.4f}")
    print(f"  Snippet: {cite['snippet'][:150]}...")


## 9. Integration with Django Models


In [None]:
from core.models import Document, Chunk, ChunkEmbedding, EmbeddingModel
from django.contrib.auth import get_user_model

User = get_user_model()

# Example: Get a document and its chunks
try:
    doc = Document.objects.filter(status='indexed').first()
    if doc:
        print(f"Document: {doc.title}")
        print(f"Status: {doc.status}")
        print(f"Chunks: {doc.chunks.count()}")
        
        # Get embeddings
        embeddings_count = ChunkEmbedding.objects.filter(chunk__document=doc).count()
        print(f"Embeddings: {embeddings_count}")
    else:
        print("No indexed documents found. Upload a document first.")
except Exception as e:
    print(f"Error: {e}")


## 10. Summary

This notebook demonstrated:
1. ✅ Loading embedding models
2. ✅ Text chunking
3. ✅ Creating embeddings
4. ✅ Building vector index (FAISS)
5. ✅ Querying and retrieval
6. ✅ RAG prompt construction
7. ✅ Citation extraction

In production, these steps are automated through:
- Celery tasks for async processing
- Django models for persistence
- API endpoints for Q/A and summarization
