# Phase 1: Embedding Pipeline

This notebook demonstrates the complete document processing pipeline for Phase 1:
1. **Data Fetching** (ArXiv metadata and PDFs)
2. **Document Loading** (combining metadata with text)
3. **Document Chunking** (splitting into manageable pieces)
4. **Document Embedding** (generating vector representations)
5. **Data Persistence** (saving processed chunks)

## ‚ö†Ô∏è IMPORTANT: Fetch Data First!

**Before running this notebook, you MUST fetch ArXiv data first!**

No data = No chunks = No embeddings

Run this command in your terminal:
```bash
# Fetch 100 papers (metadata only - for abstracts)
python scripts/fetch_arxiv_data.py --max-results 100

# OR fetch with PDFs (for full-text processing)
python scripts/fetch_arxiv_data.py --max-results 100 --download-pdfs
```

See `docs/phase_guides/phase1.md` for detailed instructions.

In [3]:
# Setup: Add src to path and configure logging
import sys
from pathlib import Path

project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))

# Imports
import src.config
from src.utils.logging_config import setup_logging
from src.embedding.document_processor import DocumentProcessor, process_arxiv_abstracts
from src.constants import ChunkingStrategy

# Setup logging
logger = setup_logging('embedding_pipeline')
logger.info('Embedding pipeline notebook initialized')

[INFO] [embedding_pipeline] Embedding pipeline notebook initialized


## Step 1: Verify Data Availability

Before processing, let's check if we have ArXiv metadata files.

In [None]:
# Check if metadata files exist
from src import config
from src.constants import ARXIV_METADATA_SUBDIR

metadata_dir = config.RAW_DATA_DIR / ARXIV_METADATA_SUBDIR
metadata_files = list(metadata_dir.glob("*.json")) if metadata_dir.exists() else []

print(f"üìÅ Metadata directory: {metadata_dir}")
print(f"üìÑ Found {len(metadata_files)} metadata files")

if len(metadata_files) == 0:
    print("\n‚ö†Ô∏è  WARNING: No metadata files found!")
    print("\nPlease run this command in your terminal first:")
    print("  python scripts/fetch_arxiv_data.py --max-results 100")
    print("\nOr see docs/phase_guides/phase1.md for detailed instructions.")
else:
    print(f"‚úÖ Data ready! Sample files:")
    for f in metadata_files[:5]:
        print(f"   - {f.name}")
    if len(metadata_files) > 5:
        print(f"   ... and {len(metadata_files) - 5} more")

üìÅ Metadata directory: /mnt/data/lourvens/learning/research-agent/data/raw/arxiv_metadata
üìÑ Found 100 metadata files
‚úÖ Data ready! Sample files:
   - 2010.09254v1.json
   - 1808.02632v1.json
   - 2011.02705v1.json
   - 1808.10568v2.json
   - 2010.00247v2.json
   ... and 95 more


## Step 2: Process Documents

Choose one of the following options:

### Option A: Process Abstracts Only (Fast, No PDFs Required)
- Uses only metadata (title + abstract)
- Fast processing (~15 seconds for 100 papers)
- No PDF downloads needed

### Option B: Process Full Papers (Slower, Requires PDFs)
- Includes full PDF text
- Slower processing (~5-10 minutes for 100 papers)
- Requires PDFs to be downloaded first


In [5]:
# Option A: Process Abstracts Only (Recommended for testing)
# This is fast and doesn't require PDF downloads

print("üîÑ Processing abstracts only...")
print("=" * 60)

embedded_docs = process_arxiv_abstracts(max_documents=10)

print(f"\n‚úÖ Processed {len(embedded_docs)} documents")
if len(embedded_docs) > 0:
    # Handle embedding dimension (can be list or numpy array)
    embedding = embedded_docs[0].metadata['embedding']
    if hasattr(embedding, 'shape'):
        emb_dim = embedding.shape[0] if len(embedding.shape) > 0 else len(embedding)
    else:
        emb_dim = len(embedding)
    
    print(f"üìä Embedding dimension: {emb_dim}")
    print(f"üìù Sample document:")
    print(f"   Title: {embedded_docs[0].metadata.get('title', 'N/A')[:60]}...")
    print(f"   ArXiv ID: {embedded_docs[0].metadata.get('arxiv_id', 'N/A')}")
    print(f"   Content length: {len(embedded_docs[0].page_content)} chars")
else:
    print("‚ö†Ô∏è  No documents were processed. Please ensure metadata files exist.")


üîÑ Processing abstracts only...
[INFO] [pdf_processor] Initialized PDFProcessor with loader: pymupdf
[INFO] [chunking] Initialized DocumentChunker
[INFO] [embedder] Initializing DocumentEmbedder


  self.embeddings = HuggingFaceEmbeddings(


[INFO] [embedder] DocumentEmbedder initialized successfully
[INFO] [document_processor] Initialized DocumentProcessor pipeline
[INFO] [document_processor] Starting document processing pipeline
[INFO] [document_loader] Loading all documents
[INFO] [document_loader] Found 10 metadata files
[INFO] [document_loader] Completed document loading
[INFO] [document_loader] Saved loaded documents (one file per document)
[INFO] [document_loader] Saved loaded documents for testing/monitoring
[INFO] [chunking] Chunking 10 documents
[INFO] [chunking] Completed chunking
[INFO] [embedder] Starting document embedding
[INFO] [embedder] Completed document embedding
[INFO] [chunk_saver] Saved processed chunks
[INFO] [document_processor] Saved processed chunks to /mnt/data/lourvens/learning/research-agent/data/processed/arxiv/chunks/arxiv_chunks_2025-12-16_10-48-16_abstracts.json
[INFO] [document_processor] Document processing pipeline completed

‚úÖ Processed 10 documents
üìä Embedding dimension: 384
üìù

### Option B: Process Full Papers (Uncomment to use)

**Note**: This requires PDFs to be downloaded first. Run:
```bash
python scripts/fetch_arxiv_data.py --max-results 100 --download-pdfs
```


In [None]:
# Option B: Process Full Papers (Uncomment to use)
# WARNING: This is slow and requires PDFs to be downloaded first!

# processor = DocumentProcessor(
#     embedding_model="all-MiniLM-L6-v2",
#     chunk_strategy=ChunkingStrategy.RECURSIVE
# )
# 
# print("üîÑ Processing full papers...")
# print("=" * 60)
# 
# embedded_docs = processor.process_documents(
#     include_full_text=True,
#     max_documents=10,
#     save_to_disk=True
# )
# 
# print(f"\n‚úÖ Processed {len(embedded_docs)} documents")
# if len(embedded_docs) > 0:
#     # Handle embedding dimension (can be list or numpy array)
#     embedding = embedded_docs[0].metadata['embedding']
#     if hasattr(embedding, 'shape'):
#         emb_dim = embedding.shape[0] if len(embedding.shape) > 0 else len(embedding)
#     else:
#         emb_dim = len(embedding)
#     
#     print(f"üìä Embedding dimension: {emb_dim}")
#     print(f"üìù Sample document:")
#     print(f"   Title: {embedded_docs[0].metadata.get('title', 'N/A')[:60]}...")
#     print(f"   ArXiv ID: {embedded_docs[0].metadata.get('arxiv_id', 'N/A')}")
#     print(f"   Content length: {len(embedded_docs[0].page_content)} chars")


## Step 3: Verify Results

Let's check what was saved and verify the processing was successful.


In [None]:
# Check processed chunks directory
from src.constants import DATA_SOURCE_ARXIV, PROCESSED_CHUNKS_SUBDIR

processed_dir = config.PROCESSED_DATA_DIR / DATA_SOURCE_ARXIV / PROCESSED_CHUNKS_SUBDIR
chunk_files = list(processed_dir.glob("*.json")) if processed_dir.exists() else []

print(f"üìÅ Processed chunks directory: {processed_dir}")
print(f"üíæ Found {len(chunk_files)} chunk files")

if len(chunk_files) > 0:
    # Get the most recent file
    latest_file = max(chunk_files, key=lambda p: p.stat().st_mtime)
    print(f"\nüìÑ Latest chunk file: {latest_file.name}")
    print(f"   Size: {latest_file.stat().st_size / 1024:.2f} KB")
    
    # Load and inspect
    import json
    with latest_file.open("r", encoding="utf-8") as f:
        chunk_data = json.load(f)
    
    print(f"\nüìä Chunk file statistics:")
    print(f"   Source: {chunk_data.get('source', 'N/A')}")
    print(f"   Total chunks: {len(chunk_data.get('chunks', []))}")
    
    if chunk_data.get('chunks'):
        sample_chunk = chunk_data['chunks'][0]
        print(f"\nüìù Sample chunk:")
        print(f"   Chunk ID: {sample_chunk.get('chunk_id', 'N/A')}")
        print(f"   ArXiv ID: {sample_chunk.get('metadata', {}).get('arxiv_id', 'N/A')}")
        print(f"   Title: {sample_chunk.get('metadata', {}).get('title', 'N/A')[:60]}...")
        print(f"   Has embedding: {'embedding' in sample_chunk.get('metadata', {})}")
        if 'embedding' in sample_chunk.get('metadata', {}):
            emb = sample_chunk['metadata']['embedding']
            print(f"   Embedding dim: {len(emb) if isinstance(emb, list) else 'N/A'}")
else:
    print("\n‚ö†Ô∏è  No chunk files found. Processing may have failed or save_to_disk=False.")


üìÅ Processed chunks directory: /mnt/data/lourvens/learning/research-agent/data/processed/arxiv/chunks
üíæ Found 1 chunk files

üìÑ Latest chunk file: arxiv_chunks_2025-12-16_10-48-16_abstracts.json
   Size: 140.53 KB

üìä Chunk file statistics:
   Source: arxiv
   Total chunks: 10

üìù Sample chunk:
   Chunk ID: 2010.09254v1_chunk_0
   ArXiv ID: 2010.09254v1
   Title: Query-aware Tip Generation for Vertical Search...
   Has embedding: True
   Embedding dim: 384


## Step 4: Inspect Document Metadata

Verify that all required metadata fields are preserved (per AGENT.md Rule 2).


In [8]:
# Verify metadata completeness
if 'embedded_docs' in globals() and len(embedded_docs) > 0:
    required_fields = [
        'source', 'arxiv_id', 'title', 'authors', 
        'published', 'pdf_url', 'embedding'
    ]
    
    sample_doc = embedded_docs[0]
    metadata = sample_doc.metadata
    
    print("üìã Required Metadata Fields (AGENT.md Rule 2):")
    print("=" * 60)
    
    missing_fields = []
    for field in required_fields:
        if field in metadata:
            value = metadata[field]
            if field == 'embedding':
                # Handle both list and numpy array
                if hasattr(value, 'shape'):
                    emb_info = f"{type(value).__name__} (shape: {value.shape})"
                else:
                    emb_info = f"{type(value).__name__} (dim: {len(value)})"
                print(f"   ‚úÖ {field}: {emb_info}")
            elif field == 'authors':
                print(f"   ‚úÖ {field}: {len(value)} author(s)")
            else:
                display_value = str(value)[:50] + "..." if len(str(value)) > 50 else str(value)
                print(f"   ‚úÖ {field}: {display_value}")
        else:
            print(f"   ‚ùå {field}: MISSING")
            missing_fields.append(field)
    
    if missing_fields:
        print(f"\n‚ö†Ô∏è  Warning: {len(missing_fields)} required field(s) missing: {missing_fields}")
    else:
        print(f"\n‚úÖ All required metadata fields present!")
else:
    print("‚ö†Ô∏è  No documents to inspect. Run processing cells first.")


üìã Required Metadata Fields (AGENT.md Rule 2):
   ‚úÖ source: arxiv
   ‚úÖ arxiv_id: 2010.09254v1
   ‚úÖ title: Query-aware Tip Generation for Vertical Search
   ‚úÖ authors: 10 author(s)
   ‚úÖ published: 2020-10-19T06:48:40+00:00
   ‚úÖ pdf_url: https://arxiv.org/pdf/2010.09254v1
   ‚úÖ embedding: list (dim: 384)

‚úÖ All required metadata fields present!


## Step 5: Summary and Next Steps

### What We've Accomplished
- ‚úÖ Fetched ArXiv metadata (via script)
- ‚úÖ Loaded documents with metadata
- ‚úÖ Generated embeddings for document chunks
- ‚úÖ Saved processed chunks to disk
- ‚úÖ Verified metadata completeness

### Next Steps
1. **Vector Store**: Create ChromaDB vector store (Phase 1 continuation)
2. **RAG Chain**: Build question-answering pipeline
3. **Phase 2**: Add multi-source integration

### Resources
- üìñ [Phase 1 Guide](docs/phase_guides/phase1.md) - Detailed documentation
- üìù [AGENT.md](AGENT.md) - Architecture rules
- üß™ [Tests](tests/) - Run `pytest tests/ -v` to verify everything works


## Step 2: Process Documents

Choose one of the following options:

### Option A: Process Abstracts Only (Fast, No PDFs Required)
- Uses only metadata (title + abstract)
- Fast processing (~15 seconds for 100 papers)
- No PDF downloads needed

### Option B: Process Full Papers (Slower, Requires PDFs)
- Includes full PDF text
- Slower processing (~5-10 minutes for 100 papers)
- Requires PDFs to be downloaded first
