# Document Ingestion for Baseline RAG

This notebook demonstrates document ingestion for the baseline RAG system using the document preprocessing utilities.

## Features
- Text extraction and cleaning
- Document chunking with configurable size and overlap
- Metadata preservation
- Batch processing for large document sets

## Usage
1. Load source material (text files, PDFs, etc.)
2. Process and chunk documents
3. Ingest into RAG system

In [None]:
import os
from pathlib import Path
from utils.notebook_utils.importable import notebook_to_module
from utils.notebook_utils.document_utils import DocumentPreprocessor, ingest_documents

# Import our RAG implementation
baseline_rag = notebook_to_module('implementation.ipynb')
BaselineRAG = baseline_rag.BaselineRAG

In [None]:
# Example usage
def test_ingestion():
    # Create test documents
    os.makedirs('test_docs', exist_ok=True)
    
    with open('test_docs/doc1.txt', 'w') as f:
        f.write("""
        Machine learning is a subset of artificial intelligence that focuses on developing systems that can learn from data.
        Deep learning is a type of machine learning that uses neural networks with multiple layers.
        Reinforcement learning is another type of machine learning where agents learn by interacting with an environment.
        """)
    
    with open('test_docs/doc2.txt', 'w') as f:
        f.write("""
        Natural Language Processing (NLP) is a field of AI that focuses on interactions between computers and human language.
        Common NLP tasks include text classification, named entity recognition, and machine translation.
        """)
    
    # Initialize RAG system
    rag = BaselineRAG(index_name="test-rag-documents")
    
    # Test ingestion
    print("Ingesting documents...")
    ingest_documents(
        'test_docs',
        rag,
        metadata={'dataset': 'test', 'domain': 'AI/ML'},
        batch_size=2
    )
    
    # Test query
    print("\nTesting query...")
    result = rag.query("What is machine learning and deep learning?")
    
    print("\nResponse:", result['response'])
    print("\nContext used:")
    for doc in result['context']:
        print(f"- {doc['content']}")
        print(f"  Metadata: {doc['metadata']}")
    
    # Cleanup
    import shutil
    shutil.rmtree('test_docs')

if __name__ == "__main__":
    test_ingestion()