# üî¨ SciCO - The Zotero Library RAG System

Welcome to **SciCO** (Scientific Co-worker)! This interactive tutorial will guide you through:

1. **Configuration** - Setting up your environment
2. **Zotero Integration** - Connecting to your library and retrieving metadata
3. **PDF Processing** - Converting PDFs to searchable markdown
4. **Text Chunking** - Breaking documents into semantic pieces
5. **Vector Storage** - Creating embeddings for semantic search
6. **Retrieval** - Querying your knowledge base

---

## üéØ What is RAG?

**Retrieval-Augmented Generation (RAG)** combines:
- **Vector databases** for semantic search
- **Your documents** as the knowledge source
- **LLMs** for intelligent question answering

This allows you to ask questions about your scientific papers and get answers grounded in your actual sources!

---
## üìã Prerequisites

Before starting, ensure you have:

‚úÖ **Ollama** installed and running ([https://ollama.ai](https://ollama.ai))  
‚úÖ **Zotero** with a populated library  
‚úÖ A `.env` file with required variables (see below)  
‚úÖ Python packages installed (`pip install -r requirements.txt`)


---
# 1Ô∏è‚É£ Configuration

The project requires a `.env` file in your project root with these variables:

### üìù Required Environment Variables

In [1]:
# Example .env file structure:
# Copy this to your .env file and fill in the paths

example_env = """
# Name of the collection in your Zotero library
COLLECTION_NAME='Your Collection Name'

# Path where markdown files will be saved
MARKDOWN_FOLDER_PATH='/path/to/markdown/output'

# Path to your Zotero data folder (contains zotero.sqlite)
ZOTERO_LIBRARY_PATH='/path/to/Zotero/data'

# Path to the ChromaDB index file (should end in .db)
INDEX_PATH='/path/to/index/chroma.db'

# (Optional) For testing
TEST_PDF_PATH='/path/to/test/paper.pdf'
"""

print("üìÑ Example .env configuration:")
print(example_env)

üìÑ Example .env configuration:

# Name of the collection in your Zotero library
COLLECTION_NAME='Your Collection Name'

# Path where markdown files will be saved
MARKDOWN_FOLDER_PATH='/path/to/markdown/output'

# Path to your Zotero data folder (contains zotero.sqlite)
ZOTERO_LIBRARY_PATH='/path/to/Zotero/data'

# Path to the ChromaDB index file (should end in .db)
INDEX_PATH='/path/to/index/chroma.db'

# (Optional) For testing
TEST_PDF_PATH='/path/to/test/paper.pdf'



### üîß Import Dependencies and Setup

In [2]:
import sys
import os
from pathlib import Path
import requests
import json
from pprint import pprint
from dotenv import load_dotenv
from IPython.display import display, Markdown, HTML
import warnings
warnings.filterwarnings('ignore')

# Add project source to path
project_src = Path.cwd()
if str(project_src) not in sys.path:
    sys.path.insert(0, str(project_src))

# Load environment variables
load_dotenv()

print("‚úÖ Dependencies imported successfully!")

‚úÖ Dependencies imported successfully!


### üè• Health Check: Verify Ollama is Running

In [3]:
def ensure_ollama_running(host: str = "127.0.0.1", port: int = 11434, timeout: float = 2.0) -> dict:
    """Check if Ollama is running and return status info."""
    base_url = f"http://{host}:{port}"
    try:
        resp = requests.get(f"{base_url}/api/version", timeout=timeout)
        if resp.status_code == 200:
            version_info = resp.json()
            return {
                'status': 'running',
                'url': base_url,
                'version': version_info.get('version', 'unknown')
            }
        else:
            return {
                'status': 'error',
                'message': f"Ollama responded with status {resp.status_code}"
            }
    except requests.exceptions.RequestException as e:
        return {
            'status': 'not_running',
            'message': f"Cannot reach Ollama at {base_url}",
            'error': str(e)
        }


# Run health check
ollama_status = ensure_ollama_running()

if ollama_status['status'] == 'running':
    display(HTML(
        f"<div style='padding:10px; background-color:#1a4d2e; border-left:4px solid #28a745; border-radius:4px; color:#ffffff;'>"
        f"<strong>‚úÖ Ollama is running!</strong><br>"
        f"üîó URL: {ollama_status['url']}<br>"
        f"üì¶ Version: {ollama_status['version']}"
        f"</div>"))
else:
    display(HTML(
        f"<div style='padding:10px; background-color:#5c1a1a; border-left:4px solid #dc3545; border-radius:4px; color:#ffffff;'>"
        f"<strong>‚ùå Ollama is NOT running!</strong><br>"
        f"‚ö†Ô∏è {ollama_status.get('message', 'Unknown error')}<br>"
        f"<em>Please start Ollama before continuing.</em>"
        f"</div>"))
    raise RuntimeError("Ollama must be running to use this notebook.")


### üìä Verify Configuration

In [4]:
# Check if all required environment variables are set
required_vars = ['COLLECTION_NAME', 'MARKDOWN_FOLDER_PATH', 'ZOTERO_LIBRARY_PATH', 'INDEX_PATH']
config_status = {}

print("üîç Configuration Status:\n")
for var in required_vars:
    value = os.getenv(var)
    config_status[var] = value
    status = "‚úÖ" if value else "‚ùå"
    print(f"{status} {var}: {value if value else 'NOT SET'}")

all_set = all(config_status.values())
if all_set:
    print("\n‚úÖ All required variables are configured!")
else:
    print("\n‚ö†Ô∏è Some variables are missing. Please update your .env file.")

üîç Configuration Status:

‚úÖ COLLECTION_NAME: scico-test
‚úÖ MARKDOWN_FOLDER_PATH: explore_langchain
‚úÖ ZOTERO_LIBRARY_PATH: /home/soenke/Zotero
‚úÖ INDEX_PATH: explore_langchain/test.db

‚úÖ All required variables are configured!


---
# 2Ô∏è‚É£ Zotero Integration

Let's connect to your Zotero database and explore what's inside!

In [5]:
from src.ZoteroIntegration import ZoteroMetadataRetriever

# Initialize the Zotero connection
zotero_path = Path(os.getenv('ZOTERO_LIBRARY_PATH'))
retriever = ZoteroMetadataRetriever(zotero_path)

print("üîå Connecting to Zotero database...")
retriever.initialize()
print("‚úÖ Connected successfully!")
print(f"üìÇ Database path: {retriever.config.sqlite_path}")

üîå Connecting to Zotero database...
‚úÖ Connected successfully!
üìÇ Database path: /home/soenke/Zotero/zotero.sqlite


### üìö Explore Collections and PDFs

In [6]:
# Get PDFs from the configured collection
collection_name = os.getenv('COLLECTION_NAME')
print(f"üìñ Retrieving PDFs from collection: '{collection_name}'\n")

pdfs = retriever.get_pdfs_in_collection(collection_name)

if pdfs:
    print(f"‚úÖ Found {len(pdfs)} PDF(s) in this collection\n")
    print("üìÑ First 3 PDFs:")
    print("-" * 80)
    for i, pdf in enumerate(pdfs[:3], 1):
        print(f"\n{i}. {pdf['pdf_name']}")
        print(f"   Citation Key: {pdf['citationkey'] or 'None'}")
        print(f"   Item ID: {pdf['itemID']}")
        print(f"   Path: {pdf['pdf_path'] or 'Not found in storage'}")
else:
    print(f"‚ö†Ô∏è No PDFs found in collection '{collection_name}'")
    print("Tip: Make sure your collection name matches exactly (case-sensitive)")

üìñ Retrieving PDFs from collection: 'scico-test'

‚úÖ Found 1 PDF(s) in this collection

üìÑ First 3 PDFs:
--------------------------------------------------------------------------------

1. Wegner et al. - 2023 - Complexity measures for EEG microstate sequences - concepts and algorithms.pdf
   Citation Key: wegnerComplexityMeasuresEEG2023
   Item ID: 289
   Path: /home/soenke/Zotero/storage/9N8E7TQU/Wegner et al. - 2023 - Complexity measures for EEG microstate sequences - concepts and algorithms.pdf


### üîç Deep Dive: Get Full Metadata for a PDF

In [7]:
# Let's examine the full metadata for the first PDF (if available)
if pdfs and pdfs[0]['pdf_path']:
    sample_pdf_path = Path(pdfs[0]['pdf_path'])
    print(f"üî¨ Analyzing: {sample_pdf_path.name}\n")
    
    metadata = retriever.get_metadata_for_pdf(sample_pdf_path)
    
    if metadata:
        print("üìä Full Metadata:")
        print("=" * 80)
        print(f"\nüìñ Title: {metadata.get('title', 'N/A')}")
        print(f"\n‚úçÔ∏è Authors: {metadata.get('authors', 'N/A')}")
        print(f"\nüìÖ Year: {metadata.get('year', 'N/A')}")
        print(f"\nüîó DOI: {metadata.get('doi', 'N/A')}")
        print(f"\nüåê URL: {metadata.get('url', 'N/A')}")
        print(f"\nüì∞ Publication: {metadata.get('publication_title', 'N/A')}")
        
        if metadata.get('abstract'):
            abstract = metadata['abstract']
            print(f"\nüìù Abstract: {abstract[:200]}..." if len(abstract) > 200 else f"\nüìù Abstract: {abstract}")
        
        if metadata.get('tags'):
            print(f"\nüè∑Ô∏è Tags: {', '.join(metadata['tags'])}")
        
        if metadata.get('collections'):
            print(f"\nüìÅ Collections:")
            for coll in metadata['collections']:
                print(f"   - {coll['name']}")
        
        print("\n" + "=" * 80)
    else:
        print("‚ö†Ô∏è Could not retrieve metadata")
else:
    print("‚ö†Ô∏è No valid PDF path found for analysis")
    print("Tip: Make sure PDFs exist in your Zotero storage folder")

üî¨ Analyzing: Wegner et al. - 2023 - Complexity measures for EEG microstate sequences - concepts and algorithms.pdf

üìä Full Metadata:

üìñ Title: Complexity measures for EEG microstate sequences - concepts and algorithms

‚úçÔ∏è Authors: Wegner, Frederic von; Wiemers, Milena; Hermann, Gesine; T√∂dt, Inken; Tagliazucchi, Enzo; Laufs, Helmut

üìÖ Year: 2023

üîó DOI: 10.21203/rs.3.rs-2878411/v1

üåê URL: https://www.researchsquare.com/article/rs-2878411/v1

üì∞ Publication: None

üìù Abstract: EEG microstate sequence analysis quantifies properties of ongoing brain electrical activity which is known to exhibit complex dynamics across many time scales. In this report we review recent developm...

üìÅ Collections:
   - PCI-From-Resting-State-Reconstruction
   - scico-test



---
# 3Ô∏è‚É£ PDF to Markdown Conversion

Now let's convert a PDF to structured Markdown using the `marker` library with Ollama.

In [18]:
from src.PdfToMarkdown import convert_pdf_to_markdown

# We'll use the first PDF from our collection (if available)
if pdfs and pdfs[0]['pdf_path']:
    pdf_path = pdfs[0]['pdf_path']
    output_folder = os.getenv('MARKDOWN_FOLDER_PATH')
    
    print(f"üìÑ Converting PDF: {Path(pdf_path).name}")
    print(f"üìÇ Output folder: {output_folder}")
    print("\n‚è≥ This may take a few minutes depending on PDF size...\n")
    
    try:
        # Create output directory if it doesn't exist
        os.makedirs(output_folder, exist_ok=True)

        # Find the generated markdown file
        pdf_name = Path(pdf_path).stem
        markdown_path = Path(output_folder) / pdf_name / f"{pdf_name}.md"

        # Convert PDF to markdown
        try:
            convert_pdf_to_markdown(pdf_path=pdf_path, output_path=output_folder)
        except FileExistsError as e:
            print('File is already processed')

        if markdown_path.exists():
            print(f"\n‚úÖ Conversion successful!")
            print(f"üìù Markdown file: {markdown_path}")
            
            # Show a preview
            with open(markdown_path, 'r', encoding='utf-8') as f:
                content = f.read()
                preview_length = 500
                print(f"\nüìñ Preview (first {preview_length} characters):")
                print("=" * 80)
                print(content[:preview_length])
                print("...")
                print("=" * 80)
                print(f"\nüìä Total characters: {len(content):,}")
        else:
            print("‚ö†Ô∏è Markdown file not found after conversion")
            
    except Exception as e:
        print(f"‚ùå Error during conversion: {e}")
        print("Tip: Make sure Ollama is running and the PDF is accessible")
else:
    print("‚ö†Ô∏è No PDF available for conversion")
    print("Skipping this step...")
    markdown_path = None

üìÑ Converting PDF: Wegner et al. - 2023 - Complexity measures for EEG microstate sequences - concepts and algorithms.pdf
üìÇ Output folder: explore_langchain

‚è≥ This may take a few minutes depending on PDF size...

File is already processed

‚úÖ Conversion successful!
üìù Markdown file: explore_langchain/Wegner et al. - 2023 - Complexity measures for EEG microstate sequences - concepts and algorithms/Wegner et al. - 2023 - Complexity measures for EEG microstate sequences - concepts and algorithms.md

üìñ Preview (first 500 characters):
![](_page_0_Picture_0.jpeg)

# Complexity measures for EEG microstate sequences - concepts and algorithms

Frederic von Wegner ( [f.vonwegner@unsw.edu.au \)](mailto:f.vonwegner@unsw.edu.au) UNSW Sydney Milena Wiemers Klinikum L√ºneburg Gesine Hermann Kiel University Inken T√∂dt Kiel University Enzo Tagliazucchi University of Buenos Aires Helmut Laufs Kiel University

Research Article

Keywords:

Posted Date: May 10th, 2023

DOI: <https://doi.org/1

---
# 4Ô∏è‚É£ Markdown Chunking

Large documents need to be split into smaller chunks for effective embedding and retrieval.

In [19]:
from src.MarkdownChunker import MarkdownChunker

# Use the markdown file we just created (or provide a path to an existing one)
if 'markdown_path' in locals() and markdown_path and markdown_path.exists():
    print(f"‚úÇÔ∏è Chunking markdown file: {markdown_path.name}")

    # Initialize chunker
    chunker = MarkdownChunker(
        md_path=str(markdown_path),
        chunk_size=500,
        chunk_overlap=50
    )
    
    # Perform chunking
    chunks = chunker.chunk(method='markdown+recursive')
    
    print(f"‚úÖ Created {len(chunks)} chunks\n")
    
    # Show statistics
    chunk_lengths = [c.metadata['length'] for c in chunks]
    print("üìä Chunk Statistics:")
    print(f"   Min length: {min(chunk_lengths)} chars")
    print(f"   Max length: {max(chunk_lengths)} chars")
    print(f"   Avg length: {sum(chunk_lengths) / len(chunk_lengths):.0f} chars")
    
    # Display first chunk as example
    print("\nüìÑ Example Chunk:")
    print("=" * 80)
    example_chunk = chunks[0]
    print(f"ID: {example_chunk.metadata['split_id']}")
    print(f"\nMetadata:")
    for key, value in example_chunk.metadata.items():
        print(f"   {key}: {value}")
    print(f"\nContent:\n{example_chunk.page_content}")
    print("=" * 80)
    
else:
    print("‚ö†Ô∏è No markdown file available for chunking")
    print("Skipping this step...")
    chunks = None

‚úÇÔ∏è Chunking markdown file: Wegner et al. - 2023 - Complexity measures for EEG microstate sequences - concepts and algorithms.md
‚úÖ Created 347 chunks

üìä Chunk Statistics:
   Min length: 1 chars
   Max length: 853 chars
   Avg length: 275 chars

üìÑ Example Chunk:
ID: 0

Metadata:
   table: False
   split_id: 0
   length: 27

Content:
![](_page_0_Picture_0.jpeg)


---
# 5Ô∏è‚É£ Vector Storage with ChromaDB

Now we'll create embeddings and store them in a vector database for semantic search.

In [20]:
from src.VectorStorage import ChromaStorage

# Initialize ChromaDB storage
index_path = os.getenv('INDEX_PATH')
collection_name = os.getenv('COLLECTION_NAME')

print(f"üóÑÔ∏è Initializing vector storage...")
print(f"üìÇ Index path: {index_path}")
print(f"üìö Collection: {collection_name}\n")

storage = ChromaStorage(index_path=index_path, collection_name=collection_name)

print(f"‚úÖ ChromaDB initialized!")
print(f"üìä Current collection size: {storage.collection.count()} documents")

üóÑÔ∏è Initializing vector storage...
üìÇ Index path: explore_langchain/test.db
üìö Collection: scico-test

‚úÖ ChromaDB initialized!
üìä Current collection size: 0 documents


### üì• Add Chunks to Vector Database

In [21]:
# Add chunks to the vector database (if we have them)
if 'chunks' in locals() and chunks:
    print(f"üì§ Adding {len(chunks)} chunks to vector database...")
    print("‚è≥ Creating embeddings (this may take a moment)...\n")
    
    try:
        storage.add_documents(chunks)
        print(f"‚úÖ Successfully added chunks!")
        print(f"üìä Collection now contains: {storage.collection.count()} documents")
    except Exception as e:
        print(f"‚ùå Error adding chunks: {e}")
else:
    print("‚ö†Ô∏è No chunks available to add")
    print("You can still query existing documents if the database is not empty")

üì§ Adding 347 chunks to vector database...
‚è≥ Creating embeddings (this may take a moment)...

‚úÖ Successfully added chunks!
üìä Collection now contains: 347 documents


---
# 6Ô∏è‚É£ Semantic Search & Retrieval

Now comes the magic! Let's query our knowledge base.

### üîç Simple Query Example

In [22]:
# Define a query
query = "What is criticality in EEG signals?"
n_results = 3

print(f"üîç Query: '{query}'")
print(f"üìä Retrieving top {n_results} results...\n")

try:
    results = storage.query(query_texts=[query], n_results=n_results)
    
    if results['documents'] and results['documents'][0]:
        print(f"‚úÖ Found {len(results['documents'][0])} relevant chunks\n")
        print("=" * 80)
        
        for i, (doc, metadata, distance) in enumerate(zip(
            results['documents'][0],
            results['metadatas'][0],
            results['distances'][0]
        ), 1):
            # Calculate similarity score (inverse of distance)
            similarity = 1 / (1 + distance)
            
            print(f"\nüìÑ Result {i}")
            print(f"   Similarity: {similarity:.3f} (distance: {distance:.3f})")
            print(f"   Source: {metadata.get('citationkey', 'Unknown')}")
            print(f"   Section: {metadata.get('level1', 'N/A')}")
            if metadata.get('level2'):
                print(f"   Subsection: {metadata.get('level2')}")
            print(f"\n   Content:\n   {doc[:300]}..." if len(doc) > 300 else f"\n   Content:\n   {doc}")
            print("\n" + "-" * 80)
    else:
        print("‚ö†Ô∏è No results found. The database might be empty.")
        
except Exception as e:
    print(f"‚ùå Error during query: {e}")

üîç Query: 'What is criticality in EEG signals?'
üìä Retrieving top 3 results...

‚úÖ Found 3 relevant chunks


üìÑ Result 1
   Similarity: 0.661 (distance: 0.513)
   Source: Unknown
   Section: 916 Microstate sequence complexity in wake and sleep

   Content:
   967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 might explain why electrophysiological and imaging data are unable to give a unique answer to the question how close the brain is to a critical state in a defined condition. Another difference between...

--------------------------------------------------------------------------------

üìÑ Result 2
   Similarity: 0.659 (distance: 0.518)
   Source: Unknown
   Section: Complexity measures for EEG microstate sequences - concepts and algorithms

   Content:
   
| Complexity measures for EEG microstate                                                                                                                                  

### üéØ Interactive Query Tool

In [13]:
def search_knowledge_base(query: str, n_results: int = 5) -> None:
    """
    Interactive search function with formatted output.
    """
    print(f"\n{'=' * 80}")
    print(f"üîç SEARCH QUERY: {query}")
    print(f"{'=' * 80}\n")
    
    try:
        results = storage.query(query_texts=[query], n_results=n_results)
        
        if not results['documents'] or not results['documents'][0]:
            print("‚ùå No results found.")
            return
        
        for i, (doc, metadata, distance) in enumerate(zip(
            results['documents'][0],
            results['metadatas'][0],
            results['distances'][0]
        ), 1):
            similarity_score = 1 / (1 + distance)
            
            # Create a visual similarity bar
            bar_length = int(similarity_score * 20)
            bar = "‚ñà" * bar_length + "‚ñë" * (20 - bar_length)
            
            print(f"\n{'‚ñº' * 40}")
            print(f"RESULT #{i}")
            print(f"Relevance: {bar} {similarity_score*100:.1f}%")
            print(f"\nüìÅ Source: {metadata.get('filename', 'Unknown')}")
            print(f"üìñ Section: {metadata.get('level1', 'N/A')}")
            if metadata.get('level2'):
                print(f"üìë Subsection: {metadata.get('level2')}")
            
            print(f"\nüí° Content Preview:")
            print(f"{'‚îÄ' * 80}")
            # Highlight query terms (simple version)
            preview = doc[:400] + "..." if len(doc) > 400 else doc
            print(preview)
            print(f"{'‚îÄ' * 80}")
        
        print(f"\n{'=' * 80}\n")
        
    except Exception as e:
        print(f"‚ùå Error: {e}")

# Example queries to try
example_queries = [
    "What is criticality?",
    "How is consciousness measured during anesthesia?",
    "What are the main findings of the study?",
    "What methods were used in the research?"
]

print("üìù Example queries you can try:")
for i, q in enumerate(example_queries, 1):
    print(f"   {i}. {q}")

print("\nüí° Try running: search_knowledge_base('your question here')")

üìù Example queries you can try:
   1. What is criticality?
   2. How is consciousness measured during anesthesia?
   3. What are the main findings of the study?
   4. What methods were used in the research?

üí° Try running: search_knowledge_base('your question here')


In [14]:
# Try your first search!
search_knowledge_base("What is criticality?", n_results=3)


üîç SEARCH QUERY: What is criticality?

‚ùå No results found.


---
# 7Ô∏è‚É£ Using MainProcessor (All-in-One)

The `MainProcessor` class provides a convenient wrapper around all components.

In [15]:
from src.MainProcessor import MainProcessor

# Initialize the main processor
processor = MainProcessor(collection_name=os.getenv('COLLECTION_NAME'))

print("üéØ MainProcessor initialized!\n")
print("üìã Configuration:")
print(f"   üìö Collection: {os.getenv('COLLECTION_NAME')}")
print(f"   üìÇ Zotero Library: {processor.zotero_library_path}")
print(f"   üìù Markdown Folder: {processor.markdown_folder_path}")
print(f"   üíæ Vector Index: {processor.index_path}")
print(f"\n   üìä Collection size: {processor.storage.collection.count()} documents")

üéØ MainProcessor initialized!

üìã Configuration:
   üìö Collection: scico-test
   üìÇ Zotero Library: /home/soenke/Zotero
   üìù Markdown Folder: explore_langchain
   üíæ Vector Index: explore_langchain/test.db

   üìä Collection size: 0 documents


### üîÑ Query Using MainProcessor

In [16]:
# Use the processor to query
query = "What are the key findings?"
results = processor.query_vector_storage([query], n_results=3)

print(f"üîç Query: '{query}'\n")
print(f"‚úÖ Retrieved {len(results['documents'][0])} results\n")

for i, (doc, meta) in enumerate(zip(results['documents'][0], results['metadatas'][0]), 1):
    print(f"Result {i}: {doc[:150]}...")
    print(f"Source: {meta.get('filename')}\n")

üîç Query: 'What are the key findings?'

‚úÖ Retrieved 0 results



---
# üéì Summary & Next Steps

## What We've Learned

‚úÖ **Configuration**: Set up environment variables and verified Ollama  
‚úÖ **Zotero Integration**: Connected to your library and retrieved metadata  
‚úÖ **PDF Processing**: Converted PDFs to structured markdown  
‚úÖ **Chunking**: Split documents into semantic pieces  
‚úÖ **Vector Storage**: Created embeddings with ChromaDB  
‚úÖ **Retrieval**: Performed semantic search on your knowledge base  

## üöÄ Next Steps

1. **Process More Documents**: Run the pipeline on your entire collection
2. **Fine-tune Chunking**: Adjust `chunk_size` and `overlap` for better results
3. **Build a RAG App**: Add LLM-powered answer generation
4. **Create a Web Interface**: Use Streamlit or Gradio for a user-friendly UI
5. **Add Query Optimization**: Implement the `RAGQuestionOptimizer` module

## üìö Helpful Functions

```python
# Search your knowledge base
search_knowledge_base("your question", n_results=5)

# Get metadata for any PDF
retriever.get_metadata_for_pdf(Path("path/to/file.pdf"))

# List all PDFs in a collection
retriever.get_pdfs_in_collection("Collection Name")

# Query using the processor
processor.query_vector_storage(["query"], n_results=5)
```

---

## ü§ù Contributing

This is an evolving project! Future enhancements include:
- Query optimization with LLMs
- Answer generation with citations
- Multi-document synthesis
- Advanced RAG techniques

Happy researching! üî¨üìö

---
# üß™ Experimental: Batch Processing

Process multiple PDFs from your collection in one go.

In [17]:
def batch_process_collection(max_pdfs: int = 5) -> None:
    """
    Process multiple PDFs from the collection.
    WARNING: This can take a long time!
    """
    print(f"üîÑ Starting batch processing (max {max_pdfs} PDFs)...\n")
    
    pdfs = retriever.get_pdfs_in_collection(os.getenv('COLLECTION_NAME'))
    pdfs_to_process = [p for p in pdfs if p['pdf_path']][:max_pdfs]
    
    total = len(pdfs_to_process)
    successful = 0
    failed = 0
    
    for i, pdf in enumerate(pdfs_to_process, 1):
        print(f"\n{'=' * 80}")
        print(f"Processing {i}/{total}: {pdf['pdf_name']}")
        print(f"{'=' * 80}")
        
        try:
            # Convert to markdown
            print("üìÑ Converting to markdown...")
            convert_pdf_to_markdown(
                pdf_path=pdf['pdf_path'],
                output_path=os.getenv('MARKDOWN_FOLDER_PATH')
            )
            
            # Find markdown file
            pdf_stem = Path(pdf['pdf_path']).stem
            md_path = Path(os.getenv('MARKDOWN_FOLDER_PATH')) / pdf_stem / f"{pdf_stem}.md"
            
            if md_path.exists():
                # Chunk
                print("‚úÇÔ∏è Chunking...")
                chunker = MarkdownChunker(md_path=str(md_path), chunk_size=150, chunk_overlap=50)
                chunks = chunker.chunk()
                
                # Add to vector DB
                print(f"üì§ Adding {len(chunks)} chunks to vector DB...")
                storage.add_documents(chunks)
                
                successful += 1
                print(f"‚úÖ Success!")
            else:
                print(f"‚ö†Ô∏è Markdown file not found")
                failed += 1
                
        except Exception as e:
            print(f"‚ùå Error: {e}")
            failed += 1
    
    print(f"\n{'=' * 80}")
    print(f"üìä Batch Processing Complete!")
    print(f"   ‚úÖ Successful: {successful}")
    print(f"   ‚ùå Failed: {failed}")
    print(f"   üìö Total documents in DB: {storage.collection.count()}")
    print(f"{'=' * 80}")

# Uncomment to run (WARNING: This will take time!)
# batch_process_collection(max_pdfs=3)