# 🔬 SciCO - The Zotero Library RAG System

Welcome to **SciCO** (Scientific Co-worker)! This interactive tutorial will guide you through:

1. **Configuration** - Setting up your environment
2. **Zotero Integration** - Connecting to your library and retrieving metadata
3. **PDF Processing** - Converting PDFs to searchable markdown
4. **Text Chunking** - Breaking documents into semantic pieces
5. **Vector Storage** - Creating embeddings for semantic search
6. **Retrieval** - Querying your knowledge base

---

## 🎯 What is RAG?

**Retrieval-Augmented Generation (RAG)** combines:
- **Vector databases** for semantic search
- **Your documents** as the knowledge source
- **LLMs** for intelligent question answering

This allows you to ask questions about your scientific papers and get answers grounded in your actual sources!

---
## 📋 Prerequisites

Before starting, ensure you have:

✅ **Ollama** installed and running ([https://ollama.ai](https://ollama.ai))  
✅ **Zotero** with a populated library  
✅ A `.env` file with required variables (see below)  
✅ Python packages installed (`pip install -r requirements.txt`)


---
# 1️⃣ Configuration

The project requires a `.env` file in your project root with these variables:

### 📝 Required Environment Variables

In [1]:
# Example .env file structure:
# Copy this to your .env file and fill in the paths

example_env = """
# Name of the collection in your Zotero library
COLLECTION_NAME='Your Collection Name'

# Path where markdown files will be saved
MARKDOWN_FOLDER_PATH='/path/to/markdown/output'

# Path to your Zotero data folder (contains zotero.sqlite)
ZOTERO_LIBRARY_PATH='/path/to/Zotero/data'

# Path to the ChromaDB index file (should end in .db)
INDEX_PATH='/path/to/index/chroma.db'

# (Optional) For testing
TEST_PDF_PATH='/path/to/test/paper.pdf'
"""

print("📄 Example .env configuration:")
print(example_env)

📄 Example .env configuration:

# Name of the collection in your Zotero library
COLLECTION_NAME='Your Collection Name'

# Path where markdown files will be saved
MARKDOWN_FOLDER_PATH='/path/to/markdown/output'

# Path to your Zotero data folder (contains zotero.sqlite)
ZOTERO_LIBRARY_PATH='/path/to/Zotero/data'

# Path to the ChromaDB index file (should end in .db)
INDEX_PATH='/path/to/index/chroma.db'

# (Optional) For testing
TEST_PDF_PATH='/path/to/test/paper.pdf'



### 🔧 Import Dependencies and Setup

In [2]:
import sys
import os
from pathlib import Path
import requests
from dotenv import load_dotenv
from IPython.display import display, HTML
import warnings
warnings.filterwarnings('ignore')

# Add project source to path
project_src = Path.cwd()
if str(project_src) not in sys.path:
    sys.path.insert(0, str(project_src))

# Load environment variables
load_dotenv()

print("✅ Dependencies imported successfully!")

✅ Dependencies imported successfully!


### 🏥 Health Check: Verify Ollama is Running

In [3]:
def ensure_ollama_running(host: str = "127.0.0.1", port: int = 11434, timeout: float = 2.0) -> dict:
    """Check if Ollama is running and return status info."""
    base_url = f"http://{host}:{port}"
    try:
        resp = requests.get(f"{base_url}/api/version", timeout=timeout)
        if resp.status_code == 200:
            version_info = resp.json()
            return {
                'status': 'running',
                'url': base_url,
                'version': version_info.get('version', 'unknown')
            }
        else:
            return {
                'status': 'error',
                'message': f"Ollama responded with status {resp.status_code}"
            }
    except requests.exceptions.RequestException as e:
        return {
            'status': 'not_running',
            'message': f"Cannot reach Ollama at {base_url}",
            'error': str(e)
        }


# Run health check
ollama_status = ensure_ollama_running()

if ollama_status['status'] == 'running':
    display(HTML(
        f"<div style='padding:10px; background-color:#1a4d2e; border-left:4px solid #28a745; border-radius:4px; color:#ffffff;'>"
        f"<strong>✅ Ollama is running!</strong><br>"
        f"🔗 URL: {ollama_status['url']}<br>"
        f"📦 Version: {ollama_status['version']}"
        f"</div>"))
else:
    display(HTML(
        f"<div style='padding:10px; background-color:#5c1a1a; border-left:4px solid #dc3545; border-radius:4px; color:#ffffff;'>"
        f"<strong>❌ Ollama is NOT running!</strong><br>"
        f"⚠️ {ollama_status.get('message', 'Unknown error')}<br>"
        f"<em>Please start Ollama before continuing.</em>"
        f"</div>"))
    raise RuntimeError("Ollama must be running to use this notebook.")


### 📊 Verify Configuration

In [4]:
# Check if all required environment variables are set
required_vars = ['COLLECTION_NAME', 'MARKDOWN_FOLDER_PATH', 'ZOTERO_LIBRARY_PATH', 'INDEX_PATH']
config_status = {}

print("🔍 Configuration Status:\n")
for var in required_vars:
    value = os.getenv(var)
    config_status[var] = value
    status = "✅" if value else "❌"
    print(f"{status} {var}: {value if value else 'NOT SET'}")

all_set = all(config_status.values())
if all_set:
    print("\n✅ All required variables are configured!")
else:
    print("\n⚠️ Some variables are missing. Please update your .env file.")

🔍 Configuration Status:

✅ COLLECTION_NAME: scico-test
✅ MARKDOWN_FOLDER_PATH: example/markdown-library
✅ ZOTERO_LIBRARY_PATH: /home/soenke/Zotero
✅ INDEX_PATH: example/zotero-vector-storage.db

✅ All required variables are configured!


---
# 2️⃣ Zotero Integration

Let's connect to your Zotero database and explore what's inside!

In [5]:
from legacy.ZoteroIntegration import ZoteroMetadataRetriever

# Initialize the Zotero connection
zotero_path = Path(os.getenv('ZOTERO_LIBRARY_PATH'))
retriever = ZoteroMetadataRetriever(zotero_path)

print("🔌 Connecting to Zotero database...")
retriever.initialize()
print("✅ Connected successfully!")
print(f"📂 Database path: {retriever.config.sqlite_path}")

🔌 Connecting to Zotero database...
✅ Connected successfully!
📂 Database path: /home/soenke/Zotero/zotero.sqlite


### 📚 Explore Collections and PDFs

In [6]:
# Get PDFs from the configured collection
collection_name = os.getenv('COLLECTION_NAME')
print(f"📖 Retrieving PDFs from collection: '{collection_name}'\n")

pdfs = retriever.get_pdfs_in_collection(collection_name)

if pdfs:
    print(f"✅ Found {len(pdfs)} PDF(s) in this collection\n")
    print("📄 First 3 PDFs:")
    print("-" * 80)
    for i, pdf in enumerate(pdfs[:3], 1):
        print(f"\n{i}. {pdf['pdf_name']}")
        print(f"   Citation Key: {pdf['citationkey'] or 'None'}")
        print(f"   Item ID: {pdf['itemID']}")
        print(f"   Path: {pdf['pdf_path'] or 'Not found in storage'}")
else:
    print(f"⚠️ No PDFs found in collection '{collection_name}'")
    print("Tip: Make sure your collection name matches exactly (case-sensitive)")

📖 Retrieving PDFs from collection: 'scico-test'

✅ Found 1 PDF(s) in this collection

📄 First 3 PDFs:
--------------------------------------------------------------------------------

1. Wegner et al. - 2023 - Complexity measures for EEG microstate sequences - concepts and algorithms.pdf
   Citation Key: wegnerComplexityMeasuresEEG2023
   Item ID: 289
   Path: /home/soenke/Zotero/storage/9N8E7TQU/Wegner et al. - 2023 - Complexity measures for EEG microstate sequences - concepts and algorithms.pdf


### 🔍 Deep Dive: Get Full Metadata for a PDF

In [7]:
# Let's examine the full metadata for the first PDF (if available)
def metadata_parser_for_chunks(metadata):
    return {
        'title': metadata.get('title', 'N/A'),
        'authors': metadata.get('authors', 'N/A'),
        'year': metadata.get('year', 'N/A'),
        'doi': metadata.get('doi', 'N/A'),
        'url': metadata.get('url', 'N/A'),
        'citation_key': metadata.get('citation_key', 'N/A'),
        'collections' : '; '.join([col['name'] for col in metadata.get('collections', [])] if metadata.get('collections') else 'N/A'),
        'content_type': 'pdf-zotero',
    }

if pdfs and pdfs[0]['pdf_path']:
    sample_pdf_path = Path(pdfs[0]['pdf_path'])
    print(f"🔬 Analyzing: {sample_pdf_path.name}\n")
    
    metadata = retriever.get_metadata_for_pdf(sample_pdf_path)
    
    if metadata:
        zotero_metadata = metadata_parser_for_chunks(metadata)
        print("📊 Full Metadata:")
        print("=" * 80)
        print(f"\n📖 Title: {metadata.get('title', 'N/A')}")
        print(f"\n✍️ Authors: {metadata.get('authors', 'N/A')}")
        print(f"\n📅 Year: {metadata.get('year', 'N/A')}")
        print(f"\n🔗 DOI: {metadata.get('doi', 'N/A')}")
        print(f"\n🌐 URL: {metadata.get('url', 'N/A')}")
        print(f"\n📰 Publication: {metadata.get('publication_title', 'N/A')}")
        
        if metadata.get('abstract'):
            abstract = metadata['abstract']
            print(f"\n📝 Abstract: {abstract[:200]}..." if len(abstract) > 200 else f"\n📝 Abstract: {abstract}")
        
        if metadata.get('tags'):
            print(f"\n🏷️ Tags: {', '.join(metadata['tags'])}")
        
        if metadata.get('collections'):
            print(f"\n📁 Collections:")
            for coll in metadata['collections']:
                print(f"   - {coll['name']}")
        
        print("\n" + "=" * 80)
    else:
        print("⚠️ Could not retrieve metadata")
else:
    print("⚠️ No valid PDF path found for analysis")
    print("Tip: Make sure PDFs exist in your Zotero storage folder")

🔬 Analyzing: Wegner et al. - 2023 - Complexity measures for EEG microstate sequences - concepts and algorithms.pdf

📊 Full Metadata:

📖 Title: Complexity measures for EEG microstate sequences - concepts and algorithms

✍️ Authors: Wegner, Frederic von; Wiemers, Milena; Hermann, Gesine; Tödt, Inken; Tagliazucchi, Enzo; Laufs, Helmut

📅 Year: 2023

🔗 DOI: 10.21203/rs.3.rs-2878411/v1

🌐 URL: https://www.researchsquare.com/article/rs-2878411/v1

📰 Publication: None

📝 Abstract: EEG microstate sequence analysis quantifies properties of ongoing brain electrical activity which is known to exhibit complex dynamics across many time scales. In this report we review recent developm...

📁 Collections:
   - PCI-From-Resting-State-Reconstruction
   - scico-test



---
# 3️⃣ PDF to Markdown Conversion

Now let's convert a PDF to structured Markdown using the `marker` library with Ollama.

In [8]:
from src.Tools.PdfToMarkdown import convert_pdf_to_markdown

# We'll use the first PDF from our collection (if available)
if pdfs and pdfs[0]['pdf_path']:
    pdf_path = pdfs[0]['pdf_path']
    output_folder = os.getenv('MARKDOWN_FOLDER_PATH')
    
    print(f"📄 Converting PDF: {Path(pdf_path).name}")
    print(f"📂 Output folder: {output_folder}")
    print("\n⏳ This may take a few minutes depending on PDF size...\n")
    
    try:
        # Create output directory if it doesn't exist
        os.makedirs(output_folder, exist_ok=True)

        # Find the generated markdown file
        pdf_name = Path(pdf_path).stem
        markdown_path = Path(output_folder) / pdf_name / f"{pdf_name}.md"

        # Convert PDF to markdown
        try:
            convert_pdf_to_markdown(pdf_path=pdf_path, output_path=output_folder)
        except FileExistsError as e:
            print('File is already processed')

        if markdown_path.exists():
            print(f"\n✅ Conversion successful!")
            print(f"📝 Markdown file: {markdown_path}")
            
            # Show a preview
            with open(markdown_path, 'r', encoding='utf-8') as f:
                content = f.read()
                preview_length = 500
                print(f"\n📖 Preview (first {preview_length} characters):")
                print("=" * 80)
                print(content[:preview_length])
                print("...")
                print("=" * 80)
                print(f"\n📊 Total characters: {len(content):,}")
        else:
            print("⚠️ Markdown file not found after conversion")
            
    except Exception as e:
        print(f"❌ Error during conversion: {e}")
        print("Tip: Make sure Ollama is running and the PDF is accessible")
else:
    print("⚠️ No PDF available for conversion")
    print("Skipping this step...")
    markdown_path = None

📄 Converting PDF: Wegner et al. - 2023 - Complexity measures for EEG microstate sequences - concepts and algorithms.pdf
📂 Output folder: example/markdown-library

⏳ This may take a few minutes depending on PDF size...

File is already processed

✅ Conversion successful!
📝 Markdown file: example/markdown-library/Wegner et al. - 2023 - Complexity measures for EEG microstate sequences - concepts and algorithms/Wegner et al. - 2023 - Complexity measures for EEG microstate sequences - concepts and algorithms.md

📖 Preview (first 500 characters):
![](_page_0_Picture_0.jpeg)

# Complexity measures for EEG microstate sequences - concepts and algorithms

Frederic von Wegner ( [f.vonwegner@unsw.edu.au \)](mailto:f.vonwegner@unsw.edu.au) UNSW Sydney Milena Wiemers Klinikum Lüneburg Gesine Hermann Kiel University Inken Tödt Kiel University Enzo Tagliazucchi University of Buenos Aires Helmut Laufs Kiel University

Research Article

Keywords:

Posted Date: May 10th, 2023

DOI: <https://doi.org/10.21

---
# 4️⃣ Markdown Chunking

Large documents need to be split into smaller chunks for effective embedding and retrieval.

In [9]:
from src.Tools.TextSplitter import MarkdownChunker

# Use the markdown file we just created (or provide a path to an existing one)
if 'markdown_path' in locals() and markdown_path and markdown_path.exists():
    print(f"✂️ Chunking markdown file: {markdown_path.name}")

    # Initialize chunker
    chunker = MarkdownChunker(
        md_path=str(markdown_path),
        chunk_size=500,
        chunk_overlap=50
    )
    
    # Perform chunking
    chunks = chunker.chunk(method='markdown+semantic')

    # Add zotero metadata to each chunk
    chunks = chunker.add_additional_metadata(metadata=zotero_metadata, splits=chunks)
    
    print(f"✅ Created {len(chunks)} chunks\n")
    
    # Show statistics
    chunk_lengths = [c.metadata['length'] for c in chunks]
    print("📊 Chunk Statistics:")
    print(f"   Min length: {min(chunk_lengths)} chars")
    print(f"   Max length: {max(chunk_lengths)} chars")
    print(f"   Avg length: {sum(chunk_lengths) / len(chunk_lengths):.0f} chars")
    
    # Display first chunk as example
    print("\n📄 Example Chunk:")
    print("=" * 80)
    example_chunk = chunks[0]
    print(f"ID: {example_chunk.metadata['split_id']}")
    print(f"\nMetadata:")
    for key, value in example_chunk.metadata.items():
        print(f"   {key}: {value}")
    print(f"\nContent:\n{example_chunk.page_content}")
    print("=" * 80)
    
else:
    print("⚠️ No markdown file available for chunking")
    print("Skipping this step...")
    chunks = None

✂️ Chunking markdown file: Wegner et al. - 2023 - Complexity measures for EEG microstate sequences - concepts and algorithms.md
✅ Created 230 chunks

📊 Chunk Statistics:
   Min length: 1 chars
   Max length: 5562 chars
   Avg length: 409 chars

📄 Example Chunk:
ID: 0

Metadata:
   table: False
   split_id: 0
   length: 27
   title: Complexity measures for EEG microstate sequences - concepts and algorithms
   authors: Wegner, Frederic von; Wiemers, Milena; Hermann, Gesine; Tödt, Inken; Tagliazucchi, Enzo; Laufs, Helmut
   year: 2023
   doi: 10.21203/rs.3.rs-2878411/v1
   url: https://www.researchsquare.com/article/rs-2878411/v1
   citation_key: wegnerComplexityMeasuresEEG2023
   collections: PCI-From-Resting-State-Reconstruction; scico-test
   content_type: pdf-zotero

Content:
![](_page_0_Picture_0.jpeg)


---
# 5️⃣ Vector Storage with ChromaDB

Now we'll create embeddings and store them in a vector database for semantic search.

In [10]:
from src.Tools.VectorStorage import ChromaStorage

# Initialize ChromaDB storage
index_path = os.getenv('INDEX_PATH')
collection_name = os.getenv('COLLECTION_NAME')

print(f"🗄️ Initializing vector storage...")
print(f"📂 Index path: {index_path}")
print(f"📚 Collection: {collection_name}\n")

storage = ChromaStorage(index_path=index_path, collection_name=collection_name)

print(f"✅ ChromaDB initialized!")
print(f"📊 Current collection size: {storage.collection.count()} documents")

🗄️ Initializing vector storage...
📂 Index path: example/zotero-vector-storage.db
📚 Collection: scico-test

✅ ChromaDB initialized!
📊 Current collection size: 0 documents


### 📥 Add Chunks to Vector Database

In [11]:
# Add chunks to the vector database (if we have them)
if 'chunks' in locals() and chunks:
    print(f"📤 Adding {len(chunks)} chunks to vector database...")
    print("⏳ Creating embeddings (this may take a moment)...\n")
    
    try:
        storage.add_documents(chunks)
        print(f"✅ Successfully added chunks!")
        print(f"📊 Collection now contains: {storage.collection.count()} documents")
    except Exception as e:
        print(f"❌ Error adding chunks: {e}")
else:
    print("⚠️ No chunks available to add")
    print("You can still query existing documents if the database is not empty")

📤 Adding 230 chunks to vector database...
⏳ Creating embeddings (this may take a moment)...

✅ Successfully added chunks!
📊 Collection now contains: 230 documents


---
# 6️⃣ Semantic Search & Retrieval

Now comes the magic! Let's query our knowledge base.

### 🔍 Simple Query Example

In [12]:
# Define a query
query = "What is Lemepl Ziv complexity?"
n_results = 5

print(f"🔍 Query: '{query}'")
print(f"📊 Retrieving top {n_results} results...\n")

try:
    results = storage.query(query_texts=[query], n_results=n_results)
    
    if results['documents'] and results['documents'][0]:
        print(f"✅ Found {len(results['documents'][0])} relevant chunks\n")
        print("=" * 80)
        
        for i, (doc, metadata, distance) in enumerate(zip(
            results['documents'][0],
            results['metadatas'][0],
            results['distances'][0]
        ), 1):
            # Calculate similarity score (inverse of distance)
            similarity = 1 / (1 + distance)
            
            print(f"\n📄 Result {i}")
            print(f"   Similarity: {similarity:.3f} (distance: {distance:.3f})")
            print(f"   Source: {metadata.get('citation_key', 'Unknown')}")
            print(f"   Section: {metadata.get('level1', 'N/A')}")
            if metadata.get('level2'):
                print(f"   Subsection: {metadata.get('level2')}")
            print(f"\n   Content:\n   {doc}")
            print("\n" + "-" * 80)
    else:
        print("⚠️ No results found. The database might be empty.")
        
except Exception as e:
    print(f"❌ Error during query: {e}")

🔍 Query: 'What is Lemepl Ziv complexity?'
📊 Retrieving top 5 results...

✅ Found 5 relevant chunks


📄 Result 1
   Similarity: 0.640 (distance: 0.563)
   Source: wegnerComplexityMeasuresEEG2023
   Section: $\begin{array}{c} 362\\ 363 \end{array} \quad \text{Lempel-Ziv complexity (LZC)} \end{array}$

   Content:
   Fig. 4 Potts model Lempel-Ziv complexity (LZC) for Q = 4 (A) and Q = 5 (B). The same shape is observed for both models.

--------------------------------------------------------------------------------

📄 Result 2
   Similarity: 0.624 (distance: 0.602)
   Source: wegnerComplexityMeasuresEEG2023
   Section: Complexity measures for EEG microstate sequences - concepts and algorithms

   Content:
   ., is reproduced by a very short instruction such as 'print A, n times'. As will be explained further below, there are more practical approaches to measure Kolmogorov complexity than trying to find the actual program, namely entropy rate and Lempel-Ziv complexity.

-------------------

### 🎯 Interactive Query Tool

In [13]:
def search_knowledge_base(query: str, n_results: int = 5) -> None:
    """
    Interactive search function with formatted output.
    """
    print(f"\n{'=' * 80}")
    print(f"🔍 SEARCH QUERY: {query}")
    print(f"{'=' * 80}\n")
    
    try:
        results = storage.query(query_texts=[query], n_results=n_results)
        
        if not results['documents'] or not results['documents'][0]:
            print("❌ No results found.")
            return
        
        for i, (doc, metadata, distance) in enumerate(zip(
            results['documents'][0],
            results['metadatas'][0],
            results['distances'][0]
        ), 1):
            similarity_score = 1 / (1 + distance)
            
            # Create a visual similarity bar
            bar_length = int(similarity_score * 20)
            bar = "█" * bar_length + "░" * (20 - bar_length)
            
            print(f"\n{'▼' * 40}")
            print(f"RESULT #{i}")
            print(f"Relevance: {bar} {similarity_score*100:.1f}%")
            print(f"\n📁 Source: {metadata.get('title', 'Unknown')}")
            print(f"      Key: {metadata.get('citation_key', 'Unknown')}")
            print(f"📖 Section: {metadata.get('level1', 'N/A')}")
            if metadata.get('level2'):
                print(f"📑 Subsection: {metadata.get('level2')}")
            
            print(f"\n💡 Content:")
            print(f"{'─' * 80}")
            # Highlight query terms (simple version)
            print(doc)
            print(f"{'─' * 80}")
        
        print(f"\n{'=' * 80}\n")
        
    except Exception as e:
        print(f"❌ Error: {e}")

# Example queries to try
example_queries = [
    "What is criticality?",
    "How is consciousness measured during anesthesia?",
    "What are the main findings of the study?",
    "What methods were used in the research?"
]

print("📝 Example queries you can try:")
for i, q in enumerate(example_queries, 1):
    print(f"   {i}. {q}")

print("\n💡 Try running: search_knowledge_base('your question here')")

📝 Example queries you can try:
   1. What is criticality?
   2. How is consciousness measured during anesthesia?
   3. What are the main findings of the study?
   4. What methods were used in the research?

💡 Try running: search_knowledge_base('your question here')


In [14]:
# Try your first search!
search_knowledge_base("What is criticality?", n_results=3)


🔍 SEARCH QUERY: What is criticality?


▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼
RESULT #1
Relevance: ███████████░░░░░░░░░ 56.7%

📁 Source: Complexity measures for EEG microstate sequences - concepts and algorithms
      Key: wegnerComplexityMeasuresEEG2023
📖 Section: $\begin{array}{c} 362\\ 363 \end{array} \quad \text{Lempel-Ziv complexity (LZC)} \end{array}$

💡 Content:
────────────────────────────────────────────────────────────────────────────────
437 438 439 440 441 Excess entropy peaks at the critical point and decays to lower, but non-zero values, away from the critical temperature (T < T<sup>c</sup> and T > Tc). Although asymmetric around the critical temperature, the shape of the excess entropy curve reflects the concept of statistical complexity whereas entropy rate reflects the Kolmogorov complexity concept.
────────────────────────────────────────────────────────────────────────────────

▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼▼
RESULT #2
Relevance: ███████████░░░░░░░░░ 56.6

---
# 7️⃣ Using MainProcessor (All-in-One)

The `MainProcessor` class provides a convenient wrapper around all components.

In [15]:
from legacy.MainProcessor import MainProcessor

# Initialize the main processor
processor = MainProcessor(collection_name=os.getenv('COLLECTION_NAME'))

print("🎯 MainProcessor initialized!\n")
print("📋 Configuration:")
print(f"   📚 Collection: {os.getenv('COLLECTION_NAME')}")
print(f"   📂 Zotero Library: {processor.zotero_library_path}")
print(f"   📝 Markdown Folder: {processor.markdown_folder_path}")
print(f"   💾 Vector Index: {processor.index_path}")
print(f"\n   📊 Collection size: {processor.storage.collection.count()} documents")

🎯 MainProcessor initialized!

📋 Configuration:
   📚 Collection: scico-test
   📂 Zotero Library: /home/soenke/Zotero
   📝 Markdown Folder: example/markdown-library
   💾 Vector Index: example/zotero-vector-storage.db

   📊 Collection size: 230 documents


### 🔄 Query Using MainProcessor

In [16]:
# Use the processor to query
query = "What are the key findings?"
results = processor.query_vector_storage([query], n_results=3)

print(f"🔍 Query: '{query}'\n")
print(f"✅ Retrieved {len(results['documents'][0])} results\n")

for i, (doc, meta) in enumerate(zip(results['documents'][0], results['metadatas'][0]), 1):
    print(f"Result {i}: {doc[:150]}...")
    print(f"Source: {meta.get('filename')}\n")

🔍 Query: 'What are the key findings?'

✅ Retrieved 3 results

Result 1: The main results of this report can be summarized as follows:...
Source: None

Result 2: Keywords:...
Source: None

Result 3: EEG microstate research (Van de Ville et al, 2010; Jia et al, 2021), and fMRI studies(Bullmore et al, 2009; Tagliazucchi et al, 2013)....
Source: None



---
# 🎓 Summary & Next Steps

## What We've Learned

✅ **Configuration**: Set up environment variables and verified Ollama  
✅ **Zotero Integration**: Connected to your library and retrieved metadata  
✅ **PDF Processing**: Converted PDFs to structured markdown  
✅ **Chunking**: Split documents into semantic pieces  
✅ **Vector Storage**: Created embeddings with ChromaDB  
✅ **Retrieval**: Performed semantic search on your knowledge base  

## 🚀 Next Steps

1. **Process More Documents**: Run the pipeline on your entire collection
2. **Fine-tune Chunking**: Adjust `chunk_size` and `overlap` for better results
3. **Build a RAG App**: Add LLM-powered answer generation
4. **Create a Web Interface**: Use Streamlit or Gradio for a user-friendly UI
5. **Add Query Optimization**: Implement the `RAGQuestionOptimizer` module

## 📚 Helpful Functions

```python
# Search your knowledge base
search_knowledge_base("your question", n_results=5)

# Get metadata for any PDF
retriever.get_metadata_for_pdf(Path("path/to/file.pdf"))

# List all PDFs in a collection
retriever.get_pdfs_in_collection("Collection Name")

# Query using the processor
processor.query_vector_storage(["query"], n_results=5)
```

---

## 🤝 Contributing

This is an evolving project! Future enhancements include:
- Query optimization with LLMs
- Answer generation with citations
- Multi-document synthesis
- Advanced RAG techniques

Happy researching! 🔬📚

---
# 🧪 Experimental: Batch Processing

Process multiple PDFs from your collection in one go.

In [17]:
def batch_process_collection(max_pdfs: int = 5) -> None:
    """
    Process multiple PDFs from the collection.
    WARNING: This can take a long time!
    """
    print(f"🔄 Starting batch processing (max {max_pdfs} PDFs)...\n")
    
    pdfs = retriever.get_pdfs_in_collection(os.getenv('COLLECTION_NAME'))
    pdfs_to_process = [p for p in pdfs if p['pdf_path']][:max_pdfs]
    
    total = len(pdfs_to_process)
    successful = 0
    failed = 0
    
    for i, pdf in enumerate(pdfs_to_process, 1):
        print(f"\n{'=' * 80}")
        print(f"Processing {i}/{total}: {pdf['pdf_name']}")
        print(f"{'=' * 80}")
        
        try:
            # Convert to markdown
            print("📄 Converting to markdown...")
            convert_pdf_to_markdown(
                pdf_path=pdf['pdf_path'],
                output_path=os.getenv('MARKDOWN_FOLDER_PATH')
            )
            
            # Find markdown file
            pdf_stem = Path(pdf['pdf_path']).stem
            md_path = Path(os.getenv('MARKDOWN_FOLDER_PATH')) / pdf_stem / f"{pdf_stem}.md"
            
            if md_path.exists():
                # Chunk
                print("✂️ Chunking...")
                chunker = MarkdownChunker(md_path=str(md_path), chunk_size=150, chunk_overlap=50)
                chunks = chunker.chunk()
                
                # Add to vector DB
                print(f"📤 Adding {len(chunks)} chunks to vector DB...")
                storage.add_documents(chunks)
                
                successful += 1
                print(f"✅ Success!")
            else:
                print(f"⚠️ Markdown file not found")
                failed += 1
                
        except Exception as e:
            print(f"❌ Error: {e}")
            failed += 1
    
    print(f"\n{'=' * 80}")
    print(f"📊 Batch Processing Complete!")
    print(f"   ✅ Successful: {successful}")
    print(f"   ❌ Failed: {failed}")
    print(f"   📚 Total documents in DB: {storage.collection.count()}")
    print(f"{'=' * 80}")

# Uncomment to run (WARNING: This will take time!)
# batch_process_collection(max_pdfs=3)