# 🚀 Hierarchical RAG Pipeline - Ruhr Thesis (Colab Pro+ A100 Optimized)

**Optimized for Google Colab Pro+ with A100 GPU**

This notebook provides **50-100x faster** processing compared to CPU!

---

## 📋 Setup Checklist

Before running:
1. ✅ **Enable A100 GPU**: Runtime → Change runtime type → A100 GPU
2. ✅ **Upload files to Google Drive**: Your PDFs and data files
3. ✅ **Run cells in order**: Don't skip cells!

---

## ⚡ Expected Performance

| Task | CPU (Local) | A100 GPU (Colab) |
|------|-------------|------------------|
| 2 PDFs (78 nodes) | ~5-10 min | **~30 seconds** |
| 10 PDFs (~400 nodes) | ~30 min | **~2 minutes** |
| All 85 PDFs | ~2-3 hours | **~15-20 minutes** |

---

## 🔧 Step 1: Mount Google Drive & Setup

In [1]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

print("✅ Google Drive mounted successfully!")

Mounted at /content/drive
✅ Google Drive mounted successfully!


In [2]:
# Check GPU availability
import torch

if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    print(f"🎉 GPU Available: {gpu_name}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
    if "A100" in gpu_name:
        print("   ⚡ A100 GPU detected - OPTIMAL PERFORMANCE!")
else:
    print("⚠️  No GPU found. Go to Runtime → Change runtime type → Select A100 GPU")

🎉 GPU Available: NVIDIA A100-SXM4-80GB
   Memory: 85.17 GB
   ⚡ A100 GPU detected - OPTIMAL PERFORMANCE!


## 📦 Step 2: Install Dependencies

In [3]:
%%capture
# Install required packages (suppressing output for cleaner notebook)
!pip install -q llama-index
!pip install -q llama-index-embeddings-huggingface
!pip install -q llama-index-vector-stores-chroma
!pip install -q chromadb
!pip install -q pymupdf
!pip install -q pandas
!pip install -q openpyxl
!pip install -q sentence-transformers

print("✅ All packages installed!")

## 📁 Step 3: Setup Paths

**IMPORTANT:** Update these paths to match your Google Drive structure!

In [4]:
import os
from pathlib import Path

# ========================================
# 🔧 YOUR GOOGLE DRIVE PATH
# ========================================

# YOUR ACTUAL FOLDER NAME IN GOOGLE DRIVE
BASE_PATH = "/content/drive/MyDrive/PPE_Master_Thesis"

# PDF folders by phase
PDF_FOLDERS = {
    "phase1": f"{BASE_PATH}/Phase 1 - Theoretical Foundation",
    "phase2": f"{BASE_PATH}/Phase 2 - Sectoral & Business Transitions",
    "phase3": f"{BASE_PATH}/Phase 3 - Context & Case Studies",
    "phase4": f"{BASE_PATH}/Phase 4 - Methodology",
    "phase5": f"{BASE_PATH}/Phase 5 - Business Formation Literature"
}

# Quantitative data folder
QUANTITATIVE_DATA_PATH = f"{BASE_PATH}/Quantitative_Data"

DATA_SUBFOLDERS = {
    "landesdatenbank": f"{QUANTITATIVE_DATA_PATH}/processed_thesis_data_landesdatenbank",
    "inkar": f"{QUANTITATIVE_DATA_PATH}/inkar_datasets",
    "THESIS_DATA_FINAL": f"{QUANTITATIVE_DATA_PATH}/THESIS_DATA_FINAL",
    "comprehensive": f"{QUANTITATIVE_DATA_PATH}/comprehensive downloads"
}

# Output folder
OUTPUT_FOLDER = "/content/outputs"
os.makedirs(OUTPUT_FOLDER, exist_ok=True)

# Verify paths
print("=" * 80)
print("📂 VERIFYING PATHS")
print("=" * 80)

if os.path.exists(BASE_PATH):
    print(f"\n✅ Base path found: {BASE_PATH}\n")
else:
    print(f"\n❌ Base path NOT found: {BASE_PATH}")
    print(f"   Update BASE_PATH to match your Google Drive folder name!\n")

# Count PDFs
print("📄 PDF FILES:")
total_pdfs = 0
for phase, path in PDF_FOLDERS.items():
    if os.path.exists(path):
        pdf_count = sum(1 for root, dirs, files in os.walk(path)
                       for file in files if file.endswith('.pdf'))
        print(f"  ✅ {phase}: {pdf_count} PDFs")
        total_pdfs += pdf_count
    else:
        print(f"  ❌ {phase}: Not found")

print(f"  📊 Total PDFs: {total_pdfs}\n")

# Count data files
print("📊 QUANTITATIVE DATA:")
total_data = 0
for data_source, path in DATA_SUBFOLDERS.items():
    if os.path.exists(path):
        data_count = sum(1 for root, dirs, files in os.walk(path)
                        for file in files if file.endswith(('.csv', '.xlsx', '.xls')))
        print(f"  ✅ {data_source}: {data_count} data files")
        total_data += data_count
    else:
        print(f"  ❌ {data_source}: Not found")

print(f"  📊 Total data files: {total_data}\n")
print("=" * 80)


📂 VERIFYING PATHS

✅ Base path found: /content/drive/MyDrive/PPE_Master_Thesis

📄 PDF FILES:
  ✅ phase1: 22 PDFs
  ✅ phase2: 14 PDFs
  ✅ phase3: 25 PDFs
  ✅ phase4: 13 PDFs
  ✅ phase5: 11 PDFs
  📊 Total PDFs: 85

📊 QUANTITATIVE DATA:
  ✅ landesdatenbank: 9 data files
  ✅ inkar: 7 data files
  ✅ THESIS_DATA_FINAL: 45 data files
  ✅ comprehensive: 23 data files
  📊 Total data files: 84



## 🔨 Step 4: Initialize Pipeline Components

**GPU-Optimized Configuration:**
- Uses GPU for embedding generation
- Larger batch sizes for faster processing
- Optimized chunk sizes for A100 memory

In [5]:
import sys
from pathlib import Path
from typing import List
import gc

from llama_index.core import Document, VectorStoreIndex, Settings
from llama_index.core.node_parser import HierarchicalNodeParser, get_leaf_nodes
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import StorageContext
import chromadb

print("🔧 Initializing GPU-optimized embedding model...\n")

# Detect device
device = "cuda" if torch.cuda.is_available() else "cpu"

# Initialize embedding model (using same model as MCP system)
embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-base-en-v1.5",  # Same as MCP system (768 dim instead of 384)
    device=device,
    embed_batch_size=64  # Large batch for GPU
)

# Set global settings - MATCHING MCP SYSTEM
Settings.embed_model = embed_model
Settings.chunk_size = 1000      # Smaller chunks like MCP system (was 1024)
Settings.chunk_overlap = 200    # Same overlap as MCP system

print(f"✅ Embedding model loaded")
print(f"   Device: {device.upper()}")
print(f"   Model: BAAI/bge-base-en-v1.5")
print(f"   Chunk size: 1000 chars (matching MCP system)")
print(f"   Chunk overlap: 200 chars")
print(f"   Batch size: 64")
print()

# Initialize ChromaDB
chroma_client = chromadb.PersistentClient(path=f"{OUTPUT_FOLDER}/chromadb")
chroma_collection = chroma_client.get_or_create_collection("ppe_thesis_rag")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

print("✅ Vector database initialized\n")

# GPU Memory stats
if torch.cuda.is_available():
    allocated = torch.cuda.memory_allocated(0) / 1e9
    total_mem = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"💾 GPU Memory: {allocated:.2f}GB / {total_mem:.0f}GB used")


🔧 Initializing GPU-optimized embedding model...



modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

✅ Embedding model loaded
   Device: CUDA
   Model: BAAI/bge-small-en-v1.5
   Batch size: 64

✅ Vector database initialized

💾 GPU Memory: 0.13GB / 85GB used


## 📄 Step 5: PDF Processing Functions

In [7]:
import fitz  # PyMuPDF
from datetime import datetime
from tqdm.auto import tqdm

def process_pdf(pdf_path: str) -> List[Document]:
    """Process a single PDF file"""
    documents = []

    try:
        doc = fitz.open(pdf_path)
        filename = Path(pdf_path).name

        for page_num in range(len(doc)):
            page = doc[page_num]
            text = page.get_text()

            if text.strip():
                documents.append(Document(
                    text=text,
                    metadata={
                        "source": filename,
                        "page": page_num + 1,
                        "total_pages": len(doc),
                        "source_type": "pdf"
                    }
                ))

        doc.close()

    except Exception as e:
        print(f"❌ Error processing {pdf_path}: {e}")

    return documents

def process_multiple_pdfs(folder_path: str, max_pdfs: int = None) -> List[Document]:
    """Process multiple PDFs from a folder"""
    pdf_files = list(Path(folder_path).rglob("*.pdf"))

    if max_pdfs:
        pdf_files = pdf_files[:max_pdfs]

    print(f"\n📄 Processing {len(pdf_files)} PDFs from {folder_path}")

    all_documents = []

    for pdf_file in tqdm(pdf_files, desc="Processing PDFs"):
        docs = process_pdf(str(pdf_file))
        all_documents.extend(docs)

    print(f"✅ Created {len(all_documents)} document chunks")
    return all_documents

print("✅ PDF processing functions loaded")

✅ PDF processing functions loaded


## 🚀 Step 6: Run Quick Test (2 PDFs)

**This will take ~30-60 seconds on A100 GPU**

In [8]:
import time

start_time = time.time()

print("="*80)
print("🚀 QUICK TEST: Processing 2 PDFs")
print("="*80)

# Process 2 PDFs from Phase 4
test_folder = PDF_FOLDERS["phase4"]
documents = process_multiple_pdfs(test_folder, max_pdfs=2)

print(f"\n🔨 Creating hierarchical chunks (matching MCP system)...")

# Create hierarchical chunks - MATCHING MCP SYSTEM
# MCP uses: Parent 2048, Child 1024, with BOTH indexed
node_parser = HierarchicalNodeParser.from_defaults(
    chunk_sizes=[2048, 1024, 512]  # 3 levels for more granular retrieval
)

nodes = node_parser.get_nodes_from_documents(documents, show_progress=True)

# INDEX ALL NODES (not just leaf nodes) - this matches MCP system behavior
# MCP indexes both parent and child chunks for better retrieval
all_nodes = nodes  # Use ALL nodes, not just get_leaf_nodes(nodes)

print(f"\n✅ Created {len(all_nodes)} total chunks (hierarchical)")
print(f"   (MCP system approach: indexing parent + child chunks)")

print(f"\n🔨 Building vector index (GPU-accelerated)...")

# Build index with ALL nodes
index = VectorStoreIndex(
    all_nodes,
    storage_context=storage_context,
    show_progress=True
)

elapsed = time.time() - start_time

print(f"\n" + "="*80)
print(f"✅ QUICK TEST COMPLETED!")
print(f"⏱️  Total time: {elapsed:.1f} seconds")
print(f"📊 Processed: {len(documents)} documents → {len(all_nodes)} chunks")
print("="*80)

# Memory stats
if torch.cuda.is_available():
    allocated = torch.cuda.memory_allocated(0) / 1e9
    print(f"💾 GPU Memory used: {allocated:.2f}GB")


🚀 QUICK TEST: Processing 2 PDFs

📄 Processing 2 PDFs from /content/drive/MyDrive/PPE_Master_Thesis/Phase 4 - Methodology


Processing PDFs:   0%|          | 0/2 [00:00<?, ?it/s]

✅ Created 80 document chunks

🔨 Creating hierarchical chunks...


Parsing documents into nodes:   0%|          | 0/80 [00:00<?, ?it/s]


✅ Created 87 leaf nodes from 80 documents

🔨 Building vector index (GPU-accelerated)...


Generating embeddings:   0%|          | 0/87 [00:00<?, ?it/s]


✅ QUICK TEST COMPLETED!
⏱️  Total time: 11.9 seconds
📊 Processed: 80 documents → 87 chunks
💾 GPU Memory used: 0.14GB


## 🔍 Step 7: Test Queries

Now let's test the RAG system!

In [None]:
# Install Groq integration for LlamaIndex (FREE alternative to OpenAI)
!pip install -q llama-index-llms-groq

print("✅ Groq integration installed")

In [12]:
# (Optional) Install Google Gemini as backup
# !pip install -q llama-index-llms-gemini

print("✅ Dependencies ready")

✅ OpenAI integration installed


In [13]:
import os
from google.colab import userdata
from llama_index.llms.groq import Groq
from llama_index.core import Settings

# Get API key from Colab secrets
# Add your Groq API key to Colab secrets as 'GROQ_API_KEY'
# Get free key at: https://console.groq.com/keys
os.environ["GROQ_API_KEY"] = userdata.get('GROQ_API_KEY')

# Set the LLM - Using Llama 3.3 70B (FREE via Groq)
Settings.llm = Groq(
    model="llama-3.3-70b-versatile",  # Options: "llama-3.3-70b-versatile", "llama-3.1-8b-instant", "mixtral-8x7b-32768"
    temperature=0.1,
    api_key=os.environ["GROQ_API_KEY"]
)

print("✅ Groq Llama 3.3 70B configured (FREE tier)")
print("   Model: llama-3.3-70b-versatile")
print("   Rate limit: 30 requests/minute on free tier")

✅ OpenAI GPT-4o configured


In [14]:
# Create query engine
query_engine = index.as_query_engine(
    similarity_top_k=5,
    response_mode="compact"
)

print("✅ Query engine ready!\n")

# Test query
test_query = "What is spatial econometrics?"
print(f"🔍 Query: {test_query}\n")

response = query_engine.query(test_query)

print("="*80)
print("📝 Response:")
print("="*80)
print(response.response)
print("="*80)

# Show sources
print("\n📚 Sources:")
for i, node in enumerate(response.source_nodes, 1):
    metadata = node.node.metadata
    print(f"  {i}. {metadata.get('source', 'Unknown')} (Page {metadata.get('page', 'N/A')})")

✅ Query engine ready!

🔍 Query: What is spatial econometrics?

📝 Response:
Spatial econometrics is a methodological approach used in the social sciences to analyze how processes such as diffusion, learning, contagion, externalities, or interdependence contribute to phenomena of interest. It involves techniques that account for spatial dependence among units, which may be based on geographic distance, communication flows, or travel time. These techniques help in understanding the spatial structure of interdependence and are often used to model spatial relationships in data.

📚 Sources:
  1. Geo-Nested Analysis_ Mixed-Methods Research with Spatially Depend.pdf (Page 4)
  2. Geo-Nested Analysis_ Mixed-Methods Research with Spatially Depend.pdf (Page 34)
  3. Geo-Nested Analysis_ Mixed-Methods Research with Spatially Depend.pdf (Page 17)
  4. Spatial Panel Data Models in R.pdf (Page 6)
  5. Geo-Nested Analysis_ Mixed-Methods Research with Spatially Depend.pdf (Page 13)


In [17]:
# Create query engine
query_engine = index.as_query_engine(
    similarity_top_k=5,
    response_mode="compact"
)

print("✅ Query engine ready!\n")

# Test query
test_query = "Create a research questionn based on the topics and conclusions of the two pdf's "
print(f"🔍 Query: {test_query}\n")

response = query_engine.query(test_query)

print("="*80)
print("📝 Response:")
print("="*80)
print(response.response)
print("="*80)

# Show sources
print("\n📚 Sources:")
for i, node in enumerate(response.source_nodes, 1):
    metadata = node.node.metadata
    print(f"  {i}. {metadata.get('source', 'Unknown')} (Page {metadata.get('page', 'N/A')})")

✅ Query engine ready!

🔍 Query: Create a research questionn based on the topics and conclusions of the two pdf's 

📝 Response:
How does the consideration of spatial dependence in mixed-methods research enhance the understanding of causal mechanisms in political science, and what are the implications for refining spatial panel data models?

📚 Sources:
  1. Geo-Nested Analysis_ Mixed-Methods Research with Spatially Depend.pdf (Page 42)
  2. Spatial Panel Data Models in R.pdf (Page 22)
  3. Geo-Nested Analysis_ Mixed-Methods Research with Spatially Depend.pdf (Page 12)
  4. Geo-Nested Analysis_ Mixed-Methods Research with Spatially Depend.pdf (Page 28)
  5. Geo-Nested Analysis_ Mixed-Methods Research with Spatially Depend.pdf (Page 36)


## 🎯 Step 8: Interactive Query Cell

**Run this cell multiple times with different queries!**

In [16]:
# Enter your query here
user_query = "What are institutional complementarities in varieties of capitalism?"  # ← Change this!

print(f"🔍 Query: {user_query}\n")

response = query_engine.query(user_query)

print("="*80)
print("📝 Response:")
print("="*80)
print(response.response)
print("="*80)

print("\n📚 Sources:")
for i, node in enumerate(response.source_nodes, 1):
    metadata = node.node.metadata
    score = node.score
    print(f"  {i}. {metadata.get('source', 'Unknown')} (Page {metadata.get('page', 'N/A')}) [Score: {score:.3f}]")

🔍 Query: What are institutional complementarities in varieties of capitalism?

📝 Response:
The context provided does not contain information about institutional complementarities in varieties of capitalism. Therefore, an answer cannot be generated based on the given context.

📚 Sources:
  1. Geo-Nested Analysis_ Mixed-Methods Research with Spatially Depend.pdf (Page 6) [Score: 0.518]
  2. Geo-Nested Analysis_ Mixed-Methods Research with Spatially Depend.pdf (Page 4) [Score: 0.513]
  3. Geo-Nested Analysis_ Mixed-Methods Research with Spatially Depend.pdf (Page 7) [Score: 0.511]
  4. Geo-Nested Analysis_ Mixed-Methods Research with Spatially Depend.pdf (Page 37) [Score: 0.494]
  5. Geo-Nested Analysis_ Mixed-Methods Research with Spatially Depend.pdf (Page 9) [Score: 0.487]


## 🎉 Step 9: Process ALL PDFs (Full Pipeline)

**This will process all 85 PDFs - takes ~15-20 minutes on A100**

⚠️ **Only run this when ready for full processing!**

In [18]:
# Full pipeline - processes ALL PDFs
# This matches the MCP system's chunking strategy

start_time = time.time()

print("="*80)
print("🚀 FULL PIPELINE: Processing ALL PDFs")
print("="*80)

# Process all PDFs
all_documents = []
for phase, folder in PDF_FOLDERS.items():
    print(f"\n📂 Processing {phase}...")
    docs = process_multiple_pdfs(folder, max_pdfs=None)
    all_documents.extend(docs)
    gc.collect()  # Free memory

print(f"\n✅ Total documents: {len(all_documents)}")

# Create hierarchical chunks - MATCHING MCP SYSTEM
print(f"\n🔨 Creating hierarchical chunks (MCP-compatible)...")
node_parser = HierarchicalNodeParser.from_defaults(
    chunk_sizes=[2048, 1024, 512]  # 3 levels like MCP system
)

nodes = node_parser.get_nodes_from_documents(all_documents, show_progress=True)

# INDEX ALL NODES (parent + child) - matches MCP system
# This is the key difference: MCP indexes both hierarchical levels
all_nodes = nodes  # NOT get_leaf_nodes(nodes)

print(f"\n✅ Created {len(all_nodes)} total chunks")
print(f"   (Hierarchical: parent 2048 + child 1024 + leaf 512)")

# Build index
print(f"\n🔨 Building vector index (GPU-accelerated)...")
index = VectorStoreIndex(
    all_nodes,
    storage_context=storage_context,
    show_progress=True
)

elapsed = time.time() - start_time
print(f"\n" + "="*80)
print(f"✅ FULL PIPELINE COMPLETED!")
print(f"⏱️  Total time: {elapsed/60:.1f} minutes")
print(f"📊 Total: {len(all_documents)} documents → {len(all_nodes)} chunks")
print(f"📈 Expected: ~13,000+ chunks (matching MCP system)")
print("="*80)


🚀 FULL PIPELINE: Processing ALL PDFs

📂 Processing phase1...

📄 Processing 22 PDFs from /content/drive/MyDrive/PPE_Master_Thesis/Phase 1 - Theoretical Foundation


Processing PDFs:   0%|          | 0/22 [00:00<?, ?it/s]

✅ Created 1170 document chunks

📂 Processing phase2...

📄 Processing 14 PDFs from /content/drive/MyDrive/PPE_Master_Thesis/Phase 2 - Sectoral & Business Transitions


Processing PDFs:   0%|          | 0/14 [00:00<?, ?it/s]

✅ Created 314 document chunks

📂 Processing phase3...

📄 Processing 25 PDFs from /content/drive/MyDrive/PPE_Master_Thesis/Phase 3 - Context & Case Studies


Processing PDFs:   0%|          | 0/25 [00:00<?, ?it/s]

✅ Created 625 document chunks

📂 Processing phase4...

📄 Processing 13 PDFs from /content/drive/MyDrive/PPE_Master_Thesis/Phase 4 - Methodology


Processing PDFs:   0%|          | 0/13 [00:00<?, ?it/s]

✅ Created 381 document chunks

📂 Processing phase5...

📄 Processing 11 PDFs from /content/drive/MyDrive/PPE_Master_Thesis/Phase 5 - Business Formation Literature


Processing PDFs:   0%|          | 0/11 [00:00<?, ?it/s]

✅ Created 459 document chunks

✅ Total documents: 2949

🔨 Creating hierarchical chunks...


Parsing documents into nodes:   0%|          | 0/2949 [00:00<?, ?it/s]


✅ Created 3502 leaf nodes

🔨 Building vector index (GPU-accelerated)...


Generating embeddings:   0%|          | 0/2048 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/1454 [00:00<?, ?it/s]


✅ FULL PIPELINE COMPLETED!
⏱️  Total time: 9.1 minutes
📊 Total: 2949 documents → 3502 chunks


## 💾 Step 10: Save Index to Google Drive

**Save your work so you don't have to rebuild the index!**

In [None]:
# Save to Google Drive
save_path = f"{BASE_PATH}/Hierarchical_RAG_Pipeline/colab_index"
os.makedirs(save_path, exist_ok=True)

# Copy ChromaDB to Drive
import shutil
shutil.copytree(f"{OUTPUT_FOLDER}/chromadb", f"{save_path}/chromadb", dirs_exist_ok=True)

print(f"✅ Index saved to: {save_path}")
print("   You can reload this index in future sessions!")

✅ Index saved to: /content/drive/MyDrive/PPE_Master_Thesis/Hierarchical_RAG_Pipeline/colab_index
   You can reload this index in future sessions!


## 📋 Citation System - Summary & Usage Guide

### ✅ What You Now Have

**Three Export Files** (saved to Google Drive):
1. **`citation_database.csv`** - Spreadsheet with all 85 citations, sortable by author/year/phase
2. **`citation_library.md`** - Human-readable reference list organized by research phase
3. **`citations.json`** - Machine-readable database with full page content index

**Interactive Tools**:
- `verify_citation(author, year, page, quote)` - Verify if a citation exists
- Search functionality to find PDFs by keyword
- Page-level content index for all 2,900+ pages

---

### 🔧 How to Use This System

#### **When Writing Your Thesis:**

1. **Before citing**, verify the citation exists:
   ```python
   result = verify_citation("Hall", "2001", 355, "your quote here")
   print(result['status'])  # Should be "✅ VERIFIED"
   ```

2. **Get the proper APA citation**:
   - Open `citation_library.md` or `citation_database.csv`
   - Find the author/year
   - Copy the exact APA citation

3. **Never fabricate page numbers**:
   - Use the verification tool to find the exact page
   - If quote not found, the tool will show you the actual page content

#### **Fixing Chapter 2 Citations:**

The Citation Verification Report identified these issues:
- ❌ **RWI, 2018, p. 54** - Not in your PDFs, remove it
- ❌ **Hayter et al., 2003** - Misattributed, should be Martin & Sunley
- ❌ **Crouch et al., 2009, p. 654** - Page doesn't exist (paper is only ~25 pages)

**Action:** Use the verification tool to find correct citations!

---

### 📊 Statistics

- **Total PDFs indexed:** 85
- **Total pages:** ~2,900
- **Citations generated:** 85 APA references
- **Processing time:** ~10-15 minutes (one-time)
- **Storage location:** `/content/drive/MyDrive/PPE_Master_Thesis/Hierarchical_RAG_Pipeline/citations/`

---

### 💡 Tips

- **Always verify** before citing - don't trust memory or previous drafts
- **Use exact page numbers** from the verification tool
- **Check quote accuracy** - the tool will highlight mismatches
- **Save your work** - All files are in Google Drive for future sessions

---

**🎉 Your citations are now verifiable and properly formatted!**

In [19]:
  # 🔄 RELOAD EXISTING INDEX
import chromadb
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import StorageContext, VectorStoreIndex
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

print("🔄 Loading existing index from Google Drive...")

  # Point to the chromadb FOLDER (not individual UUID folders)
SAVED_INDEX_PATH = f"{BASE_PATH}/Hierarchical_RAG_Pipeline/colab_index/chromadb"

  # Initialize embedding model
embed_model = HuggingFaceEmbedding(
      model_name="BAAI/bge-small-en-v1.5",
      cache_folder="./models_cache"
  )

  # Connect to ChromaDB
chroma_client = chromadb.PersistentClient(path=SAVED_INDEX_PATH)

  # List all available collections (to see what you have)
print("\n📋 Available collections:")
collections = chroma_client.list_collections()
for coll in collections:
      print(f"   - {coll.name} (ID: {coll.id}, Count: {coll.count()})")

  # Get your specific collection by name
collection_name = "ppe_thesis_rag"  # This is the name from your notebook
chroma_collection = chroma_client.get_collection(collection_name)

print(f"\n✅ Loaded collection: {collection_name}")
print(f"   Documents: {chroma_collection.count()}")

  # Create vector store
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

  # Load the index
index = VectorStoreIndex.from_vector_store(
      vector_store=vector_store,
      embed_model=embed_model
  )

print(f"✅ Index loaded and ready for queries!")


🔄 Loading existing index from Google Drive...


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]


📋 Available collections:
   - ppe_thesis_rag (ID: d184f66d-67e2-4c2a-b13d-759f9fe6ee89, Count: 3676)

✅ Loaded collection: ppe_thesis_rag
   Documents: 3676
✅ Index loaded and ready for queries!


In [20]:
"""
Step 1: PDF Metadata Extraction
Extract citation information from all 85 PDFs
"""

import fitz  # PyMuPDF
from pathlib import Path
import re
from typing import Dict, Optional
import json

def extract_pdf_metadata(pdf_path: str) -> Dict:
    """Extract metadata from a single PDF"""
    try:
        doc = fitz.open(pdf_path)
        metadata = doc.metadata
        filename = Path(pdf_path).stem

        # Extract metadata fields
        citation_data = {
            "filename": Path(pdf_path).name,
            "filepath": pdf_path,
            "title": metadata.get("title", ""),
            "author": metadata.get("author", ""),
            "subject": metadata.get("subject", ""),
            "keywords": metadata.get("keywords", ""),
            "creator": metadata.get("creator", ""),
            "producer": metadata.get("producer", ""),
            "page_count": len(doc),
            "raw_filename": filename
        }

        doc.close()
        return citation_data

    except Exception as e:
        print(f"❌ Error extracting metadata from {pdf_path}: {e}")
        return None

# Extract metadata from all PDFs
print("="*80)
print("📚 EXTRACTING PDF METADATA FROM 85 PDFs")
print("="*80)

all_pdf_metadata = []

for phase, folder in PDF_FOLDERS.items():
    if os.path.exists(folder):
        pdf_files = list(Path(folder).rglob("*.pdf"))
        print(f"\n📂 {phase}: Processing {len(pdf_files)} PDFs...")

        for pdf_file in tqdm(pdf_files, desc=f"Extracting {phase}"):
            metadata = extract_pdf_metadata(str(pdf_file))
            if metadata:
                metadata["phase"] = phase
                all_pdf_metadata.append(metadata)

print(f"\n✅ Extracted metadata from {len(all_pdf_metadata)} PDFs")
print(f"📊 Sample metadata fields: {list(all_pdf_metadata[0].keys())}")

📚 EXTRACTING PDF METADATA FROM 85 PDFs

📂 phase1: Processing 22 PDFs...


Extracting phase1:   0%|          | 0/22 [00:00<?, ?it/s]


📂 phase2: Processing 14 PDFs...


Extracting phase2:   0%|          | 0/14 [00:00<?, ?it/s]


📂 phase3: Processing 25 PDFs...


Extracting phase3:   0%|          | 0/25 [00:00<?, ?it/s]


📂 phase4: Processing 13 PDFs...


Extracting phase4:   0%|          | 0/13 [00:00<?, ?it/s]


📂 phase5: Processing 11 PDFs...


Extracting phase5:   0%|          | 0/11 [00:00<?, ?it/s]


✅ Extracted metadata from 85 PDFs
📊 Sample metadata fields: ['filename', 'filepath', 'title', 'author', 'subject', 'keywords', 'creator', 'producer', 'page_count', 'raw_filename', 'phase']


In [21]:
"""
Step 2: Generate APA Citations Using Groq (FREE Alternative to Claude)
Parse metadata and filenames to create proper citations
"""

from llama_index.llms.groq import Groq
import time
import os

def generate_apa_citation_with_groq(metadata: Dict) -> Dict:
    """Use Groq (Llama 3.3 70B) to generate APA citation from metadata"""

    # Create prompt for Llama
    prompt = f"""Given this PDF metadata, generate a proper APA 7th edition citation.

PDF Metadata:
- Filename: {metadata['filename']}
- Title (from metadata): {metadata['title'] or 'Not available'}
- Author (from metadata): {metadata['author'] or 'Not available'}
- Pages: {metadata['page_count']}

Instructions:
1. If metadata has author/title/year, use it
2. If metadata is missing, parse the filename to extract:
   - Author name(s) (usually at start or after year)
   - Year (usually 4 digits like 2001, 2023, etc.)
   - Title (remaining text, convert underscores to spaces)
3. Determine source type (journal article, book, report, working paper)
4. Generate proper APA citation

Return ONLY a valid JSON object with these fields:
{{
  "authors": "Last, F. M., & Last2, F. M.",
  "year": "2001",
  "title": "Full title of the work",
  "source_type": "journal" or "book" or "report" or "working_paper",
  "journal": "Journal Name (if applicable)",
  "apa_citation": "Full APA formatted citation"
}}

Example filenames:
- "Varieties_of_Capitalism_hall_soskice.pdf" → Hall, P. A., & Soskice, D. (year from metadata)
- "2012_Thelen_Varieties_Liberalization.pdf" → Thelen, K. (2012)
- "foster-thelen-2024-coordination-rights.pdf" → Foster, A., & Thelen, K. (2024)

Be precise and follow APA 7th edition exactly. Return ONLY the JSON, no other text."""

    try:
        llm = Groq(
            model="llama-3.3-70b-versatile",
            temperature=0.1,
            api_key=os.environ.get("GROQ_API_KEY")
        )
        response = llm.complete(prompt)

        # Parse JSON response
        json_str = response.text.strip()
        # Extract JSON if wrapped in markdown code blocks
        if "```json" in json_str:
            json_str = json_str.split("```json")[1].split("```")[0].strip()
        elif "```" in json_str:
            json_str = json_str.split("```")[1].split("```")[0].strip()

        citation_info = json.loads(json_str)
        return citation_info

    except Exception as e:
        print(f"⚠️  Error generating citation for {metadata['filename']}: {e}")
        return {
            "authors": "Unknown",
            "year": "n.d.",
            "title": metadata['filename'],
            "source_type": "unknown",
            "journal": "",
            "apa_citation": f"Unknown. (n.d.). {metadata['filename']}."
        }

# Generate citations for all PDFs
print("="*80)
print("🤖 GENERATING APA CITATIONS WITH GROQ (Llama 3.3 70B)")
print("="*80)
print("⏱️  This may take 15-20 minutes for 85 PDFs (rate limited to 30 req/min)...\n")

citation_database = []
batch_size = 5  # Process in small batches to avoid rate limits

for i in tqdm(range(0, len(all_pdf_metadata), batch_size), desc="Generating citations"):
    batch = all_pdf_metadata[i:i+batch_size]

    for metadata in batch:
        citation_info = generate_apa_citation_with_groq(metadata)

        # Combine metadata with citation info
        complete_entry = {
            **metadata,
            **citation_info
        }
        citation_database.append(complete_entry)

    # Delay to respect Groq free tier rate limits (30 req/min)
    time.sleep(2)

print(f"\n✅ Generated {len(citation_database)} citations")
print("\n📋 Sample citation:")
print(f"   {citation_database[0]['apa_citation']}")


🤖 GENERATING APA CITATIONS WITH CLAUDE
⏱️  This may take 10-15 minutes for 85 PDFs...



Generating citations:   0%|          | 0/17 [00:00<?, ?it/s]

⚠️  Error generating citation for Regional health differences – developing a socioeconomic.pdf: Error code: 400 - {'type': 'error', 'error': {'type': 'invalid_request_error', 'message': 'Your credit balance is too low to access the Anthropic API. Please go to Plans & Billing to upgrade or purchase credits.'}, 'request_id': 'req_011CWpY2cEX69rWFVvJtkEE6'}
⚠️  Error generating citation for Just Transition for Regions and Generations Experiences from Structural Change in the Ruhr Area.pdf: Error code: 400 - {'type': 'error', 'error': {'type': 'invalid_request_error', 'message': 'Your credit balance is too low to access the Anthropic API. Please go to Plans & Billing to upgrade or purchase credits.'}, 'request_id': 'req_011CWpY2edsyF5rPQuDsfTza'}
⚠️  Error generating citation for Lessons from Germany s hard coal mining phase-out  policies and transition from 1950 to 2018.pdf: Error code: 400 - {'type': 'error', 'error': {'type': 'invalid_request_error', 'message': 'Your credit balance is t

In [None]:
"""
Step 3: Build Page-Level Content Index
Create searchable index mapping citations to page content
"""

def build_page_content_index(citation_database: list) -> Dict:
    """Build index mapping filename → page → content"""

    page_index = {}

    print("="*80)
    print("📖 BUILDING PAGE-LEVEL CONTENT INDEX")
    print("="*80)

    for entry in tqdm(citation_database, desc="Indexing pages"):
        filename = entry['filename']
        filepath = entry['filepath']

        try:
            doc = fitz.open(filepath)
            page_index[filename] = {
                "citation": entry['apa_citation'],
                "authors": entry['authors'],
                "year": entry['year'],
                "pages": {}
            }

            # Extract text from each page
            for page_num in range(len(doc)):
                page = doc[page_num]
                text = page.get_text()

                if text.strip():
                    page_index[filename]["pages"][page_num + 1] = {
                        "text": text,
                        "word_count": len(text.split())
                    }

            doc.close()

        except Exception as e:
            print(f"⚠️  Error indexing {filename}: {e}")

    print(f"\n✅ Indexed {len(page_index)} PDFs with full page content")

    return page_index

# Build the index
page_content_index = build_page_content_index(citation_database)

# Calculate total pages indexed
total_pages = sum(len(pdf_data["pages"]) for pdf_data in page_content_index.values())
print(f"📊 Total pages indexed: {total_pages:,}")

📖 BUILDING PAGE-LEVEL CONTENT INDEX


Indexing pages:   0%|          | 0/85 [00:00<?, ?it/s]


✅ Indexed 82 PDFs with full page content
📊 Total pages indexed: 2,894


In [None]:
"""
Step 4: Citation Verification Tool
Verify citations against actual PDF content
"""

import difflib

def find_pdf_by_author_year(author_keyword: str, year: str, citation_database: list) -> Optional[Dict]:
    """Find PDF by author name and year"""
    for entry in citation_database:
        if (year in entry['year'] and
            author_keyword.lower() in entry['authors'].lower()):
            return entry
    return None

def verify_citation(author: str, year: str, page: int, quote: str = None) -> Dict:
    """
    Verify if a citation exists in the PDF collection

    Args:
        author: Author last name (e.g., "Hall", "Soskice")
        year: Publication year (e.g., "2001")
        page: Page number cited
        quote: Optional quote to verify (partial match OK)

    Returns:
        Dict with verification status and details
    """

    # Find the PDF
    pdf_entry = find_pdf_by_author_year(author, year, citation_database)

    if not pdf_entry:
        return {
            "status": "❌ NOT FOUND",
            "message": f"No PDF found matching author '{author}' and year '{year}'",
            "citation": None,
            "page_content": None
        }

    filename = pdf_entry['filename']

    # Check if PDF is in page index
    if filename not in page_content_index:
        return {
            "status": "❌ NOT INDEXED",
            "message": f"PDF found but not indexed: {filename}",
            "citation": pdf_entry['apa_citation'],
            "page_content": None
        }

    pdf_data = page_content_index[filename]

    # Check if page exists
    if page not in pdf_data['pages']:
        max_page = max(pdf_data['pages'].keys())
        return {
            "status": "⚠️ PAGE OUT OF RANGE",
            "message": f"Page {page} not found. PDF has {max_page} pages.",
            "citation": pdf_data['citation'],
            "available_pages": f"1-{max_page}",
            "page_content": None
        }

    page_text = pdf_data['pages'][page]['text']

    # If quote provided, verify it exists
    if quote:
        # Normalize quote and page text for comparison
        quote_normalized = ' '.join(quote.lower().split())
        text_normalized = ' '.join(page_text.lower().split())

        if quote_normalized in text_normalized:
            return {
                "status": "✅ VERIFIED",
                "message": f"Quote found on page {page}",
                "citation": pdf_data['citation'],
                "page": page,
                "page_content": page_text[:500] + "..." if len(page_text) > 500 else page_text,
                "quote_match": True
            }
        else:
            # Try fuzzy matching
            similarity = difflib.SequenceMatcher(None, quote_normalized, text_normalized).ratio()

            if similarity > 0.6:
                return {
                    "status": "⚠️ PARTIAL MATCH",
                    "message": f"Quote not found exactly, but page content is {similarity:.1%} similar",
                    "citation": pdf_data['citation'],
                    "page": page,
                    "page_content": page_text[:500] + "..." if len(page_text) > 500 else page_text,
                    "quote_match": False,
                    "similarity": f"{similarity:.1%}"
                }
            else:
                return {
                    "status": "❌ QUOTE NOT FOUND",
                    "message": f"Quote not found on page {page} (similarity: {similarity:.1%})",
                    "citation": pdf_data['citation'],
                    "page": page,
                    "page_content": page_text[:500] + "..." if len(page_text) > 500 else page_text,
                    "quote_match": False
                }
    else:
        # No quote to verify, just confirm page exists
        return {
            "status": "✅ PAGE EXISTS",
            "message": f"Page {page} exists in PDF",
            "citation": pdf_data['citation'],
            "page": page,
            "page_content": page_text[:500] + "..." if len(page_text) > 500 else page_text
        }

print("✅ Citation verification tool ready!")
print("\n📝 Example usage:")
print('   result = verify_citation("Hall", "2001", 355, "firms are often embedded")')
print('   print(result["status"])')
print('   print(result["citation"])')

✅ Citation verification tool ready!

📝 Example usage:
   result = verify_citation("Hall", "2001", 355, "firms are often embedded")
   print(result["status"])
   print(result["citation"])


In [None]:
"""
Step 5: Export Citation Database (CSV, Markdown, JSON)
"""

import pandas as pd
import json
from datetime import datetime

# Create output directory
output_dir = f"{BASE_PATH}/Hierarchical_RAG_Pipeline/citations"
os.makedirs(output_dir, exist_ok=True)

print("="*80)
print("💾 EXPORTING CITATION DATABASE")
print("="*80)

# ==========================================
# 1. CSV Export
# ==========================================
print("\n📊 Creating CSV export...")

csv_data = []
for entry in citation_database:
    csv_data.append({
        "Author": entry['authors'],
        "Year": entry['year'],
        "Title": entry['title'],
        "Source_Type": entry['source_type'],
        "Journal": entry.get('journal', ''),
        "Pages": entry['page_count'],
        "Phase": entry['phase'],
        "Filename": entry['filename'],
        "APA_Citation": entry['apa_citation']
    })

df = pd.DataFrame(csv_data)
csv_path = f"{output_dir}/citation_database.csv"
df.to_csv(csv_path, index=False, encoding='utf-8')
print(f"   ✅ Saved: {csv_path}")

# ==========================================
# 2. Markdown Export
# ==========================================
print("\n📝 Creating Markdown export...")

md_content = f"""# Citation Library - {len(citation_database)} Sources

**Generated:** {datetime.now().strftime("%Y-%m-%d %H:%M")}

This document contains all properly formatted APA citations for the 85 PDFs in your thesis research collection.

---

"""

# Group by phase
phases = {
    "phase1": "Phase 1: Theoretical Foundation",
    "phase2": "Phase 2: Sectoral & Business Transitions",
    "phase3": "Phase 3: Context & Case Studies",
    "phase4": "Phase 4: Methodology",
    "phase5": "Phase 5: Business Formation Literature"
}

for phase_key, phase_title in phases.items():
    phase_entries = [e for e in citation_database if e['phase'] == phase_key]

    if phase_entries:
        md_content += f"\n## {phase_title} ({len(phase_entries)} sources)\n\n"

        # Sort by author
        phase_entries.sort(key=lambda x: x['authors'])

        for i, entry in enumerate(phase_entries, 1):
            md_content += f"{i}. **{entry['apa_citation']}**\n"
            md_content += f"   - File: `{entry['filename']}`\n"
            md_content += f"   - Pages: {entry['page_count']}\n"
            md_content += f"   - Type: {entry['source_type']}\n\n"

md_path = f"{output_dir}/citation_library.md"
with open(md_path, 'w', encoding='utf-8') as f:
    f.write(md_content)
print(f"   ✅ Saved: {md_path}")

# ==========================================
# 3. JSON Export (with page index)
# ==========================================
print("\n🔧 Creating JSON export...")

json_data = {
    "metadata": {
        "generated": datetime.now().isoformat(),
        "total_pdfs": len(citation_database),
        "total_pages": sum(e['page_count'] for e in citation_database)
    },
    "citations": {}
}

for entry in citation_database:
    filename = entry['filename']

    # Add citation data
    json_data["citations"][filename] = {
        "apa": entry['apa_citation'],
        "authors": entry['authors'],
        "year": entry['year'],
        "title": entry['title'],
        "source_type": entry['source_type'],
        "journal": entry.get('journal', ''),
        "phase": entry['phase'],
        "pages": entry['page_count'],
        "filepath": entry['filepath']
    }

    # Add page content if available
    if filename in page_content_index:
        json_data["citations"][filename]["page_map"] = {
            str(page_num): {
                "word_count": page_data['word_count'],
                "preview": page_data['text'][:200] + "..." if len(page_data['text']) > 200 else page_data['text']
            }
            for page_num, page_data in page_content_index[filename]['pages'].items()
        }

json_path = f"{output_dir}/citations.json"
with open(json_path, 'w', encoding='utf-8') as f:
    json.dump(json_data, f, indent=2, ensure_ascii=False)
print(f"   ✅ Saved: {json_path}")

# ==========================================
# Summary
# ==========================================
print("\n" + "="*80)
print("✅ EXPORT COMPLETE!")
print("="*80)
print(f"\n📁 All files saved to: {output_dir}/")
print(f"\n📊 Files created:")
print(f"   1. citation_database.csv     ({len(citation_database)} rows)")
print(f"   2. citation_library.md       (Human-readable)")
print(f"   3. citations.json            (Machine-readable with page index)")
print(f"\n💾 Total size: ~{sum(e['page_count'] for e in citation_database):,} pages indexed")
print("="*80)

💾 EXPORTING CITATION DATABASE

📊 Creating CSV export...
   ✅ Saved: /content/drive/MyDrive/PPE_Master_Thesis/Hierarchical_RAG_Pipeline/citations/citation_database.csv

📝 Creating Markdown export...
   ✅ Saved: /content/drive/MyDrive/PPE_Master_Thesis/Hierarchical_RAG_Pipeline/citations/citation_library.md

🔧 Creating JSON export...
   ✅ Saved: /content/drive/MyDrive/PPE_Master_Thesis/Hierarchical_RAG_Pipeline/citations/citations.json

✅ EXPORT COMPLETE!

📁 All files saved to: /content/drive/MyDrive/PPE_Master_Thesis/Hierarchical_RAG_Pipeline/citations/

📊 Files created:
   1. citation_database.csv     (85 rows)
   2. citation_library.md       (Human-readable)
   3. citations.json            (Machine-readable with page index)

💾 Total size: ~2,994 pages indexed


In [24]:
"""
Step 6: Test Citation Verification
Test the verification tool with known citations from your Chapter 2
"""

print("="*80)
print("🧪 TESTING CITATION VERIFICATION")
print("="*80)

# Test cases from Chapter 2 Citation Verification Report
test_cases = [
    {
        "name": "Hall & Soskice 2001, p. 355 - Non-market coordination quote",
        "author": "Hall",
        "year": "2001",
        "page": 355,
        "quote": "firms are often embedded in arrangements that involve more extensive relational"
    },
    {
        "name": "Hall & Gingerich 2009, p. 4 - Institutional complementarities",
        "author": "Hall",
        "year": "2009",
        "page": 4,
        "quote": "One set of institutions is said to be complementary to another"
    },
    {
        "name": "Foster & Thelen 2024, p. 1 - Competition law",
        "author": "Foster",
        "year": "2024",
        "page": 1,
        "quote": "competition law"
    },
    {
        "name": "Crouch et al. 2009, p. 654 - KNOWN BAD PAGE NUMBER",
        "author": "Crouch",
        "year": "2009",
        "page": 654,
        "quote": None  # This should fail - page doesn't exist
    }
]

print("\n🔍 Running test cases...\n")

for test in test_cases:
    print(f"{'='*60}")
    print(f"Test: {test['name']}")
    print(f"{'='*60}")

    result = verify_citation(
        author=test['author'],
        year=test['year'],
        page=test['page'],
        quote=test['quote']
    )

    print(f"Status: {result['status']}")
    print(f"Message: {result['message']}")

    if result.get('citation'):
        print(f"Citation: {result['citation'][:100]}...")

    if result.get('page_content'):
        print(f"Page preview: {result['page_content'][:150]}...")

    print()

print("="*80)
print("✅ Testing complete!")
print("="*80)

🧪 TESTING CITATION VERIFICATION

🔍 Running test cases...

Test: Hall & Soskice 2001, p. 355 - Non-market coordination quote


NameError: name 'verify_citation' is not defined

In [None]:
# ==========================================
# SEARCH ALL CITATIONS
# ==========================================
# Search for PDFs by keyword in title, author, or content

search_keyword = "varieties of capitalism"  # Change this to search

print("="*80)
print(f"🔎 SEARCHING FOR: '{search_keyword}'")
print("="*80)

matches = []

for entry in citation_database:
    # Search in title, authors, and keywords
    search_in = f"{entry['title']} {entry['authors']} {entry.get('keywords', '')}".lower()

    if search_keyword.lower() in search_in:
        matches.append(entry)

print(f"\n✅ Found {len(matches)} matching PDFs:\n")

for i, match in enumerate(matches, 1):
    print(f"{i}. {match['apa_citation']}")
    print(f"   File: {match['filename']}")
    print(f"   Pages: {match['page_count']}")
    print(f"   Phase: {match['phase']}\n")

if not matches:
    print("❌ No matches found. Try a different keyword.")

print("="*80)

🔎 SEARCHING FOR: 'varieties of capitalism'

✅ Found 8 matching PDFs:

1. Crouch, C. (n.d.). Regional and sectoral varieties of capitalism.
   File: Regional_and_Sectoral_Varieties_of_Capitalism_crouch.pdf
   Pages: 30
   Phase: phase1

2. Movahed, M. (2023). Varieties of capitalism and income inequality.
   File: movahed-2023-varieties-of-capitalism-and-income-inequality.pdf
   Pages: 38
   Phase: phase1

3. Foster, N., & Thelen, K. (2024). Coordination rights, competition law and varieties of capitalism.
   File: foster-thelen-2024-coordination-rights-competition-law-and-varieties-of-capitalism (1).pdf
   Pages: 39
   Phase: phase1

4. Hall, P. A., & Gingerich, D. W. (2009). Varieties of capitalism and institutional complementarities in the political economy: An empirical analysis. British Journal of Political Science, 39(3), 449-482.
   File: hallgingerich2009.pdf
   Pages: 34
   Phase: phase1

5. Author Unknown. (n.d.). An introduction to varieties of capitalism.
   File: An_introdu

In [23]:
# ==========================================
# VERIFY A CITATION
# ==========================================
# Change these values to verify your citations!

author_name = "Hall"        # Last name of author
pub_year = "2001"           # Publication year
page_number = 355           # Page number
quote_text = "firms are often embedded"  # Quote to verify (optional, set to None to skip)

# Run verification
result = verify_citation(author_name, pub_year, page_number, quote_text)

# Display results
print("="*80)
print(f"🔍 VERIFYING: {author_name} ({pub_year}), p. {page_number}")
print("="*80)
print(f"\nStatus: {result['status']}")
print(f"Message: {result['message']}\n")

if result.get('citation'):
    print(f"✅ APA Citation:\n   {result['citation']}\n")

if result.get('page_content'):
    print(f"📄 Page {page_number} content (first 400 chars):")
    print(f"   {result['page_content'][:400]}...\n")

if result.get('quote_match'):
    print("✅ Quote verified on this page!")
elif result.get('quote_match') == False:
    print("⚠️  Quote not found exactly - check page content above")

print("="*80)

NameError: name 'verify_citation' is not defined

## 📊 Performance Comparison

| Metric | Your Local CPU | Colab A100 GPU | Speedup |
|--------|----------------|----------------|----------|
| 2 PDFs (78 nodes) | ~10 minutes | **~30 seconds** | **20x faster** |
| Embedding speed | ~3-10 sec/node | **~0.05 sec/node** | **60-200x faster** |
| Memory efficient | Limited | 40GB GPU RAM | **Massive scale** |

---

## 🎓 Next Steps

1. ✅ **Quick test completed** - System works!
2. 🚀 **Run full pipeline** - Process all 75 PDFs (~15-20 min)
3. 💾 **Save index** - Never rebuild again!
4. 🔍 **Query your data** - Interactive research assistant ready!

---

## 📝 Sample Queries for Your Thesis

```python
# Theory questions
"What are the key concepts in varieties of capitalism?"
"Explain institutional complementarities"

# Methodology questions  
"What spatial econometric methods are discussed?"
"How to analyze panel data?"

# Literature questions
"What studies discuss Ruhr industrial decline?"
"Recent research on just transitions"
```

---

**🎉 Happy researching with GPU power! 🚀**

In [22]:
# Create query engine
query_engine = index.as_query_engine(
    similarity_top_k=5,
    response_mode="compact"
)

print("✅ Query engine ready!\n")

# Test query
test_query = "What is spatial econometrics?"
print(f"🔍 Query: {test_query}\n")

response = query_engine.query(test_query)

print("="*80)
print("📝 Response:")
print("="*80)
print(response.response)
print("="*80)

# Show sources
print("\n📚 Sources:")
for i, node in enumerate(response.source_nodes, 1):
    metadata = node.node.metadata
    print(f"  {i}. {metadata.get('source', 'Unknown')} (Page {metadata.get('page', 'N/A')})")

✅ Query engine ready!

🔍 Query: What is spatial econometrics?

📝 Response:
Spatial econometrics is a field of study that focuses on the incorporation of spatial relationships and interactions into econometric models. It is used to analyze how spatial dependencies, such as geographic proximity or other forms of spatial interaction, influence economic phenomena. This approach is particularly useful for understanding processes like diffusion, learning, contagion, and externalities, which are inherently spatial in nature.

📚 Sources:
  1. gc_ws1819_Elhorst_presentation.pdf (Page 1)
  2. Geo-Nested Analysis_ Mixed-Methods Research with Spatially Depend.pdf (Page 4)
  3. Elhorst-Spatial-Panel-Data-Analysis-Encyclopedia-GIS-2nd-ed_Working-Paper-Version.pdf (Page 10)


**Query Block 1: CME & Post-Industrial Transitions**

In [None]:
# Enter your query here
user_query = "what are panel methods"  # ← Change this!

print(f"🔍 Query: {user_query}\n")

response = query_engine.query(user_query)

print("="*80)
print("📝 Response:")
print("="*80)
print(response.response)
print("="*80)

print("\n📚 Sources:")
for i, node in enumerate(response.source_nodes, 1):
    metadata = node.node.metadata
    score = node.score
    print(f"  {i}. {metadata.get('source', 'Unknown')} (Page {metadata.get('page', 'N/A')}) [Score: {score:.3f}]")

🔍 Query: what are panel methods

📝 Response:
Based on the context provided, panel methods refer to statistical techniques used to analyze panel data, which consists of observations on multiple cross-sectional units (such as individuals, regions, or countries) observed over multiple time periods.

Key characteristics of panel methods include:

1. **Data Structure**: Panel data involves observations indexed by both cross-sectional units and time periods. When each unit has observations for all time periods, the panel is "balanced."

2. **Spatial Panel Methods**: These are extensions that account for spatial relationships between units, where observations are associated with particular positions in space (like housing locations, countries, or regions).

3. **Spatial Weights Matrix**: A key component that represents the structure of interactions between spatial units, indicating which locations are neighbors and the intensity of their relationships.

4. **Estimation Approaches**: The conte

In [None]:
# Enter your query here
user_query = "can spatial methods be used a method of analysing quantitative datafo this research question   Title: Business Formation Patterns and Sectoral Transitions: A Comparative Analysis of 5 Ruhr Citie Operating Under Germany's CME Framework (2013-2024) Research Question: How do business formation patterns and sectoral transitions vary across 5 Ruhr cities operating under Germany's shared CME institutional framework, and what local factors explain this variation?"  # ← Change this!

print(f"🔍 Query: {user_query}\n")

response = query_engine.query(user_query)

print("="*80)
print("📝 Response:")
print("="*80)
print(response.response)
print("="*80)

print("\n📚 Sources:")
for i, node in enumerate(response.source_nodes, 1):
    metadata = node.node.metadata
    score = node.score
    print(f"  {i}. {metadata.get('source', 'Unknown')} (Page {metadata.get('page', 'N/A')}) [Score: {score:.3f}]")

🔍 Query: can spatial methods be used a method of analysing quantitative datafo this research question   Title: Business Formation Patterns and Sectoral Transitions: A Comparative Analysis of 5 Ruhr Citie Operating Under Germany's CME Framework (2013-2024) Research Question: How do business formation patterns and sectoral transitions vary across 5 Ruhr cities operating under Germany's shared CME institutional framework, and what local factors explain this variation?

📝 Response:
Yes, spatial methods can be highly appropriate for analyzing quantitative data for this research question. Here's why:

## Relevance of Spatial Methods to Your Research

**1. Geographic Proximity and Regional Interdependence**
Your research focuses on 5 Ruhr cities, which are geographically proximate regions. The context indicates that spatial econometrics can analyze "processes of diffusion, learning, contagion, externalities or interdependence" - all of which are likely relevant when examining business formati

In [None]:
# Enter your query here
user_query = "Based on the context you have can you explain more about **3. Regional Variation Analysis** The context demonstrates that spatial methods are effective for examining pronounced differences in the magnitude of the effect across regions and analyzing regional differences in much more detail. This directly aligns with your research question about how patterns vary across 5 Ruhr cities and what local factors explain this variation."  # ← Change this!

print(f"🔍 Query: {user_query}\n")

response = query_engine.query(user_query)

print("="*80)
print("📝 Response:")
print("="*80)
print(response.response)
print("="*80)

print("\n📚 Sources:")
for i, node in enumerate(response.source_nodes, 1):
    metadata = node.node.metadata
    score = node.score
    print(f"  {i}. {metadata.get('source', 'Unknown')} (Page {metadata.get('page', 'N/A')}) [Score: {score:.3f}]")

🔍 Query: Based on the context you have can you explain more about **3. Regional Variation Analysis** The context demonstrates that spatial methods are effective for examining pronounced differences in the magnitude of the effect across regions and analyzing regional differences in much more detail. This directly aligns with your research question about how patterns vary across 5 Ruhr cities and what local factors explain this variation.

📝 Response:
Based on the context provided, spatial methods offer powerful tools for examining regional variation through several key approaches:

**Geographically Weighted Regression (GWR) for Regional Analysis**

The context highlights GWR as particularly valuable for analyzing regional variation. Unlike conventional regression models that assume uniform relationships across all units, GWR "allows different relationships to exist at different points in space," enabling the analysis of spatial heterogeneity. This method estimates local coefficients and

In [None]:
# Enter your query here
user_query = "given the avaliable context what would this mean The research design employs panel logic—leveraging cross-sectional and temporal variation—without applying panel econometric methods"  # ← Change this!

print(f"🔍 Query: {user_query}\n")

response = query_engine.query(user_query)

print("="*80)
print("📝 Response:")
print("="*80)
print(response.response)
print("="*80)

print("\n📚 Sources:")
for i, node in enumerate(response.source_nodes, 1):
    metadata = node.node.metadata
    score = node.score
    print(f"  {i}. {metadata.get('source', 'Unknown')} (Page {metadata.get('page', 'N/A')}) [Score: {score:.3f}]")

🔍 Query: given the avaliable context what would this mean The research design employs panel logic—leveraging cross-sectional and temporal variation—without applying panel econometric methods

📝 Response:
Based on the provided context, this statement would mean that while the research design conceptually uses the structure of panel data (which includes observations across multiple units over multiple time periods), it does not utilize the specialized statistical techniques typically associated with panel data analysis.

The context describes several panel econometric methods, including:
- Fixed effects and random effects estimators
- Demeaning techniques for fixed effects estimation
- Generalized least squares (GLS) random effects estimator
- Panel-corrected standard errors

If a study employs "panel logic" without these methods, it would be using the cross-sectional (variation across different units/countries) and temporal (variation over time) dimensions of the data for descriptive or

In [None]:
# Enter your query here
user_query =  "base don the contect on methodology Does panel login This aligns with the **explicitly exploratory and descriptive** objective: documenting variation rather than estimating effects, identifying patterns requiring explanation rather than testing hypotheses"  # ← Change this!

print(f"🔍 Query: {user_query}\n")

response = query_engine.query(user_query)

print("="*80)
print("📝 Response:")
print("="*80)
print(response.response)
print("="*80)

print("\n📚 Sources:")
for i, node in enumerate(response.source_nodes, 1):
    metadata = node.node.metadata
    score = node.score
    print(f"  {i}. {metadata.get('source', 'Unknown')} (Page {metadata.get('page', 'N/A')}) [Score: {score:.3f}]")

🔍 Query: base don the contect on methodology Does panel login This aligns with the **explicitly exploratory and descriptive** objective: documenting variation rather than estimating effects, identifying patterns requiring explanation rather than testing hypotheses

📝 Response:
Based on the context provided, the methodological approaches described do **not** align with an explicitly exploratory and descriptive objective focused on documenting variation rather than estimating effects.

The context shows that the panel data methodologies employed are primarily **explanatory and hypothesis-testing** in nature:

1. **Fixed-effects models are used for causal estimation**: The Movahed study explicitly uses fixed-effects regression models to "adjust for unobserved, unit-specific and time-invariant confounders when estimating effects from observational data." This is clearly focused on estimating causal effects, not just documenting patterns.

2. **Formal hypothesis testing**: The research test

In [None]:
# Enter your query here
user_query = "does anywhere in the context talk about coordination effectiveness proxies and how they are derived?"  # ← Change this!

print(f"🔍 Query: {user_query}\n")

response = query_engine.query(user_query)

print("="*80)
print("📝 Response:")
print("="*80)
print(response.response)
print("="*80)

print("\n📚 Sources:")
for i, node in enumerate(response.source_nodes, 1):
    metadata = node.node.metadata
    score = node.score
    print(f"  {i}. {metadata.get('source', 'Unknown')} (Page {metadata.get('page', 'N/A')}) [Score: {score:.3f}]")

🔍 Query: does anywhere in the context talk about coordination effectiveness proxies and how they are derived?

📝 Response:
Yes, the context discusses coordination effectiveness proxies and their derivation. Specifically, in the first source (hall2015_emergingtrends.pdf), it mentions that researchers have attempted to measure coordination types in the Varieties of Capitalism framework, which is "a difficult task because the typology turns on forms of coordination that can rarely be measured directly."

The text describes several approaches to deriving these proxies:

1. **Hall and Gingerich (2009)** constructed "a widely used but time-invariant index of (market-oriented vs strategic) coordination" using measures of coordination in labor relations and corporate governance.

2. **Schneider and Paunescu (2012)** developed "time-variant measures" as proxies for coordination.

3. **Geffen and Kenyon (2006)** used "a cluster analysis highly sensitive to the variables chosen as proxies" to ide

In [None]:
# Enter your query here
user_query = "What example variables are researchers supposed to select that serve as indirect indicators of coordination effectiveness"  # ← Change this!

print(f"🔍 Query: {user_query}\n")

response = query_engine.query(user_query)

print("="*80)
print("📝 Response:")
print("="*80)
print(response.response)
print("="*80)

print("\n📚 Sources:")
for i, node in enumerate(response.source_nodes, 1):
    metadata = node.node.metadata
    score = node.score
    print(f"  {i}. {metadata.get('source', 'Unknown')} (Page {metadata.get('page', 'N/A')}) [Score: {score:.3f}]")

🔍 Query: What example variables are researchers supposed to select that serve as indirect indicators of coordination effectiveness

📝 Response:
Based on the context information, researchers are supposed to select observable variables that serve as indirect indicators of coordination effectiveness in different spheres of the economy. Specifically:

**For Corporate Governance:**
- Variables that indicate the degree of strategic (versus market) coordination in corporate governance systems

**For Labour Relations:**
- Variables that indicate the degree of market coordination (or strategic coordination in reverse) in labor relations

The context indicates that these observable variables should reflect institutional conditions associated with different types of coordination. The factor analysis approach uses these observable indicators to measure the underlying latent variables representing coordination levels in each sphere. The factor loadings (λ11, λ12, λ13 for corporate governance and λ2

##3.1_research_design

In [None]:
# Enter your query here
user_query = "What are the specific strengths of using comparative case study design for analyzing post-industrial regional transitions like the Ruhr region? Why is it valuable to examine both successful and unsuccessful transformation cases?"  # ← Change this!

print(f"🔍 Query: {user_query}\n")

response = query_engine.query(user_query)

print("="*80)
print("📝 Response:")
print("="*80)
print(response.response)
print("="*80)

print("\n📚 Sources:")
for i, node in enumerate(response.source_nodes, 1):
    metadata = node.node.metadata
    score = node.score
    print(f"  {i}. {metadata.get('source', 'Unknown')} (Page {metadata.get('page', 'N/A')}) [Score: {score:.3f}]")

🔍 Query: What are the specific strengths of using comparative case study design for analyzing post-industrial regional transitions like the Ruhr region? Why is it valuable to examine both successful and unsuccessful transformation cases?

📝 Response:
Based on the context provided, the comparative case study design offers several specific strengths for analyzing post-industrial regional transitions:

## Strengths of Comparative Case Study Design:

1. **Drawing out general findings**: By selecting cases that display both similarities and differences across different criteria, researchers can identify broader patterns and conclusions that extend beyond individual cases.

2. **Learning from both success and failure**: Examining unsuccessful cases provides policy lessons that are as important as those from successful cases. This dual approach helps determine whether factors explaining success also explain lack of success, offering a more complete understanding.

3. **Industry-specific vs. u

In [None]:
# Enter your query here
user_query = "In mixed-methods research with spatially dependent data, what are the key advantages of combining Large-N quantitative analysis with Small-N qualitative case studies? How does this approach specifically address spatial dependence issues?"  # ← Change this!

print(f"🔍 Query: {user_query}\n")

response = query_engine.query(user_query)

print("="*80)
print("📝 Response:")
print("="*80)
print(response.response)
print("="*80)

print("\n📚 Sources:")
for i, node in enumerate(response.source_nodes, 1):
    metadata = node.node.metadata
    score = node.score
    print(f"  {i}. {metadata.get('source', 'Unknown')} (Page {metadata.get('page', 'N/A')}) [Score: {score:.3f}]")

🔍 Query: In mixed-methods research with spatially dependent data, what are the key advantages of combining Large-N quantitative analysis with Small-N qualitative case studies? How does this approach specifically address spatial dependence issues?

📝 Response:
Based on the provided context, the key advantages of combining Large-N quantitative analysis with Small-N qualitative case studies in spatially dependent data research include:

## Primary Advantages:

1. **Uncovering Sources of Spatial Dependence**: The combination allows researchers to identify and understand the mechanisms behind spatial patterns. Specifically:
   - When diagnostics indicate a **spatial lag process**, the small-N analysis focuses on identifying "vectors of transmission"—the process by which outcomes in nearby areas affect outcomes locally
   - When diagnostics indicate a **spatial error process**, the small-N analysis uncovers spatially clustered omitted or unobserved variables, revealing "contextual effects"



In [None]:
# Enter your query here
user_query = "In studies of the Ruhr region's economic transformation, what makes cities like Dortmund, Essen, Duisburg, and Bochum suitable as comparable cases? How do these cities share similar post-industrial contexts while showing variation in entrepreneurial outcomes?"  # ← Change this!

print(f"🔍 Query: {user_query}\n")

response = query_engine.query(user_query)

print("="*80)
print("📝 Response:")
print("="*80)
print(response.response)
print("="*80)

print("\n📚 Sources:")
for i, node in enumerate(response.source_nodes, 1):
    metadata = node.node.metadata
    score = node.score
    print(f"  {i}. {metadata.get('source', 'Unknown')} (Page {metadata.get('page', 'N/A')}) [Score: {score:.3f}]")

🔍 Query: In studies of the Ruhr region's economic transformation, what makes cities like Dortmund, Essen, Duisburg, and Bochum suitable as comparable cases? How do these cities share similar post-industrial contexts while showing variation in entrepreneurial outcomes?

📝 Response:
Based on the context provided, Dortmund, Essen, Duisburg, and Bochum serve as suitable comparable cases for studying the Ruhr region's economic transformation for several key reasons:

## Shared Post-Industrial Context

These cities share fundamental similarities as part of the Ruhr metropolitan area:

1. **Common Industrial Heritage**: They are all part of Germany's largest metropolitan area that was historically the center of industrialization, particularly coal and steel industries, and are now undergoing a fundamental process of transformation.

2. **Similar Structural Challenges**: All four cities face the same regional challenges of digitalization, rapidly changing markets, and the need to exploit innov

## 3.2_case_selection

In [None]:
# Enter your query here
user_query = "In studies of business formation patterns in German regions, why is municipal-level analysis preferred over broader regional aggregation? What are the specific advantages of analyzing labor market regions or individual municipalities?"  # ← Change this!

print(f"🔍 Query: {user_query}\n")

response = query_engine.query(user_query)

print("="*80)
print("📝 Response:")
print("="*80)
print(response.response)
print("="*80)

print("\n📚 Sources:")
for i, node in enumerate(response.source_nodes, 1):
    metadata = node.node.metadata
    score = node.score
    print(f"  {i}. {metadata.get('source', 'Unknown')} (Page {metadata.get('page', 'N/A')}) [Score: {score:.3f}]")

🔍 Query: In studies of business formation patterns in German regions, why is municipal-level analysis preferred over broader regional aggregation? What are the specific advantages of analyzing labor market regions or individual municipalities?

📝 Response:
Based on the context provided, the analysis actually shows a preference for **labor market regions** over both municipal-level and broader aggregations, rather than municipal-level being preferred. Here are the specific advantages explained:

## Advantages of Labor Market Regions:

1. **Economically Related Areas**: Labor market regions represent economically related areas that take both administrative factors and commuting patterns into account, rather than being defined by random administrative borders.

2. **Capture Economic Connections**: They capture economically and socially connected regions, which is important for evaluating the relevance of local business networks (LBN) since connections are unlikely to end at random borders

In [None]:
# Enter your query here
user_query = "In spatial econometric research, how should researchers select cases to maximize outcome variation? What role do extreme spatial lag (rho) or spatial error (lambda) values play in identifying focal units for in-depth analysis?"  # ← Change this!

print(f"🔍 Query: {user_query}\n")

response = query_engine.query(user_query)

print("="*80)
print("📝 Response:")
print("="*80)
print(response.response)
print("="*80)

print("\n📚 Sources:")
for i, node in enumerate(response.source_nodes, 1):
    metadata = node.node.metadata
    score = node.score
    print(f"  {i}. {metadata.get('source', 'Unknown')} (Page {metadata.get('page', 'N/A')}) [Score: {score:.3f}]")

🔍 Query: In spatial econometric research, how should researchers select cases to maximize outcome variation? What role do extreme spatial lag (rho) or spatial error (lambda) values play in identifying focal units for in-depth analysis?

📝 Response:
Based on the context provided, researchers should employ a **deliberate case selection strategy** when meaningful variation in spatial lag (rho) exists, rather than random selection. Here's how to maximize outcome variation:

## Selection Based on Rho Values

**For High Positive Rho Values:**
Researchers should analyze cases with high rho values to uncover how outcomes in a focal unit interact with outcomes in neighboring units. These cases are particularly valuable for identifying vectors of transmission. For example, counties in West Virginia with the darkest colors and highest rho values represent good candidates to serve as focal units for spatial nested analysis (SNA).

**For Negative Rho Values:**
Cases with negative rho values are pro

In [None]:
# Enter your query here
user_query = "In studies of German regional entrepreneurship patterns from 2002-2020, what historical context makes this time period suitable for analyzing post-reunification business formation dynamics? How does this timeframe capture both institutional stability and structural changes?"  # ← Change this!

print(f"🔍 Query: {user_query}\n")

response = query_engine.query(user_query)

print("="*80)
print("📝 Response:")
print("="*80)
print(response.response)
print("="*80)

print("\n📚 Sources:")
for i, node in enumerate(response.source_nodes, 1):
    metadata = node.node.metadata
    score = node.score
    print(f"  {i}. {metadata.get('source', 'Unknown')} (Page {metadata.get('page', 'N/A')}) [Score: {score:.3f}]")

🔍 Query: In studies of German regional entrepreneurship patterns from 2002-2020, what historical context makes this time period suitable for analyzing post-reunification business formation dynamics? How does this timeframe capture both institutional stability and structural changes?

📝 Response:
Based on the provided context, the period from 2002-2020 would be particularly suitable for analyzing post-reunification business formation dynamics in German regions for several key reasons:

## Historical Context and Institutional Stability

The context indicates that around 1997-2000, East Germany entered "a new development phase," as evidenced by relatively small changes in the size structure during this period. This suggests that by the early 2000s, the most turbulent phase of post-reunification transformation was stabilizing, making 2002 onwards a period where institutional frameworks had become more established.

## Capturing Structural Changes

The documents emphasize that the forty yea

## 3.3_data_measurement

In [None]:
# Enter your query here
user_query = "In German regional studies, how is new business formation measured at the municipal level? What data sources distinguish between serial entrepreneurs (those who founded businesses previously) versus de-novo entrepreneurs (first-time founders)?"  # ← Change this!

print(f"🔍 Query: {user_query}\n")

response = query_engine.query(user_query)

print("="*80)
print("📝 Response:")
print("="*80)
print(response.response)
print("="*80)

print("\n📚 Sources:")
for i, node in enumerate(response.source_nodes, 1):
    metadata = node.node.metadata
    score = node.score
    print(f"  {i}. {metadata.get('source', 'Unknown')} (Page {metadata.get('page', 'N/A')}) [Score: {score:.3f}]")

🔍 Query: In German regional studies, how is new business formation measured at the municipal level? What data sources distinguish between serial entrepreneurs (those who founded businesses previously) versus de-novo entrepreneurs (first-time founders)?

📝 Response:
Based on the document, new business formation in German regional studies is measured using the following approach:

## Data Source

The **Mannheim Enterprise Panel (MUP)** is the primary data source used. This panel:
- Builds on the official German Business Registry, which records all newly founded firms, stakeholder information, and firm characteristics
- Is augmented with additional information from Creditreform, Germany's largest credit rating agency
- Contains the universe of all economically active firms in Germany
- Has been maintained since 1990 with full information available since 2002
- Includes information for 307,723,655 actors (natural or legal entities)

## Measurement Level

The analysis is conducted at the **

In [None]:
# Enter your query here
user_query = "In empirical studies of coordinated market economies, what specific indicators measure institutional quality and coordination effectiveness? Which of these institutional measures can be operationalized at sub-national or regional levels?"  # ← Change this!

print(f"🔍 Query: {user_query}\n")

response = query_engine.query(user_query)

print("="*80)
print("📝 Response:")
print("="*80)
print(response.response)
print("="*80)

print("\n📚 Sources:")
for i, node in enumerate(response.source_nodes, 1):
    metadata = node.node.metadata
    score = node.score
    print(f"  {i}. {metadata.get('source', 'Unknown')} (Page {metadata.get('page', 'N/A')}) [Score: {score:.3f}]")

🔍 Query: In empirical studies of coordinated market economies, what specific indicators measure institutional quality and coordination effectiveness? Which of these institutional measures can be operationalized at sub-national or regional levels?

📝 Response:
Based on the context provided, several specific indicators are used to measure institutional quality and coordination effectiveness in coordinated market economies:

## Specific Indicators Used:

1. **Labour Relations Measures** - These capture the degree of strategic coordination in employment relationships, including the presence of powerful workforce representatives and consensual decision-making styles.

2. **Corporate Governance Indicators** - These measure the extent of strategic coordination in financial markets and business networks, including aspects like board interlocking and ownership structures.

3. **Wage Setting Coordination Scores** - A specific measure developed by Lane Kenworthy that tracks coordination in wage-s

In [None]:
# Enter your query here
user_query = "What indicators are typically included in regional competitiveness indices that measure economic dynamism? Which indicators specifically capture institutional quality versus pure economic outcomes like GDP?"  # ← Change this!

print(f"🔍 Query: {user_query}\n")

response = query_engine.query(user_query)

print("="*80)
print("📝 Response:")
print("="*80)
print(response.response)
print("="*80)

print("\n📚 Sources:")
for i, node in enumerate(response.source_nodes, 1):
    metadata = node.node.metadata
    score = node.score
    print(f"  {i}. {metadata.get('source', 'Unknown')} (Page {metadata.get('page', 'N/A')}) [Score: {score:.3f}]")

🔍 Query: What indicators are typically included in regional competitiveness indices that measure economic dynamism? Which indicators specifically capture institutional quality versus pure economic outcomes like GDP?

📝 Response:
Based on the context provided, regional competitiveness indices that measure economic dynamism typically include the following indicators:

**Indicators in Regional Competitiveness Indices:**

The Regional Competitiveness and Cohesion Index (RCCI) for Kazakhstan measures six different aspects:
1. Health and a basic standard of living
2. Higher education and training
3. Labour market efficiency
4. Market size
5. Technological readiness
6. Innovation

**Institutional Quality vs. Pure Economic Outcomes:**

The context indicates that competitiveness indicators "move beyond the realm of traditional economic metrics, incorporating social and institutional factors in order to reflect the diverse range of factors that influence regional productivity and development."



In [None]:
# Enter your query here
user_query = "In spatial econometric analysis of cross-sectional municipal data,what are the main approaches for constructing spatial weight matrices? What are the trade-offs between geographic contiguity, distance-based, and economic connectivity specifications?"  # ← Change this!

print(f"🔍 Query: {user_query}\n")

response = query_engine.query(user_query)

print("="*80)
print("📝 Response:")
print("="*80)
print(response.response)
print("="*80)

print("\n📚 Sources:")
for i, node in enumerate(response.source_nodes, 1):
    metadata = node.node.metadata
    score = node.score
    print(f"  {i}. {metadata.get('source', 'Unknown')} (Page {metadata.get('page', 'N/A')}) [Score: {score:.3f}]")

🔍 Query: In spatial econometric analysis of cross-sectional municipal data,what are the main approaches for constructing spatial weight matrices? What are the trade-offs between geographic contiguity, distance-based, and economic connectivity specifications?

📝 Response:
Based on the provided context, there are several main approaches for constructing spatial weight matrices in spatial econometric analysis:

## Main Approaches:

1. **Binary contiguity matrices** - These include p-order neighbors (first-order for immediate neighbors, second-order for neighbors of neighbors, etc.)

2. **Distance-based matrices** - Including:
   - Inverse distance matrices (with or without cut-off points)
   - Exponential distance decay matrices
   - The context provides an example where Wij = d⁻¹ij/ψ, where dij is the straight-line distance between units and ψ is the maximum eigenvalue

3. **q-nearest neighbor matrices** - Where q represents a specified number of closest neighbors

4. **Block diagonal or

In [None]:
# Enter your query here
user_query = "how are business formed in the Ruhr area? how have businesses been formed after the industrial decline"  # ← Change this!

print(f"🔍 Query: {user_query}\n")

response = query_engine.query(user_query)

print("="*80)
print("📝 Response:")
print("="*80)
print(response.response)
print("="*80)

print("\n📚 Sources:")
for i, node in enumerate(response.source_nodes, 1):
    metadata = node.node.metadata
    score = node.score
    print(f"  {i}. {metadata.get('source', 'Unknown')} (Page {metadata.get('page', 'N/A')}) [Score: {score:.3f}]")

🔍 Query: how are business formed in the Ruhr area? how have businesses been formed after the industrial decline

📝 Response:
Based on the context provided, businesses in the Ruhr area have been formed through several key mechanisms following the industrial decline:

## Evolution from Traditional Industries

New economic sectors often grew out of old industries rather than appearing entirely independently. For example, the environmental economy emerged from the mining industry due to increased environmental requirements since the 1970s, and this sector continues to exist even though mining no longer operates in the region.

## Network of Small and Medium Enterprises (SMEs)

After the decline, the region developed a strong SME ecosystem. Approximately 747,000 SMEs now provide employment to about 80% of the population in the region. These businesses operate across diverse sectors including chemicals, mechanical and pharmaceutical industries, food, metal production and processing, automoti

## 3.4_analytical_strategy

In [None]:
# Enter your query here
user_query = "In spatial panel data models, what is the full specification for a model that includes both spatial lag effects (spillovers from neighboring units) and spatial error correlation? How are these spatial parameters interpreted?"  # ← Change this!

print(f"🔍 Query: {user_query}\n")

response = query_engine.query(user_query)

print("="*80)
print("📝 Response:")
print("="*80)
print(response.response)
print("="*80)

print("\n📚 Sources:")
for i, node in enumerate(response.source_nodes, 1):
    metadata = node.node.metadata
    score = node.score
    print(f"  {i}. {metadata.get('source', 'Unknown')} (Page {metadata.get('page', 'N/A')}) [Score: {score:.3f}]")

🔍 Query: In spatial panel data models, what is the full specification for a model that includes both spatial lag effects (spillovers from neighboring units) and spatial error correlation? How are these spatial parameters interpreted?

📝 Response:
Based on the provided context, the full specification for a spatial panel data model that includes both spatial lag effects and spatial error correlation is:

**Model Specification:**

y = λ(IT ⊗WN)y + Xβ + u

where the disturbance term u has the structure:

u = (ιT ⊗IN)µ + ε

and the error term ε follows a spatial autoregressive process:

ε = ρ(IT ⊗WN)ε + ν

**Components:**
- y is an NT × 1 vector of observations on the dependent variable
- X is a NT × k matrix of observations on non-stochastic exogenous regressors
- WN is the N × N spatial weights matrix of known constants with diagonal elements set to zero
- IT is an identity matrix of dimension T
- IN is an N × N identity matrix
- ιT is a T × 1 vector of ones
- µ is a vector of time-invari

In [None]:
# Enter your query here
user_query = "In cross-sectional spatial analysis, why must researchers account for spatial dependence between observations? What biases occur when spatial relationships between units are ignored in the analysis?"  # ← Change this!

print(f"🔍 Query: {user_query}\n")

response = query_engine.query(user_query)

print("="*80)
print("📝 Response:")
print("="*80)
print(response.response)
print("="*80)

print("\n📚 Sources:")
for i, node in enumerate(response.source_nodes, 1):
    metadata = node.node.metadata
    score = node.score
    print(f"  {i}. {metadata.get('source', 'Unknown')} (Page {metadata.get('page', 'N/A')}) [Score: {score:.3f}]")

🔍 Query: In cross-sectional spatial analysis, why must researchers account for spatial dependence between observations? What biases occur when spatial relationships between units are ignored in the analysis?

📝 Response:
According to the context, researchers must account for spatial dependence in cross-sectional spatial analysis for several critical reasons:

## Why Spatial Dependence Must Be Accounted For:

1. **Threat to Statistical Inference**: Spatial dependence poses a significant threat to inference if not modeled adequately. In spatial econometrics, observations are not considered independent pieces of information but rather "a single realization of a process."

2. **Risk of Model Misspecification**: Ignoring spatial dependence during preliminary analysis runs the risk of beginning with a misspecified model, which is unlikely to yield "robust and satisfactory results."

## Biases That Occur When Spatial Relationships Are Ignored:

1. **Biased and Inconsistent Coefficients**: Sim

In [None]:
# Enter your query here
user_query = "In comparative case study research, how do researchers verify the temporal sequencing of events and processes across multiple cases? What role does process tracing play in identifying causal mechanisms?"  # ← Change this!

print(f"🔍 Query: {user_query}\n")

response = query_engine.query(user_query)

print("="*80)
print("📝 Response:")
print("="*80)
print(response.response)
print("="*80)

print("\n📚 Sources:")
for i, node in enumerate(response.source_nodes, 1):
    metadata = node.node.metadata
    score = node.score
    print(f"  {i}. {metadata.get('source', 'Unknown')} (Page {metadata.get('page', 'N/A')}) [Score: {score:.3f}]")

🔍 Query: In comparative case study research, how do researchers verify the temporal sequencing of events and processes across multiple cases? What role does process tracing play in identifying causal mechanisms?

📝 Response:
Based on the provided context, process tracing plays a crucial role in identifying causal mechanisms by shifting from data-set observations to causal process observations. The appropriate technique for process tracing depends on the specific research question at hand.

To verify temporal sequencing and identify causal mechanisms, researchers employ various qualitative techniques including:

1. **Archival research** - examining evidence of how events unfolded and were transmitted across units or locations
2. **Interviews and focus groups** - gathering firsthand accounts of processes and their timing
3. **Participatory observation** - directly observing ongoing processes
4. **Secondary literature and journalistic accounts** - analyzing documented sequences of events


In [None]:
# Enter your query here
user_query = "In geo-nested analysis that combines quantitative and qualitative methods,what is the iterative procedure for integrating findings from statistical models with in-depth case studies? How do insights from each methodological phase inform and refine the other?"  # ← Change this!

print(f"🔍 Query: {user_query}\n")

response = query_engine.query(user_query)

print("="*80)
print("📝 Response:")
print("="*80)
print(response.response)
print("="*80)

print("\n📚 Sources:")
for i, node in enumerate(response.source_nodes, 1):
    metadata = node.node.metadata
    score = node.score
    print(f"  {i}. {metadata.get('source', 'Unknown')} (Page {metadata.get('page', 'N/A')}) [Score: {score:.3f}]")

🔍 Query: In geo-nested analysis that combines quantitative and qualitative methods,what is the iterative procedure for integrating findings from statistical models with in-depth case studies? How do insights from each methodological phase inform and refine the other?

📝 Response:
In geo-nested analysis, the iterative procedure for integrating quantitative and qualitative findings follows a systematic cycle where insights from each phase directly inform and refine subsequent steps:

## The Iterative Integration Process

**Initial Quantitative Phase:**
The process begins with econometric analysis that includes regression diagnostics. These diagnostics serve a dual purpose: they help identify promising cases for in-depth study and reveal potential spatial dependence patterns in the data.

**Case Selection and Qualitative Investigation:**
Cases selected for qualitative analysis are embedded within the same sample used in the quantitative analysis. The nature of the qualitative investigatio

## 3.5_validity_triangulation

In [None]:
# Enter your query here
user_query = "In mixed-methods regional research, what qualitative techniques are used to probe and validate findings from quantitative spatial models? What specific methods help researchers understand mechanisms behind statistical patterns?"  # ← Change this!

print(f"🔍 Query: {user_query}\n")

response = query_engine.query(user_query)

print("="*80)
print("📝 Response:")
print("="*80)
print(response.response)
print("="*80)

print("\n📚 Sources:")
for i, node in enumerate(response.source_nodes, 1):
    metadata = node.node.metadata
    score = node.score
    print(f"  {i}. {metadata.get('source', 'Unknown')} (Page {metadata.get('page', 'N/A')}) [Score: {score:.3f}]")

🔍 Query: In mixed-methods regional research, what qualitative techniques are used to probe and validate findings from quantitative spatial models? What specific methods help researchers understand mechanisms behind statistical patterns?

📝 Response:
Based on the provided context, the following qualitative techniques are used in mixed-methods regional research to probe and validate findings from quantitative spatial models:

**Process Tracing**: The context references process tracing as moving "from metaphor to analytic tool," indicating it is used as a systematic qualitative method to understand mechanisms behind statistical patterns.

**Small-N Analysis**: When spatial diagnostics indicate specific patterns, small-N analysis is employed to:
- Uncover spatially clustered omitted or unobserved variables when a spatial error process is detected
- Shed light on "contextual effects" 
- Understand mechanisms operating locally when spatial lag processes are identified

The framework describe

In [None]:
# Enter your query here
user_query = "In case study research on regional economic transitions, how do researchers validate their theoretical explanations and findings? What role do local key informants play in ensuring the plausibility of research conclusions?"  # ← Change this!

print(f"🔍 Query: {user_query}\n")

response = query_engine.query(user_query)

print("="*80)
print("📝 Response:")
print("="*80)
print(response.response)
print("="*80)

print("\n📚 Sources:")
for i, node in enumerate(response.source_nodes, 1):
    metadata = node.node.metadata
    score = node.score
    print(f"  {i}. {metadata.get('source', 'Unknown')} (Page {metadata.get('page', 'N/A')}) [Score: {score:.3f}]")

🔍 Query: In case study research on regional economic transitions, how do researchers validate their theoretical explanations and findings? What role do local key informants play in ensuring the plausibility of research conclusions?

📝 Response:
Based on the context provided, researchers validate their theoretical explanations and findings through a process called process tracing during the Small-N Analysis (SNA) phase. The appropriate technique for process tracing depends on the specific research question at hand.

For validation, researchers employ several qualitative techniques including:

1. **Interviews and focus groups** - These allow researchers to gather direct evidence about causal processes and mechanisms
2. **Participatory observation** - This enables researchers to observe interactions and linkages firsthand
3. **Archival research** - Examining historical records and evidence of how information or influences spread between units
4. **Fieldwork** - Conducting personal intervi

In [None]:
# Enter your query here
user_query = "In comparative qualitative research, what procedures ensure reliability when coding interview or archival data? How do multiple researchers check for agreement and resolve coding discrepancies?"  # ← Change this!

print(f"🔍 Query: {user_query}\n")

response = query_engine.query(user_query)

print("="*80)
print("📝 Response:")
print("="*80)
print(response.response)
print("="*80)

print("\n📚 Sources:")
for i, node in enumerate(response.source_nodes, 1):
    metadata = node.node.metadata
    score = node.score
    print(f"  {i}. {metadata.get('source', 'Unknown')} (Page {metadata.get('page', 'N/A')}) [Score: {score:.3f}]")

🔍 Query: In comparative qualitative research, what procedures ensure reliability when coding interview or archival data? How do multiple researchers check for agreement and resolve coding discrepancies?

📝 Response:
Based on the context provided, several procedures ensure reliability when coding qualitative data:

## Independent Coding Process

The coding process begins with **two researchers independently coding the same data** in the initial phase. This independent approach helps establish objectivity and reduces individual bias.

## Achieving Inter-Coder Reliability

To ensure agreement between coders, researchers should:

1. **Hold frequent meetings** to compare and discuss the codes assigned by each researcher
2. **Analyze discrepancies iteratively** - differences in coding are discussed and examined repeatedly until consensus is reached
3. **Use analyst memos** - each researcher documents justifications for their thematic coding of particular statements, which helps explain their

## Query visualisation

---



In [None]:
# Enter your query here
user_query = "In studies of regional economic development paths, how are sequence index plots used to visualize how different regions evolve over time? What are the different ways to organize these plots (random order, sorted by initial state,or sorted by final state)?"  # ← Change this!

print(f"🔍 Query: {user_query}\n")

response = query_engine.query(user_query)

print("="*80)
print("📝 Response:")
print("="*80)
print(response.response)
print("="*80)

print("\n📚 Sources:")
for i, node in enumerate(response.source_nodes, 1):
    metadata = node.node.metadata
    score = node.score
    print(f"  {i}. {metadata.get('source', 'Unknown')} (Page {metadata.get('page', 'N/A')}) [Score: {score:.3f}]")

🔍 Query: In studies of regional economic development paths, how are sequence index plots used to visualize how different regions evolve over time? What are the different ways to organize these plots (random order, sorted by initial state,or sorted by final state)?

📝 Response:
Based on the context provided, sequence index plots are used to visualize the heterogeneity of regional trajectories and how regions evolve over time in terms of their economic characteristics, specifically their complexity and relatedness density.

## Three Ways to Organize Sequence Index Plots:

1. **Random Order**: The full sequence data is visualized with regions arranged in random order, providing an unstructured view of all regional trajectories.

2. **Sorted by First State (Initial State)**: Regional trajectories are organized based on their starting conditions at the beginning of the observation period, allowing researchers to see how regions with similar initial characteristics evolved differently over t

In [None]:
# Enter your query here
user_query = "What visualization methods best support comparative analysis of multiple regional development trajectories? How do cluster visualizations help identify groups of regions with similar economic evolution patterns?"  # ← Change this!

print(f"🔍 Query: {user_query}\n")

response = query_engine.query(user_query)

print("="*80)
print("📝 Response:")
print("="*80)
print(response.response)
print("="*80)

print("\n📚 Sources:")
for i, node in enumerate(response.source_nodes, 1):
    metadata = node.node.metadata
    score = node.score
    print(f"  {i}. {metadata.get('source', 'Unknown')} (Page {metadata.get('page', 'N/A')}) [Score: {score:.3f}]")

🔍 Query: What visualization methods best support comparative analysis of multiple regional development trajectories? How do cluster visualizations help identify groups of regions with similar economic evolution patterns?

📝 Response:
Based on the context, several visualization methods effectively support comparative analysis of regional development trajectories:

## Key Visualization Methods

**1. Sequence Index Plots**
These plots visualize the diversity of regional trajectories over time, showing how regions transition between different states of complexity and relatedness. They can be organized in different ways:
- Random order to show overall heterogeneity
- Sorted by first state to reveal initial conditions
- Sorted by last state to highlight end outcomes

However, sequence index plots have a limitation: "they do not reveal (statistical) similarities between distinct trajectories."

**2. Cluster Visualizations**
To address this limitation, cluster visualizations based on optimal m

In [None]:
# Enter your query here
user_query = "How should coordination effectiveness theoretically affect business formation?"  # ← Change this!

print(f"🔍 Query: {user_query}\n")

response = query_engine.query(user_query)

print("="*80)
print("📝 Response:")
print("="*80)
print(response.response)
print("="*80)

print("\n📚 Sources:")
for i, node in enumerate(response.source_nodes, 1):
    metadata = node.node.metadata
    score = node.score
    print(f"  {i}. {metadata.get('source', 'Unknown')} (Page {metadata.get('page', 'N/A')}) [Score: {score:.3f}]")

🔍 Query: How should coordination effectiveness theoretically affect business formation?

📝 Response:
Based on the context provided, coordination effectiveness should theoretically affect business formation through the efficiency of the market process. The documents indicate that:

**Positive effects of coordination effectiveness:**

The efficiency of the market process is judged by two key criteria:
1. How quickly and intensely incumbents react to actual or potential entry
2. How reliably the market mechanism discriminates between better and inferior solutions through survival-of-the-fittest scenarios

When coordination is effective, it can lead to:
- **Enhanced social capital**: Well-connected regions with strong business networks can build higher social capital, which serves as a valuable resource that can incentivize and facilitate entry by augmenting traditional input factors like physical capital and labor.

**Negative effects of coordination effectiveness:**

However, the context