# arXiv Paper Curator -  Infrastructure Setup

Build a production-grade RAG system using Docker, PostgreSQL, OpenSearch, FastAPI, Airflow, and Ollama.

## Technology Stack
| Component | Purpose | Port |
|-----------|---------|------|
| **FastAPI** | REST API | 8000 |
| **PostgreSQL** | Paper metadata storage | 5432 |
| **OpenSearch** | Hybrid search engine | 9200/5601 |
| **Apache Airflow** | Workflow automation | 8080 |
| **Ollama** | Local LLM inference | 11434 |

In [9]:
# Environment Check
import sys
from pathlib import Path

python_version = sys.version_info
print(f"Python Version: {python_version.major}.{python_version.minor}.{python_version.micro}")
print(f"Environment: {sys.executable}")

if python_version >= (3, 12):
    print("✓ Python version compatible")
else:
    print("✗ Need Python 3.12+")
    exit()

Python Version: 3.12.0
Environment: c:\Users\shubh\AppData\Local\Programs\Python\Python312\python.exe
✓ Python version compatible


In [10]:
# Find Project Root
current_dir = Path.cwd()
print(f"Current Directory: {current_dir}")
print(f"Directory Name: {current_dir.name}")
print(f"Parent Directory Name: {current_dir.parent.name}")
print(f"Grandparent Directory Name: {current_dir.parent.parent.name}")

if current_dir.name == "setup" and current_dir.parent.name == "notebooks":
    project_root = current_dir.parent.parent
elif (current_dir / "compose.yml").exists():
    project_root = current_dir
else:
    project_root = None

if project_root and (project_root / "compose.yml").exists():
    print(f"✓ Project root: {project_root}")
else:
    print("✗ Missing compose.yml - check directory")
    exit()
    

Current Directory: c:\Users\shubh\OneDrive\Documents\RAG_Research_Paper_Assistant\notebooks\setup
Directory Name: setup
Parent Directory Name: notebooks
Grandparent Directory Name: RAG_Research_Paper_Assistant
✓ Project root: c:\Users\shubh\OneDrive\Documents\RAG_Research_Paper_Assistant


In [11]:
# Check Docker
import subprocess

try:
    result = subprocess.run(["docker", "--version"], capture_output=True, text=True, timeout=5)
    if result.returncode == 0:
        print(f"✓ Docker: {result.stdout}")
    else:
        print("✗ Docker: Not working")
        exit()
except:
    print("✗ Docker: Not found")
    exit()

✓ Docker: Docker version 29.1.3, build f52814d



In [12]:
# Check Docker Compose
try:
    result = subprocess.run(["docker", "compose", "version"], capture_output=True, text=True, timeout=5)
    if result.returncode == 0:
        print(f"✓ Docker Compose: {result.stdout.split()[3]}")
    else:
        print("✗ Docker Compose: Not working")
        exit()
except:
    print("✗ Docker Compose: Not found")
    exit()

✓ Docker Compose: v5.0.1


In [13]:
# Check UV Package Manager
try:
    result = subprocess.run(["uv", "--version"], capture_output=True, text=True, timeout=5)
    if result.returncode == 0:
        print(f"✓ UV: {result.stdout.strip()}")
        print("\n✓ All required software ready!")
    else:
        print("✗ UV: Not working")
        exit()
except:
    print("✗ UV: Not found")
    exit()

✓ UV: uv 0.9.28 (0e1351e40 2026-01-29)

✓ All required software ready!


## Start Services

**Command to run (in terminal):**
```bash
cd [project-root]
docker compose up -d
```

**What this does:** Downloads images (first time) and starts all services in background.

In [14]:
# Check Docker Running
try:
    result = subprocess.run(["docker", "info"], capture_output=True, timeout=5)
    if result.returncode == 0:
        print("✓ Docker is running")
    else:
        print("✗ Docker not running - start Docker Desktop")
        exit()
except:
    print("✗ Docker daemon not accessible")
    exit()

✓ Docker is running


In [15]:
# Check Current Containers - Fixed encoding
try:
    result = subprocess.run(
        ["docker", "compose", "ps", "--format", "json"],
        cwd=str(project_root),
        capture_output=True,
        text=True,
        encoding='utf-8',  # Force UTF-8 encoding
        errors='ignore',   # Ignore problematic characters
        timeout=10
    )
    
    if result.returncode == 0 and result.stdout.strip():
        print("Current containers:")
        for line in result.stdout.strip().split('\n'):
            if line.strip():
                try:
                    container = json.loads(line)
                    service = container.get('Service', 'unknown')
                    state = container.get('State', 'unknown')
                    print(f"  • {service}: {state}")
                except:
                    pass
    else:
        print("No containers running")
        
except Exception as e:
    print("Could not check containers")


Current containers:
  • api: running
  • clickhouse: running
  • opensearch-dashboards: running
  • langfuse-minio: running
  • langfuse-postgres: running
  • langfuse-redis: running
  • langfuse-web: running
  • langfuse-worker: running
  • ollama: running
  • opensearch: running
  • postgres: running
  • redis: running


## Service Health Verification

All services start automatically. Check their health status:

In [20]:
# Service Health Check
import json
import subprocess

EXPECTED_SERVICES = {
    'api': 'FastAPI REST API server',
    'postgres': 'PostgreSQL database',
    'opensearch': 'OpenSearch search engine', 
    'opensearch-dashboards': 'OpenSearch web dashboard',
    'ollama': 'Local LLM inference server',
    'airflow': 'Workflow automation (optional - may be off)'
}

try:
    result = subprocess.run(
        ["docker", "compose", "ps", "--format", "json"],
        cwd=str(project_root),
        capture_output=True,
        text=True,
        encoding='utf-8',
        errors='ignore',
        timeout=15
    )
    
    if result.returncode == 0:
        print("SERVICE STATUS")
        print("=" * 70)
        print(f"{'Service':<20} {'State':<15} {'Status':<15} {'Notes'}")
        print("-" * 70)
    else:
        print("Could not get service status")
        
except Exception as e:
    print(f"Error checking services: {e}")

# Parse Service Status
if result.returncode == 0 and result.stdout and result.stdout.strip():
    for line in result.stdout.strip().split('\n'):
        if line.strip():
            try:
                container = json.loads(line)
                service = container.get('Service', 'unknown')
                state = container.get('State', 'unknown')
                health = container.get('Health', 'no check')
                
                if state == 'running' and health in ['healthy', 'no check']:
                    indicator = "✓"
                    notes = "Ready"
                elif state == 'running' and health == 'unhealthy':
                    indicator = "⚠"
                    notes = "Starting up..."
                elif state == 'exited':
                    indicator = "✗"
                    notes = "Failed to start"
                else:
                    indicator = "?"
                    notes = f"Status: {state}"
                
                print(f"{indicator} {service:<18} {state:<14} {health:<14} {notes}")
                
            except json.JSONDecodeError:
                pass


SERVICE STATUS
Service              State           Status          Notes
----------------------------------------------------------------------
⚠ api                running        unhealthy      Starting up...
✓ clickhouse         running        healthy        Ready
✓ opensearch-dashboards running        healthy        Ready
✓ langfuse-minio     running        healthy        Ready
✓ langfuse-postgres  running        healthy        Ready
✓ langfuse-redis     running        healthy        Ready
⚠ langfuse-web       running        unhealthy      Starting up...
? langfuse-worker    running                       Status: running
✓ ollama             running        healthy        Ready
✓ opensearch         running        healthy        Ready
✓ postgres           running        healthy        Ready
✓ redis              running        healthy        Ready


In [21]:
# Check Missing Services
missing_services = set(EXPECTED_SERVICES.keys()) - found_services

if missing_services:
    print("\nMISSING SERVICES:")
    print("-" * 70)
    for service in missing_services:
        description = EXPECTED_SERVICES[service]
        if service == 'airflow':
            print(f"⚠ {service:<18} not running    {'(Optional)':<14} {description}")
        else:
            print(f"✗ {service:<18} not running    {'Required':<14} {description}")

failed_services = [s for s, info in service_states.items() 
                  if info['state'] in ['exited', 'restarting'] or info['health'] == 'unhealthy']

if failed_services:
    print(f"\nTROUBLESHOOTING:")
    for service in failed_services:
        print(f"   docker compose logs {service}")
elif missing_services and 'airflow' not in missing_services:
    print(f"\nACTION NEEDED:")
    print("Start missing services: docker compose up -d")


MISSING SERVICES:
----------------------------------------------------------------------
✗ ollama             not running    Required       Local LLM inference server
✗ opensearch         not running    Required       OpenSearch search engine
✗ opensearch-dashboards not running    Required       OpenSearch web dashboard
✗ postgres           not running    Required       PostgreSQL database
✗ api                not running    Required       FastAPI REST API server
⚠ airflow            not running    (Optional)     Workflow automation (optional - may be off)


### 1. FastAPI - REST API Service

**Interactive Exploration:**

You can explore and test the FastAPI service in several ways:
- **API Documentation**: http://localhost:8000/docs (Interactive Swagger UI)
- **Alternative Docs**: http://localhost:8000/redoc (ReDoc interface)
- **Source Code**: Located in `src/routers/` directory

Let's test the API endpoints and explore the documentation: