Skip to content

Arynshr/Scipher

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Scipher: Research Paper Processing Platform

A full-stack application for uploading, processing, and analyzing research papers with AI-powered extraction and summarization capabilities.

Overview

Scipher is a modern web application that simplifies research paper processing through:

  • PDF Upload & Processing: Drag-and-drop interface with background processing
  • Content Extraction: Converts PDFs to structured markdown using Docling
  • NER Extraction: Automatically extracts glossary terms using spaCy
  • Real-time Status: Live updates during document processing
  • Responsive Design: Works on desktop and mobile devices

Architecture

System Components

graph TB
    subgraph "Frontend (Next.js)"
        UI["Web Interface<br/>localhost:3000"]
        Upload["DocumentUploader"]
        Viewer["DocumentViewer"]
        Status["ProcessingStatus"]
    end
    
    subgraph "Backend (FastAPI)"
        API["REST API<br/>localhost:8080"]
        Routes["API Routes"]
        Processor["DocumentProcessor"]
        NER["NERExtractor"]
    end
    
    subgraph "Background Processing"
        BGTasks["BackgroundTasks"]
        ThreadPool["ThreadPoolExecutor"]
        Docling["Docling Converter"]
        SpaCy["spaCy NER"]
    end
    
    subgraph "Storage"
        DB["SQLite Database"]
        FS["File System<br/>uploads/, processed/"]
    end
    
    UI --> API
    Upload --> Routes
    Viewer --> Routes
    Status --> Routes
    
    Routes --> Processor
    Processor --> BGTasks
    BGTasks --> ThreadPool
    ThreadPool --> Docling
    ThreadPool --> SpaCy
    
    Processor --> DB
    Processor --> FS
Loading

Technology Stack

Backend 1 :

  • FastAPI (Web framework)
  • SQLAlchemy + aiosqlite (Database ORM)
  • Docling (PDF processing)
  • spaCy (Named Entity Recognition)
  • uvicorn (ASGI server)

Frontend 2 :

  • Next.js 15.5.4 (React framework)
  • TypeScript
  • Tailwind CSS
  • Axios (HTTP client)

Quick Start

Prerequisites

  • Python ≥3.13
  • Node.js ≥18.18.0
  • uv package manager

Installation

  1. Clone the repository:
git clone https://github.com/Arynshr/Scipher.git
cd Scipher
  1. Backend setup:
cd backend
uv sync
source .venv/bin/activate
python -m spacy download en_core_web_sm
  1. Frontend setup:
cd ../Web
npm install
  1. Start the application:
cd ..  # Return to project root
chmod +x start-dev.sh
./start-dev.sh

The startup script 3 handles:

  • Cache cleanup
  • Backend startup on port 8080
  • Frontend startup on port 3000
  • Graceful shutdown on Ctrl+C

Access Points

Configuration

Backend Settings

Configuration is managed through Settings class 4 :

Setting Default Description
DATABASE_URL sqlite+aiosqlite:///./scipher.db Database connection
UPLOAD_DIR uploads/ PDF upload directory
MAX_FILE_SIZE 50MB Maximum upload size
ALLOWED_EXTENSIONS .pdf Permitted file types
PROCESSING_TIMEOUT 180s Processing timeout

Frontend Environment

Create Web/.env.local:

NEXT_PUBLIC_API_URL=http://localhost:8080

API Endpoints

Document Management

Endpoint Method Description
/api/upload POST Upload PDF document
/api/status/{doc_id} GET Check processing status
/api/document/{doc_id} GET Get processed document
/api/document/{doc_id}/sections GET Get document sections
/api/document/{doc_id}/text GET Get raw text

Glossary Management

Endpoint Method Description
/api/document/{doc_id}/glossary GET Get glossary terms
/api/document/{doc_id}/glossary POST Add manual term
/api/document/{doc_id}/glossary/{term_id} DELETE Delete term
/api/document/{doc_id}/glossary/extract POST Re-extract terms

Document Processing Pipeline

Background Processing Architecture

The system uses FastAPI's BackgroundTasks for asynchronous processing 5 :

  1. Upload: File saved to uploads/ directory
  2. Queue: Background task queued with document ID
  3. Processing: PDF → Markdown conversion via Docling
  4. Extraction: NER processing with spaCy
  5. Storage: Results saved to database and processed/ directory

Processing Flow

sequenceDiagram
    participant Client
    participant API as Upload API
    participant BG as BackgroundTasks
    participant Proc as DocumentProcessor
    participant DB as Database
    
    Client->>API: POST /api/upload (PDF)
    API->>DB: CREATE Document (UPLOADED)
    API->>BG: add_task(process_document)
    API-->>Client: 200 OK {doc_id, status}
    
    Note over BG,Proc: Async processing
    BG->>Proc: process_document(doc_id)
    Proc->>DB: UPDATE status=PROCESSING
    Proc->>Proc: Convert PDF (Docling)
    Proc->>Proc: Extract entities (spaCy)
    Proc->>DB: INSERT glossary terms
    Proc->>DB: UPDATE status=COMPLETED
Loading

Development

Project Structure

Scipher/
├── backend/                 # FastAPI application
│   ├── src/scipher/        # Source code
│   │   ├── api/           # API routes
│   │   ├── core/          # Processing services
│   │   ├── models/        # Data models
│   │   └── config.py      # Configuration
│   ├── pyproject.toml     # Dependencies
│   └── main.py           # Application entry
├── Web/                   # Next.js frontend
│   ├── app/              # App router pages
│   ├── components/       # React components
│   ├── utils/           # Utilities
│   └── package.json     # Dependencies
└── start-dev.sh         # Development script

Running Tests

# Backend tests
cd backend
source .venv/bin/activate
pytest

# Frontend tests
cd Web
npm test

Code Quality

The project uses:

  • TypeScript for frontend type safety
  • Pydantic for backend data validation
  • SQLAlchemy for database ORM
  • Async/await patterns throughout

Deployment

Production Build

  1. Backend:
cd backend
uv sync --no-dev
PYTHONPATH=src uvicorn main:app --host 0.0.0.0 --port 8080
  1. Frontend:
cd Web
npm run build
npm start

Environment Variables

Production requires:

  • DATABASE_URL: Production database connection
  • CORS_ORIGINS: Allowed frontend origins
  • NEXT_PUBLIC_API_URL: Backend API URL

Troubleshooting

Common Issues

Backend fails to start:

cd backend
source .venv/bin/activate
uv sync --reinstall

spaCy model not found:

python -m spacy download en_core_web_sm

Port conflicts:

lsof -i :8080  # Backend
lsof -i :3000  # Frontend

Cache Cleanup

The startup script automatically clears caches 6 :

  • Python __pycache__ directories
  • Next.js .next build artifacts
  • npm cache

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make changes with tests
  4. Submit a pull request

License

This project is licensed under the MIT License.


Notes

This README is generated based on the current codebase structure and configuration. For the most up-to-date information, refer to the individual component documentation and inline code comments.

Wiki pages you might want to explore:

Citations

File: backend/pyproject.toml (L10-32)

dependencies = [
    "aiofiles>=24.1.0",
    "aiohttp>=3.13.0",
    "aiosqlite>=0.21.0",
    "alembic>=1.16.5",
    "docling>=2.58.0",
    "fastapi>=0.118.0",
    "fitz>=0.0.1.dev2",
    "hatchling>=1.27.0",
    "marker-pdf>=1.10.1",
    "onnxruntime>=1.23.2",
    "opencv-python-headless>=4.11.0.86",
    "pikepdf>=10.0.0",
    "pydantic>=2.11.9",
    "pydantic-settings>=2.11.0",
    "pymupdf>=1.26.5",
    "python-multipart>=0.0.20",
    "rapidocr>=3.4.2",
    "setuptools>=80.9.0",
    "spacy>=3.7.0",
    "sqlalchemy>=2.0.43",
    "uvicorn>=0.37.0",
]

File: start-dev.sh (L1-90)

#!/bin/bash

# ==========================================
# 🚀 Scipher Development Startup Script
# Starts both FastAPI (backend) and Next.js (Web)
# ==========================================

set -e  # Exit immediately on error

echo "🚀 Starting Scipher Development Environment"
echo "=========================================="

# Ensure we're in the project root
if [ ! -d "backend" ] || [ ! -d "Web" ]; then
    echo "❌ Error: Please run this script from the scipher project root directory."
    exit 1
fi

# --- Cache Cleanup Function ---
clean_caches() {
    echo "🧹 Cleaning caches for a fresh start..."
    
    # Backend: Python caches
    echo "  📦 Clearing Python caches..."
    find backend -name "__pycache__" -type d -exec rm -rf {} + 2>/dev/null || true
    find backend -name "*.pyc" -delete 2>/dev/null || true
    find backend -name "*.pyo" -delete 2>/dev/null || true
    
    # Backend: App-specific caches (e.g., processed data dir if exists)
    if [ -d "backend/src/processed_data" ]; then  # Adjust path based on settings.PROCESSED_DATA_DIR
        echo "  📁 Clearing processed data cache..."
        rm -rf backend/processed/*.json 2>/dev/null || true
    fi
    
    # Frontend: Next.js and npm caches
    echo "  🖥️  Clearing Next.js and npm caches..."
    cd Web
    rm -rf .next 2>/dev/null || true
    rm -rf node_modules/.cache 2>/dev/null || true
    npm cache clean --force 2>/dev/null || true
    cd ..
    
    # General: Any temp dirs (optional: add more as needed)
    rm -rf /tmp/scipher_* 2>/dev/null || true  # Temp files if used
    
    echo "✅ Caches cleared!"
}

# --- Backend startup ---
start_backend() {
    echo "📡 Starting FastAPI Backend..."
    cd backend

    # Activate venv if it exists (uv-compatible)
    if [ -d ".venv" ]; then
        echo "📦 Activating virtual environment..."
        source .venv/bin/activate
    fi

    # Install dependencies (prefer uv if available)
    if command -v uv &> /dev/null; then
        echo "📥 Installing Python dependencies via uv..."
        uv sync
    elif [ -f "requirements.txt" ]; then
        echo "📥 Installing Python dependencies via pip..."
        pip install -r requirements.txt
    fi

    echo "🚀 Starting backend server at http://localhost:8080"
    PYTHONPATH=src uvicorn main:app --reload --host 0.0.0.0 --port 8080 &
    BACKEND_PID=$!
    cd ..
    echo "✅ Backend started (PID: $BACKEND_PID)"
}

# --- Web (Next.js) startup ---
start_web() {
    echo "🖥️  Starting Next.js Web Frontend..."
    cd Web

    if [ ! -d "node_modules" ]; then
        echo "📦 Installing Node.js dependencies..."
        npm install
    fi

    echo "🚀 Starting frontend server at http://localhost:3000"
    npm run dev &
    FRONTEND_PID=$!
    cd ..
    echo "✅ Web frontend started (PID: $FRONTEND_PID)"

File: backend/src/scipher/config.py (L6-31)

class Settings(BaseSettings):
    APP_NAME: str = "Scipher API"
    APP_VERSION: str = "1.0.0"
    DEBUG: bool = True
    
    HOST: str = "0.0.0.0"
    PORT: int = 8080
    
    DATABASE_URL: str = "sqlite+aiosqlite:///./scipher.db"
    DB_ECHO: bool = False
    
    UPLOAD_DIR: Path = Path("uploads")
    MAX_FILE_SIZE: int = 50 * 1024 * 1024  # 50MB
    ALLOWED_EXTENSIONS: Set[str] = {".pdf"}
    
    CORS_ORIGINS: list[str] = ["http://localhost:3000", "http://localhost:3001"]
    
    PROCESSING_TIMEOUT: int = 180
    
    PROCESSED_DATA_DIR: Path = Path("processed")
    TEMP_DIR: Path = Path("temp")
    
    model_config = SettingsConfigDict(
        env_file=".env",
        case_sensitive=True
    )

File: backend/src/scipher/core/document_processor.py (L303-319)

    async def summarize_document(self, doc_id: str, summarizer: Optional[DocumentSummarizer] = None) -> SummaryResult:
        """Generate summaries for a processed document."""
        markdown_text = self.load_markdown_file(doc_id)
        if not markdown_text:
            raise ProcessingException("No processed markdown available for summarization")

        summarizer = summarizer or self.summarizer
        loop = asyncio.get_event_loop()
        try:
            result = await loop.run_in_executor(None, summarizer.summarize, markdown_text)
        except ValueError as exc:
            raise ProcessingException(str(exc)) from exc
        return result

# Singleton instance
document_processor = DocumentProcessor()

About

An application for uploading, processing, and analyzing research papers with AI-powered extraction and summarization capabilities.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors